Thoughts from an AC algorithm strategy loss

recording

When recording the loss of AC algorithms such as DDPG, it is found that the loss is as follows:
Insert picture description here
Insert picture description here

The initial thought: Isn't the loss of the strategy pi a negative q value? If the loss_pi increases, it means that q decreases. Isn't pi moving in the direction of q increasing?

After discussions with others and my own thinking, I came to the following conclusions:

  • All rewards in my environment are negative rewards. This is the basic point of thinking about this problem.
  • Because they are all negative rewards, the Q value under any strategy is negative, and the Q value under the optimal strategy is also negative.
  • The weights of the Critic network are very close to 0 after initialization, resulting in all the predicted Q values ​​of the Critic network are close to 0, and loss_pi is the average value of batch_size negative Q, so at this time loss_pi is close to 0, which explains why the starting point of loss_pi Is 0.
  • Clarify a point: The increase in loss_pi does not mean that the policy network does not converge, because most actors of the AC algorithm use policy gradients to update the network, and the loss is only the average of negative Q, and Q is close to the optimal Q (in In a purely negative reward environment, the optimal Q is negative. The rigorous point here is that it is close to the Q of the current strategy, which is also a negative number), so it is not normal for the actor's loss to converge to 0. This is the source of my confusion at first.
  • In the above training process, what the actor has to do is just to be in the state st s_t every time it is updatedstLet the output μ (st) \mu(s_t) as much as possibleμ ( st) Is moving in the direction of making the output of the critic network larger. At this time, the output of the critic network is decreasing (from 0 to a negative value) to approach the Q value of the current strategy, so the loss of the actor will be due to the critic network. The update gradually increases (from 0 to a positive value).
  • Why did loss_pi flatten out in the end? The update of the AC algorithm is an algorithm similar to the iterative strategy (note that it is similar). Both the actor network and the critic network are dynamically updated. The actor's initial strategy is not good, and the actor is always improving the strategy according to the critic network. The critic network is also gradually converging to the Q value under the current strategy. This is at least reducing the distance from the optimal Q value. The two networks "fight" (similar to GANs) and finally converge. Loss_pi tends to be flat, indicating that the output of the Critic network is 0 from the very beginning. This situation where the Q value of any strategy is very far away has transitioned to that the Critic network is close to the Q value of the current actor strategy. At this time, the change of the critic slows down, so the loss_pi is gradually flattened. In an ideal situation, the critic converges to the Q value under the optimal strategy, and the actor is the optimal strategy. At this time, the Q value has reached the maximum value of any strategy ( Q ∗ Q^*Q definition), so loss_pi will not have a rising or falling trend.
  • Reinforcement learning is unsupervised learning, and some supervised learning ideas cannot be used to think about problems. It is not normal for loss to converge to 0, but sometimes it is very abnormal to converge to 0. Whether the actor has learned something is not intuitively judged by loss. As long as the critic network is converging normally, the actor's behavior to maximize the output of the critic in each state will always make the strategy better gradually.

Guess you like

Origin blog.csdn.net/weixin_43145941/article/details/115342794