Time difference

Time difference

background

Time Difference Learning referred to as TD learning, and Monte Carlo, he also learned from Episode, do not need to know the model itself, but it can not learn the full Episode.
The case of the state transition model and reward, learning incomplete track, the function value is obtained by Behrman recurrence formula (bootstrap method), to obtain the optimal solution. .
Advantages: real-time online learning, learning can not complete the track. More suitable for control engineering.
VS strategy with different tactics
with strategy: to produce policy strategy and assessment sampling control is the same policy.
Different strategies: generating policy assessment and control strategy is different sampling strategy. Easier to learn from the experience of human experience or other individual, someone can learn from some of the old policy, you can compare the merits of the two strategies, which may also be the most important reason is to follow a basic formula to explore optimization strategies on have existing policies.
Here Insert Picture Description

sarsa

Time difference with the strategy:
Here Insert Picture Description

sarsa
Where Q (s, a) is stored in a large table, is not suitable to solve the problem of large data. Foreign policy Time difference:

Here Insert Picture Description
Algorithm steps;
Step1: Enter the Initialize method S, the number of iterations T, the state set S, the action set A, i.e., initialization.
Step2.Choose A from S selected from the current state of operation;
Step3.Take Action A, R & lt the observe, S ', the state of the current action results in a new state S' and a new reward R & lt;
Step4: Q ( S , A ) Q ( S , A ) + a [ R + γ max a Q ( S , a ) Q ( S , A ) ] ; \begin{array}{l}{Q(S, A) \leftarrow Q(S, A)+\alpha\left[R+\gamma \max _{a} Q\left(S^{\prime}, a\right)-Q(S, A)\right]} ; \end{array} Update cost function;
Step5: S S S \leftarrow S^{\prime} The next state reassigned to a new state.

Published 15 original articles · won praise 8 · views 2831

Guess you like

Origin blog.csdn.net/zx_zhang01/article/details/103817428