(1)深度强化学习基础【基本概念】

1. Terminologies

There are many professional terms in reinforcement learning. If you want to get started with reinforcement learning, you must understand these professional terms.

1] state and action
state s s s (this frame)

action a a a ∈ ∈ {left, right, up}

Who does the action is the agent.

2] policy

policy π \pi π : According to the observed state, make decisions and control the movement of the agent.

⋅ \cdot Policy function π \pi π : ( s , a ) (s, a) (s,a) → [0, 1] :
           π ( a ∣ s ) = P ( A = a ∣ S = s ) . \;\;\;\;\;\pi (a|s) = P(A=a|S=s). π(as)=P(A=aS=s).
⋅ \cdot It is the probability of taking action A = a A=a A=a given s s s, e.g,
           ⋅ π ( l e f t    ∣    s ) = 0.2 , \;\;\;\;\;\cdot\pi (left\;|\;s) = 0.2, π(lefts)=0.2,
           ⋅ π ( r i g h t ∣ s ) = 0.1 , \;\;\;\;\;\cdot\pi (right|s) = 0.1, π(rights)=0.1,
           ⋅ π ( u p      ∣      s ) = 0.7. \;\;\;\;\;\cdot\pi (up\;\;|\;\;s) = 0.7. π(ups)=0.7.
⋅ \cdot Upon observing state S = s S = s S=s, the agent’s action A can be random.

3] reward

reward R R R
⋅ \cdot Collect a coin: R R R = +1
⋅ \cdot Win the game: R R R = +10000
⋅ \cdot Touch a Goomba: R R R = -10000
       \;\;\; (game over)
⋅ \cdot Nothing happens: R R R = +1

4] state transition

         \;\;\;\; old state      ⟶ a c t i o n      \;\;\overset{action}{\longrightarrow}\;\; actionnew state
⋅ \cdot E.g., “up” action leads to a new state.
⋅ \cdot State transition can be random.
⋅ \cdot Ramdom is from the environment.
⋅ \cdot p ( s ′ ∣ s , a ) p(s^{'}|s,a) p(ss,a)= P ( S ′ = s ∣ S = s , A = a ) . P(S^{'}=s|S=s,A=a). P(S=sS=s,A=a).

5] agent environment interaction

The environment here is a game program, the agent is Mary, and the state s t s_{t} st is what the environment tells us. In super Mary, we can take the current picture as the environment s t s_{t} st. when we see the state st, we need to make an action a t a_{t} at, which can be left, up, right.

After making the action a t a_{t} at, we will get a new state and a reward r t r_{t} rt.

2. Randomness in Reinforcement Learning

1] Action have randomness

⋅ \cdot Given state s s s, the action can be random, e.g,.
         \;\;\;\; ⋅ \cdot π ( " l e f t ∣ s " ) = 0.2 \pi("left|s")=0.2 π("lefts")=0.2
         \;\;\;\; ⋅ \cdot π ( " r i g h t ∣ s " ) = 0.1 \pi("right|s")=0.1 π("rights")=0.1
         \;\;\;\; ⋅ \cdot π ( " u p ∣ s " ) = 0.7 \pi("up|s")=0.7 π("ups")=0.7

Actions are sampled by pocily function.

2] State transitions have randomness

⋅ \cdot Given state S = s S=s S=s and action A = a A=a A=a, the environment randomly generates a new state S ′ S^{'} S.

The new state is sampled by the state transition function.

3. Play the game using AI

⋅ \cdot Observe a frame(state s 1 s_{1} s1)
⋅ \cdot ⇒ \Rightarrow Make action a 1 a_{1} a1 (left, right, or up)
⋅ \cdot ⇒ \Rightarrow Observe a new frame(state s 2 s_{2} s2) and reward r 1 r_{1} r1
⋅ \cdot ⇒ \Rightarrow Make action a 2 a_{2} a2
⋅ \cdot ⇒ \Rightarrow

⋅ \cdot (state, action, reward) trajectory:
     s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . . . . , s T , a T , r T . \;\;s_{1},a_{1},r_{1},s_{2},a_{2},r_{2},......,s_{T},a_{T},r_{T}. s1,a1,r1,s2,a2,r2,......,sT,aT,rT.

4. Rewards and Returns (important)

4.1 Rerun

Definition: Return (cumulative future reward)

⋅ \cdot U t = R t + R t + 1 + R t + 2 + R t + 3 + . . . U_{t}=R_{t}+R_{t+1}+R_{t+2}+R_{t+3}+... Ut=Rt+Rt+1+Rt+2+Rt+3+...

Question: Are R t R_{t} Rt and R t + 1 R_{t+1} Rt+1 equally important?
⋅ \cdot Which of the followings do you prefer?
       \;\;\; ⋅ \cdot I give you $100 right now.
       \;\;\; ⋅ \cdot I will give you $100 one year later.
⋅ \cdot Future reward is less valuable than present reward.
⋅ \cdot R t + 1 R_{t+1} Rt+1 should be given less weight than R t R_{t} Rt

Definition: Discounted return(cumulative discounted future reward)

⋅ \cdot γ \gamma γ: discount rate (tuning hyper-parameter).

⋅ \cdot U t = R t + γ R t + 1 + γ 2 R t + 2 + γ 3 R t + 3 + . . . U_{t}=R_{t}+\gamma R_{t+1}+\gamma ^{2}R_{t+2}+\gamma ^{3}R_{t+3}+... Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+...

4.2 Randomness in Returns

Definition: Discounted return(at time step t)

⋅ \cdot U t = R t + R t + 1 + R t + 2 + R t + 3 + . . . U_{t}=R_{t}+R_{t+1}+R_{t+2}+R_{t+3}+... Ut=Rt+Rt+1+Rt+2+Rt+3+...

At time step t, the return U t U_{t} Ut is random.
⋅ \cdot Two sources of randomness:
         \;\;\;\; 1. Action can be random:    P [ A = a ∣ S = s ] = π ( a ∣ s ) . \;P[A=a|S=s]=\pi(a|s). P[A=aS=s]=π(as).
         \;\;\;\; 2. New state can be random:    P [ S ′ = s ∣ S = s , A = a ] = p ( s ′ ∣ s , a ) . \;P[S^{'}=s|S=s,A=a]=p(s^{'}|s,a). P[S=sS=s,A=a]=p(ss,a).

⋅ \cdot For any i ≥ \geq t, the reward R i R_{i} Ri depends on S i S_{i} Si and A i A_{i} Ai.

⋅ \cdot Thus, given s t s_{t} st, the return U t U_{t} Ut depends on the random variables:
       \;\;\; ⋅ \cdot A t , A t + 1 , A t + 2 , . . . A_{t},A_{t+1},A_{t+2},... At,At+1,At+2,... and S t + 1 , S t + 2 , . . . S_{t+1},S_{t+2},... St+1,St+2,...

5. Value Function

5.1 Action-Value Function Q ( s , a ) Q(s,a) Q(s,a)

Definition: Return (cumulative future reward)
⋅ \cdot U t = R t + R t + 1 + R t + 2 + R t + 3 + . . . U_{t}=R_{t}+R_{t+1}+R_{t+2}+R_{t+3}+... Ut=Rt+Rt+1+Rt+2+Rt+3+...

Definition: Action-value function for policy π \pi π
⋅ \cdot Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_{t},a_{t})=E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}] Qπ(st,at)=E[UtSt=st,At=at]

⋅ \cdot Return U t U_{t} Ut (random variable) depends on actions A t , A t + 1 , A t + 2 , . . . A_{t},A_{t+1},A_{t+2},... At,At+1,At+2,... and S t , S t + 1 , S t + 2 , . . . S_{t},S_{t+1},S_{t+2},... St,St+1,St+2,...

⋅ \cdot Actions are random:    P [ A = a ∣ S = s ] = π ( a ∣ s ) . \;P[A=a|S=s]=\pi(a|s). P[A=aS=s]=π(as). (Policy function)

⋅ \cdot States are random:    P [ S ′ = s ∣ S = s , A = a ] = p ( s ′ ∣ s , a ) . \;P[S^{'}=s|S=s,A=a]=p(s^{'}|s,a). P[S=sS=s,A=a]=p(ss,a). (State transition)

Action value function represents: if the policy function is used π \pi π, then whether it is good or bad to act a t a_{t} at in the state of s t s_{t} st, we know the policy function π \pi π, You can score all actions a a a in the current state.

Definition: Optimal action-value function
⋅ \cdot Q π ∗ ( s t , a t ) = m a x π Q_{\pi}^{*}(s_{t},a_{t})=\underset{\pi}{max} Qπ(st,at)=πmax Q π ( s t , a t ) Q_{\pi}^{}(s_{t},a_{t}) Qπ(st,at)
Evaluate action a a a to tell the best action.

5.2 State-Value Function V ( s ) V(s) V(s)

Definition: State-value function
⋅ \cdot V π ( s t ) = V_{\pi}(s_{t})= Vπ(st)= E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s t ) ⋅ Q π ( s t , a ) E_{A}[Q_{\pi}^{}(s_{t},A)]=\sum_{a} \pi(a|s_{t}) \cdot Q_{\pi}(s_{t},a) EA[Qπ(st,A)]=aπ(ast)Qπ(st,a). (Actions are discrete)

⋅ \cdot V π ( s t ) = V_{\pi}(s_{t})= Vπ(st)= E A [ Q π ( s t , A ) ] = ∫ π ( a ∣ s t ) ⋅ Q π ( s t , a ) d a E_{A}[Q_{\pi}^{}(s_{t},A)]=\int \pi(a|s_{t}) \cdot Q_{\pi}(s_{t},a) da EA[Qπ(st,A)]=π(ast)Qπ(st,a)da. (Actions are continuous)

V π ( s t ) V_{\pi}(s_{t}) Vπ(st) could make a judgment on the current situation and tell us whether we are going to win or lose, or others.

5.3 Understanding the Value Functions

⋅ \cdot Action-value function: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_{t},a_{t})=E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}] Qπ(st,at)=E[UtSt=st,At=at].
⋅ \cdot For policy π \pi π,    Q π ( s , a ) \;Q_{\pi} (s,a) Qπ(s,a) evaluates how good it is for an agent to pick action a a a while being in state s s s.

⋅ \cdot State-value function: V π ( s t ) = E A [ Q π ( s t , A ) ] V_{\pi}(s_{t})=E_{A}[Q_{\pi}^{}(s_{t},A)] Vπ(st)=EA[Qπ(st,A)]
⋅ \cdot For fixed policy π \pi π,    V π ( s ) \;V_{\pi}(s) Vπ(s) evaluates how good the situation is in state s s s.
⋅ \cdot E s [ V π ( s ) ] E_{s}[V_{\pi}(s)] Es[Vπ(s)] evaluates how good the policy π \pi π is.

6. How does AI control the agent?

Suppose we have a good policy π ( a ∣ s ) \pi(a|s) π(as).
⋅ \cdot Upon observe the state s s , s_{s}, ss,
⋅ \cdot random sampling: a t ∽ π ( ⋅ ∣ s t ) a_{t}\backsim\pi(\cdot|s_{t}) atπ(st).

Suppose we know the optimal action-value function Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a).
⋅ \cdot Upon observe the state s t , s_{t}, st,
⋅ \cdot choose the action that maximizes the values: a t = a r g m a x a Q ∗ ( s t , a ) . a_{t}=argmax_{a}Q^{*}(s_{t},a). at=argmaxaQ(st,a).

7. Summary

Agent, Environment, State s s s, Action a a a, Reward r r r, Policy π ( a ∣ s ) \pi(a|s) π(as), State transition p ( s ′ ∣ s , a ) p(s^{'}|s,a) p(ss,a).

Return: U t = R t + γ R t + 1 + γ 2 R t + 2 + γ 3 R t + 3 + . . . U_{t}=R_{t}+\gamma R_{t+1}+\gamma ^{2}R_{t+2}+\gamma ^{3}R_{t+3}+... Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+...

Action-value function: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_{t},a_{t})=E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}] Qπ(st,at)=E[UtSt=st,At=at].

Optimal action-value function: Q π ∗ ( s t , a t ) = m a x π Q_{\pi}^{*}(s_{t},a_{t})=\underset {\pi}{max} Qπ(st,at)=πmax Q π ( s t , a t ) Q_{\pi}^{}(s_{t},a_{t}) Qπ(st,at).

State-value function: V π ( s t ) = V_{\pi}(s_{t})= Vπ(st)= E A [ Q π ( s t , A ) ] E_{A}[Q_{\pi}^{}(s_{t},A)] EA[Qπ(st,A)]

猜你喜欢

转载自blog.csdn.net/weixin_49716548/article/details/125960576