1. Terminologies
There are many professional terms in reinforcement learning. If you want to get started with reinforcement learning, you must understand these professional terms.
1] state and action
state s s s (this frame)
action a a a ∈ ∈ ∈ {left, right, up}
Who does the action is the agent.
2] policy
policy π \pi π : According to the observed state, make decisions and control the movement of the agent.
⋅ \cdot ⋅ Policy function π \pi π : ( s , a ) (s, a) (s,a) → [0, 1] :
π ( a ∣ s ) = P ( A = a ∣ S = s ) . \;\;\;\;\;\pi (a|s) = P(A=a|S=s). π(a∣s)=P(A=a∣S=s).
⋅ \cdot ⋅It is the probability of taking action A = a A=a A=a given s s s, e.g,
⋅ π ( l e f t ∣ s ) = 0.2 , \;\;\;\;\;\cdot\pi (left\;|\;s) = 0.2, ⋅π(left∣s)=0.2,
⋅ π ( r i g h t ∣ s ) = 0.1 , \;\;\;\;\;\cdot\pi (right|s) = 0.1, ⋅π(right∣s)=0.1,
⋅ π ( u p ∣ s ) = 0.7. \;\;\;\;\;\cdot\pi (up\;\;|\;\;s) = 0.7. ⋅π(up∣s)=0.7.
⋅ \cdot ⋅ Upon observing state S = s S = s S=s, the agent’s action A can be random.
3] reward
reward R R R
⋅ \cdot ⋅ Collect a coin: R R R = +1
⋅ \cdot ⋅ Win the game: R R R = +10000
⋅ \cdot ⋅ Touch a Goomba: R R R = -10000
\;\;\; (game over)
⋅ \cdot ⋅ Nothing happens: R R R = +1
4] state transition
\;\;\;\; old state ⟶ a c t i o n \;\;\overset{action}{\longrightarrow}\;\; ⟶actionnew state
⋅ \cdot ⋅ E.g., “up” action leads to a new state.
⋅ \cdot ⋅ State transition can be random.
⋅ \cdot ⋅ Ramdom is from the environment.
⋅ \cdot ⋅ p ( s ′ ∣ s , a ) p(s^{'}|s,a) p(s′∣s,a)= P ( S ′ = s ∣ S = s , A = a ) . P(S^{'}=s|S=s,A=a). P(S′=s∣S=s,A=a).
5] agent environment interaction
The environment here is a game program, the agent is Mary, and the state s t s_{t} st is what the environment tells us. In super Mary, we can take the current picture as the environment s t s_{t} st. when we see the state st, we need to make an action a t a_{t} at, which can be left, up, right.
After making the action a t a_{t} at, we will get a new state and a reward r t r_{t} rt.
2. Randomness in Reinforcement Learning
1] Action have randomness
⋅ \cdot ⋅ Given state s s s, the action can be random, e.g,.
\;\;\;\; ⋅ \cdot ⋅ π ( " l e f t ∣ s " ) = 0.2 \pi("left|s")=0.2 π("left∣s")=0.2
\;\;\;\; ⋅ \cdot ⋅ π ( " r i g h t ∣ s " ) = 0.1 \pi("right|s")=0.1 π("right∣s")=0.1
\;\;\;\; ⋅ \cdot ⋅ π ( " u p ∣ s " ) = 0.7 \pi("up|s")=0.7 π("up∣s")=0.7
Actions are sampled by pocily function.
2] State transitions have randomness
⋅ \cdot ⋅ Given state S = s S=s S=s and action A = a A=a A=a, the environment randomly generates a new state S ′ S^{'} S′.
The new state is sampled by the state transition function.
3. Play the game using AI
⋅ \cdot ⋅ Observe a frame(state s 1 s_{1} s1)
⋅ \cdot ⋅ ⇒ \Rightarrow ⇒ Make action a 1 a_{1} a1 (left, right, or up)
⋅ \cdot ⋅ ⇒ \Rightarrow ⇒Observe a new frame(state s 2 s_{2} s2) and reward r 1 r_{1} r1
⋅ \cdot ⋅ ⇒ \Rightarrow ⇒ Make action a 2 a_{2} a2
⋅ \cdot ⋅ ⇒ \Rightarrow ⇒ …
⋅ \cdot ⋅ (state, action, reward) trajectory:
s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . . . . , s T , a T , r T . \;\;s_{1},a_{1},r_{1},s_{2},a_{2},r_{2},......,s_{T},a_{T},r_{T}. s1,a1,r1,s2,a2,r2,......,sT,aT,rT.
4. Rewards and Returns (important)
4.1 Rerun
Definition: Return (cumulative future reward)
⋅ \cdot ⋅ U t = R t + R t + 1 + R t + 2 + R t + 3 + . . . U_{t}=R_{t}+R_{t+1}+R_{t+2}+R_{t+3}+... Ut=Rt+Rt+1+Rt+2+Rt+3+...
Question: Are R t R_{t} Rt and R t + 1 R_{t+1} Rt+1 equally important?
⋅ \cdot ⋅ Which of the followings do you prefer?
\;\;\; ⋅ \cdot ⋅ I give you $100 right now.
\;\;\; ⋅ \cdot ⋅ I will give you $100 one year later.
⋅ \cdot ⋅ Future reward is less valuable than present reward.
⋅ \cdot ⋅ R t + 1 R_{t+1} Rt+1 should be given less weight than R t R_{t} Rt
Definition: Discounted return(cumulative discounted future reward)
⋅ \cdot ⋅ γ \gamma γ: discount rate (tuning hyper-parameter).
⋅ \cdot ⋅ U t = R t + γ R t + 1 + γ 2 R t + 2 + γ 3 R t + 3 + . . . U_{t}=R_{t}+\gamma R_{t+1}+\gamma ^{2}R_{t+2}+\gamma ^{3}R_{t+3}+... Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+...
4.2 Randomness in Returns
Definition: Discounted return(at time step t)
⋅ \cdot ⋅ U t = R t + R t + 1 + R t + 2 + R t + 3 + . . . U_{t}=R_{t}+R_{t+1}+R_{t+2}+R_{t+3}+... Ut=Rt+Rt+1+Rt+2+Rt+3+...
At time step t, the return U t U_{t} Ut is random.
⋅ \cdot ⋅ Two sources of randomness:
\;\;\;\; 1. Action can be random: P [ A = a ∣ S = s ] = π ( a ∣ s ) . \;P[A=a|S=s]=\pi(a|s). P[A=a∣S=s]=π(a∣s).
\;\;\;\; 2. New state can be random: P [ S ′ = s ∣ S = s , A = a ] = p ( s ′ ∣ s , a ) . \;P[S^{'}=s|S=s,A=a]=p(s^{'}|s,a). P[S′=s∣S=s,A=a]=p(s′∣s,a).
⋅ \cdot ⋅ For any i ≥ \geq ≥ t, the reward R i R_{i} Ri depends on S i S_{i} Si and A i A_{i} Ai.
⋅ \cdot ⋅ Thus, given s t s_{t} st, the return U t U_{t} Ut depends on the random variables:
\;\;\; ⋅ \cdot ⋅ A t , A t + 1 , A t + 2 , . . . A_{t},A_{t+1},A_{t+2},... At,At+1,At+2,... and S t + 1 , S t + 2 , . . . S_{t+1},S_{t+2},... St+1,St+2,...
5. Value Function
5.1 Action-Value Function Q ( s , a ) Q(s,a) Q(s,a)
Definition: Return (cumulative future reward)
⋅ \cdot ⋅ U t = R t + R t + 1 + R t + 2 + R t + 3 + . . . U_{t}=R_{t}+R_{t+1}+R_{t+2}+R_{t+3}+... Ut=Rt+Rt+1+Rt+2+Rt+3+...
Definition: Action-value function for policy π \pi π
⋅ \cdot ⋅ Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_{t},a_{t})=E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}] Qπ(st,at)=E[Ut∣St=st,At=at]
⋅ \cdot ⋅ Return U t U_{t} Ut (random variable) depends on actions A t , A t + 1 , A t + 2 , . . . A_{t},A_{t+1},A_{t+2},... At,At+1,At+2,... and S t , S t + 1 , S t + 2 , . . . S_{t},S_{t+1},S_{t+2},... St,St+1,St+2,...
⋅ \cdot ⋅ Actions are random: P [ A = a ∣ S = s ] = π ( a ∣ s ) . \;P[A=a|S=s]=\pi(a|s). P[A=a∣S=s]=π(a∣s). (Policy function)
⋅ \cdot ⋅ States are random: P [ S ′ = s ∣ S = s , A = a ] = p ( s ′ ∣ s , a ) . \;P[S^{'}=s|S=s,A=a]=p(s^{'}|s,a). P[S′=s∣S=s,A=a]=p(s′∣s,a). (State transition)
Action value function represents: if the policy function is used π \pi π, then whether it is good or bad to act a t a_{t} at in the state of s t s_{t} st, we know the policy function π \pi π, You can score all actions a a a in the current state.
Definition: Optimal action-value function
⋅ \cdot ⋅ Q π ∗ ( s t , a t ) = m a x π Q_{\pi}^{*}(s_{t},a_{t})=\underset{\pi}{max} Qπ∗(st,at)=πmax Q π ( s t , a t ) Q_{\pi}^{}(s_{t},a_{t}) Qπ(st,at)
Evaluate action a a a to tell the best action.
5.2 State-Value Function V ( s ) V(s) V(s)
Definition: State-value function
⋅ \cdot ⋅ V π ( s t ) = V_{\pi}(s_{t})= Vπ(st)= E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s t ) ⋅ Q π ( s t , a ) E_{A}[Q_{\pi}^{}(s_{t},A)]=\sum_{a} \pi(a|s_{t}) \cdot Q_{\pi}(s_{t},a) EA[Qπ(st,A)]=∑aπ(a∣st)⋅Qπ(st,a). (Actions are discrete)
⋅ \cdot ⋅ V π ( s t ) = V_{\pi}(s_{t})= Vπ(st)= E A [ Q π ( s t , A ) ] = ∫ π ( a ∣ s t ) ⋅ Q π ( s t , a ) d a E_{A}[Q_{\pi}^{}(s_{t},A)]=\int \pi(a|s_{t}) \cdot Q_{\pi}(s_{t},a) da EA[Qπ(st,A)]=∫π(a∣st)⋅Qπ(st,a)da. (Actions are continuous)
V π ( s t ) V_{\pi}(s_{t}) Vπ(st) could make a judgment on the current situation and tell us whether we are going to win or lose, or others.
5.3 Understanding the Value Functions
⋅ \cdot ⋅ Action-value function: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_{t},a_{t})=E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}] Qπ(st,at)=E[Ut∣St=st,At=at].
⋅ \cdot ⋅ For policy π \pi π, Q π ( s , a ) \;Q_{\pi} (s,a) Qπ(s,a) evaluates how good it is for an agent to pick action a a a while being in state s s s.
⋅ \cdot ⋅ State-value function: V π ( s t ) = E A [ Q π ( s t , A ) ] V_{\pi}(s_{t})=E_{A}[Q_{\pi}^{}(s_{t},A)] Vπ(st)=EA[Qπ(st,A)]
⋅ \cdot ⋅ For fixed policy π \pi π, V π ( s ) \;V_{\pi}(s) Vπ(s) evaluates how good the situation is in state s s s.
⋅ \cdot ⋅ E s [ V π ( s ) ] E_{s}[V_{\pi}(s)] Es[Vπ(s)] evaluates how good the policy π \pi π is.
6. How does AI control the agent?
Suppose we have a good policy π ( a ∣ s ) \pi(a|s) π(a∣s).
⋅ \cdot ⋅ Upon observe the state s s , s_{s}, ss,
⋅ \cdot ⋅ random sampling: a t ∽ π ( ⋅ ∣ s t ) a_{t}\backsim\pi(\cdot|s_{t}) at∽π(⋅∣st).
Suppose we know the optimal action-value function Q ∗ ( s , a ) Q^{*}(s,a) Q∗(s,a).
⋅ \cdot ⋅ Upon observe the state s t , s_{t}, st,
⋅ \cdot ⋅ choose the action that maximizes the values: a t = a r g m a x a Q ∗ ( s t , a ) . a_{t}=argmax_{a}Q^{*}(s_{t},a). at=argmaxaQ∗(st,a).
7. Summary
Agent, Environment, State s s s, Action a a a, Reward r r r, Policy π ( a ∣ s ) \pi(a|s) π(a∣s), State transition p ( s ′ ∣ s , a ) p(s^{'}|s,a) p(s′∣s,a).
Return: U t = R t + γ R t + 1 + γ 2 R t + 2 + γ 3 R t + 3 + . . . U_{t}=R_{t}+\gamma R_{t+1}+\gamma ^{2}R_{t+2}+\gamma ^{3}R_{t+3}+... Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3+...
Action-value function: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_{t},a_{t})=E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}] Qπ(st,at)=E[Ut∣St=st,At=at].
Optimal action-value function: Q π ∗ ( s t , a t ) = m a x π Q_{\pi}^{*}(s_{t},a_{t})=\underset {\pi}{max} Qπ∗(st,at)=πmax Q π ( s t , a t ) Q_{\pi}^{}(s_{t},a_{t}) Qπ(st,at).
State-value function: V π ( s t ) = V_{\pi}(s_{t})= Vπ(st)= E A [ Q π ( s t , A ) ] E_{A}[Q_{\pi}^{}(s_{t},A)] EA[Qπ(st,A)]