Goals & Rewards
Rewards
Immediate reward: At each time step, the reward is a simple number, R t ∈ R R_t \in \R Rt∈R.
Cumulative reward(Returns): The total amount of immediate reward the agent receives.
Reward Hypothesis: That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal(called reward).
Goals
The goal of an agent is to maximize the total amount of immediate rewards in the long run.
Returns & Episodes
Episodes
The agent-environment interaction can breaks into subsequences which we called episodes. These tasks are called episodic tasks.
Returns
- Return—“sum”
In general, we define the cumulative reward as expected retrun, where the return, denoted G t \bf{G_t} Gt. In the simplest case, if the episode is finite with a natural notation of final time step, the return is the sum of the rewards:
G t ≐ R t + 1 + R t + 2 + ⋯ + R T , where T is a final time step. G_t\doteq R_{t+1}+R_{t+2}+ \cdots +R_T, \text{where T is a final time step.} Gt≐Rt+1+Rt+2+⋯+RT,where T is a final time step.
- T T T is a variable that normally varies from episode to episode.
However, in many cases the agent-environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. We call these continuing tasks, whose final time step T → ∞ T\rightarrow \infin T→∞.
- Return—“discounted”
An additional way to calculate the expected return base on a concept of discounting. In this case, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. Then, the expected return turns out to be expected discounted return, where the discounted return, denoted:
G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 , where 0 ≤ γ ≤ 1 G_t\doteq R_{t+1}+\gamma R_{t+2}+ \gamma^2 R_{t+3} \cdots =\sum^{\infin}_{k=0} \gamma^k R_{t+k+1}, \text{where }0\leq\gamma\leq1 Gt≐Rt+1+γRt+2+γ2Rt+3⋯=∑k=0∞γkRt+k+1,where 0≤γ≤1
The discounted return can also be written in an incremental form, which is important for the theory and algorithms of reinforcement learning:
G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 R t + 4 + ⋯ = R t + 1 + γ ( R t + 2 + γ R t + 3 + γ 2 R t + 4 + ⋯ ) = R t + 1 + γ G t + 1 \begin{aligned}G_t&\doteq R_{t+1}+\gamma R_{t+2}+ \gamma^2 R_{t+3}+\gamma^3 R_{t+4}+\cdots \\ &=R_{t+1}+\gamma(R_{t+2}+ \gamma R_{t+3}+\gamma^2 R_{t+4}+\cdots)\\ &=R_{t+1}+\gamma G_{t+1}\end{aligned} Gt≐Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+⋯=Rt+1+γ(Rt+2+γRt+3+γ2Rt+4+⋯)=Rt+1+γGt+1
Note that although the return is a sum of an infinite number of terms, it is still finite if the reward is nonzero and constant—if γ < 1 \gamma\lt1 γ<1.
-Unified Notation
In order to unify notations for both episodic and continuing tasks, we introduce absorbing state to represent that transitions only to itself and that generates only rewards of zero. Here the solid square represents the special absorbing state.
We can define the return, using the convention of omitting episode numbers when they are not needed, and including the possibility that γ = 1 \gamma=1 γ=1 if the sum remains defined. Alternatively, we can write:
G t ≐ ∑ k = t + 1 T γ k − t − 1 R k G_t\doteq \sum^{T}_{k=t+1}\gamma^{k-t-1}R_k Gt≐∑k=t+1Tγk−t−1Rk
including the possibility that T = ∞ T=\infin T=∞ or γ = 1 \gamma=1 γ=1(but not both).
Policies & Function
Policy
Formally, a policy is a mapping from states to probabilities of selecting each possible action. If the agent follows the policy π \pi π at time t t t, then π ( a ∣ s ) \pi(a|s) π(a∣s) is the probability that A t = a A_t=a At=a if S t = s S_t=s St=s.
Value Function
The value function of a state s s s under policy π \pi π, denoted v π ( s ) v_{\pi}(s) vπ(s). For MDPs, we can define v π ( s ) v_{\pi}(s) vπ(s) formally by:
v π ( s ) ≐ E π [ G t ∣ S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ] , for all s ∈ S v_{\pi}(s)\doteq E_{\pi}[G_t|S_t=s]=E_{\pi}[\sum^{\infin}_{k=0}\gamma^kR_{t+k+1}|S_t=s], \text{for all s}\in S vπ(s)≐Eπ[Gt∣St=s]=Eπ[∑k=0∞γkRt+k+1∣St=s],for all s∈S
It estimates “how good” it is for agent to be in state s s s.
The value function can be written in “incremental form” by using Bellman equation:
v π ( s ) = E π [ G t ∣ S t = s ] , where G t = R t + 1 + γ G t + 1 = E π [ R t + 1 + γ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) [ r + γ E π [ G t + 1 ∣ S t + 1 = s ′ ] ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] , for all s ∈ S \begin{aligned} v_{\pi}(s)&=E_{\pi}[G_t|S_t=s], \text{where } G_t=R_{t+1}+\gamma G_{t+1} \\ &=E_{\pi}[R_{t+1}+\gamma G_{t+1} | S_t=s] \\ &=\sum_{a}\pi(a|s)\sum_{
{s}'}\sum_{r}p({s}',r|s,a)[r+\gamma E_{\pi}[G_{t+1}|S_{t+1}={s}']] \\ &=\sum_{a}\pi(a|s)\sum_{
{s}',r}p({s}',r|s,a)[r+\gamma v_{\pi}({s}')], \text{for all s}\in S \end{aligned} vπ(s)=Eπ[Gt∣St=s],where Gt=Rt+1+γGt+1=Eπ[Rt+1+γGt+1∣St=s]=a∑π(a∣s)s′∑r∑p(s′,r∣s,a)[r+γEπ[Gt+1∣St+1=s′]]=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)],for all s∈S
Optimal value function:
v ∗ ( s ) ≐ m a x π v π ( s ) v_*(s)\doteq \underset{\pi}{max} \ v_{\pi}(s) v∗(s)≐πmax vπ(s)
Action-value Function
The action-value function defines the value of taking action a a a in state s s s under a policy π \pi π, denoted q π ( s , a ) q_{\pi}(s,a) qπ(s,a).
q π ( s , a ) ≐ E π [ G t ∣ S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ] q_{\pi}(s,a)\doteq E_{\pi}[G_t|S_t=s, A_t=a]= E_{\pi}[\sum^{\infin}_{k=0}\gamma^kR_{t+k+1}|S_t=s,A_t=a] qπ(s,a)≐Eπ[Gt∣St=s,At=a]=Eπ[∑k=0∞γkRt+k+1∣St=s,At=a]
It estimates “how good” it is for agent to take action a a a in state s s s.
Optimal action-value function:
q ∗ ( s , a ) ≐ m a x π q π ( s , a ) q_*(s,a)\doteq \underset{\pi}{max} \ q_{\pi}(s,a) q∗(s,a)≐πmax qπ(s,a)
Compared to the value function, optimal policy can be obtained directly from optimal action-value function:
π ∗ ( s ) ≐ a r g m a x a q ∗ ( s , a ) \pi_*(s)\doteq \underset{a}{argmax} \ q_*(s,a) π∗(s)≐aargmax q∗(s,a)
Advantage Function
The advantage function estimates “how good” the action a a a is, as compared to the expected return, when following the policy π \pi π.
A π ( s , a ) ≐ Q π ( s , a ) − V π ( s ) A_{\pi}(s,a)\doteq Q_{\pi}(s,a)-V_{\pi}(s) Aπ(s,a)≐Qπ(s,a)−Vπ(s)