Basic terminologies in RL

Goals & Rewards

Rewards

Immediate reward: At each time step, the reward is a simple number, R t ∈ R R_t \in \R RtR.
Cumulative reward(Returns): The total amount of immediate reward the agent receives.

Reward Hypothesis: That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal(called reward).

Goals

The goal of an agent is to maximize the total amount of immediate rewards in the long run.

Returns & Episodes

Episodes

The agent-environment interaction can breaks into subsequences which we called episodes. These tasks are called episodic tasks.

Returns

- Return—“sum”

In general, we define the cumulative reward as expected retrun, where the return, denoted G t \bf{G_t} Gt. In the simplest case, if the episode is finite with a natural notation of final time step, the return is the sum of the rewards:

G t ≐ R t + 1 + R t + 2 + ⋯ + R T , where T is a final time step. G_t\doteq R_{t+1}+R_{t+2}+ \cdots +R_T, \text{where T is a final time step.} GtRt+1+Rt+2++RT,where T is a final time step.

  • T T T is a variable that normally varies from episode to episode.

However, in many cases the agent-environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. We call these continuing tasks, whose final time step T → ∞ T\rightarrow \infin T.

- Return—“discounted”

An additional way to calculate the expected return base on a concept of discounting. In this case, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. Then, the expected return turns out to be expected discounted return, where the discounted return, denoted:

G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 , where  0 ≤ γ ≤ 1 G_t\doteq R_{t+1}+\gamma R_{t+2}+ \gamma^2 R_{t+3} \cdots =\sum^{\infin}_{k=0} \gamma^k R_{t+k+1}, \text{where }0\leq\gamma\leq1 GtRt+1+γRt+2+γ2Rt+3=k=0γkRt+k+1,where 0γ1

The discounted return can also be written in an incremental form, which is important for the theory and algorithms of reinforcement learning:

G t ≐ R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 R t + 4 + ⋯ = R t + 1 + γ ( R t + 2 + γ R t + 3 + γ 2 R t + 4 + ⋯   ) = R t + 1 + γ G t + 1 \begin{aligned}G_t&\doteq R_{t+1}+\gamma R_{t+2}+ \gamma^2 R_{t+3}+\gamma^3 R_{t+4}+\cdots \\ &=R_{t+1}+\gamma(R_{t+2}+ \gamma R_{t+3}+\gamma^2 R_{t+4}+\cdots)\\ &=R_{t+1}+\gamma G_{t+1}\end{aligned} GtRt+1+γRt+2+γ2Rt+3+γ3Rt+4+=Rt+1+γ(Rt+2+γRt+3+γ2Rt+4+)=Rt+1+γGt+1

Note that although the return is a sum of an infinite number of terms, it is still finite if the reward is nonzero and constant—if γ < 1 \gamma\lt1 γ<1.

-Unified Notation

In order to unify notations for both episodic and continuing tasks, we introduce absorbing state to represent that transitions only to itself and that generates only rewards of zero. Here the solid square represents the special absorbing state.
在这里插入图片描述
We can define the return, using the convention of omitting episode numbers when they are not needed, and including the possibility that γ = 1 \gamma=1 γ=1 if the sum remains defined. Alternatively, we can write:

G t ≐ ∑ k = t + 1 T γ k − t − 1 R k G_t\doteq \sum^{T}_{k=t+1}\gamma^{k-t-1}R_k Gtk=t+1Tγkt1Rk

including the possibility that T = ∞ T=\infin T= or γ = 1 \gamma=1 γ=1(but not both).

Policies & Function

Policy

Formally, a policy is a mapping from states to probabilities of selecting each possible action. If the agent follows the policy π \pi π at time t t t, then π ( a ∣ s ) \pi(a|s) π(as) is the probability that A t = a A_t=a At=a if S t = s S_t=s St=s.

Value Function

The value function of a state s s s under policy π \pi π, denoted v π ( s ) v_{\pi}(s) vπ(s). For MDPs, we can define v π ( s ) v_{\pi}(s) vπ(s) formally by:
v π ( s ) ≐ E π [ G t ∣ S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ] , for all s ∈ S v_{\pi}(s)\doteq E_{\pi}[G_t|S_t=s]=E_{\pi}[\sum^{\infin}_{k=0}\gamma^kR_{t+k+1}|S_t=s], \text{for all s}\in S vπ(s)Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s],for all sS
It estimates “how good” it is for agent to be in state s s s.

The value function can be written in “incremental form” by using Bellman equation:
v π ( s ) = E π [ G t ∣ S t = s ] , where  G t = R t + 1 + γ G t + 1 = E π [ R t + 1 + γ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ ∑ r p ( s ′ , r ∣ s , a ) [ r + γ E π [ G t + 1 ∣ S t + 1 = s ′ ] ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] , for all s ∈ S \begin{aligned} v_{\pi}(s)&=E_{\pi}[G_t|S_t=s], \text{where } G_t=R_{t+1}+\gamma G_{t+1} \\ &=E_{\pi}[R_{t+1}+\gamma G_{t+1} | S_t=s] \\ &=\sum_{a}\pi(a|s)\sum_{ {s}'}\sum_{r}p({s}',r|s,a)[r+\gamma E_{\pi}[G_{t+1}|S_{t+1}={s}']] \\ &=\sum_{a}\pi(a|s)\sum_{ {s}',r}p({s}',r|s,a)[r+\gamma v_{\pi}({s}')], \text{for all s}\in S \end{aligned} vπ(s)=Eπ[GtSt=s],where Gt=Rt+1+γGt+1=Eπ[Rt+1+γGt+1St=s]=aπ(as)srp(s,rs,a)[r+γEπ[Gt+1St+1=s]]=aπ(as)s,rp(s,rs,a)[r+γvπ(s)],for all sS

Optimal value function:
v ∗ ( s ) ≐ m a x π   v π ( s ) v_*(s)\doteq \underset{\pi}{max} \ v_{\pi}(s) v(s)πmax vπ(s)

Action-value Function

The action-value function defines the value of taking action a a a in state s s s under a policy π \pi π, denoted q π ( s , a ) q_{\pi}(s,a) qπ(s,a).
q π ( s , a ) ≐ E π [ G t ∣ S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ] q_{\pi}(s,a)\doteq E_{\pi}[G_t|S_t=s, A_t=a]= E_{\pi}[\sum^{\infin}_{k=0}\gamma^kR_{t+k+1}|S_t=s,A_t=a] qπ(s,a)Eπ[GtSt=s,At=a]=Eπ[k=0γkRt+k+1St=s,At=a]
It estimates “how good” it is for agent to take action a a a in state s s s.

Optimal action-value function:
q ∗ ( s , a ) ≐ m a x π   q π ( s , a ) q_*(s,a)\doteq \underset{\pi}{max} \ q_{\pi}(s,a) q(s,a)πmax qπ(s,a)
Compared to the value function, optimal policy can be obtained directly from optimal action-value function:
π ∗ ( s ) ≐ a r g m a x a   q ∗ ( s , a ) \pi_*(s)\doteq \underset{a}{argmax} \ q_*(s,a) π(s)aargmax q(s,a)

Advantage Function

The advantage function estimates “how good” the action a a a is, as compared to the expected return, when following the policy π \pi π.
A π ( s , a ) ≐ Q π ( s , a ) − V π ( s ) A_{\pi}(s,a)\doteq Q_{\pi}(s,a)-V_{\pi}(s) Aπ(s,a)Qπ(s,a)Vπ(s)

猜你喜欢

转载自blog.csdn.net/lun55423/article/details/112093370
RL