Basic terminologies in RL

Goals & Rewards

Rewards

Immediate reward: At each time step, the reward is a simple number, $R_t \in \R$ .
Cumulative reward(Returns): The total amount of immediate reward the agent receives.

Reward Hypothesis: That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal(called reward).

Goals

The goal of an agent is to maximize the total amount of immediate rewards in the long run.

Returns & Episodes

Episodes

The agent-environment interaction can breaks into subsequences which we called episodes. These tasks are called episodic tasks.

Returns

- Return—“sum”

In general, we define the cumulative reward as expected retrun, where the return, denoted $\bf{G_t}$ . In the simplest case, if the episode is finite with a natural notation of final time step, the return is the sum of the rewards:

$G_t\doteq R_{t+1}+R_{t+2}+ \cdots +R_T, \text{where T is a final time step.}$

$T$ is a variable that normally varies from episode to episode.

However, in many cases the agent-environment interaction does not break naturally into identifiable episodes, but goes on continually without limit. We call these continuing tasks, whose final time step $T\rightarrow \infin$ .

- Return—“discounted”

An additional way to calculate the expected return base on a concept of discounting. In this case, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. Then, the expected return turns out to be expected discounted return, where the discounted return, denoted:

$G_t\doteq R_{t+1}+\gamma R_{t+2}+ \gamma^2 R_{t+3} \cdots =\sum^{\infin}_{k=0} \gamma^k R_{t+k+1}, \text{where }0\leq\gamma\leq1$

The discounted return can also be written in an incremental form, which is important for the theory and algorithms of reinforcement learning:

$\begin{aligned}G_t&\doteq R_{t+1}+\gamma R_{t+2}+ \gamma^2 R_{t+3}+\gamma^3 R_{t+4}+\cdots \\ &=R_{t+1}+\gamma(R_{t+2}+ \gamma R_{t+3}+\gamma^2 R_{t+4}+\cdots)\\ &=R_{t+1}+\gamma G_{t+1}\end{aligned}$

Note that although the return is a sum of an infinite number of terms, it is still finite if the reward is nonzero and constant—if $\gamma\lt1$ .

-Unified Notation

In order to unify notations for both episodic and continuing tasks, we introduce absorbing state to represent that transitions only to itself and that generates only rewards of zero. Here the solid square represents the special absorbing state.
在这里插入图片描述
We can define the return, using the convention of omitting episode numbers when they are not needed, and including the possibility that $\gamma=1$ if the sum remains defined. Alternatively, we can write:

$G_t\doteq \sum^{T}_{k=t+1}\gamma^{k-t-1}R_k$

including the possibility that $T=\infin$ or $\gamma=1$ (but not both).

Policies & Function

Policy

Formally, a policy is a mapping from states to probabilities of selecting each possible action. If the agent follows the policy $\pi$ at time $t$ , then $\pi(a|s)$ is the probability that $A_t=a$ if $S_t=s$ .

Value Function

The value function of a state $s$ under policy $\pi$ , denoted $v_{\pi}(s)$ . For MDPs, we can define $v_{\pi}(s)$ formally by:
$v_{\pi}(s)\doteq E_{\pi}[G_t|S_t=s]=E_{\pi}[\sum^{\infin}_{k=0}\gamma^kR_{t+k+1}|S_t=s], \text{for all s}\in S$
It estimates “how good” it is for agent to be in state $s$ .

The value function can be written in “incremental form” by using Bellman equation:
$\begin{aligned} v_{\pi}(s)&=E_{\pi}[G_t|S_t=s], \text{where } G_t=R_{t+1}+\gamma G_{t+1} \\ &=E_{\pi}[R_{t+1}+\gamma G_{t+1} | S_t=s] \\ &=\sum_{a}\pi(a|s)\sum_{ {s}'}\sum_{r}p({s}',r|s,a)[r+\gamma E_{\pi}[G_{t+1}|S_{t+1}={s}']] \\ &=\sum_{a}\pi(a|s)\sum_{ {s}',r}p({s}',r|s,a)[r+\gamma v_{\pi}({s}')], \text{for all s}\in S \end{aligned}$

Optimal value function:
$v_*(s)\doteq \underset{\pi}{max} \ v_{\pi}(s)$

Action-value Function

The action-value function defines the value of taking action $a$ in state $s$ under a policy $\pi$ , denoted $q_{\pi}(s,a)$ .
$q_{\pi}(s,a)\doteq E_{\pi}[G_t|S_t=s, A_t=a]= E_{\pi}[\sum^{\infin}_{k=0}\gamma^kR_{t+k+1}|S_t=s,A_t=a]$
It estimates “how good” it is for agent to take action $a$ in state $s$ .

Optimal action-value function:
$q_*(s,a)\doteq \underset{\pi}{max} \ q_{\pi}(s,a)$
Compared to the value function, optimal policy can be obtained directly from optimal action-value function:
$\pi_*(s)\doteq \underset{a}{argmax} \ q_*(s,a)$

Advantage Function

The advantage function estimates “how good” the action $a$ is, as compared to the expected return, when following the policy $\pi$ .
$A_{\pi}(s,a)\doteq Q_{\pi}(s,a)-V_{\pi}(s)$