Chapter 7 n-step Bootstrapping

核心思想就是在做bootstrapping之前再向前多走几步

7.1 n-step TD Prediction

The backup diagrams of n-step methods
temporal difference 扩展了n步，这就被称为n-step TD methods

n-step returns

G_{t : t + n} ≐ R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} V_{t + n - 1} (S_{t + n})

$G_{t:t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^n V_{t+n-1}(S_{t+n})$

其中 $V_t:S \rightarrow \mathbb{R}$ 这里是在t时刻对 $v_{\pi}$ 的估计

因为又向后看了几步，所以只有等到得到 $R_{t+n}$ 和计算出 $V_{t+n-1}$ 之后才能做更新

V_{t + n} (S_{t}) ≐ V_{t + n - 1} (S_{t}) + α [G_{t : t + n} - V_{t + n - 1} (S_{t})], 0 \leq t \leq T

$V_{t+n}(S_t) \doteq V_{t+n-1}(S_t)+\alpha[G_{t:t+n}-V_{t+n-1}(S_t)], \qquad 0 \leq t \leq T$

n-step TD for estimating

error reduction property of n-step returns
the worst error of the expected n-step return is guaranteed to be less than or equal to $\gamma^n$ times the worst error under $V_{t+n-1}$ :

max_{s} | E_{π} [G_{t : t + n} | S_{t} = s] - v_{π} (s) | \leq γ^{n} max_{s} | V_{t + n - 1} (s) - v_{π} (s) |

$\underset{s}{\max}|\mathbb{E}_{\pi}[G_{t:t+n}|S_t=s]-v_{\pi}(s)| \leq \gamma^n \underset{s}{\max}|V_{t+n-1}(s)-v_{\pi}(s)|$

这表明所有的n-step TD方法在合适的技术条件下都收敛到正确的预测

7.2 n-step Sarsa

跟之前介绍的Sarsa相比，只有G变成了n-step returns

G_{t : t + n} ≐ R_{t + 1} + γ R_{t + 2} + \dots + γ^{n - 1} R_{t + n} + γ^{n} Q_{t + n - 1} (S_{t_{n}}, A_{t + n}), n \geq 1, 0 \leq t < T - n

$G_{t:t+n} \doteq R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{n-1} R_{t+n}+\gamma^n Q_{t+n-1}(S_{t_n},A_{t+n}), \qquad n \geq 1,0 \leq t \lt T-n$
更新公式也基本没有发生变化

Q_{t + n} (S_{t}, A_{t}) ≐ Q_{t + n - 1} (S_{t}, A_{t}) + α [G_{t : t + n} - Q_{t + n - 1} (S_{t}, A_{t})], 0 \leq t \leq T

$Q_{t+n}(S_t,A_t) \doteq Q_{t+n-1}(S_t,A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)], \qquad 0 \leq t \leq T$
The backup diagrams for the spectrum of n-step methods for state-action values

对于上图展示的Expected Sarsa。跟n-step Sarsa类似，除了最后考虑的一项不同。

G_{t : t + n} ≐ R_{t + 1} + \dots + γ^{n - 1} R_{t + n} + γ^{n} {\bar{V}}_{t + n - 1} (S_{t + n}), t + n < T,

$G_{t:t+n} \doteq R_{t+1}+\cdots+\gamma^{n-1}R_{t+n}+\gamma^n \bar V_{t+n-1}(S_{t+n}), \qquad t+n \lt T,$
这里的不同点有

G_{t : t + n} ≐ G_{t} for t + n \geq T

$G_{t:t+n} \doteq G_t \text{ for } t+n \geq T$ ，
其中

{\bar{V}}_{t} (s)

$\bar V_t(s)$ 是 expected approximte value of state s

{\bar{V}}_{t} (s) ≐ \sum_{a} π (a | s) Q_{t} (s, a), for all s \in S

$\bar V_t(s) \doteq \sum_a \pi(a|s)Q_t(s,a), \qquad \text{for all } s \in S$

7.3 n-step On-policy Learning by Importance Sampling

这一节有关于off-policy learning很好的介绍。off-policy learning就是学习一个policy $\pi$ 的值，同时遵循另外一个policy b的experience。通常， $\pi$ 是对当前action-value估计的greedy policy，而b是一个跟具有探索性的policy，或许是 $\varepsilon\text{-greedy}$

还是要用上 importance sampling ratio

ρ_{t : h} ≐ \prod_{k = t}^{min (k, T - 1)} \frac{π (A_{k} | S_{k})}{b (A_{k} | S_{k})}

$\rho_{t:h} \doteq \prod_{k=t}^{\min(k,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

更新公式

V_{t + n} (S_{t}) ≐ V_{t + n - 1} (S_{t}) + α ρ_{t : t + n - 1} [G_{t : t + n} - V_{t + n - 1} (S_{t})], 0 \leq t < T

$V_{t+n}(S_t) \doteq V_{t+n-1}(S_t)+\alpha \rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], \qquad 0 \leq t \lt T$

off-policy form n-step Sarsa

Q_{t + n} (S_{t}, A_{t}) ≐ Q_{t + n - 1} (S_{t}, A_{t}) + α ρ_{t + 1 : t + n - 1} [G_{t : t + n} - Q_{t + n - 1} (S_{t}, A_{t})], 0 \leq t < T

$Q_{t+n}(S_t,A_t) \doteq Q_{t+n-1}(S_t,A_t)+\alpha \rho_{t+1:t+n-1}[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)], \qquad 0 \leq t \lt T$
off-policy n-step Sarsa

7.4 *Per-decision Off-policy Methods with Control Variates

A more sophisticated approach would use per-decision importance sampling ideas

n-step returns可以写为
$G_{t:h} = R_{t+1}+\gamma G_{t+1:h}, \qquad t \lt h \lt T,$

off-policy definition of the n-step return ending at horizon

\begin{matrix} (7.13) & G_{t : h} ≐ ρ_{t} (R_{t + 1} + γ G_{t + 1 : h}) + (1 - ρ_{t}) V_{h - 1} (S_{t}), t < h < T, \end{matrix}

$G_{t:h} \doteq \rho_t(R_{t+1}+\gamma G_{t+1:h})+(1-\rho_t)V_{h-1}(S_t), \qquad t \lt h \lt T, \tag {7.13}$
同时有

G_{h : h} ≐ V_{h - 1} (S_{h})

$G_{h:h} \doteq V_{h-1}(S_h)$
上式7.13中的第二项被称为 control variate
control variate 不会改变期望更新，因为在5.9节介绍过，importance sampling ratio的期望值是1。

An off-policy form with control variates

\begin{aligned} G_{t : h} & ≐ R_{t + 1} + γ (ρ_{t + 1} G_{t + 1 : h} + {\bar{V}}_{h - 1} (S_{t + 1}) - ρ_{t + 1} Q_{h - 1} (S_{t + 1}, A_{t + 1})), \\ = R_{t + 1} + γ ρ_{t + 1} (G_{t + 1 : h} + Q_{h - 1} (S_{t + 1}, A_{t + 1})) + γ {\bar{V}}_{h - 1} (S_{t + 1}), t < h \leq T . \end{aligned}

$\begin{align*} G_{t:h} &\doteq R_{t+1}+\gamma(\rho_{t+1}G_{t+1:h}+\bar V_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1})), \\ & = R_{t+1}+\gamma \rho_{t+1}(G_{t+1:h}+Q_{h-1}(S_{t+1},A_{t+1}))+\gamma \bar V_{h-1}(S_{t+1}), \qquad t \lt h \leq T. \end{align*}$
如果

h < t

$h \lt t$ ，则递归以

G_{h : h} ≐ Q_{h - 1} (S_{h}, A_{h})

$G_{h:h} \doteq Q_{h-1}(S_h,A_h)$ 结束；如果

h \geq T

$h \geq T$ ，则递归以

G_{T - 1 : T} ≐ R_{T}

$G_{T-1:T} \doteq R_T$ 结束。

control variates就是一种减小方差的方法

7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

不需要importance sampling的off-policy方法
tree-backup update

tree-backup n-step return的一般形式

G_{t : t + n} ≐ R_{t + t} + γ \sum_{α \neq A_{t + 1}} π (a | S_{t + 1}) Q_{t + n - 1} (S_{t + 1}, a) + γ π (A_{t + 1}, S_{t + 1}) G_{t + 1 : t + n}, t < T - 1

$G_{t:t+n} \doteq R_{t+t}+\gamma \sum_{\alpha \neq A_{t+1}} \pi(a|S_{t+1})Q_{t+n-1}(S_{t+1},a)+\gamma \pi(A_{t+1},S_{t+1})G_{t+1:t+n}, \qquad t \lt T-1$
当n=1时，

G_{T - 1 : T} ≐ R_{T}

$G_{T-1:T} \doteq R_T$

上述action-value用于n-step Sarsa

Q_{t + n} (S_{t}, A_{t}) ≐ Q_{t + n - 1} (S_{t}, A_{t}) + α [G_{t : t n} - Q_{t + n - 1} (S_{t}, A_{t})], 0 \leq t < T,

$Q_{t+n}(S_t,A_t) \doteq Q_{t+n-1}(S_t,A_t)+\alpha[G_{t:tn}-Q_{t+n-1}(S_t,A_t)], \qquad 0 \leq t \lt T,$
n-step Tree Backup for estimating

7.6 *A Unifying Algorithm: n-step $Q(\delta)$

跟前面描述的类似，就是往前看的方式变了，其他的都是一样的，看下图
The backup diagrams

改写7.16的形式为如下：

\begin{aligned} G_{t : h} & = R_{t + 1} + γ \sum_{a \neq A_{t + 1}} π (a | S_{t + 1}) Q_{h - 1} (S t + 1, a) + γ π (A_{t + 1} | S_{t + 1}) G_{t + 1 : h} \\ = R_{t + 1} + γ {\bar{V}}_{h - 1} (S_{t + 1}) - γ π (A_{t + 1} | S_{t + 1}) Q_{h - 1} (S_{t + 1}, A_{t + 1}) + γ π (A_{t + 1} | S_{t + 1}) G_{t + 1 : h} \\ = R_{t + 1} + γ π (A_{t + 1} | S_{t + 1}) (G_{t + 1 : h} - Q_{h - 1} (S_{t + 1}, A_{t + 1})) + γ {\bar{V}}_{h - 1} (S_{t + 1}) ， \end{aligned}

$\begin{align*} G_{t:h} & = R_{t+1}+\gamma \sum_{a \neq A_{t+1}} \pi(a|S_{t+1})Q_{h-1}(S{t+1},a)+\gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\\ & = R_{t+1}+\gamma \bar V_{h-1}(S_{t+1})-\gamma \pi(A_{t+1}|S_{t+1})Q_{h-1}(S_{t+1},A_{t+1})+\gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\\ & = R_{t+1}+\gamma \pi(A_{t+1}|S_{t+1})(G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma \bar V_{h-1}(S_{t+1})， \end{align*}$
把其中的

π (A_{t + 1} | S t + 1)

$\pi(A_{t+1}|S{t+1})$ 替换成importance-sampling ratio

ρ_{t + 1}

$\rho_{t+1}$

G_{t : h} ≐ R_{t + 1} + γ (δ_{t + 1} ρ_{t + 1} + (1 - δ_{t + 1}) π (A_{t + 1 | S_{t + 1}})) (G_{t + 1 : h} - Q_{h - 1} (S_{t + 1}, A_{t + 1})) + γ {\bar{V}}_{h - 1} (S_{t + 1})

$G_{t:h} \doteq R_{t+1}+\gamma(\delta_{t+1}\rho_{t+1}+(1-\delta_{t+1})\pi(A_{t+1|S_{t+1}}))(G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma \bar V_{h-1}(S_{t+1})$
对于

t < h \leq T

$t \lt h \leq T$ ，如果

h < T

$h \lt T$ ，则递归式最后以

G_{h : h} ≐ 0

$G_{h:h} \doteq 0$ 结束；如果

h = T

$h=T$ ，则递归式最后以

G_{T - 1 : T} ≐ R_{T}

$G_{T-1:T} \doteq R_T$ 结束。

Off-policy n-step Q(delta)