ADPRL - 近似动态规划和强化学习 - Note 7 - Approximate Dynamic Programming

Note 7 - 近似动态规划 Approximate Dynamic Programming

7. 近似动态规划（Approximate Dynamic Programming）

7. 近似动态规划（Approximate Dynamic Programming）

在前面的章节中，我们研究了经典DP算法的理论基础和它们的高级变化。尽管这些算法具有良好的理论特性，但在许多实际应用中，这些算法仍然是低效的，甚至是不切实际的。这种现象主要是由于维数的诅咒，它在存储或计算方面都会造成潜在的高负担。SDM的一个具有挑战性的应用是边缘计算（edge computing），其中计算和数据存储被推到数据源上。显然，对于任何经典的DP算法来说，边缘的计算能力和存储容量都是非常有限的。更具体地说，本节重点讨论维度诅咒的存储角度。

在这里插入图片描述

图7：边缘计算应用。

7.1 近似架构 (Approximation architectures)

总成本函数近似的主要思想是构建一个相对低维的参数化空间来近似总成本函数。具体来说，一个参数化的总成本函数近似器将状态 $x\in\mathcal{X}$ 映射为总成本函数估计值，即。

$\mathcal{X} \times \mathbb{R}^{m} \rightarrow \mathbb{R}, \quad(x, \theta) \mapsto J(x, \theta) \tag{7.1}$

其中 $\theta$ 是近似结构的参数（向量）。通过这样的结构，目的是将参数空间的维度从 $K$ 的向量空间减少到 $m$ 的向量空间， $m\ll K$ ，即减轻维度的诅咒。值得注意的是，实际的总成本函数不一定位于这样的近似空间中。

7.1.1 线性函数近似（Linear Function Approximation，LFA）

总成本函数近似的一个简单框架是线性函数近似（LFA）。它的目的是构建一组实值特征，这些特征捕捉所有状态的属性，即。

$\phi: \mathcal{X} \rightarrow \mathbb{R}^{m}, \quad x \mapsto \phi(x) \tag{7.2}$

然后将总成本函数近似为这些特征的线性组合，即：
$J_{l}: \mathcal{X} \times \mathbb{R}^{m} \rightarrow \mathbb{R}, \quad(x, \theta) \mapsto \theta^{\top} \phi(x) \tag{7.3}$

通过将所有特征向量收集到一个矩阵中，即
$\Phi(x):=\left[\phi\left(x_{1}\right), \ldots, \phi\left(x_{K}\right)\right] \in \mathbb{R}^{m \times K}, \tag{7.4}$

我们可以把近似总成本函数的空间确定为 $\Phi(x)$ 的行的张成，即
$\mathcal{J}_{l}:=\left\{\Phi(x)^{\top} h \mid x \in \mathcal{X}, h \in \mathbb{R}^{m}\right\} \subset \mathbb{R}^{K} \tag{7.5}$

当然，假设所有的特征向量都是不同的，所以总成本函数的线性近似仍可区分对于诱导性策略。为了确保每个近似值 $J\in\mathcal{J}_{l}$ 都由 $h\in\mathbb{R}^{m}$ 相对于指定的特征矩阵 $\phi(x)$ 所代表，线性映射 $\mapsto \Phi(x)^{\top} h$ 需要是单射（一对一映射）。因此，我们假设特征矩阵是满秩的。

Assumption 7.1 特征矩阵的秩

特征矩阵 $\Phi(x)\in \mathbb{R}^{m \times K}$ 的秩是 $m$ 。

有趣的是，经典的DP算法可以被容易地建模为一个线性总成本函数近似。让我们把表格查询特征（tabular lookup features）定义为
$\phi^{\text {table }}(x):=\left(\mathbf{1}_{x_{1}}(x), \ldots, \mathbf{1}_{x_{K}}(x)\right)^{\top}, \tag{7.6}$

其中指示函数 $1_{x_{i}}: \mathcal{X} \rightarrow\{0,1\}$ 被定义为
$\mathbf{1}_{x_{i}}(x):= \begin{cases}1, & \text { if } x=x_{i} \\ 0, & \text { otherwise }\end{cases} \tag{7.7}$

尽管MDP的模型假定状态空间的有限性是 $\mathrm{DP}$ 和RL求解的基础，但许多工程问题并没有这种连续状态空间的便利，如机器人技术。为了使DP或RL方法在解决具有连续状态空间的MDP问题时都是可行的，建立一种构建线性总成本函数近似空间的机制具有很大的实际用途。最流行的技术之一是瓦片编码（tile coding），如图8所示。

在这里插入图片描述

图8：瓦片编码。对于具有连续状态空间的MDP问题，每个圆盘代表一个用于生成特征的局部指标函数。

具体来说，瓦片编码的主要思想是假设相邻的状态具有相同的重要性，因此总成本函数。让我们定义连续状态空间 $\mathcal{X}$ 中的开放子集 $\mathcal{N}_{k} \in \mathcal{X}$ , for $\ldots, \tau$ 。我们假设这些开放子集的联合覆盖了状态空间，即。

$\bigcup_{k=1}^{\tau} \mathcal{N}_{k}=\mathcal{X} \tag{7.8}$

我们可以定义相同的指示函数为

$\mathbf{1}_{\mathcal{N}_{i}}(x):= \begin{cases}1, & \text { if } x \in \mathcal{N}_{i} \\ 0, & \text { otherwise }\end{cases}$

然后，可以通过以下方式构建一个瓦片编码特征向量

$\phi^{\text {tile }}(x):=\left(\mathbf{1}_{\mathcal{N}_{1}}(x), \ldots, \mathbf{1}_{\mathcal{N}_{\tau}}(x)\right)^{\top} .$

有了线性总成本函数近似的构造，然后就可以直接构造经典的DP算法了。

7.1.2 神经网络函数逼近 (Neural Function Approximation)

深度强化学习（DRL）的最新发展表明，神经网络（NNs）在解决具有大型或连续状态空间的挑战性RL问题方面具有卓越的性能。历史上，神经网络（NN）被用来近似总成本函数已经有几十年了。最近NN在解决模式识别、计算机视觉和语音识别等挑战性问题上的成功，进一步引发了人们对NN应用于总成本函数的努力。基于NN的总成本函数近似（NN-VFA）方法已经在许多具有挑战性的领域中证明了其卓越的性能，例如Atari游戏和围棋游戏。尽管取得了这些进展，对基于NN-VFA的算法的充分理解仍然是一个开放的问题，而且对更具挑战性的应用有着巨大的需求。

一个经典的NN由许多连接的基本计算单元组成，称为神经元，见图9，呈层状结构，如多层感知器（MLP），见图9。

在这里插入图片描述

图9：单一感知器神经元的图示。

让 $\sigma: \mathbb{R} \rightarrow \mathbb{R}$ 是一个单位激活函数，传统上它被选择为非常数、有界、连续和单调增长。常见的例子有Sigmoid函数

$\sigma(x)=\frac{1}{1+e^{-x}}, \tag{7.9}$

和整流线性单元（ReLU）。

$\sigma(x)= \begin{cases}0, & \text { for } x \leq 0 \\ x, & \text { for } x>0\end{cases} \tag{7.10}$

更确切地说，让我们用 $L$ 表示MLP结构中的层数，用 $n_{l}$ 表示第 $l$ 层的处理单元数， $\ldots, L$ 。具体来说，用 $l = 0$ ，我们指的是输入层，并设定 $n_{0}=K$ ， $n_{L}=1$ 。这里，我们用 $\sigma^{\prime}: \mathbb{R} \rightarrow \mathbb{R}$ 是激活函数 $\sigma$ 的一阶导数。

对于MLP结构中的第 $(l, k)$ 个单元，指的是第 $l$ 层中的第 $k$ 个单元，我们定义相应的单元映射为

$f_{l, k}\left(w_{l, k}, \phi_{l-1}\right):=\sigma\left(w_{l, k}^{\top} \phi_{l-1}-b_{l, k}\right), \tag{7.11}$

其中， $\phi_{l-1} \in \mathbb{R}^{n_{l-1}}$ 表示来自第 $(l - 1)$ 层的输出， $w_{l, k} \in \mathbb{R}^{n_{l-1}}$ 和 $b_{l, k} \in \mathbb{R}$ 分别是与第 $(l, k)$ 单元相关的权重向量和偏置。请注意，一般来说，偏差 $b_{l, k}$ 是一个与虚拟单元相关的自由变量。然而，通过下面的分析，为了表述方便，我们把它固定为公式（7.11）中的一个常数标量。然后，我们可以通过堆叠该层的所有单元映射，简单地定义第 $l$ 层映射为

$f_{l}\left(W_{l}, \phi_{l-1}\right):=\left[f_{l, 1}\left(w_{l, 1}, \phi_{l-1}\right), \ldots, f_{l, n_{l}}\left(w_{l, n_{l}}, \phi_{l-1}\right)\right]^{\top} \tag{7.12}$

其中 $W_{l}:=\left[w_{l, 1}, \ldots, w_{l, n_{l}}\right] \in \mathbb{R}^{n_{l-1} \times n_{l}}$ 是第 $l$ 个权重矩阵。具体来说，让我们用 $\phi_{0} \in \mathbb{R}^{K}$ 表示输入，那么第 $l$ 层的输出被递归定义为 $\phi_{l}:=f_{l}\left(W_{l}, \phi_{l-1}\right)$ 注意，MLP的最后一层通常采用自映射作为激活函数，即 $\phi_{L}:=W_{L}^{\top} \phi_{L-1}$ 。最后，用 $\mathcal{W}:=\mathbb{R}^{K \times n_{1}} \times \ldots \times \mathbb{R}^{n_{L-1}}$ 来表示MLP中所有参数矩阵的集合。我们将所有的层间映射组合起来，定义整个MLP网络的映射

$\mathcal{W} \times \mathbb{R}^{K} \rightarrow \mathbb{R}, \quad(\mathbf{W}, x) \mapsto f_{L}\left(W_{L}, \cdot\right) \circ \ldots \circ f_{1}\left(W_{1}, x\right) \tag{7.13}$

在这里插入图片描述

图10：一个有两个隐藏层的MLP。

有了这样的结构，我们可以定义一组参数化的总成本函数，由一个给定的MLP架构规定为
$\mathcal{F}:=\left\{f(\mathbf{W}, \cdot): \mathbb{R}^{K} \rightarrow \mathbb{R} \mid \mathbf{W} \in \mathcal{W}\right\} \tag{7.14}$

更具体地说，我们用 $\mathcal{F}\left(K, n_{1}, \ldots, n_{L-1}, 1\right)$ 来表示MLP的架构，即每层的单元数。让我们用 $F_{x}(\mathbf{W}):=f(\mathbf{W}, x)$ 表示状态为 $x$ 的MLP在 $\mathbf{W}$ 的评估，用 $\mathbf{W}):=\left[F_{x_{1}}(\mathbf{W}), \ldots, F_{x_{K}}(\mathbf{W})\right]^{\top} \in \mathbb{R}^{K}$ 表示一个近似的总成本函数，MLP的近似总成本函数集被定义为

$\mathcal{J}_{n}:=\{F(x, \mathbf{W}) \mid x \in \mathcal{X}, \mathbf{W} \in \mathcal{W}\} \subset \mathbb{R}^{K} \tag{7.15}$

7.2 贝尔曼残差最小化（Bellman Residual Minimisation ）

对于正在选择的总成本函数的特定近似架构，重要的是调查质量和方法，以确定最佳总成本函数的最佳近似。

定义 7.1 最佳总成本函数的直接估计 (Direct estimate of optimal total cost function).

给定一个无限范围的MDP $\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}$ ，和一个封闭子集 $\mathcal{J} \subset \mathbb{R}^{K}$ ，那么最优总成本函数 $J_{D}$ 的直接最优估计为

$J_{D} \in \underset{J \in \mathcal{J}}{\operatorname{argmin}}\left\|J-J^{*}\right\|_{\infty} \tag{7.16}$

解决方案 $J_{D}$ 的质量以其相关GIP的真实总成本来衡量。

Proposition 7.1 最优直接估计的约束

给定一个无限范围的MDP $\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}$ 和一个封闭子集 $\mathcal{J} \subset \mathbb{R}^{K}$ ， $J_{D} \in \mathcal{J}$ 是 $J^{*}$ 的最优直接估计， $\pi_{D}$ 是关于 $J_{D}$ 的贪婪策略，那么我们有

$\left\|J^{\pi_{D}}-J^{*}\right\|_{\infty} \leq \frac{2(1+\gamma)}{1-\gamma} \min _{J \in \mathcal{J}}\left\|J-J^{*}\right\|_{\infty} . \tag{7.17}$

Proof.
通过应用无穷范数的三角不等式，我们可以得到

根据 Proposition $4.2$ 有

$\begin{aligned} \left\|J^{\pi_{D}}-J^{*}\right\|_{\infty} & \leq \frac{2}{1-\gamma}\left\|\mathrm{T}_{\mathfrak{g}} J_{D}-J_{D}\right\|_{\infty} \\ & \leq \frac{2(1+\gamma)}{1-\gamma}\left\|J_{D}-J^{*}\right\|_{\infty} \\ &=\frac{2(1+\gamma)}{1-\gamma} \min _{J \in \mathcal{J}}\left\|J-J^{*}\right\|_{\infty} . \end{aligned} \tag{7.19}$

即证。

很明显，这种对最佳总成本的直接评估只是理论上的讨论。我们需要开发更可行的机制来寻找最佳近似值。定理3.1中所示的最佳贝尔曼算子的充分和必要条件表明，使用最佳贝尔曼算子下的残余误差来估计最佳总成本函数的可能措施如下

定义 7.2 最佳总成本函数的间接估计 (Indirect estimate of optimal total cost function).

给定一个无限范围的MDP ${\mathcal{X}, \mathcal{U}, p, g, \gamma}$ ，和一个封闭子集 $\mathcal{J} \subset \mathbb{R}^{K}$ ，那么最优总成本函数 $J_{B}$ 的最优间接估计为
$J_{B} \in \underset{J \in \mathcal{J}}{\operatorname{argmin}}\left\|\mathrm{T}_{\mathfrak{g}} J-J\right\|_{\infty} \tag{7.20}$

注意， $J$ 与 $\mathrm{T}_{\mathfrak{g}} J$ 之间的差值的最大范数被称为贝尔曼残差（Bellman residual） 。

Proposition 7.2 最优间接估计的约束

给定一个无限范围的MDP $\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}$ ，并让 $J_{B} \in \mathcal{J}$ 是 $J^{*}$ 的最优间接估计， $\pi_{B}$ 是关于 $J_{B}$ 的贪婪策略。那么我们有

$\left\|J^{\pi_{B}}-J^{*}\right\|_{\infty} \leq \frac{2(1+\gamma)}{1-\gamma} \min _{J \in \mathcal{J}}\left\|J-J^{*}\right\|_{\infty} . \tag{7.21}$

Proof.
通过应用无穷范数的三角不等式，我们可以得到

$\begin{aligned} \left\|\mathrm{T}_{\mathfrak{g}} J-J\right\|_{\infty} & \leq\left\|\mathrm{T}_{\mathfrak{g}} J-J^{*}\right\|_{\infty}+\left\|J^{*}-J\right\|_{\infty} \\ & \leq(1+\gamma)\left\|J-J^{*}\right\|_{\infty} . \end{aligned} \tag{7.22}$
直接的有
$\begin{aligned} \left\|\mathrm{T}_{\mathfrak{g}} J_{B}-J_{B}\right\|_{\infty} &=\min _{J \in \mathcal{J}}\left\|\mathrm{T}_{\mathfrak{g}} J-J\right\|_{\infty} \\ & \leq(1+\gamma) \min _{J \in \mathcal{J}}\left\|J-J^{*}\right\|_{\infty} . \end{aligned} \tag{7.23}$

该结果由命题4.2得出。

在这里插入图片描述

图11：参数化总成本近似的误差界限。

有趣的是，如Propositions 7.1和7.2所示，直接和间接方法所提供的误差界限是相等的。因此，利用贝尔曼残差最小化作为计算工具来估计最优总成本是安全的。另一个有趣的观察表明，只有当最优总成本位于给定的近似集时，两个误差界线才会变成零。显然，来自命题4.3的结果确保了当近似空间被适当构建时，GIP $\pi_{B}$ 可以是最优的。

Corollary 7.1 最优策略 $\pi_{B}$ 的充分不必要约束

给定一个无限范围的MDP ${\mathcal{X}, \mathcal{U}, p, g, \gamma}$ ，并让 $J_{B} \in \mathcal{J}$ 是 $J^{*}$ 的最优间接估计， $\pi_{B}$ 是关于 $J_{B}$ 的贪婪策略，最优总成本和任何非最优总成本之间的差距由 $\rho:=\min _{\pi \notin \mathfrak{P}_{d m}^{*}}\left\|J^{\pi}-J^{*}\right\|_{\infty}>0$ 定义。如果满足以下条件。

$\min _{J \in \mathcal{J}}\left\|J-J^{*}\right\|_{\infty} \leq \frac{\rho(1-\gamma)}{2(1+\gamma)} \tag{7.24}$

那么 $\pi_{B}$ 就是最优的。

7.3 近似价值迭代（Approximate Value Iteration）

很明显，公式（7.20）中给出的上述贝尔曼残差是很难优化的，特别是由于 $\mathrm{T}_{\mathfrak{g}}$ 的评估而涉及的离散优化。一个实际的解决方案是采用VI的过程，即在给定的近似架构中对一步最优Bellman算子进行近似评估。

让我们采用线性函数近似架构，让 $J_{k}=\Phi^{\top} \theta_{k}$ 是最佳总成本函数的估计。最佳贝尔曼算子 $\mathrm{T}_{\mathrm{g}}$ 对 $J_{k}$ 的一次应用不一定位于同一个子空间 $\mathcal{J}$ 中，如图12所描述。
在这里插入图片描述

图12：拟合值迭代算法。从总成本函数 $J_{k}$ 的第 $k$ 次估计开始，应用最佳Bellman算子 $\mathrm{T}_{\mathfrak{g}} J_{k}$ 。

为了将其结果 $\mathrm{T}_{\mathfrak{g}} J_{k}$ 带回 $J_{k}$ 回到近似空间 $\mathcal{J}$ ，我们需要应用一个与无穷范数相关的适当的正交投影。具体来说，我们有如下更新规则

$J_{k+1}=\Pi_{\infty} \mathrm{T}_{\mathfrak{g}} J_{k} \tag{7.25}$

其中 $\Pi_{\infty}$ 是关于无穷大准则的正交投影

$\Pi_{\infty}(J)=\Phi^{\top} \underset{\theta \in \mathbb{R}^{m}}{\operatorname{argmin}}\left\|J-\Phi^{\top} \theta\right\|_{\infty} . \tag{7.26}$

这种投影VI算法被称为拟合VI算法（fitted VI algorithm）。可以证明，拟合VI收敛到一个唯一的固定点。不幸的是，解决无穷大规范下的最小化问题在数值上是不可行的。为了缓解这种困难，我们可以采用范数的等价性来开发一种数值上可行的算法。

众所周知，对于一个给定的 $x\in\mathbb{R}^{m}$ ，以下关系成立

$\|x\|_{\infty} \leq\|x\|_{2} \tag{7.27}$

那么，一个近似的VI步可以被定义为

$J_{k+1} \in \underset{J \in \mathcal{J}}{\operatorname{argmin}}\left\|J-\mathrm{T}_{\mathfrak{g}} J_{k}\right\|_{2}^{2} . \tag{7.28}$

由于我们局限于一个LFA架构，所以很容易采用正交投影的解决方案。让我们定义 $\Pi:=\Phi^{\top}\left(\Phi \Phi^{\top}\right)^{-1} \Phi$ ，我们可以像Algorithm 9那样直接构建一个具有LFA的近似VI。

在这里插入图片描述

需要注意的是，Algorithm 9不能保证收敛性，这是因为组成的操作符 $\left(\Pi T_{\mathfrak{g}}\right): \mathbb{R}^{K} \rightarrow \mathbb{R}^{K}$ 对于 $|\cdot\|_{2}$ 或 $|\cdot\|_{\infty}$ 来说都不是一个收缩。这种现象在近似DP的背景下被称为范数不匹配(norm mismatch) 。

在本节的其余部分，我们研究近似VI的通用框架的收敛性。对于一个给定的估计值 $J_{k}$ ，一个近似的VI步骤被计算为找到一个新的估计值，但要达到一定的误差，如下面的不等式，并对所有 $\ldots, \infty$ 进行计算

$\left\|J_{k+1}-\mathrm{T}_{\mathfrak{g}} J_{k}\right\|_{\infty} \leq \delta . \tag{7.29}$

在这里，可以用范数之间的等价关系来指定误差界限 $\delta$ ，即：

$\left\|J_{k+1}-\mathrm{T}_{\mathfrak{g}} J_{k}\right\|_{\infty} \leq\left\|J_{k+1}-\mathrm{T}_{\mathfrak{g}} J_{k}\right\|_{2} . \tag{7.30}$

图13说明了一个通用AVI算法的基本思路。
在这里插入图片描述

图13：近似值迭代算法。这里，总成本函数估计值周围的每个圆盘象征着总成本函数近似的误差容限。

在下面的Proposition 中，我们从估计值和最佳总成本函数之间的误差或差异方面来研究其性能。

Proposition 7.3 近似VI算法的误差界限 (Error bounds for approximate VI algorithm)

给定一个无限范围的MDP $\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}$ ，由近似 $V I$ 算法产生的 $J_{k}$ 序列满足以下不等式

$\lim _{k \rightarrow \infty}\left\|J_{k}-J^{*}\right\|_{\infty} \leq \frac{\delta}{1-\gamma} \tag{7.31}$

Proof.

定义 $J_{0} \in \mathbb{R}^{K}$ 是总成本函数的初始估计。我们可以将无穷范数的三角形不等式应用于近似VI的第 $k$ 输出与精确VI算子的第 $k$ 输出之间的差值，如下所示

$\begin{aligned} \left\|J_{k}-\mathrm{T}_{\mathfrak{g}}^{k} J_{0}\right\|_{\infty}=& \| J_{k}-\mathrm{T}_{\mathfrak{g}} J_{k-1}+\mathrm{T}_{\mathfrak{g}} J_{k-1}-\mathrm{T}_{\mathfrak{g}}^{2} J_{k-2}+\ldots \\ & \ldots-\mathrm{T}_{\mathfrak{g}}^{k-1} J_{1}+\mathrm{T}_{\mathfrak{g}}^{k-1} J_{1}-\mathrm{T}_{\mathfrak{g}}^{k} J_{0} \|_{\infty} \\ \leq &\left\|J_{k}-\mathrm{T}_{\mathfrak{g}} J_{k-1}\right\|_{\infty}+\left\|\mathrm{T}_{\mathfrak{g}} J_{k-1}-\mathrm{T}_{\mathfrak{g}}^{2} J_{k-2}\right\|_{\infty}+\left\|\mathrm{T}_{\mathfrak{g}}^{k-1} J_{1}-\mathrm{T}_{\mathfrak{g}}^{k} J_{0}\right\|_{\infty} \\ \leq & \delta+\gamma \delta+\ldots+\gamma^{k-1} \delta, \end{aligned} \tag{7.32}$

其中第二个不等式是由于最佳贝尔曼算子的收缩特性 $\mathrm{T}_{\mathfrak{g}}$ 。根据几何级数的特性，我们得到

$\left\|J_{k}-\mathrm{T}_{\mathfrak{g}}^{k} J_{0}\right\|_{\infty} \leq \frac{1-\gamma^{k}}{1-\gamma} \delta \tag{7.33}$

让极限 $\rightarrow \infty$ ，结果由Proposition 3.5中的VI算法的收敛性得出，即 $\lim _{k \rightarrow \infty} \mathrm{T}_{\mathfrak{g}}^{k} J_{0}=J^{*}$ 。

与经典的VI算法类似，就近似VI算法产生的估计总成本函数而言，可以一系列贪婪诱导性策略。通过回顾Theorem 4.1中GIP的直接误差界限，我们直接得出以下属性

$\begin{aligned} \lim _{k \rightarrow \infty}\left\|J^{\pi_{k}}-J^{*}\right\|_{\infty} & \leq \frac{2 \gamma}{1-\gamma} \lim _{k \rightarrow \infty}\left\|J_{k}-J^{*}\right\|_{\infty} \\ & \leq \frac{2 \gamma \delta}{(1-\gamma)^{2}} \end{aligned} \tag{7.34}$

其中 $J^{\pi_{k}}$ 是相对于近似VI算法的第 $k$ 次迭代的GIP。

Corollary 7.2 一般AVI算法的有界性

给定一个无限范围的MDP $\{\mathcal{X}, \mathcal{U}, p, g, \gamma\}$ ，让 $J_{k}$ 是由近似VI算法产生的总成本函数估计序列。让我们用 $\pi_{k}$ 表示关于 $J_{k}$ 的相应的贪婪诱导策略，那么GIP的总成本函数 $\pi_{k}$ 满足以下属性

$\lim _{k \rightarrow \infty}\left\|J^{\pi_{k}}-J^{*}\right\|_{\infty} \leq \frac{2 \gamma \delta}{(1-\gamma)^{2}} \tag{7.35}$

Remark 7.1 公差 $\delta$ 与误差 $\rho$ 的关系

正如我们已经讨论过的，上述误差界限可以是保守的。为了能够从AVI算法的结果中检索出一个最优策略，我们回顾一下命题4.3的结果。设 $\rho>0$ 为最佳总成本函数和其最接近的总成本函数之间的误差，如公式（4.11）所定义。预计AVI算法极限处的误差界限将服从以下不等式

$\lim _{k \rightarrow \infty}\left\|J_{k}-J^{*}\right\|_{\infty} \leq \frac{\delta}{1-\gamma}<\frac{\rho(1-\gamma)}{2 \gamma} \tag{7.36}$

作为结果，我们有

$\delta<\frac{\rho(1-\gamma)^{2}}{2 \gamma} \tag{7.37}$

显然，公差 $\delta$ 的选择取决于 $\rho$ 的总成本，不幸的是，这在一般情况下是无法获得的。

7.3 Example: E-Bus

Consider a group of electric buses running round trips 24 hours a day. The task is to identify optimal operating actions at different battery states. The battery’s endurance and charging speed gradually decrease with the increase of battery life. Hence, for different buses, they have different transition probabilities between battery states. The following figure illustrates the state transitions between different states.

Five states: $H$ - high battery, $E$ - empty battery, $L_{1}, L_{2}, L_{3}$ : three different low level battery statuses
Two actions: $S$ - continue to serve, C - charge
Numbers on the edges refer to transition probabilities. $p_{1}=0.4, p_{2}=0.6$
Discount factor $\gamma=0.9$ .

在这里插入图片描述

We choose the number of unserviced passengers as the local costs:

In the high battery state, if it keeps the service, the unserviced passenger number is 0 .
In all the low battery stats, if it keeps the service, the unserviced passenger number is 2 .
In all the low battery state and empty battery, if it charges the battery, the unserviced passenger number is 5 .

Task 1: Implement Value Iteration (VI) and Approximate Value Iteration (AVI) algorithm with Linear Function approximation (LFA) using the feature matrix

$\Phi=\left[\begin{array}{ccccc} -0.40 & -0.44 & -0.45 & -0.46 & -0.48 \\ -0.07 & 0.73 & -0.61 & 0.21 & -0.23 \end{array}\right]$

Compare their convergence speeds in terms of the difference from each iterate to the corresponding accumulation point $\left\|J_{k}-J^{*}\right\|_{\infty}$ against the index of sweep $k$ .

Task 2: Given the gap between the optimal total cost and any non-optimal total cost ( $\rho:= \min _{\pi \notin \mathfrak{P}_{d m}^{*}}\left\|J^{\pi}-J^{*}\right\|_{\infty}=0.239$ ), check conditions on optimality in policy according to Eq. (7.24).

Task 3: If we use a random-generated feature matrix $\Phi$ , repeat the aforementioned two tasks, what can we observe? Discussion: how do we choose the feature matrix $\Phi$ .

7.3.1 相关代码

import random
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

matplotlib.rcParams.update(matplotlib.rcParamsDefault)

INIT_J = 0  # define initial total cost
ITER_N = 100  # the number of iterations

# transition probability
p1 = 0.4
p2 = 0.6

# cost of each state-action pair
ghs = 0
gl1s = gl2s = gl3s = 2
gec = gl3c = gl2c = gl1c = 5

# discount factor
gamma = 0.9

##
# using VI to get optimal_J. i.e. J*:
k = 0
jh = jl1 = jl2 = jl3 = je = INIT_J
while k < 200:
    jh_ = p1 * (ghs + gamma * jl1) + p2 * (ghs + gamma * jl2)
    jl1_ = min(p1 * (gl1s + gamma * jl2) + p2 * (gl1s + gamma * jl3), gl1c + gamma * jh)
    jl2_ = min(p1 * (gl2s + gamma * jl3) + p2 * (gl2s + gamma * je),
               p1 * (gl2c + gamma * jl1) + p2 * (gl2c + gamma * jh))
    jl3_ = min(gl3s + gamma * je, p1 * (gl3c + gamma * jl2) + p2 * (gl3c + gamma * jl1))
    je_ = p1 * (gec + gamma * jl3) + p2 * (gec + gamma * jl2)

    jh, jl1, jl2, jl3, je = jh_, jl1_, jl2_, jl3_, je_
    k += 1

jh_optim, jl1_optim, jl2_optim, jl3_optim, je_optim = jh, jl1, jl2, jl3, je
print('After VI we get the optimal costs {:.4f} {:.4f} {:.4f} {:.4f} {:.4f}'.format(jh_optim, jl1_optim, jl2_optim,
                                                                                    jl3_optim, je_optim))


##
# VI
jh = jl1 = jl2 = jl3 = je = INIT_J
vi_converge = []

for k in range(ITER_N):
    jh_ = p1 * (ghs + gamma * jl1) + p2 * (ghs + gamma * jl2)
    jl1_ = min(p1 * (gl1s + gamma * jl2) + p2 * (gl1s + gamma * jl3), gl1c + gamma * jh)
    jl2_ = min(p1 * (gl2s + gamma * jl3) + p2 * (gl2s + gamma * je),
               p1 * (gl2c + gamma * jl1) + p2 * (gl2c + gamma * jh))
    jl3_ = min(gl3s + gamma * je, p1 * (gl3c + gamma * jl2) + p2 * (gl3c + gamma * jl1))
    je_ = p1 * (gec + gamma * jl3) + p2 * (gec + gamma * jl2)

    jh, jl1, jl2, jl3, je = jh_, jl1_, jl2_, jl3_, je_
    k += 1
    # calculate the infinite norm of ||J_k - J*||
    infinity_gap = max(
        abs(jh - jh_optim),
        abs(jl1 - jl1_optim),
        abs(jl2 - jl2_optim),
        abs(jl3 - jl3_optim),
        abs(je - je_optim)
    )
    vi_converge.append(infinity_gap)

plt.plot(range(ITER_N), vi_converge, 'o-', label='VI')

plt.ylabel(r"$\|\| J - J^*\|\|_{\infty}$")
plt.xlabel('Iteration k')

##
# Approximate value iteration using specific features

phi = np.array([[-0.4, -0.44, -0.45, -0.46, -0.48],
                [-0.07, 0.73, -0.61, 0.21, -0.23]])
inv = np.linalg.inv(phi @ phi.T)
PI = phi.T @ inv @ phi
# boundary of optimal indirect estimate
rho = 0.239
boundary = rho * (1 - gamma) / (2 * (1 + gamma))
print('Upper bound  = {}, '.format(boundary))

J_AVI = np.array([0, 0, 0, 0, 0])
J_OPT = np.array([jh_optim, jl1_optim, jl2_optim, jl3_optim, je_optim])
print('\t and the left hand side = {}'.format(min(abs(PI @ J_OPT - J_OPT))))
avi_converge = []
for k in range(ITER_N):
    # TgJk
    jh, jl1, jl2, jl3, je = J_AVI
    jh_ = p1 * (ghs + gamma * jl1) + p2 * (ghs + gamma * jl2)
    jl1_ = min(p1 * (gl1s + gamma * jl2) + p2 * (gl1s + gamma * jl3), gl1c + gamma * jh)
    jl2_ = min(p1 * (gl2s + gamma * jl3) + p2 * (gl2s + gamma * je),
               p1 * (gl2c + gamma * jl1) + p2 * (gl2c + gamma * jh))
    jl3_ = min(gl3s + gamma * je, p1 * (gl3c + gamma * jl2) + p2 * (gl3c + gamma * jl1))
    je_ = p1 * (gec + gamma * jl3) + p2 * (gec + gamma * jl2)
    J_AVI = np.array([jh_, jl1_, jl2_, jl3_, je_])

    # orthogonal projector
    J_AVI = PI @ J_AVI
    if min(abs(J_AVI - J_OPT)) <= boundary:
        print('After {} iterations, the policy is optimal'.format(k))
    avi_converge.append(max(abs(J_AVI - J_OPT)))

plt.plot(range(ITER_N), avi_converge, '^-', label='AVI')

##
# Approximate value iteration using random features

phi = np.random.rand(2, 5)
inv = np.linalg.inv(phi @ phi.T)
PI = phi.T @ inv @ phi
# boundary of optimal indirect estimate
rho = 0.239
boundary = rho * (1 - gamma) / 2 * (1 + gamma)

J_AVI_rand = np.array([0, 0, 0, 0, 0])
avi_converge_rand = []
for k in range(ITER_N):
    # TgJk
    jh, jl1, jl2, jl3, je = J_AVI_rand
    jh_ = p1 * (ghs + gamma * jl1) + p2 * (ghs + gamma * jl2)
    jl1_ = min(p1 * (gl1s + gamma * jl2) + p2 * (gl1s + gamma * jl3), gl1c + gamma * jh)
    jl2_ = min(p1 * (gl2s + gamma * jl3) + p2 * (gl2s + gamma * je),
               p1 * (gl2c + gamma * jl1) + p2 * (gl2c + gamma * jh))
    jl3_ = min(gl3s + gamma * je, p1 * (gl3c + gamma * jl2) + p2 * (gl3c + gamma * jl1))
    je_ = p1 * (gec + gamma * jl3) + p2 * (gec + gamma * jl2)
    J_AVI_rand = np.array([jh_, jl1_, jl2_, jl3_, je_])

    # orthogonal projector
    J_AVI_rand = PI @ J_AVI_rand
    if min(abs(J_AVI_rand - J_OPT)) <= boundary:
        print('After {} iterations, the policy is optimal'.format(k))
    avi_converge_rand.append(max(abs(J_AVI_rand - J_OPT)))

plt.plot(range(ITER_N), avi_converge_rand, '^-', label='AVI')


plt.legend()
plt.show()

7.3.2 输出结果

After VI we get the optimal costs 26.1268 28.5141 29.3736 30.7331 31.9256
Upper bound  = 0.006289473684210525, 
	 and the left hand side = 0.20402763408068125
After 44 iterations, the policy is optimal

在这里插入图片描述

7.3.3 Feature matrix $\Phi$

Calculate $J^{\pi}$ for all the possible policies. $J^{\pi} \in \mathbb{R}^{8 \times 5}$
So, how to establish a $J^{\pi}$ for all the possible policies?
Singular value decomposition (SVD): $J^{\pi}=\mathbf{U \Sigma V}^{*}$ .
Select the first two component of $\mathbf{V}^{*}$ as $\Phi$