文献笔记:Deterministic Policy Gradient Algorithms

作为随机策略，在相同的策略，在同一个状态处，采用的动作是基于一个概率分布的，即是不确定的。而确定性策略则决定简单点，虽然在同一个状态处，采用的动作概率不同，但是最大概率只有一个，如果我们只取最大概率的动作，去掉这个概率分布，那么就简单多了。即作为确定性策略，相同的策略，在同一个状态处，动作是唯一确定的，即策略变成

\[\pi _ { \theta } ( s ) = a\]

论文首先回顾了随机策略Policy Gradiet算法以及其函数逼近理论，指出如果满足:

(1) $Q ^ { w } ( s , a ) = \nabla _ { \theta } \log \pi _ { \theta } ( a | s ) ^ { \top } w$。(2)$w$可以最小化MSE损失函数$\epsilon ^ { 2 } ( w ) = \mathbb { E } _ { s \sim \rho ^ { \pi } , a \sim \pi _ { \theta } } \left[ \left( Q ^ { w } ( s , a ) - Q ^ { \pi } ( s , a ) \right) ^ { 2 } \right]$。则函数估计$Q ^ { w } ( s , a ) \approx Q ^ { \pi } ( s , a )$是无偏的。

对于决定性策略，本文首先给出了策略函数的更新方式，其中$\mu_{\theta}(s)$为策略函数

\[\theta ^ { k + 1 } = \theta ^ { k } + \alpha \mathbb { E } _ { s \sim \rho ^ { \mu ^ { k } } } \left[ \nabla _ { \theta } Q ^ { \mu ^ { k } } \left( s , \mu _ { \theta } ( s ) \right) \right]\]

运用链式法则，上式可以改写成：

\[\theta ^ { k + 1 } = \theta ^ { k } + \alpha \mathbb { E } _ { s \sim \rho ^ { \mu ^ { k } } } \left[ \nabla _ { \theta } \mu _ { \theta } ( s ) \nabla _ { a } Q ^ { \mu ^ { k } } \left. ( s , a ) \right| _ { a = \mu _ { \theta } ( s ) } \right]\]

随后论文3.2部分给出了上述策略函数更新的理论证明，记$\rho_{\mu}(s)$表示由策略$\mu_{theta}(s)$生成的平稳分布，则优化目标可以写为：

\begin{aligned} J \left( \mu _ { \theta } \right) & = \int _ { \mathcal { S } } \rho ^ { \mu } ( s ) r \left( s , \mu _ { \theta } ( s ) \right) \mathrm { d } s \\ & = \mathbb { E } _ { s \sim \rho ^ { \mu } } \left[ r \left( s , \mu _ { \theta } ( s ) \right) \right] \end{aligned}

扫描二维码关注公众号，回复： 6004873 查看本文章

（上式的个人理解表示平均一步回报）

该证明与策略梯度函数逼近论文Policy Gradient Methods for Reinforcement Learning with Function Approximation证明类似，表明了目标函数梯度不涉及平稳分布函数导数的计算。具体的定理形式为：

作者首先给出了On policy的DPG算法：

\begin{aligned} \delta _ { t } & = r _ { t } + \gamma Q ^ { w } \left( s _ { t + 1 } , a _ { t + 1 } \right) - Q ^ { w } \left( s _ { t } , a _ { t } \right) \\ w _ { t + 1 } & = w _ { t } + \alpha _ { w } \delta _ { t } \nabla _ { w } Q ^ { w } \left( s _ { t } , a _ { t } \right) \\ \theta _ { t + 1 } & = \theta _ { t } + \alpha _ { \theta } \nabla _ { \theta } \mu _ { \theta } \left( s _ { t } \right) \nabla _ { a } Q ^ { w } \left. \left( s _ { t } , a _ { t } \right) \right| _ { a = \mu _ { \theta } ( s ) } \end{aligned}

其中$w$是针对$Q$的估计函数$Q^w$的参数，其更新的准则是根据MSE计算出来的。$\theta$是策略函数$\mu_{\theta}$的参数。

其后论文又给出了off-policy的DPG算法，其中我们所希望评估的策略函数为$\mu_{\theta}$，为了进行探索而实际采用的策略函数为$\beta$，记$\rho^{\beta}(s)$表示由策略$\beta$生成的平稳分布，则准则函数可以写成：

\begin{aligned} \nabla _ { \theta } J _ { \beta } \left( \mu _ { \theta } \right) & \approx \int _ { \mathcal { S } } \rho ^ { \beta } ( s ) \nabla _ { \theta } \mu _ { \theta } ( a | s ) Q ^ { \mu } ( s , a ) \mathrm { d } s \\ & = \mathbb { E } _ { s \sim \rho ^ { \beta } } \left[ \nabla _ { \theta } \mu _ { \theta } ( s ) \nabla _ { a } Q ^ { \mu } \left. ( s , a ) \right| _ { a = \mu _ { \theta } ( s ) } \right] \end{aligned}

注意到对于决定性策略，不存在选择策略的概率这一表达，因此，在更新过程中可以避免传统的基于概率的重要性采样计算，所以算法可以表述成下面形式：

\begin{aligned} \delta _ { t } & = r _ { t } + \gamma Q ^ { w } \left( s _ { t + 1 } , \mu _ { \theta } \left( s _ { t + 1 } \right) \right) - Q ^ { w } \left( s _ { t } , a _ { t } \right) \\ w _ { t + 1 } & = w _ { t } + \alpha _ { w } \delta _ { t } \nabla _ { w } Q ^ { w } \left( s _ { t } , a _ { t } \right) \\ \theta _ { t + 1 } & = \theta _ { t } + \alpha _ { \theta } \nabla _ { \theta } \mu _ { \theta } \left( s _ { t } \right) \nabla _ { a } Q ^ { w } \left. \left( s _ { t } , a _ { t } \right) \right| _ { a = \mu _ { \theta } ( s ) } \end{aligned}

最后，论文给出了DPG的采用线性函数逼近定理，以及在线性逼近下加入baseline的更新算法，以及拓展到MSPBE（即采用二次损失函数，梯度时序差分算法）的更新算法。

文献笔记:Deterministic Policy Gradient Algorithms

猜你喜欢