深度强化学习-确定性策略梯度算法推导

深度强化学习-确定性策略梯度算法推导

引言

1 确定性策略梯度算法

2 确定性策略梯度算法推导 

3 确定性策略梯度定理的常用形式

4 确定性策略梯度定理的常用形式推导

5 总结


引言

前面我们详细推导过策略梯度算法 ,如果有小伙伴对这个算法的推导过程比较感兴趣的话,可以看一下我的这篇博文:深度强化学习-策略梯度算法推导。在连续的动作空间中,动作的个数是无穷大的。如果采用常规方法,需要计算max_{a}q(s,a\mid \theta )。而对于无穷多的动作,最大值往往很难求得。为此,D.Silver等人在文章《Deterministic Policy Gradient Algorithm》中提出了确定性策略的方法,用于处理连续动作空间问题。本文将针对连续动作空间,推导出确定性策略的策略梯度算法。

1 确定性策略梯度算法

对于连续动作空间里的确定性策略,\pi (a\mid s;\theta ) 并不是一个通常意义上的函数,它对策略参数\theta的梯度\triangledown \pi (a\mid s;\theta )也不复存在(因为在状态s处动作是唯一确定的)。不过确定性策略可以表示为\pi (s;\theta )(s\in S),这样就可以对策略参数\theta正常求导。

当策略是一个连续动作空间上的确定性策略\pi (s;\theta )(s\in S)时,确定性策略梯度定理为

确定性策略梯度算法

\triangledown E_{\pi (\theta )}\left [ G_{0} \right ]=E\left [ \sum_{t=0}^{+\infty }\gamma ^{t}\triangledown \pi (S_{t};\theta )\left [ \triangledown _{a}q_{\pi (\theta )}(S_{t},a) \right ]_{a=\pi (S_{t};\theta )} \right ]

2 确定性策略梯度算法推导 

 考虑Bellman期望方程:

v_{\pi (\theta )}(s)=q_{\pi (\theta )}(s,\pi (s;\theta )),s\in S

q_{\pi (\theta )}(s,\pi (s;\theta ))=r(s,\pi (s;\theta ))+\gamma \sum_{s^{'}}^{}p(s^{'}\mid s,\pi (\theta ))v_{\pi (\theta )}(s^{'}),s\in S

以上两式对\theta求梯度,有

\triangledown v_{\pi (\theta )}(s)=\triangledown q_{\pi (\theta )}(s,\pi (s;\theta )),s\in S

\triangledown q_{\pi (\theta )}(s,\pi (s;\theta ))=\left [ \triangledown _{a}r(s,a) \right ]_{a=\pi (s;\theta )}\triangledown \pi (s;\theta )+

\gamma \sum_{s^{'}}^{}\left \{ \left [ \triangledown _{a}p(s^{'}\mid s,a) \right ]_{a=\pi (s;\theta )}\left [ \triangledown \pi (s;\theta ) \right ]v_{\pi (\theta )}(s^{'})+p(s^{'}\mid s;\pi(\theta))\triangledown v_{\pi(\theta)}(s^{'}) \right \}

=\triangledown \pi(s;\theta)\left [ \triangledown _{a}r(s,a)+\gamma\sum_{s^{'}}^{}\triangledown _{a}p(s^{'}\mid s,a)v_{\pi(\theta)}(s^{'}) \right ]_{a=\pi (s;\theta)}+

\gamma\sum_{s^{'}}^{}p(s^{'}\mid s,a)v_{\pi(\theta)}(s^{'})

=\triangledown \pi(s;\theta)\left [ \triangledown _{a}q_{\pi(\theta)}(s,a) \right ]_{a=\pi (s;\theta)}+\gamma\sum_{s^{'}}^{}p(s^{'}\mid s;\pi(\theta))v_{\pi(\theta)}(s^{'}),s\in S

\triangledown q_{\pi (\theta )}(s,\pi (s;\theta ))的表达式代入\triangledown v_{\pi (\theta )}(s)的表达式中,有

\triangledown v_{\pi (\theta )}(s)

=\triangledown \pi(s;\theta)\left [ \triangledown _{a}q_{\pi(\theta)}(s,a) \right ]_{a=\pi (s;\theta)}+\gamma\sum_{s^{'}}^{}p(s^{'}\mid s;\pi(\theta))v_{\pi(\theta)}(s^{'}),s\in S

对上式求关于S_{t}的期望,有

E\left [ \triangledown v_{\pi(\theta)}(S_{t}) \right ]

=\sum_{s}^{}Pr\left [ S_{t}=s \right ]\triangledown v_{\pi(\theta)}(S_{t})

=\sum_{s}^{}Pr\left [ S_{t}=s \right ]\left [ \triangledown \pi(s;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(s,a) \right ]_{a=\pi(s; \theta)}+\gamma \sum_{s^{'}}^{}p(s^{'} \mid s,\pi(\theta))\triangledown v_{\pi (\theta)}(s^{'}) \right ]

=\sum_{s}^{}Pr\left [ S_{t}=s \right ]\left [ \triangledown \pi(s;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(s,a) \right ]_{a=\pi(s; \theta)}+\gamma \sum_{s^{'}}^{}Pr\left [ S_{t+1}=s^{'} \mid S_{t}=s; \pi(\theta) \right ]\triangledown v_{\pi (\theta)}(s^{'}) \right ]=\sum_{s}^{}Pr\left [ S_{t}=s \right ] \triangledown \pi(s;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(s,a) \right ]_{a=\pi(s; \theta)}+\gamma \sum_{s}^{}Pr\left [ S_{t}=s \right ] \sum_{s^{'}}^{}Pr\left [ S_{t+1}=s^{'} \mid S_{t}=s; \pi(\theta) \right ]\triangledown v_{\pi (\theta)}(s^{'})

=\sum_{s}^{}Pr\left [ S_{t}=s \right ] \triangledown \pi(s;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(s,a) \right ]_{a=\pi(s; \theta)}+\gamma \sum_{s^{'}}^{}Pr\left [ S_{t+1}=s^{'} ; \pi(\theta) \right ]\triangledown v_{\pi (\theta)}(s^{'})

=E\left [ \triangledown \pi(S;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(S,a) \right ]_{a=\pi(S; \theta)} \right ] +\gamma E\left [\triangledown v_{\pi (\theta)}(S_{t+1}) \right ]

这样就得到了从E\left [ \triangledown v_{\pi(\theta)}(S_{t}) \right ]E\left [\triangledown v_{\pi (\theta)}(S_{t+1}) \right ]的递推式。注意,最终关注的梯度值为(因为我们需要最大化累积期望回报)

\triangledown E_{\pi (\theta )}\left [ G_{0} \right ]=E\left [\triangledown v_{\pi (\theta)}(S_{0}) \right ]

所以有

 \triangledown E_{\pi (\theta )}\left [ G_{0} \right ]

=E\left [\triangledown v_{\pi (\theta)}(S_{0}) \right ]

=E\left [ \triangledown \pi(S_{0};\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(S_{0},a) \right ]_{a=\pi(S_{0}; \theta)} \right ] +\gamma E\left [\triangledown v_{\pi (\theta)}(S_{1}) \right ]

=E\left [ \triangledown \pi(S_{0};\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(S_{0},a) \right ]_{a=\pi(S_{0}; \theta)} \right ] +

\gamma E\left [ \triangledown \pi(S_{1};\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(S_{1},a) \right ]_{a=\pi(S_{1}; \theta)} \right ] +\gamma^{2} E\left [\triangledown v_{\pi (\theta)}(S_{2}) \right ]

=\cdots

=\sum_{t=0}^{+\infty }E\left [\gamma ^{t} \triangledown \pi(S_{t};\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(S_{t},a) \right ]_{a=\pi(S_{t}; \theta)} \right ]

=E\left [ \sum_{t=0}^{+\infty }\gamma ^{t}\triangledown \pi (S_{t};\theta )\left [ \triangledown _{a}q_{\pi (\theta )}(S_{t},a) \right ]_{a=\pi (S_{t};\theta )} \right ]

从而得到和之前策略梯度定理类似的形式。

3 确定性策略梯度定理的常用形式

对于连续动作空间中的确定性策略,更常用的是另外一种形式:

 \triangledown E_{\pi (\theta )}\left [ G_{0} \right ]=E_{S\sim \rho_{\pi (\theta)} }\left [\triangledown \pi(S;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(S,a) \right ]_{a=\pi(S; \theta)} \right ]

其中的期望是针对折扣的状态分布

\rho _{\pi }(s)=\int_{s_{0}\in S}^{}p_{s_{0}}(s_{0})\sum_{t=0}^{+\infty }\gamma^{t}Pr\left [ S_{t}=s \mid S_{0}=s_{0}; \theta) \right ]ds_{0}

而言的。

4 确定性策略梯度定理的常用形式推导

\triangledown E_{\pi (\theta )}\left [ G_{0} \right ]

=\sum_{t=0}^{+\infty }E\left [\gamma ^{t} \triangledown \pi(S_{t};\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(S_{t},a) \right ]_{a=\pi(S_{t}; \theta)} \right ]

=\sum_{t=0}^{+\infty } \int_{s}^{} p_{s_{t}}(s)\gamma ^{t} \triangledown \pi(S_{t};\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(S_{t},a) \right ]_{a=\pi(S_{t}; \theta)}ds

=\sum_{t=0}^{+\infty } \int_{s}^{} \left ( \int_{s_{0}}^{}p_{s_{0}}(s_{0})Pr\left [ S_{t}=s \mid S_{0}=s_{0}; \theta) \right ]ds_{0} \right ) \gamma^{t}\triangledown \pi(s;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(s,a) \right ]_{a=\pi(s; \theta)}ds

=\int_{s}^{} \left ( \int_{s_{0}}^{}p_{s_{0}}(s_{0}) \sum_{t=0}^{+\infty } \gamma^{t}Pr\left [ S_{t}=s \mid S_{0}=s_{0}; \theta) \right ]ds_{0} \right )\triangledown \pi(s;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(s,a) \right ]_{a=\pi(s; \theta)}ds

 =\int_{s}^{} \rho _{\pi (\theta)}(s)\triangledown \pi(s;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(s,a) \right ]_{a=\pi(s; \theta)}ds

=E_{S\sim \rho_{\pi (\theta)} }\left [\triangledown \pi(S;\theta)\left [ \triangledown _{a}q_{\pi (\theta)}(S,a) \right ]_{a=\pi(S; \theta)} \right ]

5 总结

 本文主要推导了确定性策略梯度算法及其常用形式,它是许多确定性算法的核心,例如DDPG和TD3等,所以希望大家能够理解(本文主要搬运于肖智清《强化学习原理与Python实现》)。

以上如果有出现错误的地方,欢迎各位怒斥!

猜你喜欢

转载自blog.csdn.net/weixin_46133643/article/details/124439392