强化学习(九):策略梯度

Policy Gradient Methods

之前学过的强化学习几乎都是所谓的‘行动-价值’方法,也就是说这些方法先是学习每个行动在特定状态下的价值,之后在每个状态,根据当每个动作的估计价值进行选择。这种方法可看成是一种‘间接’的方法,因为强化学习的目标是如何决策,这些方法把每个动作的价值作为指标,来辅助决策。这类方法是一种很直观的,很容易理解的思维方式。当然还有另一种更‘直接’的方法,即不使用辅助手段(如之前的价值函数),直接学习决策。这种方法更直接,因为当学习结束得到的就是如何决策,但这种方法却不太直观,有价值函数作为助手,基本上不存在模型解释性问题,哪个价值大选哪个,很容易理解。直接学习策略却不同,没有参照,没有依傍,对每个策略的被选择的原因不太好理解,这也成为这类方法相较于之前的‘行动-价值’方法难以理解的主要原因。其实如同深度学习一样,我们只是找到一种函数可以很好的拟合决策过程就成功了。因为决策过程可以看成是一个决策函数,我们只须应用强化学习的方法尽量的逼近这个函数即可。既然是学习决策函数,那么我们很容易想到学习这个函数的参数,这样策略也就成了参数化策略,参数化的策略在选择行动时是不需要借助价值函数的。价值函数虽然不参与决策,但在策略参数学习时却可能会被用到。

深度学习的流行,也使得基于梯度学习的算法得到广泛应用。所以基于梯度的策略学习也成为主流。基于梯度学习就需一个目标函数,比如基于梯度的策略性能函数,不妨记为\(J(\theta)\), 接下来就很简单,我们只需最大化这个性能函数即可,其参数更新方式:
\[ \theta_{t+1} = \theta_t + \alpha \widehat{\nabla J (\theta_t)} \]
这就是策略梯算法的一般形式,其中\(\widehat{\nabla J (\theta_t)}\)是一种随机估计,其期望近似于性能函数的梯度。

Policy Approximation and its Advantages

如果行动空间是离散的并且不是特别的在, 那么可以用一种偏好函\(h(s,a,\theta)\)数来刻画状态-行动对,在每个状态中,拥有最大偏好的行动将拥有最大的概率被选择,显然,这样的使用soft max来刻画最常用:
\[ \pi(a|s,\theta) \dot = \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}} \]
偏好函数\(h(s,a,\theta)\)可以用ANN来刻画,也可以简单地使用特征的线性组合。

 参数化的策略有很多好处。首先,参数化的策略的估计更接近确定性策略。

其次,基于行动偏好soft-max可以任意概率选择行动。

再次,策略参数化相较于‘行动-价值’方法,策略的近似函数会相对更简单。

最后,策略参数化方法可以方便的融入先验知识。

The Policy Gradient Theorem

上述的优势是策略参数化方法做优于‘行动-价值’方法在实践中的考虑,在理论上,策略参数化方法也有一个重要的优势:policy gradient theorem。它为性能函数的梯度提供了一个解析形式:
\[ \nabla J(\theta) \propto \sum_s \mu(s)\sum_a q_{\pi}(s,a)\nabla_{\pi}(a|s,\theta) \]
其中\(\mu(s)\) 是 on-policy distribution

REINFORCE: Monte Carlo Policy Gradient

\[ \begin{array}\\ \nabla J(\theta)& \propto& \sum_s \mu(s)\sum_a q_{\pi}(s,a)\nabla_{\pi}(a|s,\theta) \\ &=& E_{\pi}\left[\sum_a \pi(a|S_t,\theta)q_{\pi}(S_t,a) \frac{\nabla\pi(a|S_t,\theta)}{\pi(a|S_t,\theta)}\right]\\ &=& E_{\pi}\left[q_{\pi}(S_t,A_t)\frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\right]\qquad\qquad (\text{replacing a by the sample} \ A_t \sim \pi)\\ &=& E_{\pi}\left[ G_t \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\right]\qquad\qquad\qquad (\text{becauser}\ E_{\pi}[G_t | S_t,A_t] = q_{\pi}(S_t,A_t)) \end{array} \]

其中\(G_t\)是收益(returns),由上可以得出参数更新公式:
\[ \theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)} \]

# REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for pi*
Algorithm parametere: step size alpha >0
Initialize policy parameter theta(vector)
Loop forever:
    Generate an episode S0,A0,R1,...S_{T-1},A_{T-1},RT, following pi(.|.,theta)
    Loop for each step of the episode t = 0,1,...,T-1
    G = sum_{k=t+1}^T gamma^{k-t-1}Rk
    theta = theta + alpha r^t G grad(ln pi(A_t|S_t,theta))

REINFORCE with Baseline

作为一种Monte Carlo方法,REINFORCE的方差比较较大,从而导致学习的效率较低。而引入一个baseline却可以降低方差。
\[ \nabla J(\theta) \propto \sum_s \mu(s)\sum_a( q_{\pi}(s,a)\nabla_{\pi}-b(s))(a|s,\theta) \]
从而:
\[ \theta_{t+1} = \theta_t + \alpha (G_t-b(S_t)) \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)} \]
baseline的一个自然的选择就是价值函数\(v(s)\)

# REINFORCE with baseline(episodic), for estimating pi_theta = pi*
Input: a differentiable policy parameterization pi(a|s,theta)
Input: a differentiable state-value function parameteration v(s,w)
Alogrithm parameters: step sizes alpha_theta >0, alpha_w >0
Initialize policy parameter theta and state_value weights w

Loop forever(for each episode):
    Generate an episode S_0,A_0,R_1,... S_{T-1},A_{T-1},R_T following pi(.|.,theta)
    Loop for each step of the episode t= 0,1,...T-:
        G = sum_{k=t+1}^T gammma^{k-t-1} R_k
        delta = G- v(S_t,w)
        w = w + alpha_w gammma^t delta grad(v(S_t,w))
        theta = theta + alpha_theta gamm^t delta grad(ln(pi(A_t|S_t,theta)))

Actor-Critic Method

\[ \begin{array}\\ \theta_{t+1} &\dot =& \theta_t + \alpha (G_t-\hat v(S_t,w)) \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\\ &=& \theta_t + \alpha(R_{t+1} + \gamma\hat v(S_{t+1},w) - \hat v(S_t,w)) \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\\ &=& \theta + \alpha \delta_t \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\\ \end{array} \]

# one-step Acotor-Critic (episodic), for estimating pi_theta = pi*
Input: a differentiable policy parameterization pi(a|s,theta)
Input: a differentiable state-value function parameterization v(s,w)
Parameters: step size alhpa_theta > 0, theta_w > 0
Initialize policy parameter theta and state value weights w
Loop forever (first state of episode)
I =1
Loop while S is not terminal (for each time step):
    A = pi(.|S,theta)
    take action A, observe S',R
    delta = R + gamma v(S',w) - v(S,w)
    w = w + alpha_w I delta grad(v(S,w))
    theta = theta + alpha_theta I delta grad(ln pi(A|S,theta))
    I = gamma I
    S = S'
# Actor-Critic with Eligibility Traces(episodic), for estimating pi_theta = pi*
Input: a differentiable policy paramterization pi(a|s,theta)
Input: a differentiable state-value function parameterization v(s,w)
Parameters: trace-decay rates lambda_theta in [0,1], lambda_w in [0,1], step size alpha_theta > 0, alpha_w > 0.
Initialize policy parameter theta and state-value weights w

Loop forever(for each episode):
    Initialize S (first state of episode)
    z_theta = 0 (d'-component eligibility trace vector)
    z_vector = 0 (d-component eligibility trace vector)
    I = 1
    Loop while S is not terminal (for each time step):
         A = pi(.|S,theta)
         take action A, observe S',R
         delta = R + gamma v(S',w) - v(S,w)
         z_w = gamma lambda_w z_w + I grad(v(S,w))
         z_theta = gamma lambda_theta z_theta + I grad(ln(pi(A|S,theta)))
         w = w + alpha_w delta z_w
         theta = theta + alpha_theta delta z_theta
         I = gamma I
         S = S'                             

Policy Gradient for Continuing Problems

对于连续问题,需要针对每步的平均奖励定义性能指标:
\[ \begin{array}\\ J(\theta) \dot= r(\pi) &\dot =& \lim_{h\rightarrow\infty}\frac{1}{h}\sum_{t=1}^h E[R_t| A_{0:t-1}\sim \pi]\\ &=&\lim_{t\rightarrow \infty} E[R_t| A_{0:t-1} \sim \pi]\\ &=& \sum_{s}\mu(s)\sum_a \pi(a|s) \sum_{s',r}p(s',r|s,a)r \end{array} \]
其中,\(\mu\)是在\(\pi\)下的steady-state 分布:\(\mu(s)\dot = \lim_{t\rightarrow\infty}P\{S_t = s| A_{0:t}\sim \pi\}\) 假定存在且独立于\(S_0\)

这是一个特别的分布,在此分布下,根据\(\pi\)来选择行动,那么得到的结果仍符合当前分布:
\[ \sum_s\mu(s)\sum_{a}\pi(a|s,\theta)p(s'|s,a) = \mu(s'), \qquad s' \in S \]

# Actor-Critic with Eligibility Traces(continuing), for estimating pi_theta = pi*
Input: a differentiable policy paramterization pi(a|s,theta)
Input: a differentiable state-value function parameterization v(s,w)
Parameters: trace-decay rates lambda_theta in [0,1], lambda_w in [0,1], step size alpha_theta > 0, alpha_w > 0,alpha_{R_bar} >0,
Initialize R_bar
Initialize policy parameter theta and state-value weights w
Initialize S
z_w = 0 (eligibility trace vector)
z_theta = 0 (eligibility trace vector)

Loop forever(for each time step):
    Select A from pi(.|S,theta)
    take action A and observe S', R
    delta = R - R_bar + v(S',w) - v(S,w)
    R_bar = R_bar + alpha_{R_bar} delta
    z_w = lambda_w z_w + grad(v(S,w))
    z_theta = lambda_theta z_theta grad(ln(pi(A|S,theta)))
    w = w + alpha_w delta z_w
    theta = theta alpha_theta delta z_theta
    S = S'

Policy Parameterization for Continuous Actions

当动作空间是连续的,基于策略的方法转而关注动作分布的性质。如果每个动作可用实数值来刻画,那么策略可以用正态分布密度函数来刻画:
\[ \pi(a|s,\pmb \theta) \dot = \frac{1}{\sigma(s,\pmb\theta)\sqrt{2\pi}}\exp \bigg(-\frac{(a - \mu(s,\pmb\theta))^2}{2\theta(s,\pmb\theta)^2}\bigg) \]

猜你喜欢

转载自www.cnblogs.com/vpegasus/p/pg.html