【ICLR2019】基于模型的深度强化学习算法框架，具有理论保证

论文题目：Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees

作者及标题信息

所解决的问题？

提出了一种具有理论性保证的基于模型的强化学习算法框架。设计了一个元算法，该算法在理论上保证了将单调性改进到期望报酬的局部最大值。将这个框架用于MBRL得到 Stochastic Lower Bounds Optimization (SLBO)算法。(同样是假定奖励函数已知)。

背景

model-free的强化学习算法取得了巨大成功，但是其采样成本昂贵。model-based方法通过在learned mode上规划学习，在采样效率上取得了巨大成功。

Our meta-algorithm (Algorithm 1) extends the optimism-in-face-of-uncertainty principle to non-linear dynamical models in a way that requires no explicit uncertainty quantiﬁcation of the dynamical models.

所采用的方法？

所提出的单调递增框架

SLBO算法

model的学习过程采用的是 use a multi-step prediction loss for learning the models with $\ell_{2}$ norm。其loss定义如下：

$\mathcal{L}_{\phi}^{(H)}\left(\left(s_{t: t+h}, a_{t: t+h}\right) ; \phi\right)=\frac{1}{H} \sum_{i=1}^{H}\left\|\left(\hat{s}_{t+i}-\hat{s}_{t+i-1}\right)-\left(s_{t+i}-s_{t+i-1}\right)\right\|_{2}$

再引入策略 $\theta$ ，整体的公式(6.2)loss定义如下：

$\max _{\phi, \theta} V^{\pi_{\theta}, \operatorname{sg}\left(\widehat{M}_{\phi}\right)}-\lambda \underbrace{\mathbb{E}}_{\left(s_{t: t+h}, a_{t: t+h}\right) \sim \pi_{k}, M^{\star}}\left[\mathcal{L}_{\phi}^{(H)}\left(\left(s_{t: t+h}, a_{t: t+h}\right) ; \phi\right)\right]$

原论文中还涉及大量理论推导，以后有研究需要再看吧，感兴趣的可以看看。

取得的效果？

实验结果

所出版信息？作者信息？

ICLR 2019的一篇文章，作者来自普林斯顿大学计算机科学系三年级博士，导师Sanjeev Arora，之前就读于清华姚班。主要研究机器学习，尤其是强化学习算法。

参考链接

Sanjeev Arora主要从事机器学习理论性收敛分析。

Sanjeev Arora个人主页：https://www.cs.princeton.edu/~arora/
代码链接：https://github.com/roosephu/slbo

扩展阅读

设 $V^{\pi}$ 为真实环境下的值函数， $\widehat{V}^{\pi}$ 为评估模型下的值函数。设计一个可证明的upper bound $D^{\pi,\widehat{M}}$ ,用于衡量estimate 和real dynamical model之间的值函数估计误差，与真实的值函数相比 $D^{\pi,\widehat{M}}$ leads to lower bound ：

$V^{\pi} \geq \widehat{V}^{\pi}-D^{\pi, \widehat{M}}$

算法先通过与环境交互收集数据， builds the lower bound above, and then maximizes it over both the dynamical model $\widehat{M}$ and the policy $\pi$ 。lower bounds的优化可以使用任何RL算法，因为它是用sample trajectory from a ﬁxed reference policy 来优化的，而不是一个交互的策略迭代过程。

值函数的定义如下：

$V^{\pi, M}(s)=\underset{\forall t \geq 0, A_{t} \sim \pi\left(\cdot | S_{t}\right) ,S_{t+1} \sim M(\cdot|S_{t},A_{t})}{\mathbb{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} R\left(S_{t}, A_{t}\right) | S_{0}=s\right]$

待续。。。。

小小何先生博客专家

发布了199 篇原创文章 · 获赞 174 · 访问量 22万+

私信关注