David Silver RL课程第1课（关于增强学习的介绍） - 代码天地

David Silver RL课程第1课（关于增强学习的介绍）

其他 2018-11-18 12:23:56 阅读次数: 0

1.The difference of the reinforcement learning:(区别于传统的监督/非监督学习）

no supervisor ,only a reward signal（小孩试错的过程）
feedback is delayed,not instantaneous(错误的决定不会即时显现灾难，要经过几个阶段的验证，反馈被延迟）
time really matters(sequential连续的，not i.i.d data)(独立同分布已经被破坏掉了，agent根据环境影响来采取措施应对环境的变化）

2.增强学习可以用在各个领域利用奖励机制便于优化决策，需要不同数据集的集合。例如在游戏中通过不断地学习试错找到完美的策略。

3.Rewards

是一个标量的反馈信号，用随机变量表示。
转换后的标量奖励信号要足够多，并且有一定的优先级（有衡量尺度）。
每一步决策都要找到对应的,时每一步的reward相加最后实现最大化

4.Goal:select actions to maximise total future reward

建立统一的框架，使用机器学习的方法，使用相同的形式应对不同的连续决策问题，提前考虑未来，最大化未来的奖励

Actions may have long term consequences
Reward may be dalayed
It may be better to sacrifice（牺牲） immediate reward to gain more long-term reward

需要提前考虑未来，结果是长期性的。可能不是当下想要的结果，但是经过几步以后，就变成我们想要的结果了，这就意味着现在需要放弃一些好的奖励，而在不久的未来则会得到更高的奖励，所以不要太贪心，需要目光长远，例如长期投资问题或者飞机飞行油耗问题

5.数据流传播方向：

agent负责take action,agent采取行动的每一步都是基于它当前所获得的信息。agent有两个输入，一个是观察得到的外部信息，另外一个就是获得的奖励，共同决定了下一步的措施。我们的目标就是找到位于大脑中的算法。

在另一个方面，我们有一个外部环境。随着时间不断循环agent与environment之间的交互，agent每采取一步行动，agent就会得到来自外部世界观测的输入；agent采取行动之后，新的环境就产生了，产生对应的obsercation和reward，产生了下一个外部信息以及对应的分数。我们不能控制环境，只能唯一地通过agent采取行动这个渠道来影响环境。

增强学习是基于观察，奖励，行动措施的时间序列。
这个时间序列代表着agent的经验，这个经验就是用于增强学习的数据。
因此增强学习的问题就是聚焦这个数据来源，即这个数据流。

6.History:The history is the sequence of observations,actions,rewards。

What happens next depends on the history:
The agent selects actions depends on the history.(创建映射）
The environment selects observations/rewards(环境根据history发生变化产生rewards)
但是history通常很巨大

7.State 对history简要的总结，用state代替history

State is the information used to determine what happens next.
State is a function of the history.
state分为agent state和environment state

8.An information state(Markov state) contains all useful information from history.

Markov链（Markov性质）

A state is Markov if and only if

下一时刻的状态与原来的state无关，仅和当下有关

Once the state is known,the history may be thrown away.
The state is a sufficient statistic of the future.
The environment stste is Markov.
The history is Markov（定义，可以存储整个的history)

9.Full observability environment (全观察环境）（课程大部分涉及到此种环境)

agent directly observes environment state(数字所表示的状态）
agent state和environment state相同
This is a Markov decision process(MDP）

10.Partial observability:agent indirectly observes environment

eg: robot/poker playing agent
此时agent state和environment state不相同
This is a partially observable Markov decision process(POMDP)

11.创建代理

记住每一次的观测，动作，奖励 complete history:
Beliefs of environment state:(贝叶斯问题）
neural network: 线性组合方式将最近agent的状态与最近的观测结合起来，就能得到最新的状态（循环神经网络）

12.An RL agent may include one or more of these components:

policy:agent's behaviour function(行为函数，状态到行动的映射）
value function:how good is each state and/or action.(预期奖励）
model:agent's representation of the environment(判断环境的变化）

13.Policy

A policy is the agent's behaviour
It is a map from statre to action.
Deterministic policy:
Stochastic(随机）policy: 随即方式状态映射到状态

14.Value:未来奖励的预测

Value function is a prediction of future reward.
Used to evaluate the goodness and badnenss of states.
And therefore to select between actions.
对于一种policy ，其中是下一阶段的奖励，其中增加一个小于1的权重值，这表明我们更关注当前的奖励，即，作为折现值。

15.Model：并不是环境本身，不是必须要求的。

A model predicts what the environment will do next.
transition model:P predicts the next state(dynamics)
reward model:R predicts the next (immediate) reward.
状态转换模型：是根据当前的状态和动作，环境所处的下一个状态的概率。

预期奖励是基于先前的以及当下的状态的。

16.对增强学习分类根据agent是否包含这三个关键元素：

Value Based:No Policy(Implicit不清楚的)，即不需要明确的Policy；Value Function
Policy Based:Policy;No Value Function
Actor Critic:Policy;Value Function

17.根据model分类：

Model Free:Policy and/or Value Function;No Model
Model Based:Policy and/or Value FUnction;Model

猜你喜欢

转载自blog.csdn.net/baidu_32239977/article/details/84196786

David Silver RL课程第1课（关于增强学习的介绍）

David Silver RL课程第2课（Markov decision processes)

Lecture 1：Introduce to Reinforcement Learning -By David Silver

David Silver深度强化学习第1课

强化学习David Silver课程Lecture1 笔记

机器学习：David Silver 深度强化学习课程

强化学习David Silver课程Lecture2 笔记

David Silver强化学习课程笔记（一）

David Silver 强化学习Lecture1：Introduction

David Silver 强化学习Lecture3：Dynamic Programming

David Silver 强化学习Lecture2：MDP

算法模型---【David Silver强化学习公开课】

David Silver强化学习公开课（一）：简介

David Silver《强化学习RL》第一讲介绍

David Silver深度强化学习第3课 - 动态规划

David Silver深度强化学习第2课 - 马尔科夫决策过程

David Silver深度强化学习第4课-免模型预测

David Silver《强化学习RL》第八讲整合学习与规划

David Silver《强化学习RL》第二讲马尔可夫决策过程

David Silver《强化学习RL》第三讲动态规划寻找最优策略

David Silver《强化学习RL》第七讲策略梯度

David Silver强化学习公开课第七课 Actor critic

【转载】David Silver公开课1——强化学习入门

【David Silver-强化学习笔记】p1、Introduction

David Silver 强化学习Lecture4：Model-Free Prediction

David Silver强化学习Lecture2：马尔可夫决策过程

David Silver强化学习公开课（三）：动态规划寻找最优策略

David Silver强化学习公开课（五）：不基于模型的控制

David Silver强化学习公开课（二）：马尔科夫决策过程

David Silver强化学习公开课（四）：不基于模型的预测

今日推荐

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

周排行

计算机组成与设计（七）—— 除法器

Integer Approximation(分治+枚举)

大话数据库索引

windows10系统JDK的配置及下载地址

mysql实现秒值转换中原六仔平台搭建

Codeforces Round #556 (Div. 1)

百练1064 网线主管

Codeforces 995F Cowmpany Cowmpensation

子集生成之增量构造法，位向量法，二进制法

ERROR: cmd.exe failed with args /c "/APK\gradle\rungradle.bat...

每日归档

更多

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)