通过图像观测训练DDPG智能体控制平衡摆

简单摆锤模型
创建环境接口
创建DDPG智能体
训练智能体
DDPG智能体仿真

本示例说明了如何训练深度确定性策略梯度（DDPG）智能体，通过MATLAB®建模的图像观察来控制平衡摆锤。
在这里插入图片描述

有关DDPG智能体的详细信息，请参阅深度确定性策略梯度智能体。

简单摆锤模型

此示例的强化学习环境是一个简单的无摩擦摆，其最初悬挂在向下的位置。训练的目标是使摆锤直立，而花费最少的控制力。

对于这种环境：

平衡摆向上位置为0弧度，向下悬挂位置为pi弧度。
agent到环境的扭矩动作信号为2 ~ 2 N·m。
从环境中观察到的是摆锤的位置和摆锤角速度的图像。
每一步提供的奖励 $r_t$ 为

在这里:

$\theta_{t}$ 从直立位置到位移的角度。
$\dot{\theta_{t}}$ 是位移角的导数。
$u_{t-1}$ 是前一个时间步骤的控制工作。

有关此模型的更多信息，请参见加载预定义控制系统环境。

创建环境接口

为钟摆创建一个预定义的环境界面。

env = rlPredefinedEnv('SimplePendulumWithImage-Continuous')

在这里插入图片描述
接口具有连续的动作空间，智能体可以在其中施加–2至2 N·m的扭矩。

从环境接口获得观察和动作规范。

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

固定随机发生器种子的重现性。

rng(0)

创建DDPG智能体

DDPG智能体使用评论者价值函数表示法，根据给定的观察和操作来估算长期奖励。要创建评论者，请首先创建具有三个输入（图像，角速度和动作）和一个输出的深度卷积神经网络（CNN）。有关创建表示形式的更多信息，请参见创建策略和值函数表示形式。

hiddenLayerSize1 = 400;
hiddenLayerSize2 = 300;

imgPath = [
    imageInputLayer(obsInfo(1).Dimension,'Normalization','none','Name',obsInfo(1).Name)
    convolution2dLayer(10,2,'Name','conv1','Stride',5,'Padding',0)
    reluLayer('Name','relu1')
    fullyConnectedLayer(2,'Name','fc1')
    concatenationLayer(3,2,'Name','cat1')
    fullyConnectedLayer(hiddenLayerSize1,'Name','fc2')
    reluLayer('Name','relu2')
    fullyConnectedLayer(hiddenLayerSize2,'Name','fc3')
    additionLayer(2,'Name','add')
    reluLayer('Name','relu3')
    fullyConnectedLayer(1,'Name','fc4')
    ];
dthetaPath = [
    imageInputLayer(obsInfo(2).Dimension,'Normalization','none','Name',obsInfo(2).Name)
    fullyConnectedLayer(1,'Name','fc5','BiasLearnRateFactor',0,'Bias',0)
    ];
actPath =[
    imageInputLayer(actInfo(1).Dimension,'Normalization','none','Name','action')
    fullyConnectedLayer(hiddenLayerSize2,'Name','fc6','BiasLearnRateFactor',0,'Bias',zeros(hiddenLayerSize2,1))
    ];

criticNetwork = layerGraph(imgPath);
criticNetwork = addLayers(criticNetwork,dthetaPath);
criticNetwork = addLayers(criticNetwork,actPath);
criticNetwork = connectLayers(criticNetwork,'fc5','cat1/in2');
criticNetwork = connectLayers(criticNetwork,'fc6','add/in2');

查看评论者网络配置。

figure
plot(criticNetwork)

在这里插入图片描述
使用rlRepresentationOptions指定评论者表示的选项。

criticOptions = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);

取消注释以下行以使用GPU来加速评论者CNN的训练。使用GPU需要Parallel Computing Toolbox™软件和具有CUDA®功能的NVIDIA®GPU，并具有3.0或更高的计算能力。

% criticOptions.UseDevice = 'gpu';

使用指定的神经网络和选项创建评论者表示。您还必须指定评论者的动作和观察信息，这些信息是从环境界面获得的。有关更多信息，请参见rlQValueRepresentation。

critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,...
    'Observation',{
    
    'pendImage','angularRate'},'Action',{
    
    'action'},criticOptions);

DDPG智能体使用行动者表示来决定采取哪种行动。要创建行动者，首先要创建一个具有两个输入（图像和角速度）和一个输出（动作）的深度卷积神经网络（CNN）。

以类似于评论者的方式构造行动者。

imgPath = [
    imageInputLayer(obsInfo(1).Dimension,'Normalization','none','Name',obsInfo(1).Name)
    convolution2dLayer(10,2,'Name','conv1','Stride',5,'Padding',0)
    reluLayer('Name','relu1')
    fullyConnectedLayer(2,'Name','fc1')
    concatenationLayer(3,2,'Name','cat1')
    fullyConnectedLayer(hiddenLayerSize1,'Name','fc2')
    reluLayer('Name','relu2')
    fullyConnectedLayer(hiddenLayerSize2,'Name','fc3')
    reluLayer('Name','relu3')
    fullyConnectedLayer(1,'Name','fc4')
    tanhLayer('Name','tanh1')
    scalingLayer('Name','scale1','Scale',max(actInfo.UpperLimit))
    ];
dthetaPath = [
    imageInputLayer(obsInfo(2).Dimension,'Normalization','none','Name',obsInfo(2).Name)
    fullyConnectedLayer(1,'Name','fc5','BiasLearnRateFactor',0,'Bias',0)
    ];

actorNetwork = layerGraph(imgPath);
actorNetwork = addLayers(actorNetwork,dthetaPath);
actorNetwork = connectLayers(actorNetwork,'fc5','cat1/in2');

actorOptions = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);

取消注释以下行以使用GPU来加速actor CNN的训练。

% actorOptions.UseDevice = 'gpu';

使用指定的神经网络和选项创建行动者表示。有关更多信息，请参见rlDeterministicActorRepresentation。

actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{
    
    'pendImage','angularRate'},'Action',{
    
    'scale1'},actorOptions);

查看行动者网络配置。

figure
plot(actorNetwork)

在这里插入图片描述
要创建DDPG智能体，请首先使用rlDDPGAgentOptions指定DDPG智能体选项。

agentOptions = rlDDPGAgentOptions(...
    'SampleTime',env.Ts,...
    'TargetSmoothFactor',1e-3,...
    'ExperienceBufferLength',1e6,...
    'DiscountFactor',0.99,...
    'MiniBatchSize',128);
agentOptions.NoiseOptions.Variance = 0.6;
agentOptions.NoiseOptions.VarianceDecayRate = 1e-6;

然后使用指定的行动者表示，评论者表示和智能体选项创建智能体。有关更多信息，请参见rlDDPGAgent。

agent = rlDDPGAgent(actor,critic,agentOptions);

训练智能体

要训练智能体，请首先指定训练选项。对于此示例，使用以下选项。

每次训练最多进行5000个episode，每个episode最多持续400个时间步。
在“情节管理器”对话框中显示训练进度（设置“Plots ”选项）。
当智能体在连续十个episode中获得的移动平均累计奖励大于-740时，停止训练。在这一点上，智能体可以以最小的控制力快速地使摆锤处于直立位置。

maxepisodes = 5000;
maxsteps = 400;
trainingOptions = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',-740);

您可以在训练或模拟过程中使用绘图功能来可视化摆锤。

plot(env)

在这里插入图片描述
训练此智能体是一个需要大量时间才能完成的计算密集型过程。为了节省运行本示例的时间，请通过将doTraining设置为false来加载预训练的智能体。 要自己训练智能体，请将doTraining设置为true。

doTraining = false;
if doTraining    
    % Train the agent.
    trainingStats = train(agent,env,trainingOptions);
else
    % Load pretrained agent for the example.
    load('SimplePendulumWithImageDDPG.mat','agent')       
end

在这里插入图片描述

DDPG智能体仿真

要验证训练后的智能体的表现，请在钟摆环境中对其进行仿真。有关智能体模拟的更多信息，请参见rlSimulationOptions和sim。

simOptions = rlSimulationOptions('MaxSteps',500);
experience = sim(env,agent,simOptions);