创建自定义强化学习算法的智能体

创建环境
定义策略
自定义智能体类
智能体属性
构造函数
相关函数
可选功能
创建自定义智能体
训练自定义智能体
自定义智能体仿真

本示例说明如何为您自己的自定义强化学习算法创建自定义智能体。这样做使您可以利用Reinforcement Learning Toolbox™软件的以下内置功能。

访问所有智能体函数，包括train和sim
使用Episode Manager可视化训练进度
在Simulink®环境中训练智能体

在此示例中，您将自定义REINFORCE训练循环转换为自定义智能体类。有关REINFORCE自定义训练回路的更多信息，请参阅 Train Reinforcement Learning Policy Using Custom Training Loop。有关编写自定义智能体类的更多信息，请参见 Custom Agents。

固定随机生成器种子的再现性。

rng(0)

创建环境

创建使用“Train Reinforcement Learning Policy Using Custom Training Loop example”中使用的相同训练环境。该环境是具有离散动作空间的平衡杆环境。使用rlPredefinedEnv函数创建环境。

env = rlPredefinedEnv('CartPole-Discrete');

从环境中提取观察和动作规范。

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

获取观察数（numObs）和动作数（numAct）。

numObs = obsInfo.Dimension(1);
numAct = numel(actInfo.Elements);

定义策略

在此示例中，强化学习策略是离散动作随机策略。它由一个深度神经网络表示，该网络包含fullyConnectedLayer，reluLayer和softmaxLayer层。给定当前观测值，该网络输出每个离散动作的概率。 softmaxLayer可以确保表示形式输出的概率值范围为[0 1]，并且所有概率之和为1。

为行动者创建深度神经网络。

actorNetwork = [featureInputLayer(numObs,'Normalization','none','Name','state')
                fullyConnectedLayer(24,'Name','fc1')
                reluLayer('Name','relu1')
                fullyConnectedLayer(24,'Name','fc2')
                reluLayer('Name','relu2')
                fullyConnectedLayer(2,'Name','output')
                softmaxLayer('Name','actionProb')];

使用rlStochasticActorRepresentation对象创建行动者表示。

actorOpts = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1);
actor = rlStochasticActorRepresentation(actorNetwork,...
    obsInfo,actInfo,'Observation','state',actorOpts);

自定义智能体类

要定义您的自定义智能体，请首先创建一个类，该类是rl.agent.CustomAgent类的子类。此示例的自定义智能体类在CustomReinforceAgent.m中定义。

CustomReinforceAgent类具有以下类定义，该定义指示智能体类名称和关联的抽象智能体。

classdef CustomReinforceAgent < rl.agent.CustomAgent

要定义您的智能体，您必须指定以下内容：

智能体属性
构造函数
评估长期奖励折扣的评论者表示形式（如果学习需要）
根据当前观察选择动作的行动者表示（如果学习需要）
所需的智能体方法
可选智能体方法

智能体属性

在类文件的属性部分中，指定创建和训练智能体所需的任何参数。

rl.Agent.CustomAgent类已经包含智能体采样时间（SampleTime）以及操作和观察规范（分别为ActionInfo和ObservationInfo）的属性。

定制REINFORCE智能体定义了以下其他智能体属性。

properties
    % Actor representation
    Actor
    
    % Agent options
    Options
    
    % Experience buffer
    ObservationBuffer
    ActionBuffer
    RewardBuffer
end

properties (Access = private)
    % Training utilities
    Counter
    NumObservation
    NumAction
end

构造函数

要创建自定义智能体，您必须定义一个构造函数。构造函数执行以下操作。

定义动作和观察规范。有关创建这些规范的更多信息，请参见rlNumericSpec和rlFiniteSetSpec。
设置智能体属性。
调用基本抽象类的构造函数。
定义采样时间（在Simulink环境中进行训练所需）。

例如，CustomREINFORCEAgent构造函数根据输入行动者表示形式定义动作和观察空间。

function obj = CustomReinforceAgent(Actor,Options)
    %CUSTOMREINFORCEAGENT Construct custom agent
    %   AGENT = CUSTOMREINFORCEAGENT(ACTOR,OPTIONS) creates custom
    %   REINFORCE AGENT from rlStochasticActorRepresentation ACTOR
    %   and structure OPTIONS. OPTIONS has fields:
    %       - DiscountFactor
    %       - MaxStepsPerEpisode
    
    % (required) Call the abstract class constructor.
    obj = obj@rl.agent.CustomAgent();
    obj.ObservationInfo = Actor.ObservationInfo;
    obj.ActionInfo = Actor.ActionInfo;
    
    % (required for Simulink environment) Register sample time. 
    % For MATLAB environment, use -1.
    obj.SampleTime = -1;
    
    % (optional) Register actor and agent options.
    Actor = setLoss(Actor,@lossFunction);
    obj.Actor = Actor;
    obj.Options = Options;
    
    % (optional) Cache the number of observations and actions.
    obj.NumObservation = prod(obj.ObservationInfo.Dimension);
    obj.NumAction = prod(obj.ActionInfo.Dimension);
    
    % (optional) Initialize buffer and counter.
    reset(obj);
end

构造函数使用函数句柄将actor表示的损失函数设置为lossFunction，该函数在CustomREINFORCEAgent.m中作为局部函数实现。

function loss = lossFunction(policy,lossData)

    % Create the action indication matrix.
    batchSize = lossData.batchSize;
    Z = repmat(lossData.actInfo.Elements',1,batchSize);
    actionIndicationMatrix = lossData.actionBatch(:,:) == Z;
    
    % Resize the discounted return to the size of policy.
    G = actionIndicationMatrix .* lossData.discountedReturn;
    G = reshape(G,size(policy));
    
    % Round any policy values less than eps to eps.
    policy(policy < eps) = eps;
    
    % Compute the loss.
    loss = -sum(G .* log(policy),'all');
    
end

相关函数

要创建自定义强化学习智能体，您必须定义以下实现功能。

getActionImpl —评估智能体策略并在模拟过程中选择一个智能体。
getActionWithExplorationImpl —评估策略并在训练期间选择具有探索性的操作。
learningImpl —智能体如何从当前经验中学习

要在自己的代码中调用这些函数，请使用抽象基类中的wrapper方法。例如，要调用getActionImpl，请使用getAction。 wrapper 方法与实现方法具有相同的输入和输出参数。

getActionImpl Function

getActionImpl函数用于评估智能体的策略，并在使用sim函数模拟智能体时选择操作。此函数必须具有以下签名，其中obj是智能体对象，observation是当前观察值，而Action是选定的动作。

 function Action = getActionImpl(obj,Observation)

对于自定义REINFORCE智能体，您可以通过调用行动者表示形式的getAction函数来选择一个动作。离散rlStochasticActorRepresentation根据观察值生成离散分布，并从该分布中采样一个动作。

function Action = getActionImpl(obj,Observation)
    % Compute an action using the policy given the current 
    % observation.
    
    Action = getAction(obj.Actor,Observation);
end

getActionWithExplorationImpl Function

使用训练函数训练智能体时，getActionWithExplorationImpl函数使用智能体的探索模型选择动作。使用此函数，您可以实现诸如epsilon-greedy探索或添加高斯噪声之类的探索技术。此函数必须具有以下签名，其中obj是智能体对象，observation是当前观察值，而Action是选定的动作。

function Action = getActionWithExplorationImpl(obj,Observation)

对于自定义REINFORCE智能体，getActionWithExplorationImpl函数与getActionImpl相同。默认情况下，随机行动者总是进行探索，也就是说，他们总是根据概率分布选择一个动作。

function Action = getActionWithExplorationImpl(obj,Observation)
    % Compute an action using the exploration policy given the  
    % current observation.
    
    % REINFORCE: Stochastic actors always explore by default
    % (sample from a probability distribution)
    Action = getAction(obj.Actor,Observation);
end

learnImpl Function

learningImpl函数定义智能体如何从当前经验中学习。此函数通过更新策略参数并选择探索下一个状态的动作来实现智能体的自定义学习算法。该函数必须具有以下签名，其中obj是智能体对象，Experience是当前的智能体经验，而Action是选定的操作。

function Action = learnImpl(obj,Experience)

智能体经验是单元格数组Experience = {state，action，reward，nextstate，isdone}。这里：

状态是当前的观察。
动作是当前动作。这不同于输出参数Action，后者是下一个状态的动作。
奖励是当前的奖励。
nextState是下一个观察值。
isDone是一个逻辑标志，指示训练episode 已完成。

对于自定义REINFORCE智能体，请在“使用自定义训练循环的强化训练策略”中重复自定义训练循环的步骤2至7。您将省略步骤1、8和9，因为您将使用内置的训练函数来训练您的智能体。

function Action = learnImpl(obj,Experience)
    % Define how the agent learns from an Experience, which is a
    % cell array with the following format.
    %   Experience = {
    
    observation,action,reward,nextObservation,isDone}
    
    % Reset buffer at the beginning of the episode.
    if obj.Counter < 2
        resetBuffer(obj);
    end
    
    % Extract data from experience.
    Obs = Experience{
    
    1};
    Action = Experience{
    
    2};
    Reward = Experience{
    
    3};
    NextObs = Experience{
    
    4};
    IsDone = Experience{
    
    5};
    
    % Save data to buffer.
    obj.ObservationBuffer(:,:,obj.Counter) = Obs{
    
    1};
    obj.ActionBuffer(:,:,obj.Counter) = Action{
    
    1};
    obj.RewardBuffer(:,obj.Counter) = Reward;
    
    if ~IsDone
        % Choose an action for the next state.
        
        Action = getActionWithExplorationImpl(obj, NextObs);
        obj.Counter = obj.Counter + 1;
    else
        % Learn from episodic data.
        
        % Collect data from the buffer.
        BatchSize = min(obj.Counter,obj.Options.MaxStepsPerEpisode);
        ObservationBatch = obj.ObservationBuffer(:,:,1:BatchSize);
        ActionBatch = obj.ActionBuffer(:,:,1:BatchSize);
        RewardBatch = obj.RewardBuffer(:,1:BatchSize);
        
        % Compute the discounted future reward.
        DiscountedReturn = zeros(1,BatchSize);
        for t = 1:BatchSize
            G = 0;
            for k = t:BatchSize
                G = G + obj.Options.DiscountFactor ^ (k-t) * RewardBatch(k);
            end
            DiscountedReturn(t) = G;
        end
        
        % Organize data to pass to the loss function.
        LossData.batchSize = BatchSize;
        LossData.actInfo = obj.ActionInfo;
        LossData.actionBatch = ActionBatch;
        LossData.discountedReturn = DiscountedReturn;
        
        % Compute the gradient of the loss with respect to the
        % actor parameters.
        ActorGradient = gradient(obj.Actor,'loss-parameters',...
            {
    
    ObservationBatch},LossData);
        
        % Update the actor parameters using the computed gradients.
        obj.Actor = optimize(obj.Actor,ActorGradient);
        
        % Reset the counter.
        obj.Counter = 1;
    end
end

可选功能

（可选）您可以通过指定带有以下函数签名的resetImpl函数来定义训练开始时如何重置智能体，其中obj是智能体对象。

function resetImpl(obj)

使用此函数，您可以在训练前将智能体设置为已知或随机条件。

function resetImpl(obj)
    % (Optional) Define how the agent is reset before training/
    
    resetBuffer(obj);
    obj.Counter = 1;
end

另外，您可以根据需要在自定义智能体类中定义任何其他帮助函数。例如，自定义REINFORCE智能体定义了resetBuffer函数，用于在每个训练episode开始时重新初始化体验缓冲区。

function resetBuffer(obj)
    % Reinitialize all experience buffers.
    
    obj.ObservationBuffer = zeros(obj.NumObservation,1,obj.Options.MaxStepsPerEpisode);
    obj.ActionBuffer = zeros(obj.NumAction,1,obj.Options.MaxStepsPerEpisode);
    obj.RewardBuffer = zeros(1,obj.Options.MaxStepsPerEpisode);
end

创建自定义智能体

定义自定义智能体类后，在MATLAB工作空间中创建它的实例。要创建自定义REINFORCE智能体，请首先指定智能体选项。

options.MaxStepsPerEpisode = 250;
options.DiscountFactor = 0.995;

然后，使用选项和先前定义的行动者表示形式，调用自定义智能体构造函数。

agent = CustomReinforceAgent(actor,options);

训练自定义智能体

配置训练以使用以下选项。

将训练设置为最多持续5000个episode，每个episode最多持续250个步骤。
在达到最大episode次数后或在100个发作中的平均奖励达到240的值时终止训练。

numEpisodes = 5000;
aveWindowSize = 100;
trainingTerminationValue = 240;
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',numEpisodes,...
    'MaxStepsPerEpisode',options.MaxStepsPerEpisode,...
    'ScoreAveragingWindowLength',aveWindowSize,...
    'StopTrainingValue',trainingTerminationValue);

使用训练函数训练智能体。训练此智能体是一个计算密集型过程，需要几分钟才能完成。为了节省运行本示例的时间，请通过将doTraining设置为false来加载预训练的智能体。 要自己训练智能体，请将doTraining设置为true。

doTraining = false;
if doTraining
    % Train the agent.
    trainStats = train(agent,env,trainOpts);
else
    % Load pretrained agent for the example.
    load('CustomReinforce.mat','agent');
end

在这里插入图片描述

自定义智能体仿真

启用环境可视化，该环境可视化在每次调用环境步骤功能时更新。

plot(env)

要验证训练后的智能体的表现，请在倒立摆环境中对其进行仿真。有关智能体模拟的更多信息，请参见rlSimulationOptions和sim。

simOpts = rlSimulationOptions('MaxSteps',options.MaxStepsPerEpisode);
experience = sim(env,agent,simOpts);