强化学习系列--演员-评论员算法(Actor-Critic Algorithm)

强化学习系列--演员-评论员算法(Actor-Critic Algorithm)

介绍

演员-评论员算法(Actor-Critic Algorithm)是一种结合了值函数估计和策略梯度方法的强化学习算法。该算法同时训练一个策略网络(演员)和一个值函数网络(评论员)。演员根据当前策略选择动作,评论员根据当前状态估计值函数,并作为基准线来计算策略梯度的更新。

以下是演员-评论员算法的详细步骤:

  1. 初始化策略网络的参数 θ \theta θ 和值函数网络的参数 ω \omega ω
  2. 对于每个回合:
    • 初始化状态 s s s
    • 根据策略网络,选择动作 a a a
    • 执行动作 a a a,观察奖励 r r r 和下一个状态 s ′ s' s
    • 使用值函数网络估计当前状态的值函数 V ( s ; ω ) V(s;\omega) V(s;ω)
    • 计算累积奖励 G t G_t Gt,可以使用蒙特卡洛方法或者使用折扣因子 γ \gamma γ 来计算:
      G t = r + γ V ( s ′ ; ω ) G_t = r + \gamma V(s';\omega) Gt=r+γV(s;ω)
    • 计算优势函数 A t A_t At,用于估计策略梯度的更新量:
      A t = G t − V ( s ; ω ) A_t = G_t - V(s;\omega) At=GtV(s;ω)
    • 使用策略网络的策略梯度公式更新策略网络的参数 θ \theta θ
      Δ θ = α ∇ θ log ⁡ ( π ( a ∣ s ; θ ) ) A t \Delta\theta = \alpha \nabla_\theta \log(\pi(a|s;\theta)) A_t Δθ=αθlog(π(as;θ))At
    • 使用值函数网络的损失函数更新值函数网络的参数 ω \omega ω,例如使用均方误差(MSE)损失:
      Δ ω = β ( G t − V ( s ; ω ) ) ∇ ω V ( s ; ω ) \Delta\omega = \beta (G_t - V(s;\omega)) \nabla_\omega V(s;\omega) Δω=β(GtV(s;ω))ωV(s;ω)
    • 更新策略网络和值函数网络的参数:
      θ ← θ + Δ θ \theta \leftarrow \theta + \Delta\theta θθ+Δθ
      ω ← ω + Δ ω \omega \leftarrow \omega + \Delta\omega ωω+Δω
  3. 重复步骤2直到达到指定的回合数或达到停止条件。

在演员-评论员算法中,策略网络充当演员的角色,通过策略梯度方法来更新策略参数,以最大化累积奖励。值函数网络充当评论员的角色,估计状态的值函数并作为基准线来计算优势函数。通过同时训练演员和评论员,演员-评论员算法可以更稳定地学习策略并减小梯度估计的方差。

请注意,上述步骤中的超参数 α \alpha α β \beta β 是学习率,用于控制参数更新的步长。而折扣因子 γ \gamma γ 是用于平衡当前奖励和未来奖励的重要性。这些超参数需要根据具体问题进行调优。

示例代码(pytorch)

下面是一个使用PyTorch实现演员-评论员(Actor-Critic)算法的简单示例代码:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np

class ActorCritic(nn.Module):
    def __init__(self, num_states, num_actions, alpha_actor, alpha_critic, gamma):
        super(ActorCritic, self).__init__()
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha_actor = alpha_actor
        self.alpha_critic = alpha_critic
        self.gamma = gamma

        self.actor = nn.Sequential(
            nn.Linear(num_states, 32),
            nn.ReLU(),
            nn.Linear(32, num_actions),
            nn.Softmax(dim=-1)
        )

        self.critic = nn.Sequential(
            nn.Linear(num_states, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )

        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=alpha_actor)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=alpha_critic)

    def get_action(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0)
        action_probs = self.actor(state)
        action = torch.multinomial(action_probs, num_samples=1).item()
        return action

    def update(self, states, actions, rewards, next_states, dones):
        states = torch.tensor(states).float()
        actions = torch.tensor(actions).unsqueeze(1)
        rewards = torch.tensor(rewards).unsqueeze(1)
        next_states = torch.tensor(next_states).float()
        dones = torch.tensor(dones).unsqueeze(1)

        next_state_values = self.critic(next_states)
        target_values = rewards + self.gamma * next_state_values * (1 - dones)

        advantages = target_values - self.critic(states)

        actor_loss = -(advantages.detach() * torch.log(self.actor(states))).mean()
        critic_loss = F.mse_loss(self.critic(states), target_values.detach())

        self.actor_optimizer.zero_grad()
        self.critic_optimizer.zero_grad()
        actor_loss.backward()
        critic_loss.backward()
        self.actor_optimizer.step()
        self.critic_optimizer.step()

# 示例中的环境需要根据具体问题进行定义
class Environment:
    def __init__(self, num_states, num_actions):
        self.num_states = num_states
        self.num_actions = num_actions

    def reset(self):
        return np.zeros((self.num_states,))

    def step(self, action):
        next_state = np.random.randn(self.num_states)  # 示例中的随机状态转移
        reward = np.random.randn()  # 示例中的随机奖励
        done = False  # 示例中的终止条件
        return next_state, reward, done

# 示例中的问题配置
num_states = 10
num_actions = 4
alpha_actor = 0.001
alpha_critic = 0.01
gamma = 0.99
num_episodes = 1000

env = Environment(num_states, num_actions)
agent = ActorCritic(num_states, num_actions, alpha_actor, alpha_critic, gamma)

for episode in range(num_episodes):
    state = env.reset()
    done = False
    total_reward = 0

    while not done:
        action = agent.get_action(state)
        next_state, reward, done = env.step(action)
        agent.update([state], [action], [reward], [next_state], [done])
        state = next_state
        total_reward += reward

    print(f"Episode: {
      
      episode}, Total reward: {
      
      total_reward}")

在示例代码中,我们使用PyTorch实现了一个简单的演员-评论员(Actor-Critic)代理。演员网络(Actor)使用多层感知机(MLP)作为策略网络,输出动作的概率分布。评论员网络(Critic)也使用MLP作为值函数网络,输出状态的值函数估计。

在每个回合中,根据当前策略选择动作,并与环境进行交互。然后,根据奖励和下一个状态计算目标值和优势函数。使用优势函数更新策略网络的参数,使用目标值更新值函数网络的参数。重复这个过程直到达到指定的回合数。

示例代码(tensorflow)

下面是一个简单的Python代码示例,演示演员-评论员算法的实现:

import numpy as np
import tensorflow as tf

class ActorCritic:
    def __init__(self, num_states, num_actions, alpha_actor, alpha_critic, gamma):
        self.num_states = num_states
        self.num_actions = num_actions
        self.alpha_actor = alpha_actor
        self.alpha_critic = alpha_critic
        self.gamma = gamma

        self.actor_network = self.build_actor_network()
        self.critic_network = self.build_critic_network()

    def build_actor_network(self):
        inputs = tf.keras.Input(shape=(self.num_states,))
        x = tf.keras.layers.Dense(32, activation='relu')(inputs)
        x = tf.keras.layers.Dense(self.num_actions, activation='softmax')(x)
        model = tf.keras.Model(inputs=inputs, outputs=x)
        model.compile(optimizer=tf.keras.optimizers.Adam(lr=self.alpha_actor), loss='categorical_crossentropy')
        return model

    def build_critic_network(self):
        inputs = tf.keras.Input(shape=(self.num_states,))
        x = tf.keras.layers.Dense(32, activation='relu')(inputs)
        x = tf.keras.layers.Dense(1)(x)
        model = tf.keras.Model(inputs=inputs, outputs=x)
        model.compile(optimizer=tf.keras.optimizers.Adam(lr=self.alpha_critic), loss='mean_squared_error')
        return model

    def get_action(self, state):
        state = np.reshape(state, (1, self.num_states))
        action_probs = self.actor_network.predict(state)[0]
        action = np.random.choice(np.arange(self.num_actions), p=action_probs)
        return action

    def update(self, states, actions, rewards, next_states, dones):
        states = np.array(states)
        actions = np.array(actions)
        rewards = np.array(rewards)
        next_states = np.array(next_states)

        next_state_values = self.critic_network.predict(next_states)
        next_state_values = np.squeeze(next_state_values)

        target_values = rewards + self.gamma * next_state_values * (1 - dones)

        advantages = target_values - self.critic_network.predict(states).flatten()

        self.actor_network.fit(states, tf.keras.utils.to_categorical(actions, num_classes=self.num_actions),
                               sample_weight=advantages, verbose=0)
        self.critic_network.fit(states, target_values, verbose=0)

# 示例中的环境需要根据具体问题进行定义
class Environment:
    def __init__(self, num_states, num_actions):
        self.num_states = num_states
        self.num_actions = num_actions

    def reset(self):
        return np.zeros((self.num_states,))

    def step(self, action):
        next_state = np.random.randn(self.num_states)  # 示例中的随机状态转移
        reward = np.random.randn()  # 示例中的随机奖励
        done = False  # 示例中的终止条件
        return next_state, reward, done

# 示例中的问题配置
num_states = 10
num_actions = 4
alpha_actor = 0.001
alpha_critic = 0.01
gamma = 0.99
num_episodes = 1000

env = Environment(num_states, num_actions)
agent = ActorCritic(num_states, num_actions, alpha_actor, alpha_critic, gamma)

for episode in range(num_episodes):
    state = env.reset()
    done = False
    total_reward = 0

    while not done:
        action = agent.get_action(state)
        next_state, reward, done = env.step(action)
        agent.update([state], [action], [reward], [next_state], [done])
        state = next_state
        total_reward += reward

    print(f"Episode: {
      
      episode}, Total reward: {
      
      total_reward}")

在示例代码中,我们定义了一个简单的问题环境(Environment)和一个演员-评论员(ActorCritic)代理。演员-评论员代理使用一个神经网络作为策略网络(演员)和另一个神经网络作为值函数网络(评论员)。策略网络通过softmax函数输出动作的概率分布,值函数网络输出状态的值函数估计。

扫描二维码关注公众号,回复: 16950979 查看本文章

在每个回合中,根据当前策略选择动作,并与环境进行交互。然后,根据奖励和下一个状态计算目标值和优势函数。使用优势函数更新策略网络的参数,使用目标值更新值函数网络的参数。重复这个过程直到达到指定的回合数。

猜你喜欢

转载自blog.csdn.net/qq_36892712/article/details/132504116
今日推荐