[Reinforcement learning combat] strategy gradient method (policy gradient)-python lever balance combat

Code link

case study

This article considers the lever balance problem of Gym Curry (CartPole-v0). As shown in the figure below, a cart can move on a linear slide. A pole is connected to the trolley at one end, and is suspended at the other end, which may not be completely upright. The initial position of the trolley and the initial angle of the rod are randomly selected within a certain range. The agent can control the car to move 1 unit to the left or a fixed distance to the right along the slide rail (the range of movement is fixed, and it cannot move). When any of the following situations occurs, the round ends:

·The inclination angle of the rod exceeds 12 degrees;

·The trolley moves more than 2.4 units in length;

· Reach 200 steps in the round.

Get 1 unit of reward for every step you take. We hope that the round can be as long as possible. It is generally believed that if the average reward in 100 consecutive rounds is ≥195, the problem is considered solved.
figure 1
In this task, the observation value has 4 components, which respectively represent the position of the trolley, the speed of the trolley, the angle of the stick, and the angular velocity of the stick. The value ranges are shown in Table 7-1. The action is taken from {0,1}, which respectively represent force exerting to the left and force to the right.
Insert picture description here
For a random strategy, the round reward is about 9-10.

Same policy strategy gradient algorithm to solve the optimal strategy

First, use the same strategy algorithm to solve the optimal strategy. The VPGAgent class in the code below is the agent class of the algorithm, which supports both the version without baseline and the version with baseline. It uses artificial neural networks to approximate the strategy function.

class VPGAgent:
def __init__(self, env, policy_kwargs, baseline_kwargs=None,
gamma=0.99):
self.action_n = env.action_space.n
self.gamma = gamma
self.trajectory = [] # 轨迹存储
self.policy_net = self.build_network(output_size=self.action_n,
output_activation=tf.nn.softmax,
loss=keras.losses.categorical_crossentropy,
**policy_kwargs)
if baseline_kwargs: # 基线
self.baseline_net = self.build_network(output_size=1,
**baseline_kwargs)
def build_network(self, hidden_sizes, output_size,
activation=tf.nn.relu, output_activation=None,
loss=keras.losses.mse, learning_rate=0.01):
model = keras.Sequential()
for hidden_size in hidden_sizes:
model.add(keras.layers.Dense(units=hidden_size,
activation=activation))
model.add(keras.layers.Dense(units=output_size,
activation=output_activation))
optimizer = keras.optimizers.Adam(learning_rate)
model.compile(optimizer=optimizer, loss=loss)
return model
def decide(self, observation):
probs = self.policy_net.predict(observation[np.newaxis])[0]
action = np.random.choice(self.action_n, p=probs)
return action
def learn(self, observation, action, reward, done):
self.trajectory.append((observation, action, reward))
if done:
df = pd.DataFrame(self.trajectory,
columns=['observation', 'action', 'reward'])
df['discount'] = self.gamma ** df.index.to_series()
df['discounted_reward'] = df['discount'] * df['reward']
df['discounted_return'] = df['discounted_reward'][::-1].cumsum()
df['psi'] = df['discounted_return']
x = np.stack(df['observation'])
if hasattr(self, 'baseline_net'): # 带基线的逻辑
df['baseline'] = self.baseline_net.predict(x)
df['psi'] -= (df['baseline'] * df['discount'])
df['return'] = df['discounted_return'] / df['discount']
y = df['return'].values[:, np.newaxis]
self.baseline_net.fit(x, y, verbose=0)
y = np.eye(self.action_n)[df['action']] * \
df['psi'].values[:, np.newaxis]
self.policy_net.fit(x, y, verbose=0)
self.trajectory = []

When the construction parameter baseline_kwargs of the VPGAgent class is the default value None, the agent without baseline is constructed. We can use the following code to construct an agent without a baseline:

policy_kwargs = {
    
    'hidden_sizes' : [10,], 'activation' : tf.nn.relu,
'learning_rate' : 0.01}
agent = VPGAgent(env, policy_kwargs=policy_kwargs)

When the construction parameter baselines_kwargs of the VPGAgent class is a dict object related to the baseline, construct a neural network v(S;w) to make the baseline. We can use the following code to construct an agent with a baseline:

policy_kwargs = {
    
    'hidden_sizes' : [10,], 'activation':tf.nn.relu,
'learning_rate':0.01}
baseline_kwargs = {
    
    'hidden_sizes' : [10,], 'activation':tf.nn.relu,
'learning_rate':0.01}
agent = VPGAgent(env, policy_kwargs=policy_kwargs,
baseline_kwargs=baseline_kwargs)

The code for the interaction between the agent and the environment is as follows. Using this function, we can train and test the update strategy gradient function algorithm.

def play_montecarlo(env, agent, render=False, train=False):
    observation = env.reset()
    episode_reward = 0.
    while True:
        if render:
            env.render()
        action = agent.decide(observation)
        next_observation, reward, done, _ = env.step(action)
        episode_reward += reward
        if train:
            agent.learn(observation, action, reward, done)
        if done:
            break
        observation = next_observation
    return episode_reward

	episodes = 500
	episode_rewards = []
	for episode in range(episodes):
	episode_reward = play_montecarlo(env, agent, train=True)
	episode_rewards.append(episode_reward)
	plt.plot(episode_rewards);

The effect of the simple strategy gradient algorithm without baseline is as follows: The simple strategy gradient algorithm
Insert picture description here
with baseline
Insert picture description here
shows that the variance will be smaller than the method without baseline.

Different strategy strategy gradient algorithm to solve the optimal strategy

Next, we analyze the optimal strategy based on the different strategy algorithm based on importance sampling. The agent of the corresponding algorithm is given below. This agent also supports both the version without baseline and the version with baseline.

class OffPolicyVPGAgent(VPGAgent):
    def __init__(self, env, policy_kwargs, baseline_kwargs=None, 
            gamma=0.99):
        self.action_n = env.action_space.n
        self.gamma = gamma

        self.trajectory = []

        def dot(y_true, y_pred):
            return -tf.reduce_sum(y_true * y_pred, axis=-1)
        
        self.policy_net = self.build_network(output_size=self.action_n,
                output_activation=tf.nn.softmax, loss=dot, **policy_kwargs)
        if baseline_kwargs:
            self.baseline_net = self.build_network(output_size=1,
                    **baseline_kwargs)
    
    def learn(self, observation, action, behavior, reward, done):
        self.trajectory.append((observation, action, behavior, reward))

        if done:
            df = pd.DataFrame(self.trajectory, columns=
                    ['observation', 'action', 'behavior', 'reward'])
            df['discount'] = self.gamma ** df.index.to_series()
            df['discounted_reward'] = df['discount'] * df['reward']
            df['discounted_return'] = \
                    df['discounted_reward'][::-1].cumsum()
            df['psi'] = df['discounted_return']
            
            x = np.stack(df['observation'])
            if hasattr(self, 'baseline_net'):
                df['baseline'] = self.baseline_net.predict(x)
                df['psi'] -= df['baseline'] * df['discount']
                df['return'] = df['discounted_return'] / df['discount']
                y = df['return'].values[:, np.newaxis]
                self.baseline_net.fit(x, y, verbose=0)
                
            y = np.eye(self.action_n)[df['action']] * \
                    (df['psi'] / df['behavior']).values[:, np.newaxis]
            self.policy_net.fit(x, y, verbose=0)
            
            self.trajectory = [] # 为下一回合初始化经验列表

For different strategy algorithms, there must be not only an evaluation strategy, but also a behavior strategy. The simplest behavior strategy is a random strategy.

class RandomAgent:
    def __init__(self, env):
        self.action_n = env.action_space.n
        
    def decide(self, observation):
        action = np.random.choice(self.action_n)
        behavior = 1. / self.action_n
        return action, behavior

Using different strategy learning agents and random strategies, it is possible to train and test the round update strategy gradient algorithm based on importance sampling. The training code is as follows

episodes = 1500
episode_rewards = []
for episode in range(episodes):
observation = env.reset()
episode_reward = 0.
while True:
action, behavior = behavior_agent.decide(observation)
next_observation, reward, done, _ = env.step(action)
episode_reward += reward
agent.learn(observation, action, behavior, reward, done)
if done:
break
observation = next_observation
# 跟踪监控
episode_reward = play_montecarlo(env, agent)
episode_rewards.append(episode_reward)
plt.plot(episode_rewards);

The effect of gradient algorithm of importance sampling strategy without baseline is as follows: The effect of gradient algorithm
Insert picture description here
of importance sampling strategy with baseline is as follows
Insert picture description here

Comparison conclusion

The strategy gradient algorithm can be divided into two categories: round update and timing difference update. This article introduces the round update method. The turn-based update method can only be used for turn-based tasks. The round update method does not use self-interest and does not introduce bias. However, such round update strategy gradient methods often have very large variances.

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109286039