MindSpore reinforcement learning: training using PPO with environment HalfCheetah-v2

Open source China APP, start! This is a brand new version you’ve never seen before.”

This article is shared from the Huawei Cloud Community " MindSpore Reinforcement Learning: Training using PPO with the environment HalfCheetah-v2 ", author: irrational.

Half Cheetah is a reinforcement learning environment based on MuJoCo, proposed by P. Wawrzyński in "A Cat-Like Robot Real-Time Learning to Run". The half-cheetah in this environment is a 2D robot made of 9 links and 8 joints (including two claws). In this environment, the goal is to make the cheetah run forward (to the right) as fast as possible by applying torque on the joints, with positive rewards based on distance traveled, and negative rewards for moving backward. The cheetah's torso and head are fixed, and torque can only be exerted on the front and rear thighs, calves and feet.

An action space is one Box(-1, 1, (6,), float32)where each action represents a torque between links. The observation space contains the position and speed values of different body parts of the cheetah, where all position values come first and all speed values follow. By default the observation does not include the cheetah center of mass x-coordinate, it can be exclude_current_positions_from_observation=Falseincluded by passing it at build time. If included, the observation space would have 18 dimensions, where the first dimension represents the x-coordinate of the cheetah's center of mass.

Rewards are divided into two parts: forward rewards and controlled costs. The forward reward is calculated based on the change in the x-coordinate before and after the action, and the control cost is the cost to punish the cheetah for taking excessive actions. The total reward is the forward reward minus the control cost.

Each state starts by adding noise to the state (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,) to increase randomness sex. The first 8 values are position values and the last 9 values are speed values. Position values add uniform noise, while initial velocity values (all zeros) add standard normal noise.

When the length of an episode exceeds 1000, the episode will be truncated.

Detailed information about this environment can be found at: https://www.gymlibrary.dev/environments/mujoco/half_cheetah/

This is more complex than many environments.

But it doesn’t matter, we have the ppo algorithm, which can run reinforcement learning and even large language models.

The PPO (Proximal Policy Optimization) algorithm is a policy optimization method for reinforcement learning. It is designed to solve the trust region problem in traditional policy gradient methods (such as TRPO, Trust Region Policy Optimization)

The PPO algorithm introduces clipping techniques and importance sampling techniques to reduce the variance when calculating gradients, thereby improving the convergence speed and stability of the algorithm.

In the PPO algorithm, there are two key concepts:

Policy : A policy is a function that defines the probability distribution of taking action a given a state s.
Value Function : The value function estimates the expected return that can be obtained when starting from state s and reaching a specific state or terminal under a given strategy.

The main steps of the PPO algorithm include:

Sampling : Sampling data from the current strategy, including state, action, reward and next state.
Calculating Targets : Use the target strategy to calculate the target value function and calculate the KL divergence of the target strategy.
Updating Policy : Update policy using importance sampling techniques and clipping techniques.
Updating Value Function : Update the value function using the policy gradient method.

The core idea of the PPO algorithm is to alternately update the strategy and value function to achieve the joint optimization of strategy and value. This method can effectively reduce the variance when calculating gradients and improve the convergence speed and stability of the algorithm.

The following is a simplified Markdown formula for the PPO algorithm:

# Proximal Policy Optimization (PPO) Algorithm
## 1. Sampling
Sample the data of the current policy, including state $s$, action $a$, reward $r$ and next state $s'$.
## 2. Calculating Targets
Calculate the objective value function using the objective policy and calculate the KL divergence of the objective policy.
## 3. Updating Policy
Update strategies using importance sampling techniques and clipping techniques.
## 4. Updating Value Function
Updating the value function using policy gradient methods.
## Repeat steps 1-4 to achieve joint optimization of strategy and value.

This formula is a simplified version. In fact, the PPO algorithm also includes many other details and techniques, such as experience playback, dynamic adjustment of learning rate, etc.

import argparse
import us

from mindspore import context
from mindspore import dtype as mstype
from mindspore.communication import get_rank, init

import mindspore_rl.distribution.distribution_policies as DP
from mindspore_rl.algorithm.ppo import config
from mindspore_rl.algorithm.ppo.ppo_session import PPOSession
from mindspore_rl.algorithm.ppo.ppo_trainer import PPOTrainer

parser = argparse.ArgumentParser(description="MindSpore Reinforcement PPO")
parser.add_argument("--episode", type=int, default=650, help="total episode numbers.")
parser.add_argument(
    "--device_target",
    type=str,
    default="Auto",
    choices=["Ascend", "CPU", "GPU", "Auto"],
    help="Choose a device to run the ppo example(Default: Auto).",
)
parser.add_argument(
    "--precision_mode",
    type=str,
    default="fp32",
    choices=["fp32", "fp16"],
    help="Precision mode",
)
parser.add_argument(
    "--env_yaml",
    type=str,
    default="../env_yaml/HalfCheetah-v2.yaml",
    help="Choose an environment yaml to update the ppo example(Default: HalfCheetah-v2.yaml).",
)
parser.add_argument(
    "--algo_yaml",
    type=str,
    default=None,
    help="Choose an algo yaml to update the ppo example(Default: None).",
)
parser.add_argument(
    "--enable_distribute",
    type=bool,
    default=False,
    help="Train in distribute mode (Default: False).",
)
parser.add_argument(
    "--worker_num", type=int, default=2, help="Worker num (Default: 2)."
)
parser.add_argument(
    "--graph_op_run", type=int, default=1, help="Run kernel by kernel (Default: 1)."
)
options, _ = parser.parse_known_args()`

wget https://www.roboti.us/download/mujoco200_linux.zip
mv mujoco200_linux ~/.mujoco/mujoco200
wget https://www.roboti.us/file/mjkey.txt
cp mjkey.txt /home/kewei/.mujoco/mjkey.txt
wget https://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/p/patchelf-0.12-1.el7.x86_64.rpm
yum localinstall patchelf-0.12-1.el7.x86_64.rpm
pip install 'mujoco_py==2.0.2.13'

It will take a while to compile mujoco for the first time.

Add the following content to bashrc:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.mujoco/mujoco200/bin
export MUJOCO_KEY_PATH=~/.mujoco${MUJOCO_KEY_PATH}
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kewei/.mujoco/mujoco210/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia

Then you can start training. Use with from the previous section to preserve the input.

# dqn_session.run(class_type=DQNTrainer, episode=episode)
with RealTimeCaptureAndDisplayOutput() as captured_new:
    ppo_session.run(class_type=PPOTrainer, episode=episode, duration=duration)

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

MindSpore reinforcement learning: training using PPO with environment HalfCheetah-v2

Guess you like