This article is shared from the Huawei Cloud Community " MindSpore Reinforcement Learning: Training using PPO with the environment HalfCheetah-v2 ", author: irrational.
Half Cheetah is a reinforcement learning environment based on MuJoCo, proposed by P. Wawrzyński in "A Cat-Like Robot Real-Time Learning to Run". The half-cheetah in this environment is a 2D robot made of 9 links and 8 joints (including two claws). In this environment, the goal is to make the cheetah run forward (to the right) as fast as possible by applying torque on the joints, with positive rewards based on distance traveled, and negative rewards for moving backward. The cheetah's torso and head are fixed, and torque can only be exerted on the front and rear thighs, calves and feet.
An action space is one Box(-1, 1, (6,), float32)
where each action represents a torque between links. The observation space contains the position and speed values of different body parts of the cheetah, where all position values come first and all speed values follow. By default the observation does not include the cheetah center of mass x-coordinate, it can be exclude_current_positions_from_observation=False
included by passing it at build time. If included, the observation space would have 18 dimensions, where the first dimension represents the x-coordinate of the cheetah's center of mass.
Rewards are divided into two parts: forward rewards and controlled costs. The forward reward is calculated based on the change in the x-coordinate before and after the action, and the control cost is the cost to punish the cheetah for taking excessive actions. The total reward is the forward reward minus the control cost.
Each state starts by adding noise to the state (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,) to increase randomness sex. The first 8 values are position values and the last 9 values are speed values. Position values add uniform noise, while initial velocity values (all zeros) add standard normal noise.
When the length of an episode exceeds 1000, the episode will be truncated.
Detailed information about this environment can be found at: https://www.gymlibrary.dev/environments/mujoco/half_cheetah/
This is more complex than many environments.
But it doesn’t matter, we have the ppo algorithm, which can run reinforcement learning and even large language models.
The PPO (Proximal Policy Optimization) algorithm is a policy optimization method for reinforcement learning. It is designed to solve the trust region problem in traditional policy gradient methods (such as TRPO, Trust Region Policy Optimization)
The PPO algorithm introduces clipping techniques and importance sampling techniques to reduce the variance when calculating gradients, thereby improving the convergence speed and stability of the algorithm.
In the PPO algorithm, there are two key concepts:
- Policy : A policy is a function that defines the probability distribution of taking action a given a state s.
- Value Function : The value function estimates the expected return that can be obtained when starting from state s and reaching a specific state or terminal under a given strategy.
The main steps of the PPO algorithm include:
- Sampling : Sampling data from the current strategy, including state, action, reward and next state.
- Calculating Targets : Use the target strategy to calculate the target value function and calculate the KL divergence of the target strategy.
- Updating Policy : Update policy using importance sampling techniques and clipping techniques.
- Updating Value Function : Update the value function using the policy gradient method.
The core idea of the PPO algorithm is to alternately update the strategy and value function to achieve the joint optimization of strategy and value. This method can effectively reduce the variance when calculating gradients and improve the convergence speed and stability of the algorithm.
The following is a simplified Markdown formula for the PPO algorithm:
# Proximal Policy Optimization (PPO) Algorithm ## 1. Sampling Sample the data of the current policy, including state $s$, action $a$, reward $r$ and next state $s'$. ## 2. Calculating Targets Calculate the objective value function using the objective policy and calculate the KL divergence of the objective policy. ## 3. Updating Policy Update strategies using importance sampling techniques and clipping techniques. ## 4. Updating Value Function Updating the value function using policy gradient methods. ## Repeat steps 1-4 to achieve joint optimization of strategy and value.
This formula is a simplified version. In fact, the PPO algorithm also includes many other details and techniques, such as experience playback, dynamic adjustment of learning rate, etc.
import argparse import us from mindspore import context from mindspore import dtype as mstype from mindspore.communication import get_rank, init import mindspore_rl.distribution.distribution_policies as DP from mindspore_rl.algorithm.ppo import config from mindspore_rl.algorithm.ppo.ppo_session import PPOSession from mindspore_rl.algorithm.ppo.ppo_trainer import PPOTrainer parser = argparse.ArgumentParser(description="MindSpore Reinforcement PPO") parser.add_argument("--episode", type=int, default=650, help="total episode numbers.") parser.add_argument( "--device_target", type=str, default="Auto", choices=["Ascend", "CPU", "GPU", "Auto"], help="Choose a device to run the ppo example(Default: Auto).", ) parser.add_argument( "--precision_mode", type=str, default="fp32", choices=["fp32", "fp16"], help="Precision mode", ) parser.add_argument( "--env_yaml", type=str, default="../env_yaml/HalfCheetah-v2.yaml", help="Choose an environment yaml to update the ppo example(Default: HalfCheetah-v2.yaml).", ) parser.add_argument( "--algo_yaml", type=str, default=None, help="Choose an algo yaml to update the ppo example(Default: None).", ) parser.add_argument( "--enable_distribute", type=bool, default=False, help="Train in distribute mode (Default: False).", ) parser.add_argument( "--worker_num", type=int, default=2, help="Worker num (Default: 2)." ) parser.add_argument( "--graph_op_run", type=int, default=1, help="Run kernel by kernel (Default: 1)." ) options, _ = parser.parse_known_args()`
wget https://www.roboti.us/download/mujoco200_linux.zip mv mujoco200_linux ~/.mujoco/mujoco200 wget https://www.roboti.us/file/mjkey.txt cp mjkey.txt /home/kewei/.mujoco/mjkey.txt wget https://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/p/patchelf-0.12-1.el7.x86_64.rpm yum localinstall patchelf-0.12-1.el7.x86_64.rpm pip install 'mujoco_py==2.0.2.13'
It will take a while to compile mujoco for the first time.
Add the following content to bashrc:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/.mujoco/mujoco200/bin export MUJOCO_KEY_PATH=~/.mujoco${MUJOCO_KEY_PATH} export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kewei/.mujoco/mujoco210/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia
Then you can start training. Use with from the previous section to preserve the input.
# dqn_session.run(class_type=DQNTrainer, episode=episode) with RealTimeCaptureAndDisplayOutput() as captured_new: ppo_session.run(class_type=PPOTrainer, episode=episode, duration=duration)
Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~
I decided to give up on open source industrial software. Major events - OGG 1.0 was released, Huawei contributed all source code. Ubuntu 24.04 LTS was officially released. Google Python Foundation team was laid off. Google Reader was killed by the "code shit mountain". Fedora Linux 40 was officially released. A well-known game company released New regulations: Employees’ wedding gifts must not exceed 100,000 yuan. China Unicom releases the world’s first Llama3 8B Chinese version of the open source model. Pinduoduo is sentenced to compensate 5 million yuan for unfair competition. Domestic cloud input method - only Huawei has no cloud data upload security issues