项目分享 | 昇思MindSpore接入强化学习的新环境和新算法

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

Author: **** Yan Kexi, Jing Yuheng, Liu Zhihao, Feng Xiaokun, Lei Shiqi-** Institute of Automation, Chinese Academy of Sciences

Summary

本次强化学习实验，我们小组选择的题目为“昇思MindSpore 接入强化学习的新环境/新算法”。我们将一个面向多智能体合作场景的游戏测试环境——SISL(Stanford Intelligent Systems Laboratory)接入到昇思MindSpore平台，并实现了QMIX和MAPPO算法对此实验环境的性能基准测试。本次分享，我们将围绕选题、环境、算法、实验结果和结论这5个方面来进行相应的介绍和分析。

Please see the attachment for project related codes.

https://gitee.com/lemonifolds/reinforcement/tree/c03e5f8e6104b5880fdfa53d579332078c7dfb99

Topic introduction

Reinforcement learning is a discipline that studies sequence decision-making. It learns optimization strategies to maximize cumulative returns by analyzing the interaction process between an agent and the environment. Among them, the environment is an important element in reinforcement learning. It not only directly determines the input data format of the algorithm model, but is also closely related to the research tasks of reinforcement learning. Some classic reinforcement learning algorithms are often accompanied by some classic verification environments. For example, for single-agent perception and decision-making tasks, its classic algorithm DQN (Deep Q-learning) [1] was verified in the Atari game, a classic environment for this task; for complete information zero-sum game tasks, and incomplete information multi-player tasks For mixed game tasks, its representative works AlphaGo[2][3] and AlphaStar[4] even named the algorithm according to the corresponding verification environment (Go, StarCraft). This shows the important role of the verification environment in the field of reinforcement learning.

强化学习使用的环境种类繁多，典型的环境有Gym[5]、MuJoCo[6]、MPE[7]、 Atari[1]、PySC2[8]、SMAC[9]、TORCS、ISAAC等。目前昇思MindSpore架构上已接入了 Gym、SMAC (StarCraft Multi-Agent Challenge) 两个环境，分别面向了单智能体强化学习以及多智能强化学习场景。对于后者，由于星际争霸游戏复杂高维的状态、动作和决策空间，使其成为多智能体强化学习算法的重要挑战基准。然而，在实际科研实践中，当我们提出一个新的算法时，我们往往会先在一些小型的环境中（如Gym中的Atari游戏环境）进行验证。基于此，我们小组选择了一个面向多智能体场景的小型测试环境——SISL(Stanford Intelligent Systems Laboratory)，并将其接入到了昇思MindSpore平台，为该平台提供更多样化的多智能体验证环境。除此之外，我们还针对SISL环境，将MAPPO[10]和QMIX[[11]这两个典型的多智能体强化学习算法进行了实现，并提供了相应的基准测试结果。

Next, this sharing will first give a preliminary introduction to the SISL environment connected (Chapter 2), as well as the QMIX and MAPPO algorithms used (Chapter 3); then, the implementation process of accessing the environment, experiments The obtained benchmark test results and the problems encountered in the experiment are displayed and analyzed (Chapter 4); finally, based on the experimental results, we give corresponding conclusions (Chapter 5).

Environment introduction

PettingZoo[12] is a Python library commonly used in multi-agent reinforcement learning research. SISL (Stanford Intelligent Systems Laboratory) is one of the environment packages, including three sub-environments: Multiwalker, Pursuit and Waterworld.

The basic usage of PettingZoo is similar to Gym. The following is a code example for building a randomly acting agent in the Waterworld environment:

from pettingzoo.sisl import waterworld_v4
env = waterworld_v4.env(render_mode='human')

env.reset()
for agent in env.agent_iter():
    observation, reward, termination, truncation, info = env.last()
    if termination or truncation:
        action = None
    else:
        action = env.action_space(agent).sample()
    env.step(action)
env.close()

The following is an introduction to the three sub-environments in SISL.

2.1Multiwalker

In the Multiwalker environment, bipedal robots try to carry their cargo and walk to the right. Several robots carry a large cargo, and they need to work together, as shown in the picture below.

Multiwalker environment diagram

Each agent will receive the return of the forward_reward multiplied by the change in position of the package compared to the previous time point. If at least one of the packages falling or the package exceeds the left boundary of the terrain occurs, the environment ends and each agent receives -100 benefits. If the package falls from the right edge of the terrain, the environment will also end, with a gain of 0.

If an agent falls, it receives an additional -10 benefit. If terminate_on_fall = False, then the environment will not terminate when the agent falls, otherwise the environment will terminate once the agent falls, and bring a -100 benefit. If remove_on_fall = True, then the agent that fell will be removed from the environment. Each agent also receives -5 times the head angle change to keep the head level. If shared_reward = True, then each agent's personal reward is averaged and returned to each agent.

Each agent exerts force on the two joints of its two legs, so its continuous action space is a 4-dimensional vector. The observation space of each agent is a 31-dimensional vector containing noisy radar data about the environment and surrounding agents, the angles and speeds of each joint of its own body, and other information.

2.2 Pursuit

In the Pursuit environment, some chase agents try to chase and surround the escapers, as shown in the figure below (red is the controlled chase agent, blue is the random movement of escapees).

Pursuit environment diagram

By default, there are 30 escapers and 8 pursuer agents in a 16X16 grid, and there is an obstacle (white part) in the center of the map. If an agent completely surrounds an escaper, each surrounding agent receives a benefit of 5, and the escaper is removed from the environment. Each time the agent touches an escapee, it will also receive a benefit of 0.01.

Each agent has a discrete action space: up, down, left, right, stop. The observation space of each agent is a 7X7 grid around it, which is represented by orange in the figure.

2.3 Waterworld

The Waterworld environment simulates the conditions in which archaea try to survive in the environment. Each agent must try to consume food and avoid toxins, as shown in the picture below.

Waterworld environment diagram

Depending on the input parameters, agents may need to cooperate to consume food, so the model may be both cooperative and competitive at the same time. Similarly, the benefits can also be different or averaged for each agent. The entire environment is a continuous two-dimensional space, and benefits are based on exposure to food and toxins.

The action space of each agent is a 2-dimensional vector, that is, advancement (acceleration) in the horizontal and vertical directions. The observation space of each agent is the information received by each sensor on the distance to food, toxins, other agents, and whether it collides with food or toxins. The total dimension is or , depending on the parameters.

Algorithm introduction

In this section, we will introduce the MAPPO[10] algorithm and QMIX[11] algorithm used in the experiment.

3.1 MAPPO algorithm

The full name of MAPPO is Multi-Agent Proximal Policy Optimization. As can be seen from the name, the MAPPO algorithm is a classic reinforcement learning algorithm - the PPO [13] algorithm, which is expanded in a multi-intelligence environment. For multi-agent environments, they are often described using < > seven-tuples. Among them, n represents the number of agents; S represents the state space of the environment; it is the behavior space composed of the actions of all agents; it represents the local observation obtained by agent i from the global state s (if it is completely observable environment, then ); represents the probability of transforming from s to s' given a joint action; R(s,A) represents the shared reward function; represents the discount factor.

The MAPPO algorithm uses the classic actor-critic structure, which requires training two separate neural networks: the policy network and the value function (as actor and critic respectively).

For the policy network , it is used to learn a mapping from observation o_i to action distribution a_i, and the corresponding optimization goal is:

Among them, B represents the size of the batch size, which is calculated using the GAE [14] method, represents the policy entropy, and represents the weight coefficient hyperparameter.

For the value function , which is used to learn the mapping from state S to reward estimate , the corresponding optimization objective is:

where represents the calculated discounted return. Since MAPPO adopts a centralized training and distributed execution training method, the value function can directly divide the global state ; while the policy function can only process the local observation information of each agent .

Based on the above optimization objective formula, the operation process of MAPPO can be obtained. The processing process of the cyclic MAPPO algorithm provided in the original article [10] is organized as follows:

3.1 MAPPO algorithm

In this section, u is used to denote actions and u is used to denote action-observation history. The design idea of QMIX algorithm is similar to VDN. It is hoped that each one can be obtained by calculating a global one . only need to

That will satisfy the requirement. In order for the above formula to hold, there should be monotonicity with respect to Q_i, that is

The neural network structure of QMIX is shown in the figure below: QMIX framework diagram

For each agent i, there is an agent network used to calculate it , as shown in Figure (c). This network is a DRQN network, that is, the fully connected layer in DQN is replaced by GRU, and the input is the observation of the agent at time t and the action at time t-1 .

The hybrid network is a feedforward neural network that receives n outputs from the agent network and monotonically mixes them, and the output is the result, as shown in Figure (a). The weights of the hybrid network are generated by a separate super network. The super network takes the global state s_t as input and generates the weights of a layer of hybrid network. Each supernetwork consists of a single linear layer with an absolute value activation function to ensure that the weights of the hybrid network are non-negative.

The QMIX network is trained end-to-end to minimize the following loss:

Among them, , are the parameters of the target network (similar to DQN).

Experimental results

This section carries out specific experimental analysis based on the above introduction to the SISL environment and the MAPPO and QMIX algorithms. First, we introduce the access process of the SISL environment (Section 4.1); then, based on the MAPPO and QMIX algorithms provided in the original MindSporeRL warehouse, after modification and adjustment, we try to apply them in the newly accessed environment. The benchmark test experiment (sections 4.2 and 4.3) is completed in the SISL environment; we will display and introduce the corresponding engineering improvement points, experimental results and thoughts.

4.1 SISL environment access process

We use the pip install pettingzoo[sisl] command to install the basic environment library of SISL. Of course, you can also install the basic environment locally through the following method:

cd ~/reinforcement/mindspore_rl/environment
git clone https://github.com/Farama-Foundation/PettingZoo.git
cd PettingZoo
pip install -e .[sisl]

On the basis of this basic environment, we encapsulate SISL according to MindSpore Reinforcement's encapsulation mode of its existing environment. For specific code implementation, see sisl_environment.py.

The Wrapper class of the SISL environment mainly inherits the following classes:

from mindspore_rl.environment import Environment

In order to be compatible with MindSpore Reinforcement and MindSpore's training framework, we inherit or use the following data structure in the Wrapper class of the SISL environment:

import mindspore as ms
from mindspore.ops import operations as P
from mindspore_rl.environment.space import Space

After observing the code framework of MindSpore Reinforcement, we found that different algorithms are only adapted to specific environments, and all environments and algorithms are not universal. For example, the MAPPO algorithm is only adapted to the MPE environment, so the algorithm only supports continuous vector state space and discrete action space. Moreover, since the implementation of the MPE environment uses multiple processes, the MAPPO algorithm is also implemented for multiple processes. Specialized adaptation. For another example, the QMIX algorithm is only adapted to the SMAC environment, so the algorithm only supports continuous vector state space and discrete action space. The SMAC environment is a single process, so the QMIX algorithm is only suitable for a single process. Therefore, MindSpore Reinforcement’s existing algorithms cannot universally support discrete and continuous state or action spaces, and state spaces are generally only suitable for vector forms. For other forms of input such as images, additional backbone networks such as CNN need to be implemented. In addition, the access environment also requires specific adaptation based on the single-process or multi-process implementation of the algorithm.

In order to adapt to the QMIX algorithm, we implement the single-process version of the SISL environment Wrapper class SISLEnvironment, and encapsulate all APIs of the environment according to the MindSpore Reinforcement encapsulation format.

In order to adapt to the MAPPO algorithm, we implement the multi-process version of the SISL environment, the Wrapper class SISLMultiEnvironment, and the multiprocessing library based on Python to implement the multi-process scheduling class EnvironmentProcessNew, and encapsulate all APIs of the environment according to the MindSpore Reinforcement encapsulation format.

4.2 MAPPO algorithm benchmark test experiment

reinforcement_MAPPO
├─ example
│  ├─ make_plot.py
│  ├─ scripts
│  │  ├─ mappo_train_log.txt
│  │  └─ run_standalone_train.sh
│  └─ train_SISL.py
└─ mindspore_rl
   ├─ algorithm
   │  ├─ config.py
   │  ├─ config_SISL.py
   │  ├─ mappo.py
   │  ├─ mappo_replaybuffer.py
   │  ├─ mappo_session.py
   │  ├─ mappo_trainer.py
   │  ├─ mappo_vmap.py
   │  ├─ mappo_vmap_trainer.py
   │  ├─ mpe
   │  ├─ mpe_environment.patch
   │  ├─ mpe_environment.py
   │  ├─ on-policy
   │  ├─ sisl_environment.py
   │  └─ __init__.py
   └─ environment
      └─ sisl_environment.py

SISL environment code tree implemented by MAPPO algorithm

MAPPO**** algorithm is decoupled from the environment

昇思Mindspore中MAPPO原生接入了MPE环境，并且在小组成员的尝试中，能够直接在MPE环境上运行，成功完成了与环境交互、训练、参数更新等步骤，保证了MAPPO算法代码的正确性。SISL有不同的地图作为实验环境，包含Multiwalker、Pursuit以及Waterworld，但是直接更改config.py文件后却不能成功在SISL环境上运行。在小组成员研究讨论后发现，在昇思MindSpore框架中的MAPPO实现中，MAPPO算法与环境高度耦合。如MAPPOSession.py中，所有的有关智能体数目，状态维度观测维度以及可行动作维度等与环境相关的变量都实现为了具体的数值，而非环境变量；在sisl_environment.py中同样，显式地定义了智能体数目等本应为配置参数的变量。该显式定义使得更改config.py文件无法传递到算法内部，从而在特征输入网络的接口出产生运行错误。

We have improved the original algorithm and added a decoupling part from the environment, which enables the framework to correctly read the corresponding environment information from the config.py file and correctly use the environment information to pass to the algorithm, truly realizing Decoupling the environment and MAPPO algorithm.

MAPPO access to SISL environment****

在以上的工作完成之后，我们能够确保MAPPO算法无误，并且配置文件的参数能够正确的传递到算法内部。随后，小组成员着手开始接入SISL环境。在昇思MindSpore框架中，不同的算法对应的环境不同，对应的执行也不同。在调试的过程中，我们发现，对于多线程版本的SISL环境，不能直接更改多线程的线程数以及每个线程中存在的环境数目，直接更改进程数目或者环境数目会导致算法在某一处卡死；此外，在代码运行过程中，由于数据类型不匹配，例如不能将numpy.int32与int32也不能兼容，导致了环境转换的麻烦；从环境中返回的是动作的one-hot编码，并不能直接输入到定义的mappo网络中进行训练等问题。

After solving the problems of data type and interaction with the environment, the team members trained MAPPO on the SISL environment. However, according to the default training code, the program reported an error and exited after each training of 19 episodes. After discussion among the team members and debugging line by line, according to the original training code, after each training, it does not contain the truncation prompt information to end the episode, so when the next round of training starts, the environment from the previous round will continue. The execution starts in the state, and the number of running steps of the previous round of environment will not be cleared. Since the Pursuit environment is set to have a maximum number of running steps, after running to the maximum number of executions of the environment, according to the environment framework, all rewards and observations in the run function of sisl_environment.py that are executed to the final step environment will be cleared, and the run function will not The cleared variables are not processed, so during the running of the function, whenever the maximum number of steps specified by the environment is reached, the program crashes. This is also the reason why the program reports an error after every 19 episodes of training. Therefore, the team members added a test to see whether the set maximum number is reached. If the maximum number is reached, the environment sends a trunction signal, terminates the run function and resets the environment.

After solving the above difficulties, the MAPPO algorithm has been able to run successfully on the SISL environment. Set the state information required for MAPPO to be the splicing of all agent observations, and run a maximum of 500 episodes. Each episode contains 500 time steps. Conduct MAPPO experiments on the SISL environment, and draw the reward and loss curves during training. After training is completed, we get the following results:

4.3 QMIX algorithm benchmark test experiment

reinforcement_QMIX
├─ example
│  ├─ qmix
│  │  ├─ eval.py
│  │  ├─ scripts
│  │  │  ├─ run_standalone_eval.sh
│  │  │  └─ run_standalone_train.sh
│  │  └─ train.py
│  └─ __init__.py
└─ mindspore_rl
   ├─ algorithm
   │  ├─ qmix
   │  │  ├─ config.py
   │  │  ├─ qmix.py
   │  │  ├─ qmix_session.py
   │  │  ├─ qmix_trainer.py
   │  │  ├─ _config.py
   │  │  ├─ __init__.py
   │  └─ __init__.py
   └─ environment
      ├─ sc2_environment.py
      └─ sisl_environment.py

SISL environment code tree implemented by QMIX algorithm

QMIX algorithm correction

在我们实现的过程中，发现在昇思MindSpore框架中，原有的QMIX算法及其对应的实验环境SMAC并不能运行，eval.py例程同样报错无法运行。因此，为了检验算法的正确性以及环境的可行性，我们首先修正了框架中实现的QMIX算法以及对应的环境。

由于是初次使用昇思MindSpore框架，小组同学在实验中发现框架的报错信息不明显。经过小组成员讨论合作，发现，这是由于该框架反复使用了继承和重载，以及底层计算逻辑将python语言转译为cpp的计算图进行计算，导致报错几乎相同，而框架中的debugger并不总能够返回对应的错误定位。

Through line-by-line debugging, we found that in the implemented QMIXTrainer class, save_reward should have been returned, but Step_info in the SMAC environment was incorrectly returned. After debugging, the QMIX algorithm can be correctly verified in one of the SMAC environments.

Decoupling the QMIX algorithm from the environment

基于小组成员之前的经验，SMAC有不同的地图作为实验环境，如昇思MindSpore中已经实现的2s3z，以及如3m，3s5z等。然而，当我们按照文档中所说的改变对应config.py文件时，原有的程序并不能直接运行。在小组成员研究讨论后发现，在昇思MindSpore框架中的QMIX实现中，QMIX算法与环境高度耦合。如QMIXTrainer中，几乎所有的有关智能体数目Agent_num，观测维度obs_shape以及可行动作维度等与环境相关的变量都实现为了具体的数值，而非环境变量。

The above evaluate() function is reused in the training and testing phases, and the above problems occur. Separate them here to avoid functional confusion.
Agent_num, obs_shape and other variables are related to the environment and have nothing to do with the algorithm. Local variables were added and the code was refactored according to MindSporeRL's documentation to decouple the algorithm from the environment and comply with the framework specifications.

In summary, through debugging and revising the framework, we have solved the problem of excessive coupling between the environment and the algorithm, and truly achieved the decoupling of the QMIX algorithm and the SMAC environment. We also submitted a pull request for this version of the code in the original code repository for the convenience of subsequent users of the framework.

In the 3s5z environment, using our code to test QMIX we get the following results:

QMIX access to SISL environment

在以上的工作完成之后，我们能够确保QMIX算法无误，并且能够在对应的SMAC环境中得到相应结果。随后，小组成员着手开始接入SISL环境。如前文中所提到的，在昇思MindSpore框架中，不同的算法对应的环境不同，对应的执行也不同。QMIX算法仅实现了单线程版本，因此与前述MAPPO环境并不互通。在调试的过程中，我们发现，对于单线程版本的SISL环境，始终无法正常运行。报错的位置难以定位，其内容为：Unable to cast Python instance to C++ type.

小组成员讨论并且逐行进行调试，发现昇思MindSpore框架由于使用了Cpp底层编译成计算图进行加速计算，因此存在python到cpp的转译过程，而该过程与传统的python运行过程有些微区别。传统的Python程序为解释型语言，程序几乎是逐行运行；而cpp作为编译型语言，需要完整编译后运行。正因如此，小组成员推测，这样的混用导致了前文中提到的上述报错不明显的问题。同时，在写程序的过程中需要时刻检查各个变量的数据类型，与传统python不同，numpy.int32与int32也不能兼容，这导致需要花大量时间去检查从环境到算法各个步骤的数据类型。

Faced with the difficulties encountered, the students in the group tried to check the data types of each data in the environment and algorithm in turn, but they were still unable to locate the specific variable where the problem occurred and the location where the problem occurred. Additionally, we have implemented a multi-process version of the SISL environment and run the MAPPO code. After discussion, the team members concluded that the problem was caused by the inconsistency between the data type and the underlying C++ data type it calls. It is difficult to locate the problem through single-point debugging. You need to try compilation and debugging. At the same time, this issue has little to do with the content of the reinforcement learning course, so we did not continue to spend time debugging the implementation of the QMIX algorithm in a single-process SISL environment.

in conclusion

本次强化学习实验作业，我们小组成功实现了将一个面向多智能体合作场景的游戏测试环境——SISL(Stanford Intelligent Systems Laboratory)接入到昇思MindSpore平台；并尝试使用QMIX和MAPPO算法对此实验环境进行了性能基准测试。完成了作业中所给定的各项要求，对昇思MindSpore的底层架构也有了更深刻的认知，这有利于我们今后更熟练地将MindSporeRL库应用到我们的科研活动中。

references

[1]. Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning[J]. arXiv preprint arXiv:1312.5602, 2013.

[2]. Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. nature, 2016, 529(7587): 484-489.

[3]. Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of go without human knowledge[J]. nature, 2017, 550(7676): 354-359.

[4]. Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.

[5]. Brockman G, Cheung V, Pettersson L, et al. Openai gym[J]. arXiv preprint arXiv:1606.01540, 2016.

[6]. Todorov E, Erez T, Tassa Y. Mujoco: A physics engine for model-based control[C]//2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012: 5026-5033.

[7]. Mordatch I, Abbeel P. Emergence of grounded compositional language in multi-agent populations[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).

[8]. Romo L, Jain M. PySC2 Reinforcement Learning[J].

[9]. Samvelyan M, Rashid T, De Witt C S, et al. The starcraft multi-agent challenge[J]. arXiv preprint arXiv:1902.04043, 2019.

[10]. Yu C, Velu A, Vinitsky E, et al. The surprising effectiveness of ppo in cooperative multi-agent games[J]. Advances in Neural Information Processing Systems, 2022, 35: 24611-24624.

[11]. Rashid T, Samvelyan M, De Witt C S, et al. Monotonic value function factorisation for deep multi-agent reinforcement learning[J]. The Journal of Machine Learning Research, 2020, 21(1): 7234-7284.

[12]. https://pettingzoo.farama.org/environments/sisl/

[13]. Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J]. arXiv preprint arXiv:1707.06347, 2017.

[14]. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.