OpenAI Gym3

Observations

The last blog post covered a demo using OpenAI Gym's CartPole (inverted pendulum) , and if you want to do better than taking random actions at each step, it might be good to actually understand the impact of your actions on the environment. 
The step function of the environment returns the required information. The step function returns four values ​​of observation, reward, done, and info. The following is the specific information:

  • observation (object): An environment-related object that describes the environment you observe, such as camera pixel information, robot angular velocity and angular acceleration, and the state of the board in a board game.
  • reward (float): The sum of all rewards from previous actions, calculated 
    differently , but the goal is always to increase its own total reward.
  • done (boolean): Determines whether it is time to reset the environment. Most tasks are divided into well-defined episodes, and completion is True to indicate that the episode has terminated.
  • info (dict): Diagnostic information for debugging, and sometimes for learning, but formal evaluation does not allow the use of this information for learning. 
    This is a typical implementation of an agent-environment loop. At each time step, the Agent selects an action, and the Environment returns an observation and reward. 
    write picture description here 
    The process is started by calling reset, which returns an initial observation. So a more appropriate way to write the code for the last blog post is to obey the done flag:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

When done is true, control fails and the episode ends. It can be calculated that the reward of each episode is the t+1 time it persists. The longer the persistence, the greater the reward. In the above algorithm, the agent's behavior selection is random, and the average reward is about 20. 
Yongqianggao

[ 0.00753165  0.8075176  -0.15841931 -1.63740717]
[ 0.023682    1.00410306 -0.19116745 -1.97497356]
Episode finished after 26 timesteps
[-0.01027234 -0.00503277  0.01774634  0.01849733]
[-0.01037299 -0.20040467  0.01811628  0.31672619]
[-0.01438109 -0.00554538  0.02445081  0.02981111]
[-0.01449199  0.18921755  0.02504703 -0.25505814]
[-0.01070764  0.38397309  0.01994587 -0.53973677]
[-0.00302818  0.57880906  0.00915113 -0.8260689 ]
[ 0.008548    0.77380468 -0.00737025 -1.11585968]
[ 0.02402409  0.9690226  -0.02968744 -1.41084543]
[ 0.04340455  1.16449982 -0.05790435 -1.71265888]
[ 0.06669454  1.36023677 -0.09215753 -2.0227866 ]
[ 0.09389928  1.55618414 -0.13261326 -2.34251638]
[ 0.12502296  1.75222707 -0.17946359 -2.67287294]
Episode finished after 12 timesteps
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

Spaces

在上面的例子中,已经从环境的动作空间中抽取随机动作。但这些行动究竟是什么呢? 每个环境都带有描述有效动作和观察结果的一级Space对象:

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)      
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

离散空间允许固定范围的非负数,因此在这种情况下,有效的动作是0或1. Box空间表示一个n维框,所以有效的观察将是4个数字的数组。 也可以检查Box的范围:

print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])
  • 1
  • 2
  • 3
  • 4

这种反省(introspection)可以帮助编写适用于许多不同环境的通用代码。Box和Discrete是最常用的空格,可以从空格进行抽样或检查属于它的内容:

from gym import spaces
space = spaces.Discrete(8) # Set with 8 elements {0, 1, 2, ..., 7}
x = space.sample()
assert space.contains(x)
assert space.n == 8
  • 1
  • 2
  • 3
  • 4
  • 5

对于CartPole-v0,其中一个操作会向左施加力,其中一个向右施加力。

环境(Environments)

gym主要目的是提供大量的暴露常见界面的环境,并进行版本控制,以便进行比较,可以查看系统提供那些环境:

from gym import envs
print(envs.registry.all())

[EnvSpec(PredictActionsCartpole-v0), EnvSpec(Asteroids-ramDeterministic-v0), EnvSpec(Asteroids-ramDeterministic-v3), EnvSpec(Gopher-ramDeterministic-v3), EnvSpec(Gopher-ramDeterministic-v0), EnvSpec(DoubleDunk-ramDeterministic-v3), EnvSpec(DoubleDunk-ramDeterministic-v0), EnvSpec(Tennis-ramNoFrameskip-v3), EnvSpec(RoadRunner-ramDeterministic-v0), EnvSpec(Robotank-ram-v3), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(Gopher-ram-v3), EnvSpec(Gopher-ram-v0)...
  • 1
  • 2
  • 3
  • 4

观察(Observations)

上篇博客介绍了使用OpenAI Gym的CartPole(倒立摆)的demo,如果想要在每个步骤中做出比采取随机行动更好的行动,那么实际了解行动对环境的影响可能会很好。 
环境的step 函数返回需要的信息,step 函数返回四个值observation、reward、done、info,下面是具体信息:

  • observation (object):一个与环境相关的对象描述你观察到的环境,如相机的像素信息,机器人的角速度和角加速度,棋盘游戏中的棋盘状态。
  • reward (float):先前行为获得的所有回报之和,不同环境的计算方式不 
    一,但目标总是增加自己的总回报。
  • done (boolean): 判断是否到了重新设定(reset)环境,大多数任务分为明确定义的episodes,并且完成为True表示episode已终止。
  • info (dict): Diagnostic information for debugging, and sometimes for learning, but formal evaluation does not allow the use of this information for learning. 
    This is a typical implementation of an agent-environment loop. At each time step, the Agent selects an action, and the Environment returns an observation and reward. 
    write picture description here 
    The process is started by calling reset, which returns an initial observation. So a more appropriate way to write the code for the last blog post is to obey the done flag:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

When done is true, control fails and the episode ends. It can be calculated that the reward of each episode is the t+1 time it persists. The longer the persistence, the greater the reward. In the above algorithm, the agent's behavior selection is random, and the average reward is about 20. 
Yongqianggao

[ 0.00753165  0.8075176  -0.15841931 -1.63740717]
[ 0.023682    1.00410306 -0.19116745 -1.97497356]
Episode finished after 26 timesteps
[-0.01027234 -0.00503277  0.01774634  0.01849733]
[-0.01037299 -0.20040467  0.01811628  0.31672619]
[-0.01438109 -0.00554538  0.02445081  0.02981111]
[-0.01449199  0.18921755  0.02504703 -0.25505814]
[-0.01070764  0.38397309  0.01994587 -0.53973677]
[-0.00302818  0.57880906  0.00915113 -0.8260689 ]
[ 0.008548    0.77380468 -0.00737025 -1.11585968]
[ 0.02402409  0.9690226  -0.02968744 -1.41084543]
[ 0.04340455  1.16449982 -0.05790435 -1.71265888]
[ 0.06669454  1.36023677 -0.09215753 -2.0227866 ]
[ 0.09389928  1.55618414 -0.13261326 -2.34251638]
[ 0.12502296  1.75222707 -0.17946359 -2.67287294]
Episode finished after 12 timesteps
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

Spaces

In the above example, random actions have been drawn from the action space of the environment. But what exactly are these actions? Each environment comes with a first-level Space object describing valid actions and observations:

import gym
env = gym.make('CartPole-v0')
print(env.action_space)
#> Discrete(2)
print(env.observation_space)
#> Box(4,)      
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

Discrete spaces allow a fixed range of non-negative numbers, so in this case valid actions are either 0 or 1. Box space represents an n-dimensional box, so valid observations would be an array of 4 numbers. It is also possible to check the scope of the Box:

print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])
  • 1
  • 2
  • 3
  • 4

This introspection can help write generic code that works in many different environments. Box and Discrete are the most commonly used spaces and can sample from a space or inspect what belongs to it:

from gym import spaces
space = spaces.Discrete(8) # Set with 8 elements {0, 1, 2, ..., 7}
x = space.sample()
assert space.contains(x)
assert space.n == 8
  • 1
  • 2
  • 3
  • 4
  • 5

For CartPole-v0, one of the operations applies force to the left and one to the right.

Environments

The main purpose of gym is to provide a large number of environments that expose common interfaces, and version control, so that for comparison, you can see which environments are provided by the system:

from gym import envs
print(envs.registry.all())

[EnvSpec(PredictActionsCartpole-v0), EnvSpec(Asteroids-ramDeterministic-v0), EnvSpec(Asteroids-ramDeterministic-v3), EnvSpec(Gopher-ramDeterministic-v3), EnvSpec(Gopher-ramDeterministic-v0), EnvSpec(DoubleDunk-ramDeterministic-v3), EnvSpec(DoubleDunk-ramDeterministic-v0), EnvSpec(Tennis-ramNoFrameskip-v3), EnvSpec(RoadRunner-ramDeterministic-v0), EnvSpec(Robotank-ram-v3), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(Gopher-ram-v3), EnvSpec(Gopher-ram-v0)...
  • 1
  • 2
  • 3
  • 4

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326038778&siteId=291194637