在openai gym环境下利用强化学习算法 demo

在openai gym环境下利用强化学习算法 demo

OpenAI Gym环境:本实验采用CartPole环境,该环境中有一辆小车在一个一维的无阻力轨道上行动,在车上绑着一个连接不太结实的杆,这个杆会左右摇晃,给小车施加一个正向的力或负向的力,保证杆子竖直不倾倒。如下图所示。 

本实验采用策略网络,它可以通过观察环境状态,直接预测出概率最大的执行策略,执行这个策略后可以获得最大的期望收益(包括现在的和未来的Reward)

(1)策略网络设计

本实验采用四层MLP网络,其中隐藏层2层.

输入层:当前环境的状态,描述状态通过四个参数:小车位置、速度、杆的角度和角速度。因此输入为四维的向量,故输入层神经元数目为4。

隐藏层:共2层隐藏层,每层有40个神经元。

输出层:输出为Action的概率,Action共两个值,对小车施加正向力和负向力,故输出层只需一个神经元,通过sigmoid激活函数得到概率值Pa。定义一个随机概率Pr,当Pr < Pa则Action取值1,否则取值0。

(2)超参设计

激活函数:隐藏层采用relu函数,输出层采用sigmoid函数。

学习率:固定0.1。

梯度更新策略:Adam,采用策略梯度方法,模型通过学习Action在环境中获得的反馈,使用梯度更新模型参数的过程。

迭代次数:最大10000次,设置Reward达到200时停止迭代,此时训练过程收敛,车上的杆达到稳定状态。

(3)Reward设计

Reward计算采用常用的Discounted Future Reward,即把所有未来奖励依次乘以衰减系数,这里的衰减系数一般是一个略小于但接近1的数,本实验中衰减系数取值0.99。Reward计算方式如下,

(4)训练过程

遍历每一次迭代:

  • 初始化:环境初始化;参数初始化;用于存储梯度的GradBuffer初始化,直到完成了一个batch_size的实验,再将汇总的梯度更新到模型参数;
  • 前向执行网络得到概率值Pa,即Action取值为1的概率;在(0,1)间随机抽样,若随机值<Pa,则action=1,否则action=0;
  • 当前环境状态条件下执行action,得到新的环境状态、当前action的reward和结果标志done,当done为True时该次实验结束,计算得到Discounted Future Reward。同时计算并更新梯度。
  • 当reward大于预设门限或到达最大迭代次数时停止迭代。

最优的一次实验中,迭代次数达到约160次后batch内的平均Reward达到200,此时车上的杆达到稳定状态。下图为Reward和迭代次数的关系图。

最后给出源代码:

# -*- coding: utf-8 -*-
"""
Created on Fri Sep  7 10:14:37 2018

"""
import numpy as np
import tensorflow as tf
import gym
from matplotlib import pyplot

#CartPole experiment
env = gym.make('CartPole-v0')
env.reset()

#reward Function
def discount_rewards(r):
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(range(r.size)):
        running_add = running_add*gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

#############################MLP network parameters and architecture#####
# hyperparameters
hidden_nodes = 40 # number of hidden layer neurons
batch_size = 20
learning_rate = 1e-1 # learning rate
gamma = 0.99 # discount factor for reward
input_dim = 4 # input dimensionality:cart loaction,cart speed,angle of pole, angle speed
tf.reset_default_graph()

#4 layers NN
observations = tf.placeholder(tf.float32, [None,input_dim], name="input_x")
w1 = tf.get_variable("W1",shape=[input_dim,hidden_nodes],initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(observations,w1))
w2 = tf.get_variable("W2",shape=[hidden_nodes,hidden_nodes],initializer=tf.contrib.layers.xavier_initializer())
layer2 = tf.nn.relu(tf.matmul(layer1,w2))
w3 = tf.get_variable("W3",shape=[hidden_nodes,1],initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer2,w3)
probability = tf.nn.sigmoid(score)

#############################input parameters########################
input_y = tf.placeholder(tf.float32, [None,1], name="input_y")
advantages = tf.placeholder(tf.float32,name="reward_signal")
loglik = tf.log(input_y*(input_y-probability)+(1-input_y)*(input_y+probability))
loss = -tf.reduce_mean(loglik*advantages)
tvars = tf.trainable_variables()
newGrads = tf.gradients(loss,tvars)

#adam,batch
adam = tf.train.AdamOptimizer(learning_rate = learning_rate)
w1Grad = tf.placeholder(tf.float32,name="batch_grad1")
w2Grad = tf.placeholder(tf.float32,name="batch_grad2")
w3Grad = tf.placeholder(tf.float32,name="batch_grad3")
batchGrad = [w1Grad,w2Grad,w3Grad]
updateGrads = adam.apply_gradients(zip(batchGrad,tvars))

xs,ys,drs = [],[],[]
reward_sum = 0
episode_number = 1
total_episodes = 10000
init = tf.global_variables_initializer()

#############################Training process########################
with tf.Session() as sess:
    rendering = False
    sess.run(init)
    # pyplot.figure(1)
	#observation initialization
    observation = env.reset()
    print(observation)
    gradBuffer = sess.run(tvars)
	#gradient buffer initialization
    for ix,grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0
    while episode_number <= total_episodes:
        if reward_sum/batch_size > 100 or rendering == True:
            env.render()
            rendering = True
		#forward
        x = np.reshape(observation,[1,input_dim])
        tfprob = sess.run(probability,feed_dict={observations:x})
        action = 1 if np.random.uniform()<tfprob else 0
        xs.append(x)
        y = 1 - action
        ys.append(y)
        observation,reward,done,info = env.step(action)
        reward_sum += reward
        drs.append(reward)
        
        if done:
            episode_number += 1
            epx = np.vstack(xs)
            epy = np.vstack(ys)
            epr = np.vstack(drs)
            xs,ys,drs = [],[],[]
            discounted_epr = discount_rewards(epr)
            discounted_epr -= np.mean(discounted_epr)
            discounted_epr /= np.std(discounted_epr)
            tGrad = sess.run(newGrads,feed_dict = {observations:epx,input_y:epy,advantages:discounted_epr})
            for ix,grad in enumerate(tGrad):
                gradBuffer[ix] += grad
            
            if episode_number % batch_size == 0:
                sess.run(updateGrads,feed_dict={w1Grad:gradBuffer[0],w2Grad:gradBuffer[1],w3Grad:gradBuffer[2]})
                for ix,grad in enumerate(gradBuffer):
                    gradBuffer[ix] = grad * 0
                print("Average reward for episode %d:%f."%(episode_number,reward_sum/batch_size))
                # pyplot.scatter(episode_number,reward_sum/batch_size,c='r',marker='.')
                # pyplot.xlabel('episode_number')
                # pyplot.ylabel('reward')
                # pyplot.pause(0.00001)
                    
                if reward_sum/batch_size >= 200:
                    print("OK:",episode_number)
                    break
                reward_sum = 0
            observation = env.reset()
    # pyplot.show()

猜你喜欢

转载自blog.csdn.net/a40850273/article/details/84639169