多臂老虎机的强化学习笔记

最近因为路径规划方面需要用到强化学习算法,最先看的就是这个老虎机,,但是好像联系并不大,可能是因为我没咋看懂(。。。)本着学有所惑和学有所获的精神,把这俩货进行记录,也望大神看到了可以指点迷津呀!

这里先写一下路径规划中想的强化学习的思路,驾驶车为agent,有四个方向(上下左右)来行驶就是四个action(前进,刹车,左拐弯,右拐弯),将可行驶的区域进行画框,state就是每次行驶格子的状态,奖励分为障碍物-1000,后退为0,走对的按目的地的远近奖励分数递增,好然后,就开始强化学习(没想法了,这里先打住以后再补)

1.基于上下文的多臂老虎机

先奉上学习的原文,避免各位大神看到我写的跟我走偏了。。https://zhuanlan.zhihu.com/p/26075668

在此之前看了一个不基于上下文的,就是没有state的老虎机,也是似懂非懂但是毕竟思想相似我也会用到state所以就直接记录这个了。这里的state是有不同的老虎机,但是看代码好像state之间是没有关系的,而我路径规划是想做有关系的state(不会不要让我做无用功啊啊啊)

原理(自己理解的):agent就是赌博本人((^-^)V),action就是拉动机臂,reward就是拉动机臂后获得的奖励(代码中是1和-1),而加入的state就是有多个老虎机,我我赌博本人正在操作那一个,我们的agent需要学习到在不同状态下(老虎机)执行动作所带来的回报。

代码思路:(这里不太懂权重的含义之前的老虎机权重代表的是action,以及代码中一个老虎机的概率系数?)代码是用tensorflow构造的一个神经网络,输入状态并且得到动作的权重。通过策略梯度更新方法,我们的agent就可以学习到不同状态下如何获得最大的回报。

   程序段1:定义老虎机:使用三个多臂老虎机,不同的老虎机有不同的概率分布,因此需要执行不同的动作获取最佳结果。(这里不太懂“不同的老虎机有不同的概率分布”这句,可能是真的不会玩)

import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np


class contextual_bandit():
    def __init__(self):
        self.state = 0
        #List out our bandits. Currently arms 4, 2, and 1 (respectively) are the most optimal.
        self.bandits = np.array([[0.2,0,-0.0,-5],[0.1,-5,1,0.25],[-5,5,5,5]])
        self.num_bandits = self.bandits.shape[0]
        self.num_actions = self.bandits.shape[1]
        
    def getBandit(self):
        self.state = np.random.randint(0,len(self.bandits)) #Returns a random state for each episode.
        return self.state
        
    def pullArm(self,action):
        #Get a random number.
        bandit = self.bandits[self.state,action]
        result = np.random.randn(1)
        if result > bandit:
            #return a positive reward.
            return 1
        else:
            #return a negative reward.
            return -1

 程序2:策略梯度的agent :输入为当前的状态,输出为执行的动作。这使得agent可以根据当前的状态执行不同的动作。agent使用一组权重,每一个作为在给定状态下执行特定动作的回报的估计。

class agent():
    def __init__(self, lr, s_size,a_size):
        #These lines established the feed-forward part of the network. The agent takes a state and produces an action.
        self.state_in= tf.placeholder(shape=[1],dtype=tf.int32)
        state_in_OH = slim.one_hot_encoding(self.state_in,s_size)
        output = slim.fully_connected(state_in_OH,a_size,\
            biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())
        self.output = tf.reshape(output,[-1])
        self.chosen_action = tf.argmax(self.output,0)

        #The next six lines establish the training proceedure. We feed the reward and chosen action into the network
        #to compute the loss, and use it to update the network.
        self.reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
        self.action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
        self.responsible_weight = tf.slice(self.output,self.action_holder,[1])
        self.loss = -(tf.log(self.responsible_weight)*self.reward_holder)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)
        self.update = optimizer.minimize(self.loss)
    
tf.reset_default_graph() #Clear the Tensorflow graph.

  最后训练

cBandit = contextual_bandit() #Load the bandits.
myAgent = agent(lr=0.001,s_size=cBandit.num_bandits,a_size=cBandit.num_actions) #Load the agent.
weights = tf.trainable_variables()[0] #The weights we will evaluate to look into the network.

total_episodes = 10000 #Set total number of episodes to train agent on.
total_reward = np.zeros([cBandit.num_bandits,cBandit.num_actions]) #Set scoreboard for bandits to 0.
e = 0.1 #Set the chance of taking a random action.

init = tf.initialize_all_variables()

# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < total_episodes:
        s = cBandit.getBandit() #Get a state from the environment.
        
        #Choose either a random action or one from our network.
        if np.random.rand(1) < e:
            action = np.random.randint(cBandit.num_actions)
        else:
            action = sess.run(myAgent.chosen_action,feed_dict={myAgent.state_in:[s]})
        
        reward = cBandit.pullArm(action) #Get our reward for taking an action given a bandit.
        
        #Update the network.
        feed_dict={myAgent.reward_holder:[reward],myAgent.action_holder:[action],myAgent.state_in:[s]}
        _,ww = sess.run([myAgent.update,weights], feed_dict=feed_dict)
        
        #Update our running tally of scores.
        total_reward[s,action] += reward
        if i % 500 == 0:
            print "Mean reward for each of the " + str(cBandit.num_bandits) + " bandits: " + str(np.mean(total_reward,axis=1))
        i+=1
for a in range(cBandit.num_bandits):
    print "The agent thinks action " + str(np.argmax(ww[a])+1) + " for bandit " + str(a+1) + " is the most promising...."
    if np.argmax(ww[a]) == np.argmin(cBandit.bandits[a]):
        print "...and it was right!"
    else:
        print "...and it was wrong!"

调试过程:就run就行了啊出来结果是这样

三个state

说句露怯的,,,还没看懂为啥越是负数分越高(哎!!!),往上瞅分数的函数是

np.mean(total_reward, axis=1))算出来的,这里科普一下(我不会才查的。。)mean的用法,
定义:numpy.mean(a, axis, dtype, out,keepdims )功能:求均值,有m*n矩阵,

axis 不设置值,对 m*n 个数求均值,返回一个实数

axis = 0:压缩行,对各列求均值,返回 1* n 矩阵

axis =1 :压缩列,对各行求均值,返回 m *1 矩阵

所以这句话就是对total_reward这个矩阵求均值,而

total_reward[s, action] += reward,就是reward的累加,s是不同state(3),action是摇臂动作选择(4个)而reward
reward = cBandit.pullArm(action) cBandit是contextual_bandit()也就是定义老虎机那段,
其中的pullarm函数就是return1或者-1而1和-1是与一个-2.58~+2.58的随机正态分布result和bandit比较来的(np.random.randn从标准正态分布中返回一个或多个样本值而np.random.rand随机样本位于[0, 1)中),
bandit = self.bandits[self.state, action]self.state是随机的state123,action是小于num_action(4)的随机数self.num_actions= self.bandits.shape[1]列的个数?self.bandits是np.array([[1000, 0, -60, 1000], [-80, 0, 1000, 0], [-100, 0, 1000, 1000]])得到的(创建3行4列二维数组)所以这里可能是self.bandits[self.state, action][123,1234]
整体过程就是:随机取三行四列数组中的数与result进行比较之后把得到reward累加,之后对三个state下得到的reward求取均值

哎,均值有啥用,宝宝要看的是他怎么得到的reward最大,结果是每个状态的均值(!!!),哎收拾心情看第二个action的得到,针对不同的state(a+1) 对应的action= str(np.argmax(ww[a]) + 1),(np.argmax返回最大值的索引)ww

_, ww = sess.run([myAgent.update, weights], feed_dict=feed_dict)这里说明一下sess.run的用法 session.run([fetch1, fetch2]),feed_dict的作用是给使用placeholder创建出来的tensor赋值,这里有两个fetch:myagent.update和weights myagent.update是由
myAgent = agent(lr=0.001, s_size=cBandit.num_bandits, a_size=cBandit.num_actions)#agent(lr,3,4)agent为class里边self.update = optimizer.minimize(self.loss)实现最小梯度下降算法的优化器
self.loss = -(tf.log(self.responsible_weight) * self.reward_holder)tf.log计算log,一个输入计算e的ln,两输入以第二输入为底 
self.responsible_weight = tf.slice(self.output, self.action_holder, [1])从输入数据self.output中提取出一块从self.action_holder开始的size是【1】的切片。
self.output = tf.reshape(output, [-1])改变tensor的形状,在该维度打平至一维
output = slim.fully_connected(state_in_OH, a_size, \
                              biases_initializer=None, activation_fn=tf.nn.sigmoid,
                              weights_initializer=tf.ones_initializer()) slim库前两个参数分别为网络输入、输出的神经元数量 activation_fn : 激活函数,默认是nn.relu weights_initializer : 权重的初始化器,initializers.xavier_initializer()
state_in_OH = slim.one_hot_encoding(self.state_in, s_size)主要是采用位状态寄存器来对个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候只有一位有效
self.state_in = tf.placeholder(shape=[1], dtype=tf.int32)此函数可以理解为形参,用于定义过程,在执行的时候再赋具体的值

 

猜你喜欢

转载自blog.csdn.net/chongge369/article/details/82215466