【Tensorlayer系列】深度强化学习之FrozenLake介绍及表格型Q学习求解

在这里插入图片描述
获取更多资讯,赶快关注上面的公众号吧!

Tensorlayer深度强化学习系列:

1、Tensorlayer深度强化学习之Tensorlayer安装

2.4 强化学习环境 gym 介绍

  这一部分主要讲一下 gym 中各种环境是怎样的。

2.4.1 安装

  Gym 的安装和相关说明可以查看文章(https://mp.weixin.qq.com/s?__biz=MzU1OTkwNzk4NQ==&mid=2247484108&idx=1&sn=0c9ff7488185c6287fbe56a3fa24a286&chksm=fc115732cb66de24dab450f458cc39effea9ffe4441010d5d3e00078badcdf132a54eb5388ba&token=366879770&lang=zh_CN#rd),这里不再赘述。

2.4.2 FrozenLake-v0

2.4.2.1 描述

  FrozenLake-v0 是一个 4*4 的网络格子,每个格子可以是起始块,目标块、冻结块或者危险块。我们的目标是让 agent 学习从开始块如何行动到目标块上,而不是移动到危险块上。agent 可以选择向上、向下、向左或者向右移动,同时游戏中还有可能吹来一阵风,将 agent 吹到任意的方块上。在这种情况下,每个时刻都有完美的策略是不能的,但是如何避免危险洞并且到达目标洞肯定是可行的。
更通俗一点地讲,就是冬天来了,你和你的朋友在公园里玩飞盘的时候,你把飞盘扔到了湖中央。水大部分都已冻结,但也有一些地方融化出了几个洞。如果你踏进其中一个洞,你就会掉进冰冷的水里。在这个时候,由于没有其他的飞盘,所以必须穿过湖面并取回光盘。然而,冰是滑的,所以你不会总是朝着你想要的方向前进。

  该冰面可以通过以下的网格来描述:

SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)

  当你达到目标或掉进洞里时,这一片段(回合)就结束了。如果达到了目标,将得到 1 分的奖励,否则为 0 分。

2.4.2.2 代码

import sys
from contextlib import closing

import numpy as np
from six import StringIO, b

from gym import utils
from gym.envs.toy_text import discrete

LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3

MAPS = {
    "4x4": [
        "SFFF",
        "FHFH",
        "FFFH",
        "HFFG"
    ],
    "8x8": [
        "SFFFFFFF",
        "FFFFFFFF",
        "FFFHFFFF",
        "FFFFFHFF",
        "FFFHFFFF",
        "FHHFFFHF",
        "FHFFHFHF",
        "FFFHFFFG"
    ],
}


def generate_random_map(size=8, p=0.8):
    """Generates a random valid map (one that has a path from start to goal)
    :param size: size of each side of the grid
    :param p: probability that a tile is frozen
    """
    valid = False

    # DFS to check that it's a valid path.
    def is_valid(res):
        frontier, discovered = [], set()
        frontier.append((0,0))
        while frontier:
            r, c = frontier.pop()
            if not (r,c) in discovered:
                discovered.add((r,c))
                directions = [(1, 0), (0, 1), (-1, 0), (0, -1)]
                for x, y in directions:
                    r_new = r + x
                    c_new = c + y
                    if r_new < 0 or r_new >= size or c_new < 0 or c_new >= size:
                        continue
                    if res[r_new][c_new] == 'G':
                        return True
                    if (res[r_new][c_new] not in '#H'):
                        frontier.append((r_new, c_new))
        return False

    while not valid:
        p = min(1, p)
        res = np.random.choice(['F', 'H'], (size, size), p=[p, 1-p])
        res[0][0] = 'S'
        res[-1][-1] = 'G'
        valid = is_valid(res)
    return ["".join(x) for x in res]


class FrozenLakeEnv(discrete.DiscreteEnv):
    """
    Winter is here. You and your friends were tossing around a frisbee at the park
    when you made a wild throw that left the frisbee out in the middle of the lake.
    The water is mostly frozen, but there are a few holes where the ice has melted.
    If you step into one of those holes, you'll fall into the freezing water.
    At this time, there's an international frisbee shortage, so it's absolutely imperative that
    you navigate across the lake and retrieve the disc.
    However, the ice is slippery, so you won't always move in the direction you intend.
    The surface is described using a grid like the following
        SFFF
        FHFH
        FFFH
        HFFG
    S : starting point, safe
    F : frozen surface, safe
    H : hole, fall to your doom
    G : goal, where the frisbee is located
    The episode ends when you reach the goal or fall in a hole.
    You receive a reward of 1 if you reach the goal, and zero otherwise.
    """

    metadata = {'render.modes': ['human', 'ansi']}

    def __init__(self, desc=None, map_name="4x4",is_slippery=True):
        if desc is None and map_name is None:
            desc = generate_random_map()
        elif desc is None:
            desc = MAPS[map_name]
        self.desc = desc = np.asarray(desc,dtype='c')
        self.nrow, self.ncol = nrow, ncol = desc.shape
        self.reward_range = (0, 1)

        nA = 4
        nS = nrow * ncol

        isd = np.array(desc == b'S').astype('float64').ravel()
        isd /= isd.sum()

        P = {s : {a : [] for a in range(nA)} for s in range(nS)}

        def to_s(row, col):
            return row*ncol + col

        def inc(row, col, a):
            if a == LEFT:
                col = max(col-1,0)
            elif a == DOWN:
                row = min(row+1,nrow-1)
            elif a == RIGHT:
                col = min(col+1,ncol-1)
            elif a == UP:
                row = max(row-1,0)
            return (row, col)

        for row in range(nrow):
            for col in range(ncol):
                s = to_s(row, col)
                for a in range(4):
                    li = P[s][a]
                    letter = desc[row, col]
                    if letter in b'GH':
                        li.append((1.0, s, 0, True))
                    else:
                        if is_slippery:
                            for b in [(a-1)%4, a, (a+1)%4]:
                                newrow, newcol = inc(row, col, b)
                                newstate = to_s(newrow, newcol)
                                newletter = desc[newrow, newcol]
                                done = bytes(newletter) in b'GH'
                                rew = float(newletter == b'G')
                                li.append((1.0/3.0, newstate, rew, done))
                        else:
                            newrow, newcol = inc(row, col, a)
                            newstate = to_s(newrow, newcol)
                            newletter = desc[newrow, newcol]
                            done = bytes(newletter) in b'GH'
                            rew = float(newletter == b'G')
                            li.append((1.0, newstate, rew, done))

        super(FrozenLakeEnv, self).__init__(nS, nA, P, isd)

    def render(self, mode='human'):
        outfile = StringIO() if mode == 'ansi' else sys.stdout

        row, col = self.s // self.ncol, self.s % self.ncol
        desc = self.desc.tolist()
        desc = [[c.decode('utf-8') for c in line] for line in desc]
        desc[row][col] = utils.colorize(desc[row][col], "red", highlight=True)
        if self.lastaction is not None:
            outfile.write("  ({})\n".format(["Left","Down","Right","Up"][self.lastaction]))
        else:
            outfile.write("\n")
        outfile.write("\n".join(''.join(line) for line in desc)+"\n")

        if mode != 'human':
            with closing(outfile):
                return outfile.getvalue()

2.5 强化学习算法

2.5.1 表格 Q 学习

2.5.1.1 代码

  表格 Q 学习的原理这里不再赘述,可以参照另一篇文章(第五章 基于时序差分和 Q 学习的无模型预测与控制(一),https://mp.weixin.qq.com/s?__biz=MzU1OTkwNzk4NQ==&mid=2247484656&idx=1&sn=a0804ea632ff65b4f629dca5d4d23574&chksm=fc11510ecb66d818d8b91b7043254d5fe807fe123be4271caaca27282425f6b52952790f661e&token=366879770&lang=zh_CN#rd),

"""Q-Table learning algorithm.
Non deep learning - TD Learning, Off-Policy, e-Greedy Exploration
Q(S, A) <- Q(S, A) + alpha _ (R + lambda _ Q(newS, newA) - Q(S, A))
See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
For Q-Network, see tutorial_frozenlake_q_network.py
EN: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.5m3361vlw
CN: https://zhuanlan.zhihu.com/p/25710327
tensorflow==2.0.0a0
tensorlayer==2.0.0
"""

import argparse
import os
import time

import gym
import matplotlib.pyplot as plt
import numpy as np

parser = argparse.ArgumentParser()
parser.add_argument('--train', dest='train', action='store_true', default=True)
parser.add_argument('--test', dest='test', action='store_true', default=True)

parser.add_argument(
'--save_path', default=None, help='folder to save if mode == train else model path,'
'qnet will be saved once target net update'
)
parser.add_argument('--seed', help='random seed', type=int, default=0)
parser.add_argument('--env_id', default='FrozenLake-v0')
args = parser.parse_args()

## Load the environment

alg_name = 'Qlearning'
env_id = args.env_id
env = gym.make(env_id)
render = True # display the game environment

##================= Implement Q-Table learning algorithm =====================##

## Initialize table with all zeros

Q = np.zeros([env.observation_space.n, env.action_space.n])

## Set learning parameters

lr = .85 # alpha, if use value function approximation, we can ignore it
lambd = .99 # decay factor
num_episodes = 10000
t0 = time.time()

if args.train:
all*episode_reward = []
for i in range(num_episodes): ## Reset environment and get first new observation
s = env.reset()
rAll = 0 ## The Q-Table learning algorithm
for j in range(99):
if render: env.render() ## Choose an action by greedily (with noise) picking from Q table
a = np.argmax(Q[s, :] + np.random.randn(1, env.action_space.n) \* (1. / (i + 1))) ## Get new state and reward from environment
s1, r, d, * = env.step(a) ## Update Q-Table with new knowledge
Q[s, a] = Q[s, a] + lr _ (r + lambd _ np.max(Q[s1, :]) - Q[s, a])
rAll += r
s = s1
if d is True:
break
print(
'Training | Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format(
i + 1, num_episodes, rAll,
time.time() - t0
)
)
if i == 0:
all_episode_reward.append(rAll)
else:
all_episode_reward.append(all_episode_reward[-1] _ 0.9 + rAll _ 0.1)

    # save
    path = os.path.join('model', '_'.join([alg_name, env_id]))
    if not os.path.exists(path):
        os.makedirs(path)
    np.save(os.path.join(path, 'Q_table.npy'), Q)

    plt.plot(all_episode_reward)
    if not os.path.exists('image'):
        os.makedirs('image')
    plt.savefig(os.path.join('image', '_'.join([alg_name, env_id])))

    # print("Final Q-Table Values:/n %s" % Q)

if args.test:
path = os.path.join('model', '_'.join([alg_name, env_id]))
Q = np.load(os.path.join(path, 'Q_table.npy'))
for i in range(num_episodes): ## Reset environment and get first new observation
s = env.reset()
rAll = 0 ## The Q-Table learning algorithm
for j in range(99): ## Choose an action by greedily (with noise) picking from Q table
a = np.argmax(Q[s, :]) ## Get new state and reward from environment
s1, r, d, _ = env.step(a) ## Update Q-Table with new knowledge
rAll += r
s = s1
if d is True:
break
print(
'Testing | Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format(
i + 1, num_episodes, rAll,
time.time() - t0
)
)

2.5.1.2 实验结果

  学习到的最终 Q 表如下:

[[6.20622965e-01 8.84762425e-03 3.09373823e-03 6.55067399e-03]

[6.49198039e-04 3.04069914e-04 8.78667903e-04 5.91638052e-01]

[1.92065690e-03 4.33985167e-01 3.49151873e-03 1.97126703e-03]

[2.70187111e-03 0.00000000e+00 0.00000000e+00 4.35444853e-01]

[6.34931610e-01 1.09286085e-04 1.86982907e-03 2.76783612e-04]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[6.48093009e-07 1.13896350e-04 1.65719637e-01 1.90614063e-05]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 3.84251979e-03 1.48921362e-03 7.46942896e-01]

[0.00000000e+00 8.03386378e-01 6.92688383e-04 0.00000000e+00]

[8.40889312e-01 9.86082253e-06 1.25967676e-04 6.83892296e-05]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 0.00000000e+00 9.61587991e-01 6.98637543e-03]

[0.00000000e+00 9.99905944e-01 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]

  各代累积奖励曲线如下:

图 4 各代累积奖励

  三次测试的正确率如下:

图 5 测试正确率
发布了46 篇原创文章 · 获赞 67 · 访问量 8342

猜你喜欢

转载自blog.csdn.net/hba646333407/article/details/104433054
今日推荐