**
Chapter1
**
Introduction
在交互中学习是几乎所有学习和智能理论的基本思想。我们将讨论如何设计高效的机器来解决科学或经济领域的学习问题,并通过数学分析或计算机实验的方式来评估这些设计,我们所探索的方法称之为“强化学习”。相1比较于机器学习,强化学习更加侧重于以交互目标为导向进行学习。
1.1 强化学习
强化学习就是学习做什么(即如何把当前的情景映射成动作)才能使数值化的收益信号最大化。
强化学习最显著的2个特征:试错和延迟收益。
马尔可夫决策过程包含三个方面:感知,动作和目标,任何适用于解决这类问题的方法称为强化学习。
强化学习与监督学习和无监督学习不同:
监督学习:从外部监督者提供的带标注训练集中学习。
无监督学习:是典型的寻找未标注数据中隐含结构的过程。
强化学习的独有挑战——“试探”和“开发”之间的折中权衡。
1.2 示例
1.3 强化学习要素
除了智能体和环境,强化学习还有四个核心要素:策略,收益信号,价值函数以及(可选)对环境建立的模型。
(1)策略:定义了学习智能体在特定时间的行为方式。简单地说是环境状态到动作的映射。它对应于心里学中的“刺激-反应”。策略本身是可以决定行为的,因此策略是强化学习智能体的核心,一般来说策略可能是环境所在的状态和智能体所采取的动作的随机函数。
(2)收益信号:定义了强化学习中的目标。
智能体的唯一目标就是最大化长期总收益。也就是说收益信号决定了,对智能体来说何为好,何为坏。收益信号是改变策略的主要基础。一般来说,收益信号可能是环境状态和在此基础上所采取动作的随机函数。
(3)价值函数:表示从长远的角度看什么是好的。简单来说,一个状态的价值是一个智能体从这个状态开始,对将来累积的总收益的期望。
动作的选择是基于对价值的判断做出的。收益基本上是由环境直接给予的,但是价值必须是综合评估,并根据智能体在整个过程中观察到的收益序列重新估计。
(4)对环境建立模型:它允许对外部环境的行为进行推断。环境模型会被用来规划,规划就是在真正经历之前,先考虑未来可能发生的各种情境从而预先决定采取何种动作。
使用环境模型和规划来解决强化学习问题的方法称为有模型方法。而简单地无模型方法则是直接的试错。
1.4 局限性与适用范围
状态:即作为策略和价值函数的输入,又同时作为模型的输入与输出。一般来说,可以把状态看做传递给智能体的一种信号,这种信号告诉智能体“当前环境如何”。
1.5 井字棋
import numpy as np
import pickle
BOARD_ROWS = 3
BOARD_COLS = 3
BOARD_SIZE = BOARD_ROWS * BOARD_COLS
class State:
def __init__(self):
# the board is represented by an n * n array,
# 1 represents a chessman of the player who moves first,
# -1 represents a chessman of another player
# 0 represents an empty position
self.data = np.zeros((BOARD_ROWS, BOARD_COLS))
self.winner = None
self.hash_val = None
self.end = None
# compute the hash value for one state, it's unique
def hash(self):
if self.hash_val is None:
self.hash_val = 0
for i in np.nditer(self.data):
self.hash_val = self.hash_val * 3 + i + 1
return self.hash_val
# check whether a player has won the game, or it's a tie
def is_end(self):
if self.end is not None:
return self.end
results = []
# check row
for i in range(BOARD_ROWS):
results.append(np.sum(self.data[i, :]))
# check columns
for i in range(BOARD_COLS):
results.append(np.sum(self.data[:, i]))
# check diagonals
trace = 0
reverse_trace = 0
for i in range(BOARD_ROWS):
trace += self.data[i, i]
reverse_trace += self.data[i, BOARD_ROWS - 1 - i]
results.append(trace)
results.append(reverse_trace)
for result in results:
if result == 3:
self.winner = 1
self.end = True
return self.end
if result == -3:
self.winner = -1
self.end = True
return self.end
# whether it's a tie
sum_values = np.sum(np.abs(self.data))
if sum_values == BOARD_SIZE:
self.winner = 0
self.end = True
return self.end
# game is still going on
self.end = False
return self.end
# @symbol: 1 or -1
# put chessman symbol in position (i, j)
def next_state(self, i, j, symbol):
new_state = State()
new_state.data = np.copy(self.data)
new_state.data[i, j] = symbol
return new_state
# print the board
def print_state(self):
for i in range(BOARD_ROWS):
print('-------------')
out = '| '
for j in range(BOARD_COLS):
if self.data[i, j] == 1:
token = '*'
elif self.data[i, j] == -1:
token = 'x'
else:
token = '0'
out += token + ' | '
print(out)
print('-------------')
def get_all_states_impl(current_state, current_symbol, all_states):
for i in range(BOARD_ROWS):
for j in range(BOARD_COLS):
if current_state.data[i][j] == 0:
new_state = current_state.next_state(i, j, current_symbol)
new_hash = new_state.hash()
if new_hash not in all_states:
is_end = new_state.is_end()
all_states[new_hash] = (new_state, is_end)
if not is_end:
get_all_states_impl(new_state, -current_symbol, all_states)
def get_all_states():
current_symbol = 1
current_state = State()
all_states = dict()
all_states[current_state.hash()] = (current_state, current_state.is_end())
get_all_states_impl(current_state, current_symbol, all_states)
return all_states
# all possible board configurations
all_states = get_all_states()
class Judger:
# @player1: the player who will move first, its chessman will be 1
# @player2: another player with a chessman -1
def __init__(self, player1, player2):
self.p1 = player1
self.p2 = player2
self.current_player = None
self.p1_symbol = 1
self.p2_symbol = -1
self.p1.set_symbol(self.p1_symbol)
self.p2.set_symbol(self.p2_symbol)
self.current_state = State()
def reset(self):
self.p1.reset()
self.p2.reset()
def alternate(self):
while True:
yield self.p1
yield self.p2
# @print_state: if True, print each board during the game
def play(self, print_state=False):
alternator = self.alternate()
self.reset()
current_state = State()
self.p1.set_state(current_state)
self.p2.set_state(current_state)
if print_state:
current_state.print_state()
while True:
player = next(alternator)
i, j, symbol = player.act()
next_state_hash = current_state.next_state(i, j, symbol).hash()
current_state, is_end = all_states[next_state_hash]
self.p1.set_state(current_state)
self.p2.set_state(current_state)
if print_state:
current_state.print_state()
if is_end:
return current_state.winner
# AI player
class Player:
# @step_size: the step size to update estimations
# @epsilon: the probability to explore
def __init__(self, step_size=0.1, epsilon=0.1):
self.estimations = dict()
self.step_size = step_size
self.epsilon = epsilon
self.states = []
self.greedy = []
self.symbol = 0
def reset(self):
self.states = []
self.greedy = []
def set_state(self, state):
self.states.append(state)
self.greedy.append(True)
def set_symbol(self, symbol):
self.symbol = symbol
for hash_val in all_states:
state, is_end = all_states[hash_val]
if is_end:
if state.winner == self.symbol:
self.estimations[hash_val] = 1.0
elif state.winner == 0:
# we need to distinguish between a tie and a lose
self.estimations[hash_val] = 0.5
else:
self.estimations[hash_val] = 0
else:
self.estimations[hash_val] = 0.5
# update value estimation
def backup(self):
states = [state.hash() for state in self.states]
for i in reversed(range(len(states) - 1)):
state = states[i]
td_error = self.greedy[i] * (
self.estimations[states[i + 1]] - self.estimations[state]
)
self.estimations[state] += self.step_size * td_error
# choose an action based on the state
def act(self):
state = self.states[-1]
next_states = []
next_positions = []
for i in range(BOARD_ROWS):
for j in range(BOARD_COLS):
if state.data[i, j] == 0:
next_positions.append([i, j])
next_states.append(state.next_state(
i, j, self.symbol).hash())
if np.random.rand() < self.epsilon:
action = next_positions[np.random.randint(len(next_positions))]
action.append(self.symbol)
self.greedy[-1] = False
return action
values = []
for hash_val, pos in zip(next_states, next_positions):
values.append((self.estimations[hash_val], pos))
# to select one of the actions of equal value at random due to Python's sort is stable
np.random.shuffle(values)
values.sort(key=lambda x: x[0], reverse=True)
action = values[0][1]
action.append(self.symbol)
return action
def save_policy(self):
with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:
pickle.dump(self.estimations, f)
def load_policy(self):
with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:
self.estimations = pickle.load(f)
# human interface
# input a number to put a chessman
# | q | w | e |
# | a | s | d |
# | z | x | c |
class HumanPlayer:
def __init__(self, **kwargs):
self.symbol = None
self.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']
self.state = None
def reset(self):
pass
def set_state(self, state):
self.state = state
def set_symbol(self, symbol):
self.symbol = symbol
def act(self):
self.state.print_state()
key = input("Input your position:")
data = self.keys.index(key)
i = data // BOARD_COLS
j = data % BOARD_COLS
return i, j, self.symbol
def train(epochs, print_every_n=500):
player1 = Player(epsilon=0.01)
player2 = Player(epsilon=0.01)
judger = Judger(player1, player2)
player1_win = 0.0
player2_win = 0.0
for i in range(1, epochs + 1):
winner = judger.play(print_state=False)
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
if i % print_every_n == 0:
print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
player1.backup()
player2.backup()
judger.reset()
player1.save_policy()
player2.save_policy()
def compete(turns):
player1 = Player(epsilon=0)
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
player1.load_policy()
player2.load_policy()
player1_win = 0.0
player2_win = 0.0
for _ in range(turns):
winner = judger.play()
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
judger.reset()
print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))
# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
while True:
player1 = HumanPlayer()
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
player2.load_policy()
winner = judger.play()
if winner == player2.symbol:
print("You lose!")
elif winner == player1.symbol:
print("You win!")
else:
print("It is a tie!")
if __name__ == '__main__':
train(int(1e5))
compete(int(1e3))
play()
强化学习有明确的目标。