机器学习之Grid World的Q-Learning算法解析

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/tomatomas/article/details/77341114

来自Github开源项目的基于Grid World游戏的Q-Learning算法
Github地址:https://github.com/rlcode/reinforcement-learning/tree/master/1-grid-world/5-q-learning

Q-Learning

Q-Learning是一项无模型的增强学习技术,它可以在MDP问题中寻找一个最优的动作选择策略。它通过一个动作-价值函数来进行学习,并且最终能够根据当前状态及最优策略给出期望的动作。它的一个优点就是它不需要知道某个环境的模型也可以对动作进行期望值比较,这就是为什么它被称作无模型的。

以下是维基百科原文:

Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter. A policy is a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Q-learning can handle problems with stochastic transitions and rewards, without requiring any adaptations. It has been proven that for any finite MDP, Q-learning eventually finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable.[1]

代码实现

Q-Learning算法与我们前面分析的SARSA算法的实现主要区别就在于其学习函数:

    # update q function with sample <s, a, r, s'>
    def learn(self, state, action, reward, next_state):
        current_q = self.q_table[state][action]
        # using Bellman Optimality Equation to update q function
        new_q = reward + self.discount_factor * max(self.q_table[next_state])
        self.q_table[state][action] += self.learning_rate * (new_q - current_q)

Q-Learning学习函数中使用的是贝尔曼最优方程,它将下个状态能够取得的action的最大价值直接反馈到新的q_value(new_q)上,并配合学习率更新到当前状态、动作的q_table表中。
它与SARSA算法的不同之处,就在于SARSA算法的学习函数参数多了最后一个A,这个A是个预估的值。而Q-Learning算法则是取下个状态有最大价值的A,这样做的好处就是学习起来可能更快,而坏处就是可能会出现Q值过度估计得问题,Double Q-Learning可以解决这个问题。

Q-Learning算法的变体

Google收购的DeepMind公司将Q-Learning和深度学习的结合出一个新的算法叫深度增强学习或者叫深度Q网络(即DQN),这个算法能以人类专家的水平去玩一些雅达利2600的游戏。

维基百科原文:

A recent application of Q-learning to deep learning, by Google DeepMind, titled “deep reinforcement learning” or “deep Q-networks”, has been successful at playing some Atari 2600 games at expert human levels. Preliminary results were presented in 2014, with a paper published in February 2015 in Nature.[12]

猜你喜欢

转载自blog.csdn.net/tomatomas/article/details/77341114
今日推荐