RL[0] - 初见
结构
- 背景
- Q-Learning with table
- Q-Learning with network
- 后记
背景
RL是reinforcement learning的缩写, 属于机器学习的一个领域,严谨的定义如下:
Reinforcement learning (RL) is an area of [machine learning inspired by behaviourist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
我理解RL是一个最优解的寻找问题,通过不同的trick让计算机面对action play的场景下做出最有利的行动,比如玩游戏
Q-Learning with table
q-learning是RL算法中的一个分支, 从wiki中扒的定义如下:Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process(MDP).
我最开始学习是从Playing Atari with Deep Reinforcement Learning和Simple Reinforcement Learning with Tensorflow开始的,论文主要是讲DQN的
论文中有对MDR&BellmanEquation的详细描述, 简单抽离一下:
我们的agent在每一个场景下可以做出一系列的action中的一个(A = {1, . . . , K}),因为这个action会获得相应reward, take 那个action有2个因素
- 后续位置的好坏
- 当前action带来的reward
reward可以是系统返回的,也可以是agent观测到的(就像人玩游戏一样,观察游戏画面),当前take 那个action不禁需要考虑那个action收获最大,还要考虑这个action到了那个状态,因为后续状态决定了后续的reward, 所以我们应该选择action=a 满足
Q(s,a) = r + γ(max(Q(s’,a’))
其中γ是一个系数,标识当下比未来的权重,s是当前状态s'是跳转之后的状态,这个选择就是BellmanEquation
整个环境,action, reward数学模型和框架就是MDR(马尔科夫决策过程)
假如当前这一步action的reward系统可以返回,只要我们之后后续步骤的最优解就可以每一个都按照bellman equation来走了, 所以问题就转化成了如何求解每个状态的最优解,然后把他们存起来,agent执行的时候查表即可
以下我们以FrozenLake为例子看下Q value的table是如何计算的
这个系列的例子都来自Simple Reinforcement Learning with Tensorflow
使用 OpenAI gym我们很容易可以模拟很多toy game
The FrozenLake environment consists of a 4x4 grid of blocks, each one either being the start block, the goal block, a safe frozen block, or a dangerous hole. The objective is to have an agent learn to navigate from the start to the goal without moving onto a hole. At any given time the agent can choose to move either up, down, left, or right.
每个状态有4个action,一共16个状态,所以table是16X4, mdr的decision making是partly random and partly under the control of a decision maker, 有人叫这个是ξ-greedy方式, 我refactor作者原来的变量命名方式,
代码如下(#开头是原作者的Comments, '# #'或者"""包围是我加的comments)
# coding=utf-8
import numpy as np
import gym
env = gym.make('FrozenLake-v0')
q_table = np.zeros([env.observation_space.n, env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
rewards = []
for i in range(num_episodes):
# Reset environment and get first new observation
s = env.reset()
reward_episode = 0
game_over = False
j = 0
# The Q-Table learning algorithm
while j < 99:
j += 1
# Choose an action by greedily (with noise) picking from Q table
"""
randn return a sequence of numbers from the "standard normal" distribution, with when i becoming larger and
larger, random take smaller and smaller impact of decision making, @very begging it just random choice
"""
action_to_be_taken = np.argmax(q_table[s, :] + np.random.randn(1, env.action_space.n) * (1. / (i + 1)))
# Get new state and reward from environment
new_statue, reward, game_over, _ = env.step(action_to_be_taken)
# Update Q-Table with new knowledge
"""
according to bell-equation Q(s,a) = r + γ(max(Q(s’,a’))
q_table[s, action_to_be_taken] = r + γ*max(q_table[new_state,:]), after each iteration max(q_table[new_state,:])
may be changed, hence need updated, let 's lr = γ, q_table[s, action_to_be_taken] = q = r + γ*max_old,
q + γ*(r + y*max_new - q) = r + γ*max_old + γ*(r + y*max_new - r - max_old) = r + γ*y*max_new which exactly
equal to new Q(S, a) 's value
"""
q_table[s, action_to_be_taken] = q_table[s, action_to_be_taken] + lr * (reward + y * np.max(q_table[new_statue, :]) - q_table[s, action_to_be_taken])
reward_episode += reward
s = new_statue
if game_over:
break
rewards.append(reward_episode)
print "Score over time: " + str(sum(rewards) / num_episodes)
print "Final Q-Table Values"
print q_table
原来代码中episode是2000, 我尝试了10000,20000,30000的结果如下,可以看出后续增加episode,q-table的值趋于稳定
q-learning with model
table方式虽然高效,但是面对现实问题,table的size可能是非常恐怖的巨大,难以放入内存中,于是就有另一种思路,q_table的value的值不是每一个保存,给出当前状态s模拟计算出每个action对应Q-value, Q-value最大就是最有利的选择
在FrozenLake的例子中, 我们用一层1X16的网络来标识当前的状态, 输出是4个action的q-value,所以网络结构是16X4. 我们用tensorflow来训练矩阵,
其中loss函数
loss = ∑(Q-target - Q)²
代码如下:
# coding=utf-8
import matplotlib.pyplot as plt
import numpy as np
import gym
import tensorflow as tf
env = gym.make('FrozenLake-v0')
tf.reset_default_graph()
# These lines establish the feed-forward part of the network used to choose actions
input_state = tf.placeholder(shape=[1, 16], dtype=tf.float32)
xavier_init = tf.contrib.layers.xavier_initializer()
W = tf.Variable(xavier_init([16, 4]))
q_out = tf.matmul(input_state, W)
predict = tf.argmax(q_out, 1)[0]
# Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
target_q = tf.placeholder(shape=[1, 4], dtype=tf.float32)
loss = tf.reduce_sum(tf.square(target_q - q_out))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
update_model = trainer.minimize(loss)
init = tf.global_variables_initializer()
# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
# create lists to contain total rewards and steps per episode
rewards = []
counts = []
with tf.Session() as sess:
sess.run(init)
for i in range(num_episodes):
# Reset environment and get first new observation
s = env.reset()
reward_episode = 0
d = False
j = 0
# The Q-Network
# # in case of dead loop within the game, need up limit jump out
while j < 99:
j += 1
# Choose an action by greedily (with e chance of random action) from the Q-network
# # 1*16 dimension array, a[i] = 1 if i == s otherwise 0
a, q_out_value = sess.run([predict, q_out], feed_dict={input_state: np.identity(16)[s:s + 1]})
# # ξ-greedy selection
if np.random.rand(1) < e:
a = env.action_space.sample()
# Get new state and reward from environment
new_state, r, d, _ = env.step(a)
# Obtain the Q' values by feeding the new state through our network
new_q = sess.run(q_out, feed_dict={input_state: np.identity(16)[new_state:new_state + 1]})
# Obtain maxQ' and set our target value for chosen action.
new_max_q = np.max(new_q)
target_value = q_out_value
# #
target_value[0, a] = r + y * new_max_q
# Train our network using target and predicted Q values
_, W1 = sess.run([update_model, W], feed_dict={input_state: np.identity(16)[s:s + 1], target_q: target_value})
reward_episode += r
s = new_state
if d:
# Reduce chance of random action as we train the model.
e = 1. / ((i / 50) + 10)
break
rewards.append(reward_episode)
counts.append(j)
print W1
print "Percent of succesful episodes: " + str(sum(rewards) / num_episodes) + "%"
plt.plot(rewards)
plt.plot(counts)
这个例子网络结构太简单了,用cpu就可以跑, 750个episode就可以达到成绩,盗用个图plot的图
image.png
image.png
image.png
后记
初见RL, 后续会有更加有意思的,比如人玩游戏的看到的图像然后反应,理应卷积神经网络抽取图像特征来处理而不是类似one_hot数组输入,agent缓存过往训练片段随机抽取batch,大大将强训练效果(experience replay),再比如用两个network来训练Double DQN 和同一个网络中抽离a和v Dueling DQN
感谢medium的作者辛苦讲解和deepmind的论文无私付出