强化学习RLaI

【RLaI】value iteration算法计算最优策略opt

2018-11-18  本文已影响23人  哪种生活可以永远很轻松

问题

Example 4.3: Gambler’s Problem
A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, he wins as many dollars as he has staked on that flip; if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of $100, or loses by running out of money. On each flip, the gambler must decide what portion of his capital to stake, in integer numbers of dollars. This problem can be formulated as an undiscounted, episodic, finite MDP. The state is the gambler’s capital, s2{ 1,2,...,99} and the actionsare stakes, a2{ 0,1,...,min(s,100s )}. The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1. The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let ph denote the probability of the coin coming up heads. If ph is known, then the entire problem is known and it can be solved, for instance, by value iteration. Figure 4.3 shows the change in the value function over successive sweeps of value iteration, and the final policy found, for the case of ph =0 .4. This policy is optimal, but not unique. In fact, there is a whole family of optimal policies, all corresponding to ties for the argmax action selection with respect to the optimal value function. Can you guess what the entire family looks like?


Example 4.3 Gambler's Problem书本截图

问题抽象

This problem can be formulated as an undiscounted, episodic, finite MDP. The state is the gambler’s capital, s∈{ 1,2,...,99} and the actions are stakes, a∈{ 0,1,...,min(s,100s )}. The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.

硬币抛掷
持有资金是capital 正面向上的概率是0.4.
reward始终是0直到达到目标(capital为100)时是+1

Python实现

数据结构定义

环境模型environment的定义以及记录
环境模型记录了当前环境的所有信息,包括当前状态state下的所有可能的action以及相应的reward和probability值,这里用python中的dict来记录,key是state,value是当前state下的所有可能的action及其对应的环境值,这里用dict记录。

dict({state: dict({    # 当前状态state
     action : list[       # 可能的action
                pvalue,                  # p(state, acrion) value值
                list[reward, state', probability],   # 第一种情况下的state'
                list[reward, state', probability]    # 第二种情况下的state'
               ]
     })}) 

value iteration算法

算法大致思路:

# 循环操作,循环终止条件:达到精度要求 
#     遍历所有state,在每一个state下:
#         遍历该state下所有action
#               计算qvalue更新environment
#               记录qvalue最大的action
#         value[state] 更新为qvalue最大值
#        
#         记录value更新前后差值,方便判断其是否达到要求
# 达到精度要求循环结束
# 遍历所有state 
#         对每一个state取使得其value最大的action,存储到policy

算法主要实现内容

def valueIteration():
    pre = precision    #精度
    times = 0            # 记录循环次数
# 不断迭代 更新value(state) 数组 直到精度达到要求
    while pre >= precision:
        times += 1
        # 这里要注意精度置为0
        pre = 0
        for state in range(1, 100):
            v_old = value[state]
            # 对当前状态state下所有可能的action求value 找出value的最大值
            for action in actions(state):
                # 这里因为部分会被删 所以不能取全部可能的actions
                if action not in environment[state]:
                    continue
                tmp = environment[state][action]
               #更新 environment[state][action][0] (state, action)的value值
                environment[state][action][0] = tmp[1][2]*(tmp[1][0] + value[tmp[1][1]]) + tmp[2][2]*(tmp[2][0] + value[tmp[2][1]])
            # 取当前state下所有action的value值中最大的值 更新为当前state的value
            max_value = 0
            for action in actions(state):
                if action not in environment[state]:
                    continue
                if environment[state][action][0] > max_value:
                    max_value = environment[state][action][0]
            value[state] = max_value
            # 这里的value就是相当于所有action下value的期望了 这应该是这种算法的优势所在吧
            # 大值的action保留,其余删除
            for action in actions(state):
                if action not in environment[state]:
                    continue
                if environment[state][action][0] != max_value:
                    del environment[state][action]
            # 计算这个state value值的差值
            pre = max(pre, abs(value[state] - v_old))
         
# 对每一个state取使得其value最大的action,存储到policy
    for state in range(1, 100):
        policy[state] = []
        for action in actions(state):
            if action not in environment[state]:
                continue
            if environment[state][action][0] == value[state]:
                policy[state].append(action)
  
    print("iteration times: ", times, "precision: ", precision)

除此之外,作图记录state下action的值以及其趋势、规律。

    plt.figure(3)
    plt.plot(range(100), value[:-1])
    plt.grid()
    plt.title('state-value')
    # 在计算value的循环终,每次循环后都更新value_s_t数组
        value_s_t.append(value.copy())

    plt.figure(2)
    for i in [80, 90, 97, 98, 99, 100]:
        plt.plot([t for t in range(times)], [value_s_t[t][i] for t in range(times)], label='state' + str(i) + "'s value")
    plt.legend()
    plt.title("state's value for each time steps")
 for state in range(1, 100):
        policy[state] = []
        for action in actions(state):
            if action not in environment[state]:
                continue
            if environment[state][action][0] == value[state]:
                policy[state].append(action)
                plt.figure(1)
                plt.scatter(state, action)
                plt.title("state-policy")

结果记录

书中给出的图


在已经完成算法模型的前提下,运行程序。

1st precision = 0.000000000001

fig3 value(state)
fig2 state's value
fig1 state's policy
迭代次数

2nd precision = 0.01

3rd precision = 0.1

4th precision = 1

上一篇 下一篇

猜你喜欢

热点阅读