learning

初探d3rlpy MDPDataset

2022-06-11  本文已影响0人  吃醋不吃辣的雷儿

d3rlpy: An offline deep reinforcement learning library
d3rlpy 是一个面向从业者和研究人员的离线深度强化学习库。


d3rlpy

website:https://takuseno.github.io/d3rlpy/
code:https://github.com/takuseno/d3rlpy
paper:https://arxiv.org/abs/2111.03788
docs:https://d3rlpy.readthedocs.io/en/v1.1.0/
强烈推荐大家去阅读文档,是我读过最棒的文档之一。

Install

pip install d3rlpy

d3rlpy 通过开箱即用的 scikit-learn 风格的 API 提供最先进的离线深度强化学习算法。与其他 RL 库不同,所提供的算法可以通过几次调整实现超出其论文的极其强大的性能。

Google Colaboratory

This tutorial is also available on Google Colaboratory
官方提供了一份基于Google Colab的Jupyter Notebook Tutorial,无需配置环境即可使用d3rlpy。我自己也是用的这个。

tutorial

自定义数据集 MDPDataset

这里重点讲一下数据集的部分。官方提供的数据集:


Datasets

可以看到,官方提供了cartpole, pendulum, atari, d4rl环境的数据集。但这些对于我们来说很明显是不够用的,针对这些环境以外的任务,我们如果想使用d3rlpy中的算法就需要定制自己的数据集了。


MDPDataset
官方提供了MDPDataset类作为自定义数据集的接口,我们重点来看一下MDPDataset。示例代码:
from d3rlpy.dataset import MDPDataset

# 1000 steps of observations with shape of (100,)
observations = np.random.random((1000, 100))
# 1000 steps of actions with shape of (4,)
actions = np.random.random((1000, 4))
# 1000 steps of rewards
rewards = np.random.random(1000)
# 1000 steps of terminal flags
terminals = np.random.randint(2, size=1000)

dataset = MDPDataset(observations, actions, rewards, terminals)

# automatically splitted into d3rlpy.dataset.Episode objects
dataset.episodes

# each episode is also splitted into d3rlpy.dataset.Transition objects
episode = dataset.episodes[0]
episode[0].observation
episode[0].action
episode[0].reward
episode[0].next_observation
episode[0].terminal

# d3rlpy.dataset.Transition object has pointers to previous and next
# transitions like linked list.
transition = episode[0]
while transition.next_transition:
    transition = transition.next_transition

# save as HDF5
dataset.dump('dataset.h5')

# load from HDF5
new_dataset = MDPDataset.load('dataset.h5')

从代码中可以看出,我们把observation, action, rewards, terminals喂给MDPDataset之后得到dataset就可以直接使用了。但这里这两句话我一开始没有看的很明白
automatically splitted into d3rlpy.dataset.Episode objects
each episode is also splitted into d3rlpy.dataset.Transition objects
于是我就在tutorial里自定义数据跑了一下。

import numpy as np
from d3rlpy.dataset import MDPDataset
observations = np.random.random((5, 4))
actions = np.random.random((5, 2))
rewards = np.random.random(5)
#terminal_flags = np.random.randint(2, size=5)
terminal_flags = np.array([0, 0, 0, 1, 0])
print("************************")
print(observations, "\n", actions, "\n", rewards, "\n", terminal_flags)

dataset = MDPDataset(observations, actions, rewards, terminal_flags)

# automatically splitted into d3rlpy.dataset.Episode objects
dataset.episodes

# each episode is also splitted into d3rlpy.dataset.Transition objects
episode = dataset.episodes[0]
print("************************")
print(episode[0].observation, episode[0].action, episode[0].reward, episode[0].next_observation, episode[0].terminal)

首先说一下这里为什么要用
terminal_flags = np.array([0, 0, 0, 1, 0])
而不是
terminal_flags = np.random.randint(2, size=5)
原因很简单,我收到了这样的报错:


报错

这是一个非常令人疑惑的问题,dataset.episodes[0]居然out of range,后来我把dataset.episodes打出来发现是[]空列表,再看到上面的terminal_flags 为[0 0 0 0 0],其实疑惑就解决了。因为terminal_falg始终为0第一个episode始终没有结束,因此我们没法拿到第一个episode故列表为空。之后terminal_flags = np.array([0, 0, 0, 1, 0])验证一下猜想,可以发现此时len(dataset.episodes)为1,证明完毕,把terminal_flags 修改后继续跑这段代码,得到:

************************
[[0.24846654 0.21551815 0.131713   0.87425005]
 [0.8034738  0.4311574  0.88165318 0.11492858]
 [0.37038025 0.48792868 0.53824753 0.83126296]
 [0.83917952 0.60787212 0.17332436 0.88745315]
 [0.185564   0.5873282  0.1410559  0.38302   ]] 
 [[0.49543267 0.28684972]
 [0.35513441 0.40191936]
 [0.5038486  0.44437554]
 [0.50707039 0.04350966]
 [0.19581844 0.38760234]] 
 [0.70419544 0.98312524 0.40066976 0.95934234 0.64369036] 
 [0 0 0 1 0]
************************
[0.24846654 0.21551815 0.131713   0.87425005] [0.49543267 0.2868497 ] 0.7041954398155212 [0.8034738  0.4311574  0.8816532  0.11492857] 0.0

前面的其实没有问题
observations:5x4
actionss:5x2
rewards:5
terminal_flags:5
但仔细看后面的地方就看不太懂了,为什么episode[0].observation会是(4, )的呢?按照我的terminal_flags为[0 0 0 1 0]的定义,经历4个step后第一个episode会结束,那按理说episode[0]应该有4个observation才对。但当我仔细看了一遍代码,发现
episode = dataset.episodes[0]
print(episode[0].observation)
也就是说episode 是dataset.episodes中的第一个元素,而我们print的是episode[0].observation,也就是第一个episode的第一个元素,按照这种定义是什么呢?我猜测是step,于是我又写了一段试了一下。

import numpy as np
from d3rlpy.dataset import MDPDataset
observations = np.random.random((5, 4))
actions = np.random.random((5, 2))
rewards = np.random.random(5)
#terminal_flags = np.random.randint(2, size=5)
terminal_flags = np.array([0, 0, 0, 1, 0])
print("************************")
print(observations, "\n", actions, "\n", rewards, "\n", terminal_flags)

dataset = MDPDataset(observations, actions, rewards, terminal_flags)

# save as HDF5
dataset.dump('dataset.h5')

# load from HDF5
new_dataset = MDPDataset.load('dataset.h5')
#dataset.episodes
print(len(new_dataset.episodes))
episode = new_dataset.episodes[0]
print(episode)
print("************************")
for transition in episode:
  print("transition:", transition.observation, transition.action, transition.reward, transition.terminal, transition.next_observation, transition.prev_transition, transition.next_transition)

输出:

************************
[[0.92753347 0.50440883 0.53963388 0.95715031]
 [0.2850818  0.52517806 0.78821108 0.71171217]
 [0.29113115 0.5574677  0.72235792 0.70583846]
 [0.08542663 0.4999773  0.67266474 0.69565128]
 [0.21867826 0.04446595 0.84564964 0.06929721]] 
 [[0.33829284 0.99447056]
 [0.84802802 0.67714815]
 [0.13405515 0.010628  ]
 [0.86649034 0.45765866]
 [0.5213394  0.32108906]] 
 [0.13738827 0.3193022  0.59849716 0.03844971 0.62750793] 
 [0 0 0 1 0]
1
<d3rlpy.dataset.Episode object at 0x7efdbc9b2e10>
************************
transition: [0.92753345 0.50440884 0.53963387 0.9571503 ] [0.33829284 0.99447054] 0.13738827407360077 0.0 [0.2850818  0.5251781  0.78821105 0.7117122 ] None <d3rlpy.dataset.Transition object at 0x7efdbc90f4d0>
transition: [0.2850818  0.5251781  0.78821105 0.7117122 ] [0.848028   0.67714816] 0.31930220127105713 0.0 [0.29113114 0.5574677  0.7223579  0.70583844] <d3rlpy.dataset.Transition object at 0x7efdbc914b50> <d3rlpy.dataset.Transition object at 0x7efdbc90ff50>
transition: [0.29113114 0.5574677  0.7223579  0.70583844] [0.13405515 0.010628  ] 0.5984971523284912 0.0 [0.08542663 0.4999773  0.67266476 0.6956513 ] <d3rlpy.dataset.Transition object at 0x7efdbc90f4d0> <d3rlpy.dataset.Transition object at 0x7efdbc90f350>
transition: [0.08542663 0.4999773  0.67266476 0.6956513 ] [0.86649036 0.45765865] 0.03844970837235451 1.0 [0. 0. 0. 0.] <d3rlpy.dataset.Transition object at 0x7efdbc90ff50> None

看到这里就水落石出了:我们每次训练有许多幕,每一幕有许多步。
d3rlpy.dataset.MDPDataset由若干d3rlpy.dataset.Episode object组成,每个episode由若干d3rlpy.dataset.Transition object组成。d3rlpy.dataset.Transition和online reinforcement learning中的step很像,但略有不同。从我的直观感觉上来讲,把transition封装成对象是很合理的方式,要比gym.Env里的step()方法更合理一些。

Summary

Dataset matters.
d3rlpy提供了离线训练和测试的部分,但我们仍需要一些在线场景来测试我们训练好的模型以及GUI来直观的感受。在线场景这部分千人千面,根据自己需要的环境来做就好。但挺好奇后面GUI这一部分d3rlpy有没有提供什么解决方案,后面的话继续看代码写代码把自己的需求和离线rl结合起来。

上一篇下一篇

猜你喜欢

热点阅读