初探d3rlpy MDPDataset
d3rlpy: An offline deep reinforcement learning library
d3rlpy 是一个面向从业者和研究人员的离线深度强化学习库。
d3rlpy
website:https://takuseno.github.io/d3rlpy/
code:https://github.com/takuseno/d3rlpy
paper:https://arxiv.org/abs/2111.03788
docs:https://d3rlpy.readthedocs.io/en/v1.1.0/
强烈推荐大家去阅读文档,是我读过最棒的文档之一。
Install
pip install d3rlpy
d3rlpy 通过开箱即用的 scikit-learn 风格的 API 提供最先进的离线深度强化学习算法。与其他 RL 库不同,所提供的算法可以通过几次调整实现超出其论文的极其强大的性能。
Google Colaboratory
This tutorial is also available on Google Colaboratory
官方提供了一份基于Google Colab的Jupyter Notebook Tutorial,无需配置环境即可使用d3rlpy。我自己也是用的这个。
自定义数据集 MDPDataset
这里重点讲一下数据集的部分。官方提供的数据集:
Datasets
可以看到,官方提供了cartpole, pendulum, atari, d4rl环境的数据集。但这些对于我们来说很明显是不够用的,针对这些环境以外的任务,我们如果想使用d3rlpy中的算法就需要定制自己的数据集了。
MDPDataset
官方提供了MDPDataset类作为自定义数据集的接口,我们重点来看一下MDPDataset。示例代码:
from d3rlpy.dataset import MDPDataset
# 1000 steps of observations with shape of (100,)
observations = np.random.random((1000, 100))
# 1000 steps of actions with shape of (4,)
actions = np.random.random((1000, 4))
# 1000 steps of rewards
rewards = np.random.random(1000)
# 1000 steps of terminal flags
terminals = np.random.randint(2, size=1000)
dataset = MDPDataset(observations, actions, rewards, terminals)
# automatically splitted into d3rlpy.dataset.Episode objects
dataset.episodes
# each episode is also splitted into d3rlpy.dataset.Transition objects
episode = dataset.episodes[0]
episode[0].observation
episode[0].action
episode[0].reward
episode[0].next_observation
episode[0].terminal
# d3rlpy.dataset.Transition object has pointers to previous and next
# transitions like linked list.
transition = episode[0]
while transition.next_transition:
transition = transition.next_transition
# save as HDF5
dataset.dump('dataset.h5')
# load from HDF5
new_dataset = MDPDataset.load('dataset.h5')
从代码中可以看出,我们把observation, action, rewards, terminals喂给MDPDataset之后得到dataset就可以直接使用了。但这里这两句话我一开始没有看的很明白
automatically splitted into d3rlpy.dataset.Episode objects
each episode is also splitted into d3rlpy.dataset.Transition objects
于是我就在tutorial里自定义数据跑了一下。
import numpy as np
from d3rlpy.dataset import MDPDataset
observations = np.random.random((5, 4))
actions = np.random.random((5, 2))
rewards = np.random.random(5)
#terminal_flags = np.random.randint(2, size=5)
terminal_flags = np.array([0, 0, 0, 1, 0])
print("************************")
print(observations, "\n", actions, "\n", rewards, "\n", terminal_flags)
dataset = MDPDataset(observations, actions, rewards, terminal_flags)
# automatically splitted into d3rlpy.dataset.Episode objects
dataset.episodes
# each episode is also splitted into d3rlpy.dataset.Transition objects
episode = dataset.episodes[0]
print("************************")
print(episode[0].observation, episode[0].action, episode[0].reward, episode[0].next_observation, episode[0].terminal)
首先说一下这里为什么要用
terminal_flags = np.array([0, 0, 0, 1, 0])
而不是
terminal_flags = np.random.randint(2, size=5)
原因很简单,我收到了这样的报错:
报错
这是一个非常令人疑惑的问题,dataset.episodes[0]居然out of range,后来我把dataset.episodes打出来发现是[]空列表,再看到上面的terminal_flags 为[0 0 0 0 0],其实疑惑就解决了。因为terminal_falg始终为0第一个episode始终没有结束,因此我们没法拿到第一个episode故列表为空。之后terminal_flags = np.array([0, 0, 0, 1, 0])验证一下猜想,可以发现此时len(dataset.episodes)为1,证明完毕,把terminal_flags 修改后继续跑这段代码,得到:
************************
[[0.24846654 0.21551815 0.131713 0.87425005]
[0.8034738 0.4311574 0.88165318 0.11492858]
[0.37038025 0.48792868 0.53824753 0.83126296]
[0.83917952 0.60787212 0.17332436 0.88745315]
[0.185564 0.5873282 0.1410559 0.38302 ]]
[[0.49543267 0.28684972]
[0.35513441 0.40191936]
[0.5038486 0.44437554]
[0.50707039 0.04350966]
[0.19581844 0.38760234]]
[0.70419544 0.98312524 0.40066976 0.95934234 0.64369036]
[0 0 0 1 0]
************************
[0.24846654 0.21551815 0.131713 0.87425005] [0.49543267 0.2868497 ] 0.7041954398155212 [0.8034738 0.4311574 0.8816532 0.11492857] 0.0
前面的其实没有问题
observations:5x4
actionss:5x2
rewards:5
terminal_flags:5
但仔细看后面的地方就看不太懂了,为什么episode[0].observation会是(4, )的呢?按照我的terminal_flags为[0 0 0 1 0]的定义,经历4个step后第一个episode会结束,那按理说episode[0]应该有4个observation才对。但当我仔细看了一遍代码,发现
episode = dataset.episodes[0]
print(episode[0].observation)
也就是说episode 是dataset.episodes中的第一个元素,而我们print的是episode[0].observation,也就是第一个episode的第一个元素,按照这种定义是什么呢?我猜测是step,于是我又写了一段试了一下。
import numpy as np
from d3rlpy.dataset import MDPDataset
observations = np.random.random((5, 4))
actions = np.random.random((5, 2))
rewards = np.random.random(5)
#terminal_flags = np.random.randint(2, size=5)
terminal_flags = np.array([0, 0, 0, 1, 0])
print("************************")
print(observations, "\n", actions, "\n", rewards, "\n", terminal_flags)
dataset = MDPDataset(observations, actions, rewards, terminal_flags)
# save as HDF5
dataset.dump('dataset.h5')
# load from HDF5
new_dataset = MDPDataset.load('dataset.h5')
#dataset.episodes
print(len(new_dataset.episodes))
episode = new_dataset.episodes[0]
print(episode)
print("************************")
for transition in episode:
print("transition:", transition.observation, transition.action, transition.reward, transition.terminal, transition.next_observation, transition.prev_transition, transition.next_transition)
输出:
************************
[[0.92753347 0.50440883 0.53963388 0.95715031]
[0.2850818 0.52517806 0.78821108 0.71171217]
[0.29113115 0.5574677 0.72235792 0.70583846]
[0.08542663 0.4999773 0.67266474 0.69565128]
[0.21867826 0.04446595 0.84564964 0.06929721]]
[[0.33829284 0.99447056]
[0.84802802 0.67714815]
[0.13405515 0.010628 ]
[0.86649034 0.45765866]
[0.5213394 0.32108906]]
[0.13738827 0.3193022 0.59849716 0.03844971 0.62750793]
[0 0 0 1 0]
1
<d3rlpy.dataset.Episode object at 0x7efdbc9b2e10>
************************
transition: [0.92753345 0.50440884 0.53963387 0.9571503 ] [0.33829284 0.99447054] 0.13738827407360077 0.0 [0.2850818 0.5251781 0.78821105 0.7117122 ] None <d3rlpy.dataset.Transition object at 0x7efdbc90f4d0>
transition: [0.2850818 0.5251781 0.78821105 0.7117122 ] [0.848028 0.67714816] 0.31930220127105713 0.0 [0.29113114 0.5574677 0.7223579 0.70583844] <d3rlpy.dataset.Transition object at 0x7efdbc914b50> <d3rlpy.dataset.Transition object at 0x7efdbc90ff50>
transition: [0.29113114 0.5574677 0.7223579 0.70583844] [0.13405515 0.010628 ] 0.5984971523284912 0.0 [0.08542663 0.4999773 0.67266476 0.6956513 ] <d3rlpy.dataset.Transition object at 0x7efdbc90f4d0> <d3rlpy.dataset.Transition object at 0x7efdbc90f350>
transition: [0.08542663 0.4999773 0.67266476 0.6956513 ] [0.86649036 0.45765865] 0.03844970837235451 1.0 [0. 0. 0. 0.] <d3rlpy.dataset.Transition object at 0x7efdbc90ff50> None
看到这里就水落石出了:我们每次训练有许多幕,每一幕有许多步。
d3rlpy.dataset.MDPDataset由若干d3rlpy.dataset.Episode object组成,每个episode由若干d3rlpy.dataset.Transition object组成。d3rlpy.dataset.Transition和online reinforcement learning中的step很像,但略有不同。从我的直观感觉上来讲,把transition封装成对象是很合理的方式,要比gym.Env里的step()方法更合理一些。
Summary
Dataset matters.
d3rlpy提供了离线训练和测试的部分,但我们仍需要一些在线场景来测试我们训练好的模型以及GUI来直观的感受。在线场景这部分千人千面,根据自己需要的环境来做就好。但挺好奇后面GUI这一部分d3rlpy有没有提供什么解决方案,后面的话继续看代码写代码把自己的需求和离线rl结合起来。