Offline RL

2021-06-25  本文已影响0人  臻甄

1. 综述

参考:https://zhuanlan.zhihu.com/p/341502874

1.1 一图概览offline RL

参考Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
参考作者视频:Video1Video2


上图(a), (b)分别表示online RL和off-policy RL,区别在于是否有一个buffer。(c)表示了offline RL,用一个behavior policy 采样一系列数据,放在buffer,离线训练policy,只是test的时候再跟环境交互。

1.2 Offline RL问题综述

任何off-policy RL的方法都可以用来做offline RL,一共4类方法
(1)Policy Gradient 策略梯度
(2)Approximate dynamic programming 近似动态规划:即求Q函数的通用方法(都用到了bellman方程),包括Q-learning等方法
(3)Actor-Critic algorithms:结合了policy gradient和approximate dynamic programming
(4)Model-based RL:会估计状态转移函数T的(①只学T+planning,②还学了policy的,③使用model扩充数据集)

1.3 Offline RL方法

方法1:基于重要采样的离线RL与离线策略评估

方法2:基于动态规划法的离线RL

两步走:step1从Dataset中学出一个Q函数,step2通过Q函数来做policy的提升

方法3:基于模型的离线强化学习 (Offline Model-Based RL

文献:
[1] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015a). Trust region policy optimization. In International conference on machine learning, pages 1889–1897.
[2] Kumar, A., Fu, J., Soh, M., Tucker, G., and Levine, S. (2019a). Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11761–11771.

2. Offline Reinforcement Learning NeurIPS 2020 Tutorial

Aviral kumar Sergey Levine
UC Berkeley
video

2.1

机器学习有用:大规模数据,可以识别图像、声音、翻译
强化学习:需要实时更新dataset,我们能开发data-driven RL method吗


tutorial目录

Offline RL的要求
(1)有一个好的数据集,包含了好的动作和坏的动作轨迹
(2)泛化能力
(3)stitching:有一些好的动作可以合并,比如从找到了从A到B和从B到C的路径,可以合并成从A到C的路径。

分析一个case


image.png

使用了Offline QT-opt算法和Finetuned QT-Opt算法训练
问题:考虑这种情况,在纯粹的offline RL上(纯offline数据集)增加online fine-tuning(online数据集,可以比offline的小10倍),可以比offline的成功率高很多(87%->96%)

为什么offline RL这么难:
(1)可能存在overfitting?有实验证明数据集大小对于HalfCheetah实验的效果影响不大而且看起来不像overfit了,但数据集越小q-function越容易被高估
(2)training data不够好
(3)distribution shift:dataset里的behavior policy 和 learned policy不一致
(4)sampling & function approximation error:本身online RL就存在,offline RL会更严重

Offline RL with policy gradient
(1) 使用importance sampling

An Optimistic Perspective on Offline Reinforcement Learning

论文:https://arxiv.org/abs/1907.04543
代码:https://github.com/google-research/batch_rl
中文:https://www.linkresearcher.com/theses/14edb429-a231-4009-a0f5-70b7712300d7

motivation:

  1. Agent interacts with an online environment,which limits online RL’s applicability to complex real world problems. (expensive data or high-fidelity simulator)
  1. enable better generalization by incorporating diverse prior experiences

contribution:

  1. An offline RL setup is proposed for evaluating algorithms on Atari 2600 games
  2. show that recent off-policy RL algorithms trained solely on offline data can be successful,attributed to the differences in offline dataset size and diversity as well as the choice of RL algorithm.
  3. present Random Ensemble Mixture (REM),outperforms offline QR-DQN


    different DQN

problem

  1. using a fixed dataset of experiences,isolate an RL algorithm’s ability to exploit experience and generalize vs. its ability to explore effectively(将利用和探索的能力分开来)
  2. without correcting for distribution mismatch. 当前策略和离线数据收集策略之间的分布不匹配,难确定奖励

algorithm

  1. ensembling is used on improving generalization in the offline setting.
  2. Random Ensemble Mixture (REM):用一种计算效率高的方式在指数数量的q估计上使用ensemble。
    (1)首先使用多个参数化q函数来估计q值
    (2)关键点:我们可以将多个q值估算的凸组合视为q值估算本身:train a family of Q-function approximators defined by mixing probabilities on a (K − 1)-simplex.

some points

  1. Increasing the number of models used for ensembling typically improves the performance of supervised learning models

上一篇 下一篇

猜你喜欢

热点阅读