on-policy RL, off-policy RL, off
on-policy
on-policy:收集数据的策略和维护更新的策略为同一个策略。智能体根据当前策略和环境交互,收集一定步数的数据(s, a, r, s', terminal_flag)后进行当前策略的更新,不存在replay buffer,数据使用后即丢掉,无经验回放。
Behaviour policy(Policy used for data generation is called behaviour policy) == Policy used for action selection
on-policy
off-policy
off-policy:收集数据的策略和维护更新的策略为不同的策略,智能体和环境交互。智能体根据当前策略和环境交互,收集一定步数的数据(s, a, r, s', terminal_flag)丢进replay buffer,从replay buffer中选取一定步数的数据进行当前策略的更新。
Off-policy learning allows the use of older samples (collected using the older policies) in the calculation. To update the policy, experiences are sampled from a buffer which comprises experiences/interactions that are collected from its own predecessor policies. This improves sample efficiency since we don’t need to recollect samples whenever a policy is changed.
off-policy
offline
offline:未知数据收集策略,无环境交互。智能体不和环境交互,而是利用先前收集的数据集,从中选取一定步数的数据(s, a, r, s', terminal_flag)进行当前策略的更新,无新数据产生。
Offline reinforcement learning algorithms: those utilize previously collected data, without additional online data collection. The agent no longer has the ability to interact with the environment and collect additional transitions using the behaviour policy. The learning algorithm is provided with a static dataset of fixed interaction, D, and must learn the best policy it can using this dataset. The learning algorithm doesn’t have access to additional data as it cannot interact with the environment.
offline
nice reference
https://slideslive.com/38935785/offline-reinforcement-learning-from-algorithms-to-practical-challenges
https://kowshikchilamkurthy.medium.com/off-policy-vs-on-policy-vs-offline-reinforcement-learning-demystified-f7f87e275b48