[Chapter 5] Reinforcement Learni

2021-05-30  本文已影响0人  超级超级小天才

Function Approximation

While we are learning the Q-functions, but how to represent or record the Q-values? For discrete and finite state space and action space, we can use a big table with size of |S| \times |A| to represent the Q-values for all (s,a) pairs. However, if the state space or action space is very huge, or actually, usually they are continuous and infinite, a tabular method doesn't work anymore.

We need function approximation to represent utility and Q-functions with some parameters {\theta} to be learnt. Also take the grid environment as our example, we can represent the state using a pair of coordiantes (x,y), then one simple function approximation can be like this:

\hat{U}_{\theta} (x,y)={\theta}_0+{\theta}_1 x+{\theta}_2 y

Of course, you can design more complex functions when you have a much larger state space.

In this case, our reinforcement learning agent turns to learn the parameters {\theta} to approximate the evaluation functions (\hat{U}_{\theta} or \hat{Q}_{\theta}).

For Monte Carlo learning, we can collect a set of training samples (trails) with input and label, then this turns to be a supervised learning problem. With squared error and linear function, we get a standard linear regression problem.

For Temporal Difference learning, the agent aims to adjust the parameters to reduce the temporal difference (to reduce the TD error. To update the parameters using gradient decrease method:

{\theta}_i \leftarrow {\theta}_i+{\alpha}(R(s)+{\gamma}\hat{Q}_{\theta} (s^′,a′)−\hat{Q}_{\theta}(s,a)) \frac{\partial {\hat{Q}_{\theta} (s,a)}}{\partial{{\theta}_i}}

{\theta}_i \leftarrow {\theta}_i+{\alpha}(R(s)+{\gamma} max_{a'}{\hat{Q}_{\theta} (s^′,a′)}−\hat{Q}_{\theta}(s,a)) \frac{\partial {\hat{Q}_{\theta} (s,a)}}{\partial{{\theta}_i}}

Going Deep

One of the greatest advancement in reinforcement learning is to combine it with deep learning. As we have stated above, mostly, we cannot use a tabular method to represent the evaluation functions, we need approximation! I know what you want to say: you must have thought that deep network is a good function approximation. We have input for a network and output the Q-values or utilities, that's it! So, using deep network in RL is deep reinforcement learning (DRL).

Why we need deep network?

One of the DRL algorithms is Deep Q-learning Network (DQN), we have the pseudo code here but will not go into it:

image
上一篇 下一篇

猜你喜欢

热点阅读