Reinforcement Learning Study

Reinforcement Learning Study - 6

February 14, 2022

Q Learning (TD Control)

On-Policy vs Off-Policy

Imagine you're watching a friend play a game. The friend will play better through the experience he played. How about you? We can also learn from watching a play of other people. In this situation, the learning method that the friends does is called 'On-Policy 'method. On the other hand, what you do is called 'Off-Policy' method.

these methods could be defiend as follow

On-Policy : Target Policy == Behavior Policy

Off-Policy : Target Policy != Behavior Policy

Target policy is the objective that an agent want to train. And behavior policy is the policy that interact with the environment. In other words, we want to train our 'target policy', watching our friend's 'behavor policy'.

off-policy method has three advantages over on-policy method. first, an agent can reuse the previous experiences because target policy does not necessarily have to be the same as behavior policy. Second, high quality data that has been created by experts can be used to train our policy. Finally, One-to-many or many-to-one learning is possible.

How can Q-learning be Off-policy?

Q-learning uses bellman optimality equation(

). Looking at belmman optimality equation, there isn't probability of policy(pi) because only the action that has maximum value is used. Therefore, the policy that used to explore the environment doesn't matter what policy it is.

Q-learning is TD based method. there are few changes in TD method.

...

for (the number of learning)

while (an episode):

q[x,y,a] = q[x,y,a] + alpha * (reward + max(q[next_x, next_y, :]) - q[x,y,a])

...

Search This Blog

Hello Universe