Reinforcement Learning Study - 6

 Q Learning (TD Control)

On-Policy vs Off-Policy

 Imagine you're watching a friend play a game. The friend will play better through the experience he played. How about you? We can also learn from watching a play of other people. In this situation, the learning method that the friends does is called 'On-Policy 'method. On the other hand, what you do is called 'Off-Policy' method.

these methods could be defiend as follow
  On-Policy : Target Policy == Behavior Policy
  Off-Policy : Target Policy != Behavior Policy

 Target policy is the objective that an agent want to train. And behavior policy is the policy that interact with the environment. In other words, we want to train our 'target policy', watching our friend's 'behavor policy'.
 off-policy method has three advantages over on-policy method. first, an agent can reuse the previous experiences because target policy does not necessarily have to be the same as behavior policy. Second, high quality data that has been created by experts can be used to train our policy. Finally, One-to-many or many-to-one learning is possible.


How can Q-learning be Off-policy?

 Q-learning uses bellman optimality equation( ). Looking at belmman optimality equation, there isn't probability of policy(pi) because only the action that has maximum value is used. Therefore, the policy that used to explore the environment doesn't matter what policy it is.

 Q-learning is TD based method. there are few changes in TD method.

...

for (the number of learning)

    while (an episode):

        q[x,y,a] = q[x,y,a] + alpha * (reward + max(q[next_x, next_y, :]) - q[x,y,a])

...







Comments

Popular posts from this blog

Reinforcement Learning Study - 7

the US and British attacks on Houthi and the impact of stock market

U.S. Stock Outlook 2024