Reinforcement Learning Study

Reinforcement Learning Study - 7

February 16, 2022

Deep Reinforcement Learning

In real world, states or action space is too big to record all information about the model in (value) table. So as to generalize the information, the most powerful generalization tool(function), Neural Network is used. A node is fundamental component of neural network, and each node linearly combines(WX + b) the inputs entering the node and then outputs them by applying a nonlinear function(sigmoid, ReLU etc...).

Value-based agent

value-based learning is method where neural netwrok is used to predict value function. In neural network, 'Loss function' is used to update parameter of neural network. Loss function is defined as the difference between the predicted value and the real value.

In Q-learning, value function is defined

so Loss is defined as below.

In fact, in the above eqation, because we don't know real value Q, we can't use that equation. So, here we have smart way to solve this problem. That's the expected value.

by using expectation value, we have two benefits. First, we can use the loss function defined as above because it's possible to calculate the expectation in stead of real Q. Second, the weight of states that frequently visited is increased. So, this means that we can calculate more accurately the value of important states.

One thing to note here is that TD target should not be updated although it is function of theta, because target(real value) is fixed value(constant).

Additionally, in Deep Q Learning(DQN), there are two method to increase performance of the network, 'Experience Replay' and 'Target Network'. Experience replay method is the way that uses previously recored data to train network again. Training network, randomly sampled data in records is used. In this way, correalation is reduced.

And in DQN, there is target network for traing. TD target is dependent of theta. so, whenever theta changes through updating, the target is also changed. we need to fix this value to update stably. 'Target Network' is a way to fix the target for a certain period of time.

For updating theta, we need to differentiate the loss function and use gradient discent method. In this post, I'm not gonna deal with the content related to gradient discent.

DQN peudo code

...

q = Qnet()

q_target = Qnet()

memory = ReplayBuffer() # deque/ put/ sampling

for (the number of learning)

while (an episode is not done):

memory.put(s,a,r,s_prime, done_mask)

for (i in batch_update)

memory.sample(batch_size)

max_q_prime = q_target(s_prime) # predict with target(fixed) netwrok

target = r + gamma * max_q_prime * done_mask # done_mask : 1, but 0 if end point.

loss = torch.nn.functional.smooth_l1_loss(q_a, target)

loss.backword() # update weights

...

Search This Blog

Hello Universe