Reinforcement Learning Study - 7

Deep Reinforcement Learning

 In real world, states or action space is too big to record all information about the model in (value) table. So as to generalize the information, the most powerful generalization tool(function), Neural Network is used. A node is fundamental component of neural network, and each node linearly combines(WX + b) the inputs entering the node and then outputs them by applying a nonlinear function(sigmoid, ReLU etc...).
 

Value-based agent

 value-based learning is method where neural netwrok is used to predict value function. In neural network, 'Loss function' is used to update parameter of neural network. Loss function is defined as the difference between the predicted value and the real value.
 In Q-learning, value function is defined .
 so Loss is defined as below.

In fact, in the above eqation, because we don't know real value Q, we can't use that equation. So, here we have smart way to solve this problem. That's the expected value


by using expectation value, we have two benefits. First, we can use the loss function defined as above because it's possible to calculate the expectation in stead of real Q. Second, the weight of states that frequently visited is increased. So, this means that we can calculate more accurately the value of important states.
 One thing to note here is that TD target should not be updated although it is function of theta, because target(real value) is fixed value(constant).
 Additionally, in Deep Q Learning(DQN), there are two method to increase performance of the network, 'Experience Replay' and 'Target Network'. Experience replay method is the way that uses previously recored data to train network again. Training network, randomly sampled data in records is used. In this way, correalation is reduced.
 And in DQN, there is target network for traing. TD target is dependent of theta. so, whenever theta changes through updating, the target is also changed. we need to fix this value to update stably. 'Target Network' is a way to fix the target for a certain period of time. 

 For updating theta,  we need to differentiate the loss function and use gradient discent method. In this post, I'm not gonna deal with the content related to gradient discent.



DQN peudo code

...

q = Qnet()

q_target = Qnet()

memory = ReplayBuffer()  # deque/ put/ sampling


for (the number of learning)

    while (an episode is not done):

        memory.put(s,a,r,s_prime, done_mask) 

        

    for (i in batch_update)

        memory.sample(batch_size)

        max_q_prime = q_target(s_prime) # predict with target(fixed) netwrok

        target = r + gamma * max_q_prime * done_mask # done_mask : 1, but 0 if end point.

        loss = torch.nn.functional.smooth_l1_loss(q_a, target)

        loss.backword() # update weights

...

Comments

Popular posts from this blog

the US and British attacks on Houthi and the impact of stock market

U.S. Stock Outlook 2024