Posts

Showing posts from February, 2022

Reinforcement Learning Study - 7

Image
Deep Reinforcement Learning  In real world, states or action space is too big to record all information about the model in (value) table. So as to generalize the information, the most powerful generalization tool(function), Neural Network is used. A node is fundamental component of neural network, and each node linearly combines(WX + b) the inputs entering the node and then outputs them by applying a nonlinear function(sigmoid, ReLU etc...).   Value-based agent  value-based learning is method where neural netwrok is used to predict value function. In neural network, 'Loss function' is used to update parameter of neural network. Loss function is defined as the difference between the predicted value and the real value.  In Q-learning, value function is defined  .  so Loss is defined as below. In fact, in the above eqation, because we don't know real value Q, we can't use that equation. So, here we have smart way to solve this problem. That's the expected val...

Reinforcement Learning Study - 6

Image
 Q Learning (TD Control) On-Policy vs Off-Policy  Imagine you're watching a friend play a game. The friend will play better through the experience he played. How about you? We can also learn from watching a play of other people. In this situation, the learning method that the friends does is called 'On-Policy 'method. On the other hand, what you do is called 'Off-Policy' method. these methods could be defiend as follow   On-Policy : Target Policy == Behavior Policy   Off-Policy : Target Policy != Behavior Policy  Target policy is the objective that an agent want to train. And behavior policy is the policy that interact with the environment. In other words, we want to train our 'target policy', watching our friend's 'behavor policy'.  off-policy method has three advantages over on-policy method. first, an agent can reuse the previous experiences because target policy does not necessarily have to be the same as behavior policy. Second, high quality...

Reinforcement Learning Study - 5

Image
Contorl in Model-free MDP  From now on, we are going to learn about 3 methods to solve conrol problem, MC Control, SARSA, Q Learning)  Policy Iteration method which we learned before is good method to solve model-based problem. However in model-free MDP, we can't use the method because policy iteration uses bellman expectation equation that can be used when we do know about the model. And not knowing the next state when an agent select an action is another reason that we can't use the method. so we can't make greedy policy. MC Control  So here is the solution through some changes in policy iteration,  1. Use MC instead of bellman expectation equation. Using MC, we can evaluate each state in  empirically. 2. Use Q instead of V. Although an agent don't know what each action is mapped to each state, all an agent have to do is just selecting an action that has highest expectation of value. 3. Explore with probability of epsilon. In greedy way, if an action is eva...

Reinforcement Learning Study - 4

Image
 Temporal Differecne Method (Prediction)   In Monte Carlo method, our agent could learn only after the end of an episode. but in real world, there are few situations like that. In Temporal Difference (TD), agents can learn before an episode ends by updating current value (estimate) in the next state.  so the equation to update value in TD is .. (uses the estimate in the next state, instead of cumulative reward) and this can be written in pseudo code like below. ... for (the number of learning episode)           # in each step           value[x][y] = value[x][y] + alpha * (reward + gamma*value[x_prime][y_prime] - value[x][y]) ...  Result of test (TD) MC vs TD  Time of Learning MC can be used in episodic MDP where there will be always the end. but TD can be used in non-episodic MDP as well as episodic MDP.  Bias MC uses  . TD uses . Cumulative rewards(G) used by MC is real value. Therefore G is ...

Reinforcement Learning Study - 3 (Value Iteration, using Bellman optimality equation)

Image
 Value Iteration method (grid world)  In last post, we've solved the grid world problem with 'policy iteration' method that uses bellman expectation equation. This time, we'll solve the problem with 'value iteration' method that uses bellman optimality equation.  bellman optimality equation. 1. Like the previous method, initialize all value of the table with 0. 2. Evaluate each state with bellman optimality equation like below      (example of s0). 3. repeat the process '2'   After repeating the process, the value table will be as above. Each value of the table represents the number of steps left to the goal when following the optimal path. and in this case, the optimal policy can be obtained by gridy way. Prediction in model-free evironment  So far, we've dealt with MDP problem where we do know all about environment(model-based MDP). But in real world, there are few such problem. There are two methods to evaluate state (Prediction) in model-free ...

Reinforcement Learning Study - 2(Bellman Equation)

Image
Elements of Markov Decision Process  MDP is mathematical form of reinforcement learning problem. And MDP consists of S(state), A(action), P(transition probability), R(reward),  γ(discount rate).  Among them, P (Transition Probability) and R(reward) is different from MRP(Markov Reward Process).  The concept of action by agent should be added to each element. so defined as below,  'P' means the probability of transition to the state 's`' when an agent do an action 'a' in a state 's'.   'R' means the expectation of reward when an agent do an action 'a' in a state 's'. One thing to note here is, reward is not deterministic. For example, if an agent want to step forward in strongly windy environment, the wind may prevent the agent from going forward. and the agent may step sideways. so, the definition of value of state defined in the previous post should be changed because value of a state depends on what the agent do(actions). agent h...

Reinforcement Leaning Study - 1

   This post summarized personal studies. the purpose of this post is to reorganize what I studied.  And I'm not good at English. so there may be some incorrect information and writing expession. If there's something like that, I would appreciate it if you could let me know my mistake.      What is Reinforcement Learning?  The learning process through trial and error to maximize cumulative return in sequential decision-making problem.  It is simillar to the way that a human baby learns. babies learn that what is good or what is bad by doing (looks like) random actions. Difference from Supervised/ Unsupervised Learning.  Reinforcement learning is differenct from supervised learning. Supervised learning is learning from a supervisor(labeled data). And reinforcement learning is different from Unsupervised Learning. .. Markov Decision Process  What is the meaning of 'sequential decision making problem'? We can define these problem with MDP(M...