Reinforcement Learning Study - 2(Bellman Equation)
Elements of Markov Decision Process
MDP is mathematical form of reinforcement learning problem. And MDP consists of S(state), A(action), P(transition probability), R(reward), γ(discount rate).
Among them, P (Transition Probability) and R(reward) is different from MRP(Markov Reward Process). The concept of action by agent should be added to each element. so defined as below,
'P' means the probability of transition to the state 's`' when an agent do an action 'a' in a state 's'.
'R' means the expectation of reward when an agent do an action 'a' in a state 's'. One thing to note here is, reward is not deterministic. For example, if an agent want to step forward in strongly windy environment, the wind may prevent the agent from going forward. and the agent may step sideways.
so, the definition of value of state defined in the previous post should be changed because value of a state depends on what the agent do(actions). agent has 'policy'(π) for the agent to select an action.
A policy refers to the probability that an agent will take an action in a state.
so, when following policy(π), 'State value function v(s)' is defined as follows.
Prediction and Control
What is solving a MDP probelm? Prediction and Control are the areas of interest to us in the problem.
Prediction is a task of evaluating each value of state when a policy is given.
Control is a task of finding optimal policy (π*).
And value (function) when an agent follow the optimal policy is called 'Optimal value function'.
Bellman Equation.
Bellman Equation is tool to evaluate value in MDP through defining recursive relation between value at t and value at t+1.
Bellman expectation equation
Bellman equation derivation.
the equation above can be used in a problem where we have all infromation (transition probability, reward) about MDP. This way of solving is called 'Planning' or 'Model-based solving'.
Comments
Post a Comment