Reinforcement Learning Study - 2(Bellman Equation)

February 11, 2022

Elements of Markov Decision Process

MDP is mathematical form of reinforcement learning problem. And MDP consists of S(state), A(action), P(transition probability), R(reward), γ(discount rate).

Among them, P (Transition Probability) and R(reward) is different from MRP(Markov Reward Process). The concept of action by agent should be added to each element. so defined as below,

'P' means the probability of transition to the state 's`' when an agent do an action 'a' in a state 's'.

'R' means the expectation of reward when an agent do an action 'a' in a state 's'. One thing to note here is, reward is not deterministic. For example, if an agent want to step forward in strongly windy environment, the wind may prevent the agent from going forward. and the agent may step sideways.

so, the definition of value of state defined in the previous post should be changed because value of a state depends on what the agent do(actions). agent has 'policy'(π) for the agent to select an action.

A policy refers to the probability that an agent will take an action in a state.

so, when following policy(π), 'State value function v(s)' is defined as follows.

and naturally, we can think 'Can we evaluate each action?'. so when following policy(π),

'State-Action value function q(s,a)' is defined as follows.

Prediction and Control

What is solving a MDP probelm? Prediction and Control are the areas of interest to us in the problem.

Prediction is a task of evaluating each value of state when a policy is given.

Control is a task of finding optimal policy (π*).

And value (function) when an agent follow the optimal policy is called 'Optimal value function'.

Bellman Equation.

Bellman Equation is tool to evaluate value in MDP through defining recursive relation between value at t and value at t+1.

Bellman expectation equation

Bellman equation derivation.

the equation above can be used in a problem where we have all infromation (transition probability, reward) about MDP. This way of solving is called 'Planning' or 'Model-based solving'.

Bellman Optimality Equation.

In expectation equation, there are two probabilistic factors, policy(pi) and Transition Probability.

But in optimality equation, there is no probability of policy (about selecting action) because agent always selects optimal action.

Policy Iteration.

Consider a 4x4 grid world. Agent have 4 actions (step to east, west, south, north). Each step give agent -1 reward. and we know MDP(Reward : -1, Probablility : 1(deterministic)).

As this problem is small and simple, we can use tabular method. Let's solve this problem using Policy Iteration that uses Bellman expectation equation. Policy Iteration consists of policy evaluation and policy improvement.

4x4 grid world where s15 is end point.

1. Initialize a table with all 0 representing each value of state.

2. Update all value of state in the table using Bellman expectation equation. (called policy evaluation)

progress of evaluating value of s0. ( If the agent is gonna move outward in the edge of the world, the agent stays there and get -1 reward.)

result of first evaluation. (this process can be done more times : Iterative Policy Evaluation)

3. Update policy with the result in 'Gridy way'(select a move to the state that has the most highest value). (called policy improvement)

4. Repeat 2, 3 sufficiently, and we can get a result like below.

In this case, the policy is optimal policy. and if we keep evaluating the value infinitly, the value converge to some value.

Finally we've solved MDP for the first time. we've got optimal value and optimal policy. In other words, we've solved 'Prediction' and 'Control' problem.

Search This Blog

Hello Universe