Reinforcement Learning Study - 3 (Value Iteration, using Bellman optimality equation)

February 12, 2022

Value Iteration method (grid world)

In last post, we've solved the grid world problem with 'policy iteration' method that uses bellman expectation equation. This time, we'll solve the problem with 'value iteration' method that uses bellman optimality equation.

bellman optimality equation.

1. Like the previous method, initialize all value of the table with 0.

2. Evaluate each state with bellman optimality equation like below

(example of s0).

3. repeat the process '2'

After repeating the process, the value table will be as above.

Each value of the table represents the number of steps left to the goal when following the optimal path. and in this case, the optimal policy can be obtained by gridy way.

Prediction in model-free evironment

So far, we've dealt with MDP problem where we do know all about environment(model-based MDP). But in real world, there are few such problem. There are two methods to evaluate state (Prediction) in model-free problem, MC(Monte Carlo) and TD(Temporal Difference). Here, the 'model' means reward funciton(R) and transition probability function(P). In other words, we don't know about R and P in model-free MDP.

Monte Carlo Prediction

In Monte Carlo method, agent learn the environment from whole an episode through direct action (cannot plan). So, for updating value of states, each episode should be ended. when each episode is ended, the agent evaluate states with cumulative rewards.

Evaluating from an episode, each state value can be calculated by following equation.

and

meaning that adding a certain percentage of new value to the original value.

And writing it differently, it is gonna be as follow.

I've tested MC method if this method could solve prediction problem.

the equation above could be written in pseudo code like

...

for (the number of learning episode)

while ( an episode ends)

record an episode history(array)

#when an episode ends, check the history arrary from the end point.

for (history)

value[x][y] = value[x][y] + alpha * (g_t - value[x][y])

g_t = reward + gamma * g_t

...

'g_t' refers to cumulative rewards.

The result is as below.

The result looks pretty well.

Search This Blog

Hello Universe