Reinforcement Learning Study

Reinforcement Learning Study - 5

February 14, 2022

Contorl in Model-free MDP

From now on, we are going to learn about 3 methods to solve conrol problem, MC Control, SARSA, Q Learning)

Policy Iteration method which we learned before is good method to solve model-based problem. However in model-free MDP, we can't use the method because policy iteration uses bellman expectation equation that can be used when we do know about the model. And not knowing the next state when an agent select an action is another reason that we can't use the method. so we can't make greedy policy.

MC Control

So here is the solution through some changes in policy iteration,

1. Use MC instead of bellman expectation equation. Using MC, we can evaluate each state in empirically.

2. Use Q instead of V. Although an agent don't know what each action is mapped to each state, all an agent have to do is just selecting an action that has highest expectation of value.

3. Explore with probability of epsilon. In greedy way, if an action is evaluated just little bigger than others, the agent will always select the action, although there may be more better action to the goal. So, it's necessary that we have a tool to explore other ways(actions). we are going to use 'decaying-epsilon greedy' method where probability of exploration(eplsilon) is relatively large in the beggining of learning process and the probability gradually reduces.

Using these solution, we could write pseudo code like below.

Think about new grid world.(5x7)

...

for (the number of learning)

while (in an episode)

agent take action, # 1 - epsilon : follows policy/ epsilon : select random action

get reward, state transition takes place.

record history (state, action, reward, next_state)

#an episode ends, update q value table

for transition in history (reversly)

q_table[x,y,a] = q_table[x,y,a] + alpha*(g_t - q_table[x,y,a])

g_t = g_t + reward

decaying_epsilon() # 90% -> 10% (-1%/ an episode)

...

result of test ( each element of table represents an action that has highest value)

( 0: left, 1: up, 2: right, 3 : down)

TD Control - SARSA

Can we use TD instead of MC? yes

the pseudo code can be writen..

...

for (the number of learning)

while (an episode):

select_action.

#update with information about state, action, reward, next_state.

q[x,y,a] = q[x,y,a] + alpha * (reward + q[next_x, next_y, next_action] - q[x,y,a])

...

the result of test also looks good. ( 0: left, 1: up, 2: right, 3 : down)

Search This Blog

Hello Universe