Reinforcement Learning Study - 5

Contorl in Model-free MDP

 From now on, we are going to learn about 3 methods to solve conrol problem, MC Control, SARSA, Q Learning)

 Policy Iteration method which we learned before is good method to solve model-based problem. However in model-free MDP, we can't use the method because policy iteration uses bellman expectation equation that can be used when we do know about the model. And not knowing the next state when an agent select an action is another reason that we can't use the method. so we can't make greedy policy.


MC Control

 So here is the solution through some changes in policy iteration, 

1. Use MC instead of bellman expectation equation. Using MC, we can evaluate each state in empirically.

2. Use Q instead of V. Although an agent don't know what each action is mapped to each state, all an agent have to do is just selecting an action that has highest expectation of value.

3. Explore with probability of epsilon. In greedy way, if an action is evaluated just little bigger than others, the agent will always select the action, although there may be more better action to the goal. So, it's necessary that we have a tool to explore other ways(actions). we are going to use 'decaying-epsilon greedy' method where probability of exploration(eplsilon) is relatively large in the beggining of learning process and the probability gradually reduces.


Using these solution, we could write pseudo code like below.

Think about new grid world.(5x7)



...

for (the number of learning)

    while (in an episode)

        agent take action, # 1 - epsilon : follows policy/ epsilon : select random action

        get reward, state transition takes place.

        record history (state, action, reward, next_state)

    

    #an episode ends, update q value table

    for transition in history (reversly)

        q_table[x,y,a] = q_table[x,y,a] + alpha*(g_t - q_table[x,y,a])

        g_t = g_t + reward


    decaying_epsilon() # 90% -> 10% (-1%/ an episode)

...





result of test ( each element of table represents an action that has highest value)

( 0: left, 1: up, 2: right, 3 : down)




TD Control - SARSA

 Can we use TD instead of MC? yes

the pseudo code can be writen..

...

for (the number of learning)

    while (an episode):

         select_action.

        #update with information about state, action, reward, next_state.

        q[x,y,a] = q[x,y,a] + alpha * (reward + q[next_x, next_y, next_action] - q[x,y,a])

...






the result of test also looks good. ( 0: left, 1: up, 2: right, 3 : down)



Comments

Popular posts from this blog

Reinforcement Learning Study - 7

the US and British attacks on Houthi and the impact of stock market

U.S. Stock Outlook 2024