Reinforcement Learning Study

Reinforcement Learning Study - 4

February 13, 2022

Temporal Differecne Method (Prediction)

In Monte Carlo method, our agent could learn only after the end of an episode. but in real world, there are few situations like that. In Temporal Difference (TD), agents can learn before an episode ends by updating current value (estimate) in the next state.

so the equation to update value in TD is ..

(uses the estimate in the next state, instead of cumulative reward)

and this can be written in pseudo code like below.

...

for (the number of learning episode)

# in each step

value[x][y] = value[x][y] + alpha * (reward + gamma*value[x_prime][y_prime] - value[x][y])

...

Result of test (TD)

MC vs TD

Time of Learning

MC can be used in episodic MDP where there will be always the end.

but TD can be used in non-episodic MDP as well as episodic MDP.

Bias

MC uses .

TD uses.

Cumulative rewards(G) used by MC is real value. Therefore G is unbiased.

But TD estimate v(t) with v(t+1). In fact, TD uses (estimates of value) instead of (real value). This means that there is no guarantee that will convege to the real value. In other words, (TD Target) is based.

Variance

In MC method, an agent update value after the end of episode. So, From the start point to the end point, there are many stochastic factor in the progress. but In TD, an agent can update value after just one step. TD has smaller variance than MC.

n-step TD

Looking at the TD Target (

)again, we don't necessarily have to use just one step estimates. we can also use n-step estimates to update previous value of states.

(called TD(0))

(in this case, if n goes to infinity (end point), that n-step TD == MC.)

the larger the n value of TD, the more similar it is to MC method(has larger variance and is more unbiased).

Search This Blog

Hello Universe