Reinforcement Learning Study - 4
Temporal Differecne Method (Prediction)
In Monte Carlo method, our agent could learn only after the end of an episode. but in real world, there are few situations like that. In Temporal Difference (TD), agents can learn before an episode ends by updating current value (estimate) in the next state.
so the equation to update value in TD is ..
MC vs TD
Time of Learning
MC can be used in episodic MDP where there will be always the end.
but TD can be used in non-episodic MDP as well as episodic MDP.
Bias
Cumulative rewards(G) used by MC is real value. Therefore G is unbiased.
But TD estimate v(t) with v(t+1). In fact, TD uses (estimates of value) instead of
(real value). This means that there is no guarantee that
will convege to the real value. In other words,
(TD Target) is based.
Variance
In MC method, an agent update value after the end of episode. So, From the start point to the end point, there are many stochastic factor in the progress. but In TD, an agent can update value after just one step. TD has smaller variance than MC.
Comments
Post a Comment