Reinforcement Leaning Study

Reinforcement Leaning Study - 1

February 10, 2022

This post summarized personal studies. the purpose of this post is to reorganize what I studied.

And I'm not good at English. so there may be some incorrect information and writing expession. If there's something like that, I would appreciate it if you could let me know my mistake.

What is Reinforcement Learning?

The learning process through trial and error to maximize cumulative return in sequential decision-making problem.

It is simillar to the way that a human baby learns. babies learn that what is good or what is bad by doing (looks like) random actions.

Difference from Supervised/ Unsupervised Learning.

Reinforcement learning is differenct from supervised learning. Supervised learning is learning from a supervisor(labeled data). And reinforcement learning is different from Unsupervised Learning. ..

Markov Decision Process

What is the meaning of 'sequential decision making problem'? We can define these problem with MDP(Markov Decision Process). Before learning more about MDP, it's good to know about Markov Process and Markov Reward Process for understanding MDP.

Markov Process is a process where state transition occur along the probability table until it reaches end state. It has set of state and transition probability matirx. Each element of transition probability matrix is the probabilty of transitioning from one state to another.

another important feature of Markov process is 'Makov Property'. it means that only current state (S(t)) affects to right next state (S(t+1)). In other words, the states before the present (t-1, t-2, ...) does not affect the next state(t+1).

Markov Reward Process(MRP) is Markov Process that has 'Reward'. Reward is obtained when a state changes to the next state. In this process, we can think about sum of reward, 'Return'. return G(t) is defined as below.

G(t) = R(t+1) + γR(t+2) + γ^2R(t+3) + ... (γ is discount rate. R(t) is reward of a state at time t)

Return does not consider of the past rewards. Reinforcement learning is learning process that maximize 'Return'.

In MRP, we can consider the value of a state. The greater the reward is, the greater the value is. so value of state(V(s)) can be expressed as rewards like below.

V(s) = E[G(t) | S(t) = s]. ( E is expression of expectation.)

Markov Decision Process (MDP) has one more thing, 'Agent'. Agent can 'action' in the evironment. Accordingly, the definition of Transition Probability and Reward are different from MRP. In summary, MDP consists of S (set of state), A (set of action), P (probability matrix), R (reward of state), γ (discount rate).

Reference :

바닥부터 배우는 강화학습 (노승은)

Reinforcement Leaning An Introduction (Richard S. Sutton)

Search This Blog

Hello Universe