Markov Decision Process
See more in the series visit the main course outline page
Lesson 2 Markov Decision Process
.5 Markov Decision Process - 1: single agent, there are STATES s - a set of tokens that represent every state one can be in - which part of the grid I am in - the entire grid minus blocked states, (x,y) coordinates, process for making decisions, MODEL T(s,a,s')~Pr(s'|s,a)
.6 Markov Decision Process - 2:
Action things you can do in a particular STATE: UP DOWN LEFT RIGHT
Action is also a function of state A(s), or a set of actions - A
Model aka the transition model describes the rule of the world. How to play the game.
The transition Model is a function of two variables state, action, next state aka state_prime.
S' can equal to S : means to stay.
The transition model outputs the probability one would end up at S' given that person is transitioning from S with action a
Deterministic case: means there is no noise. Take every action with certainty: 100%. In nondeterministic, action execute faithfully 80% of time, 0.8, 0.1, 0.1,
Model describes the rule of the game. Also captures what happens if you do something. Physics of the world.
Transition models are probablistic by nature
.7 Markov Decision Process - 3: Markovian property, Markov means you don't have to condition on everything pass the most recent state - Markov only the present matters. Only depends on current state s. Pr(s'|s,a) there's only one dependency on s not s1 s2 s3.
You can turn anything into markovian process by making sure the current state remembers anything from the past.
Second property of MDP: nothing ever changes, things are stationery, these rules don't change over time.
Reward : R(s) for being in a state, R(s,a) reward for being in a state and take an action, R(s,a,s') being in a state take an action and end up in s'. All mathematically equivalent. Intuition:
Green or goal is great. Want to be there. Red is punishment, restricted area. Encompasses the domain knowledge. Usefulness of entering that state.
.8 Markov Decision Process - 4: MDP describes a problem, the solution is described in Pi or policy. Pi(s) --> a takes in a state, and outputs the action to take. It's a solution to the MDP.
Pi* or policy star is the optimal policy that maximizes your long term reward across time.