Saturday, March 2, 2019

Udacity Machine Learning Nanodegree Mega Review Part 8

Markov Decision Process

See more in the series visit the main course outline page

Lesson 2 Markov Decision Process
.5 Markov Decision Process - 1: single agent, there are STATES s - a set of tokens that represent every state one can be in - which part of the grid I am in - the entire grid minus blocked states, (x,y) coordinates, process for making decisions, MODEL T(s,a,s')~Pr(s'|s,a)

.6 Markov Decision Process - 2:

Action things you can do in a particular STATE: UP DOWN LEFT RIGHT
Action is also a function of state A(s), or a set of actions - A

Model aka the transition model describes the rule of the world. How to play the game.

The transition Model is a function of two variables state, action, next state aka state_prime.

S' can equal to S : means to stay.

The transition model outputs the probability one would end up at S' given that person is transitioning from S with action a

Deterministic case: means there is no noise. Take every action with certainty: 100%. In nondeterministic, action execute faithfully 80% of time, 0.8, 0.1, 0.1,

Model describes the rule of the game. Also captures what happens if you do something. Physics of the world.
Transition models are probablistic by nature
.7 Markov Decision Process - 3: Markovian property, Markov means you don't have to condition on everything pass the most recent state - Markov only the present matters. Only depends on current state s. Pr(s'|s,a) there's only one dependency on s not s1 s2 s3.

You can turn anything into markovian process by making sure the current state remembers anything from the past.

Second property of MDP: nothing ever changes, things are stationery, these rules don't change over time.

Reward : R(s) for being in a state, R(s,a) reward for being in a state and take an action,  R(s,a,s') being in a state take an action and end up in s'. All mathematically equivalent. Intuition:
Green or goal is great. Want to be there. Red is punishment, restricted area. Encompasses the domain knowledge. Usefulness of entering that state.

.8 Markov Decision Process - 4: MDP describes a problem, the solution is described in Pi or policy. Pi(s) --> a takes in a state, and outputs the action to take. It's a solution to the MDP.

Pi* or policy star is the optimal policy that maximizes your long term reward across time.

No comments:

Post a Comment

Understand the Softmax Function in Minutes

Reposted from Uniqtech's Medium publication with permission. This is retrieved on May 14 2019. Uniqtech may have a newer version. Unde...