Revision 071c084b67aca2e6b70532ee1cb2ea69abbde4d2 (click the page title to view the current version)

Markov Decision Processes (Briefing)

Principles of Reinforcemnet Learning
- The reward hypothesis
- Examples of state/reward/action
Markov Decision Processes (MDP)
Utility and Q functions
Notes

Principles of Reinforcemnet Learning

Random exploration - little prior information is required
Online search, possibly in simulation
Learning from interaction with the environment
Reward function - some states and actions give rewards

Familiar model

  ---------         ---------
  |       | reward  |       |
  |   <------------------   |
  |       |  state  |       |
  |   <------------------   |
  |       |         |       |
  | AGENT |         | ENV.  |
  |       | action  |       |
  |   ------------------>   |
  |       |         |       |
  ---------         ---------

The reward hypothesis

Any goal can be formalised as the outcome of maximising a cummulative reward

Examples of state/reward/action

	State	Action	Reward
Chess	Layout of pieces on the board	A move	Win (+1), Loss (0), or draw (½)
Investment portfolio	Stock market and portfolio	Sell/buy	Valued change
Robot walk	Orientation and position of robot limbs	Move joints	Movement (+), falling (-)

Markov Decision Processes (MDP)

Probablistic State Machine
State+Action gives a probability distribution on the new state
- instead of the deterministic state transition seen previously
In theory, we could use a tree search
- but the randomness can cause infinite loops

Utility and Q functions

States \(S\); Actions \(A\)
Reward function assigns a value for each state, each possible action, and each possible resulting state \[R: S\times A\times S \to \mathbb{R}\]
Utility is the expected total future reward in a state \(s\) \[ U: S \to \mathbb{R}\]
If we can calculate \(U(s)\) for every state \(s\), we have a hill climbing problem.

Consider an agent in state \(s\). If he takes action \(a\) and ends up in state \(s'\), his utility is the reward, paid immediately, plus the expected future payout which is \(U(s')\). Normally we consider immediate payout to be more worth than a future promise; that’s why we pay interest on loans. Hence we introduce a discount factor \(\gamma\), and say that the utility is \[ R(s,a,s') + \gamma U(s')\] Note that \(\gamma<1\) is needed to make some of the mathematical formulæ converge as well.

The game is probabilistic, so the new state \(s'\) is a random variable. Hence we are interested in the expected payout, rather than a given payout, and we can define the utility as \[ U(s) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]\] Now, \(U(s)\) depends on the action \(a\). To emphasise this, we introduce the \(Q\)-function \[ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]\] Since we always choose the best possible action, the utility should be the maximum \[ U(s) = \max_a Q(s,a)\] Thus we can substitute for \(U\) in the formula for \(Q\) and write \[ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma \max_{a'}Q(s',a')]\] If we can compute \(Q\) for any pair \((s,a)\), we get an optimal decision policy \[\pi^*(s) = \arg\max_a Q(s,a)\]

The challenge is to compute \(Q\), which is a recursive sum, leading to an infinite series.

Notes

Policy-based versus Utility-based
Model-based or Model-free
Horizon
Dynamic Decision Network Ch 16.1.4