Revision f9c10f9fb250601bee5825b36a8c8c30d66e2eef (click the page title to view the current version)

Markov Decision Processes (Briefing)

Principles of Reinforcemnet Learning

  1. Random exploration - little prior information is required
  2. Online search, possibly in simulation
  3. Learning from interaction with the environment
  4. Reward function - some states and actions give rewards

Familiar model

  ---------         ---------
  |       | reward  |       |
  |   <------------------   |
  |       |  state  |       |
  |   <------------------   |
  |       |         |       |
  | AGENT |         | ENV.  |
  |       | action  |       |
  |   ------------------>   |
  |       |         |       |
  ---------         ---------

The reward hypothesis

Any goal can be formalised as the outcome of maximising a cummulative reward

Examples of state/reward/action

State Action Reward
Chess Layout of pieces on the board A move Win (+1), Loss (0), or draw (½)
Investment portfolio Stock market and portfolio Sell/buy Valued change
Robot walk Orientation and position of robot limbs Move joints Movement (+), falling (-)

Markov Decision Processes (MDP)

  1. Probablistic State Machine
  2. State+Action gives a probability distribution on the new state
    • instead of the deterministic state transition seen previously
  3. In theory, we could use a tree search
    • but the randomness can cause infinite loops

Utility and Q functions

  1. States \(S\); Actions \(A\)
  2. Reward function assigns a value for each state, each possible action, and each possible resulting state \[R: S\times A\times S \to \mathbb{R}\]
  3. Utility is the expected total future reward in a state \(s\) \[ U: S \to \mathbb{R}\]
  4. If we can calculate \(U(s)\) for every state \(s\), we have a hill climbing problem.

Consider an agent in state \(s\). If he takes action \(a\) and ends up in state \(s'\), his utility is the reward, paid immediately, plus the expected future payout which is \(U(s')\). Normally we consider immediate payout to be more worth than a future promise; that’s why we pay interest on loans. Hence we introduce a discount factor \(\gamma\), and say that the utility is \[ R(s,a,s') + \gamma U(s')\] Note that \(\gamma<1\) is needed to make some of the mathematical formulæ converge as well.

The game is probabilistic, so the new state \(s'\) is a random variable. Hence we are interested in the expected payout, rather than a given payout, and we can define the utility as \[ U(s) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]\] Now, \(U(s)\) depends on the action \(a\). To emphasise this, we introduce the \(Q\)-function \[ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]\] Since we always choose the best possible action, the utility should be the maximum \[ U(s) = \max_a Q(s,a)\] Thus we can substitute for \(U\) in the formula for \(Q\) and write \[ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma \max_{a'}Q(s',a')]\] If we can compute \(Q\) for any pair \((s,a)\), we get an optimal decision policy \[\pi^*(s) = \argmax_a Q(s,a)\]

The challenge is to compute \(Q\), which is a recursive sum, leading to an infinite series.


  1. Policy-based versus Utility-based
  2. Model-based or Model-free
  3. Horizon
  4. Dynamic Decision Network Ch 16.1.4