Revision 071c084b67aca2e6b70532ee1cb2ea69abbde4d2 (click the page title to view the current version)

Briefing on MDP

Changes from 071c084b67aca2e6b70532ee1cb2ea69abbde4d2 to 6a95b67e980772e1fa2a83e7c0de0ff069e51bf1

---
title: Markov Decision Processes (Briefing)
categories: lecture
---


# Principles of Reinforcemnet Learning

1.  Random exploration - little prior information is required
2.  Online search, possibly in simulation
3.  Learning from interaction with the environment
4.  Reward function - some states and actions give rewards

Familiar model
```
  ---------         ---------
  |       | reward  |       |
  |   <------------------   |
  |       |  state  |       |
  |   <------------------   |
  |       |         |       |
  | AGENT |         | ENV.  |
  |       | action  |       |
  |   ------------------>   |
  |       |         |       |
  ---------         ---------
```

## The reward hypothesis

> Any goal can be formalised as the outcome of maximising a cummulative reward

## Examples of state/reward/action

|    | State | Action | Reward |
| :- | :- | :- | :- |
| Chess | Layout of pieces on the board | A move | Win (+1), Loss (0), or draw (½) |
| Investment portfolio | Stock market and portfolio | Sell/buy | Valued change |
| Robot walk | Orientation and position of robot limbs | Move joints | Movement (+), falling (-) |


# Markov Decision Processes (MDP)

1. Probablistic State Machine
2. State+Action gives a probability distribution on the new state
    - instead of the deterministic state transition seen previously
3. In theory, we could use a tree search
    - but the randomness can cause infinite loops

# Utility and Q functions

1. States $S$; Actions $A$
1. Reward function assigns a value for each state, each possible action,
   and each possible resulting state
   $$R: S\times A\times S \to \mathbb{R}$$
1. Utility is the expected total future reward in a state $s$
   $$ U: S \to \mathbb{R}$$
4. If we can calculate $U(s)$ for every state $s$, we have a hill climbing
   problem.

Consider an agent in state $s$.  If he takes action $a$ and ends up
in state $s'$, his utility is the reward, paid immediately, plus
the expected future payout which is $U(s')$.  Normally we consider
immediate payout to be more worth than a future promise; that's why
we pay interest on loans. Hence we introduce a discount factor $\gamma$,
and say that the utility is
$$ R(s,a,s') + \gamma U(s')$$
Note that $\gamma<1$ is needed to make some of the mathematical formulæ
Note that $\gamma\lt1$ is needed to make some of the mathematical formulæ
converge as well.

The game is probabilistic, so the new state $s'$ is a random variable.
Hence we are interested in the *expected* payout, rather than a given
payout, and we can define the utility as
$$ U(s) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$$
Now, $U(s)$ depends on the action $a$.  To emphasise this, we
introduce the $Q$-function
$$ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$$
Since we always choose the best possible action, the utility should
be the maximum
$$ U(s) = \max_a Q(s,a)$$
Thus we can substitute for $U$ in the formula for $Q$ and write
$$ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$
If we can compute $Q$ for any pair $(s,a)$, we get an  optimal
decision policy
$$\pi^*(s) = \arg\max_a Q(s,a)$$

The challenge is to compute $Q$, which is a recursive sum, leading to
an infinite series.

# Notes

1. Policy-based versus Utility-based
2. Model-based or Model-free
3. Horizon
4. Dynamic Decision Network Ch 16.1.4