# Briefing on MDP

## Changes from 6a95b67e980772e1fa2a83e7c0de0ff069e51bf1 to 42e33d8b72a5fd08503828436a221824fbddd30e

---
title: Markov Decision Processes (Briefing)
categories: lecture
---

# Principles of Reinforcemnet Learning

1.  Random exploration - little prior information is required
2.  Online search, possibly in simulation
3.  Learning from interaction with the environment
4.  Reward function - some states and actions give rewards

Familiar model

---------         ---------
|       | reward  |       |
|   <------------------   |
|       |  state  |       |
|   <------------------   |
|       |         |       |
| AGENT |         | ENV.  |
|       | action  |       |
|   ------------------>   |
|       |         |       |
---------         ---------


## The reward hypothesis

> Any goal can be formalised as the outcome of maximising a cummulative reward

## Examples of state/reward/action

|    | State | Action | Reward |
| :- | :- | :- | :- |
| Chess | Layout of pieces on the board | A move | Win (+1), Loss (0), or draw (½) |
| Investment portfolio | Stock market and portfolio | Sell/buy | Valued change |
| Robot walk | Orientation and position of robot limbs | Move joints | Movement (+), falling (-) |

# Markov Decision Processes (MDP)

1. Probablistic State Machine
2. State+Action gives a probability distribution on the new state
- instead of the deterministic state transition seen previously
3. In theory, we could use a tree search
- but the randomness can cause infinite loops
4. Horizon

# Utility and Q functions

1. States $S$; Actions $A$
1. Reward function assigns a value for each state, each possible action,
and each possible resulting state
$$R: S\times A\times S \to \mathbb{R}$$
1. Utility is the expected total future reward in a state $s$
$$U: S \to \mathbb{R}$$
4. If we can calculate $U(s)$ for every state $s$, we have a hill climbing
problem.

Consider an agent in state $s$.  If he takes action $a$ and ends up
in state $s'$, his utility is the reward, paid immediately, plus
the expected future payout which is $U(s')$.  Normally we consider
immediate payout to be more worth than a future promise; that's why
we pay interest on loans. Hence we introduce a discount factor $\gamma$,
and say that the utility is
$$R(s,a,s') + \gamma U(s')$$
Note that $\gamma\lt1$ is needed to make some of the mathematical formulæ
converge as well.

The game is probabilistic, so the new state $s'$ is a random variable.
Hence we are interested in the *expected* payout, rather than a given
payout, and we can define the utility as
$$U(s) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$$
Now, $U(s)$ depends on the action $a$.  To emphasise this, we
introduce the $Q$-function
$$Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$$
Since we always choose the best possible action, the utility should
be the maximum
$$U(s) = \max_a Q(s,a)$$
Thus we can substitute for $U$ in the formula for $Q$ and write
$$Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$
If we can compute $Q$ for any pair $(s,a)$, we get an  optimal
decision policy
$$\pi^*(s) = \arg\max_a Q(s,a)$$

The challenge is to compute $Q$, which is a recursive sum, leading to
an infinite series.

# Notes

1. Policy-based versus Utility-based
2. Model-based or Model-free
3. Horizon
4. Dynamic Decision Network Ch 16.1.4