---
title: Markov Decision Processes (Briefing)
categories: lecture
---
# Principles of Reinforcemnet Learning
1. Random exploration - little prior information is required
2. Online search, possibly in simulation
3. Learning from interaction with the environment
4. Reward function - some states and actions give rewards
Familiar model
```
--------- ---------
| | reward | |
| <------------------ |
| | state | |
| <------------------ |
| | | |
| AGENT | | ENV. |
| | action | |
| ------------------> |
| | | |
--------- ---------
```
## The reward hypothesis
> Any goal can be formalised as the outcome of maximising a cummulative reward
## Examples of state/reward/action
| | State | Action | Reward |
| :- | :- | :- | :- |
| Chess | Layout of pieces on the board | A move | Win (+1), Loss (0), or draw (½) |
| Investment portfolio | Stock market and portfolio | Sell/buy | Valued change |
| Robot walk | Orientation and position of robot limbs | Move joints | Movement (+), falling (-) |
# Markov Decision Processes (MDP)
1. Probablistic State Machine
2. State+Action gives a probability distribution on the new state
- instead of the deterministic state transition seen previously
3. In theory, we could use a tree search
- but the randomness can cause infinite loops
# Utility and Q functions
1. States $S$; Actions $A$
1. Reward function assigns a value for each state, each possible action,
and each possible resulting state
$$R: S\times A\times S \to \mathbb{R}$$
1. Utility is the expected total future reward in a state $s$
$$ U: S \to \mathbb{R}$$
4. If we can calculate $U(s)$ for every state $s$, we have a hill climbing
problem.
Consider an agent in state $s$. If he takes action $a$ and ends up
in state $s'$, his utility is the reward, paid immediately, plus
the expected future payout which is $U(s')$. Normally we consider
immediate payout to be more worth than a future promise; that's why
we pay interest on loans. Hence we introduce a discount factor $\gamma$,
and say that the utility is
$$ R(s,a,s') + \gamma U(s')$$
Note that $\gamma<1$ is needed to make some of the mathematical formulæ
converge as well.
The game is probabilistic, so the new state $s'$ is a random variable.
Hence we are interested in the *expected* payout, rather than a given
payout, and we can define the utility as
$$ U(s) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$$
Now, $U(s)$ depends on the action $a$. To emphasise this, we
introduce the $Q$-function
$$ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$$
Since we always choose the best possible action, the utility should
be the maximum
$$ U(s) = \max_a Q(s,a)$$
Thus we can substitute for $U$ in the formula for $Q$ and write
$$ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$
If we can compute $Q$ for any pair $(s,a)$, we get an optimal
decision policy
$$\pi^*(s) = \arg\max_a Q(s,a)$$
The challenge is to compute $Q$, which is a recursive sum, leading to
an infinite series.
# Notes
1. Policy-based versus Utility-based
2. Model-based or Model-free
3. Horizon
4. Dynamic Decision Network Ch 16.1.4