# Principles of Reinforcemnet Learning

1. Random exploration - little prior information is required
2. Online search, possibly in simulation
3. Learning from interaction with the environment
4. Reward function - some states and actions give rewards

Familiar model

  ---------         ---------
|       | reward  |       |
|   <------------------   |
|       |  state  |       |
|   <------------------   |
|       |         |       |
| AGENT |         | ENV.  |
|       | action  |       |
|   ------------------>   |
|       |         |       |
---------         ---------

## The reward hypothesis

Any goal can be formalised as the outcome of maximising a cummulative reward

## Examples of state/reward/action

State Action Reward
Chess Layout of pieces on the board A move Win (+1), Loss (0), or draw (½)
Investment portfolio Stock market and portfolio Sell/buy Valued change
Robot walk Orientation and position of robot limbs Move joints Movement (+), falling (-)

# Markov Decision Processes (MDP)

1. Probablistic State Machine
2. State+Action gives a probability distribution on the new state
• instead of the deterministic state transition seen previously
3. In theory, we could use a tree search
• but the randomness can cause infinite loops
4. Horizon

# Utility and Q functions

1. States $$S$$; Actions $$A$$
2. Reward function assigns a value for each state, each possible action, and each possible resulting state $R: S\times A\times S \to \mathbb{R}$
3. Utility is the expected total future reward in a state $$s$$ $U: S \to \mathbb{R}$
4. If we can calculate $$U(s)$$ for every state $$s$$, we have a hill climbing problem.

Consider an agent in state $$s$$. If he takes action $$a$$ and ends up in state $$s'$$, his utility is the reward, paid immediately, plus the expected future payout which is $$U(s')$$. Normally we consider immediate payout to be more worth than a future promise; that’s why we pay interest on loans. Hence we introduce a discount factor $$\gamma$$, and say that the utility is $R(s,a,s') + \gamma U(s')$ Note that $$\gamma\lt1$$ is needed to make some of the mathematical formulæ converge as well.

The game is probabilistic, so the new state $$s'$$ is a random variable. Hence we are interested in the expected payout, rather than a given payout, and we can define the utility as $U(s) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$ Now, $$U(s)$$ depends on the action $$a$$. To emphasise this, we introduce the $$Q$$-function $Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$ Since we always choose the best possible action, the utility should be the maximum $U(s) = \max_a Q(s,a)$ Thus we can substitute for $$U$$ in the formula for $$Q$$ and write $Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$ If we can compute $$Q$$ for any pair $$(s,a)$$, we get an optimal decision policy $\pi^*(s) = \arg\max_a Q(s,a)$

The challenge is to compute $$Q$$, which is a recursive sum, leading to an infinite series.

1. Policy-based versus Utility-based
2. Model-based or Model-free
3. Horizon
4. Dynamic Decision Network Ch 16.1.4