--- title: Markov Decision Processes (Briefing) categories: lecture --- # Principles of Reinforcemnet Learning 1. Random exploration - little prior information is required 2. Online search, possibly in simulation 3. Learning from interaction with the environment 4. Reward function - some states and actions give rewards Familiar model ``` --------- --------- | | reward | | | <------------------ | | | state | | | <------------------ | | | | | | AGENT | | ENV. | | | action | | | ------------------> | | | | | --------- --------- ``` ## The reward hypothesis > Any goal can be formalised as the outcome of maximising a cummulative reward ## Examples of state/reward/action | | State | Action | Reward | | :- | :- | :- | :- | | Chess | Layout of pieces on the board | A move | Win (+1), Loss (0), or draw (½) | | Investment portfolio | Stock market and portfolio | Sell/buy | Valued change | | Robot walk | Orientation and position of robot limbs | Move joints | Movement (+), falling (-) | # Markov Decision Processes (MDP) 1. Probablistic State Machine 2. State+Action gives a probability distribution on the new state - instead of the deterministic state transition seen previously 3. In theory, we could use a tree search - but the randomness can cause infinite loops # Utility and Q functions 1. States $S$; Actions $A$ 1. Reward function assigns a value for each state, each possible action, and each possible resulting state $$R: S\times A\times S \to \mathbb{R}$$ 1. Utility is the expected total future reward in a state $s$ $$ U: S \to \mathbb{R}$$ 4. If we can calculate $U(s)$ for every state $s$, we have a hill climbing problem. Consider an agent in state $s$. If he takes action $a$ and ends up in state $s'$, his utility is the reward, paid immediately, plus the expected future payout which is $U(s')$. Normally we consider immediate payout to be more worth than a future promise; that's why we pay interest on loans. Hence we introduce a discount factor $\gamma$, and say that the utility is $$ R(s,a,s') + \gamma U(s')$$ Note that $\gamma<1$ is needed to make some of the mathematical formulæ converge as well. The game is probabilistic, so the new state $s'$ is a random variable. Hence we are interested in the *expected* payout, rather than a given payout, and we can define the utility as $$ U(s) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$$ Now, $U(s)$ depends on the action $a$. To emphasise this, we introduce the $Q$-function $$ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma U(s')]$$ Since we always choose the best possible action, the utility should be the maximum $$ U(s) = \max_a Q(s,a)$$ Thus we can substitute for $U$ in the formula for $Q$ and write $$ Q(s,a) = \sum_{s'} P(s'|s,a)\cdot[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$ If we can compute $Q$ for any pair $(s,a)$, we get an optimal decision policy $$\pi^*(s) = \arg\max_a Q(s,a)$$ The challenge is to compute $Q$, which is a recursive sum, leading to an infinite series. # Notes 1. Policy-based versus Utility-based 2. Model-based or Model-free 3. Horizon 4. Dynamic Decision Network Ch 16.1.4