---
title: Markov Decision Processes
categories: session
---

This session is a precursor to 
[Reinforcement Learning]().

+ **Reading:** Russel and Norvig, Chapter 16.1-3 + cursory 16.4
+ [A video demo](https://s3.amazonaws.com/media-p.slid.es/videos/1961471/NL0eUKU3/learning_dexterity.mp4)
+ [Briefing on MDP]()
+ Eirik's lecture notes from 2022
    + [Slides (PDF)](Reinforcement Learning Slides 1.pdf)
    + [Notes (PDF)](Reinforcement Learning Notes 1.pdf)

# Exercises

## Task 1

+ Recall the requirements to model a problem with the MDP framework:
    - Sequential decision problem     
    - Stochastic Environment
    - Fully observable
    - Markovian Transition model 
        - the probability distribution depends only on the current state and not past history
    - Additive rewards
+ and recoll the model elements
    - A set of states $s \in S$
    - A set of actions $a \in A$
    - A probabilistic transition function $T(s,a,s')$
    - A reward function $R(s,a,s')$
    - A start state $S_0$
+ Consider one of the following problems
    - The Frozen Lake
    - The game of Black Jack
    - any problem you have previously studied in the semester, e.g. on CodinGame
    - The moon lander (see the [animation from Gymnasium](https://gymnasium.farama.org/))
+ Form groups of 2-4 students and choose one problem each from the
  list above.

### 1A. Individual Part

Analyse the chosen probler, by

+ checking if it fulfills the requirements above
+ identify all the model elements (above)

If the problem does not satisfy the MDP requirements, 
can you amend it so that it does?
E.g. could the game be changed into a random one?
If it cannot be adapte, try another problem.

Using the properties of an MDP:

+ Make a graphical representation of the (possibly modified) problem from Task A.
    - if the state space is very large, you may draw only a subset for illustration
    - you can use either the Dynamic Decision Network (R&N ch 16.1.4), a state machine
      graph, or any other simple representation (e.g. like 
      [this](https://towardsdatascience.com/real-world-applications-of-markov-decision-process-mdp-a39685546026)
      or [this](https://en.wikipedia.org/wiki/Markov_decision_process#/media/File:Markov_Decision_Process.svg) )

### 1B. Group Work

1.  Present your drawings and analysis to each other
    - is the analysis convincing?
2.  Discuss together,

    - Does the problem have a finite or infinite horizon? 
        - If you were to attempt to solve the MDP, could the current horizon pose a problem, why/why not?
    - Does the problem have a discounted reward?
        - If you were to attempt to solve the MDP,
          what discount factor would make sense to use for the utility function?

## Task 2. Getting started with Gymnasium

We will be using the Gymnasium framework to test concepts and ideas
from reinforcement learning.  You may want to consult the documentation,
but you should try playing with it first.

+ [Gymnasium](https://gymnasium.farama.org/)
  formerly known as Gym from OpenAI
+ [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)

Install with these statements.  (I am not sure if you need pygame or not.)
```bash
pip install gymnasium
pip install pygame
```

**Familiarize yourself with the FrozenLake environment**
You can import and start the simulation of the Frozen Lake like so:
```python
import gymnasium as gym
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True,render_mode="human")
obs, info = env.reset(return_info=True)
```
You should now hopefully see a render of the environment.

**Try out some of these functions and see what they do:**

```python
env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
env.P # The MDP
env.P[s] # Transition matrix of state s
env.P[s][a] # Transitions from state s given action a
```

The step action is used to execute an action, say 0 to move left:
```python
env.step(0)
```

*What happens?*

**Try to make a custom Frozen Lake map**

When creating a frozen-lake environment you can add a custom-map with the ```desc``` argument, e.g:
```python
fl_map = ["HFFFS", "FHHFF", "FFFFH", "HFFHG"] 
env = gym.make('FrozenLake-v1', desc=fl_map, is_slippery=True)
```
For a custom 5x4 map.

## Task 3.  Hill Climbing.

Given a FrozenLake map, and a list of pre-calculated expected utilities,
(e.g.:
```python
utilities = [0.41,0.38,0.35,0.34,0.43,0,0.12,0,0.45,0.48,0.43,0,0,0.59,0.71,1]
```
for the default FrozenLake 4x4 map).

1.  Visualise the utilities by drawing the lake and annotating it with the utilities.
    - do the utilities look plausible?
    - Try to solve the game manually using a hill climbing approach; 
      do the utilities still look plausible?
2.  Implement a Hill Climbing solver for the Froen Lake Game.


Your code should look something like this:

```
import gymnasium as gym
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True,render_mode="human")
env.reset()
state, info = env.reset()

utilities = [0.41,0.38,0.35,0.34,0.43,0,0.12,0,0.45,0.48,0.43,0,0,0.59,0.71,1]
done = False

while not done:

    action = # Make your own definition
    # Find the expected utility following each action and choose the best one.

    state, reward, terminated, truncated, info = env.step(action)

    print()  # Add some diagnostic output so that you can see your decision

    done = terminated or truncated

```

4.  Test your program; does it work rationally?  Does it win the game?
5.  Reflect on your solution.  Can we do better?

## Task 4.  Value iteration

Recall the value/utility-function:
$$U(s) = \mathop\max\limits_{a \in A(s)} \sum\limits_{s'}P(s'|s,a)[R(s,a,s') + \gamma U(s')]$$

Value iteration starts with an arbitrary postulated utility function $U_0$,
and for each iteration $i=1,2,3,\ldots$ it calculates a new estimated $U_i$, by the rule
$$U_i(s) = \mathop\max\limits_{a \in A(s)} \sum\limits_{s'}P(s'|s,a)[R(s,a,s') + \gamma U_{i-1}(s')]$$

Implement value iteration in python to estimate the utilities of the
Frozen Lake.

**or if you prefer**

Execute value iteration manually, to find decent estiamtes.

## Task 5.  Policy iteration (optional)


Recall Q-Function:
$$Q(s,a) = \sum\limits_{s'}P(s'|s,a)[R(s,a,s') + \gamma \mathop\max\limits_{a'}Q(s',a')]$$
and the function to extract an optimal policy from the Q-Function:
$$\pi^*(s) = \mathop{\mathrm{argmax}}\limits_aQ(s,a)$
Recall also that a policy is a map from state to action.

Policy iteration starts with both a utility estimate $U_0$ and a tentative
policy $\pi_0$.
Each iteration $i$

1. estimates the utility $U_{i+1}$ by assuming the policy $\pi_i$
2. calculate a new policy $\pi_{i+1}$ by assuming utilitie $U_{i+1}$.

Implement policy iteration in python.