Revision 54c78ad01d667d5ac7a191819eb4ab58f5eaa0a3 (click the page title to view the current version)

Markov Decision Processes

Exercises

This session is a precursor to Reinforcement Learning.

Reading: Russel and Norvig, Chapter 16.1-3 + cursory 16.4
A video demo
Briefing on MDP
Eirik’s lecture notes from 2022
- Slides (PDF)
- Notes (PDF)

Exercises

Task 1

Recall the requirements to model a problem with the MDP framework:
- Sequential decision problem
- Stochastic Environment
- Fully observable
- Markovian Transition model
  - the probability distribution depends only on the current state and not past history
- Additive rewards
and recoll the model elements
- A set of states \(s \in S\)
- A set of actions \(a \in A\)
- A probabilistic transition function \(T(s,a,s')\)
- A reward function \(R(s,a,s')\)
- A start state \(S_0\)
Consider one of the following problems
- The Frozen Lake
- The game of Black Jack
- any problem you have previously studied in the semester, e.g. on CodinGame
- The moon lander (see the animation from Gymnasium)
Form groups of 2-4 students and choose one problem each from the list above.

Individual Part

Analyse the chosen probler, by

checking if it fulfills the requirements above
identify all the model elements (above)

If the problem does not satisfy the MDP requirements, can you amend it so that it does? E.g. could the game be changed into a random one? If it cannot be adapte, try another problem.

Using the properties of an MDP:

Make a graphical representation of the (possibly modified) problem from Task A.
- if the state space is very large, you may draw only a subset for illustration
- you can use either the Dynamic Decision Network (R&N ch 16.1.4), a state machine graph, or any other simple representation (e.g. like this or this )

Group Work

Present your drawings and analysis to each other
- is the analysis convincing?
Discuss together,
- Does the problem have a finite or infinite horizon?
  - If you were to attempt to solve the MDP, could the current horizon pose a problem, why/why not?
- Does the problem have a discounted reward?
  - If you were to attempt to solve the MDP, what discount factor would make sense to use for the utility function?

Exploring the Frozen Lake (Task 2)

We will be using the Gymnasium framework to test concepts and ideas from reinforcement learning. You may want to consult the documentation, but you should try playing with it first.

Gymnasium formerly known as Gym from OpenAI
Frozen Lake

Install with these statements. (I am not sure if you need pygame or not.)

pip install gymnasium
pip install pygame

Part A

Familiarize yourself with the FrozenLake environment You can import and start the simulation of the Frozen Lake like so:

import gymnasium as gym
env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True,render_mode="human")
obs, info = env.reset(return_info=True)

You should now hopefully see a render of the environment.

Try out some of these functions and see what they do:

env.action_space.sample()
observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
env.P # The MDP
env.P[s] # Transition matrix of state s
env.P[s][a] # Transitions from state s given action a

Try to make a custom Frozen Lake map

When creating a frozen-lake environment you can add a custom-map with the desc argument, e.g:

fl_map = ["HFFFS", "FHHFF", "FFFFH", "HFFHG"] 
env = gym.make('FrozenLake-v1', desc=fl_map, is_slippery=True)

For a custom 5x4 map.

Task 3

Recall the value/utility-function:

\[U(s) = \mathop\max\limits_{a \in A(s)} \sum\limits_{s'}P(s'|s,a)[R(s,a,s') + \gamma U(s')]\]

The Q-Function:

\[Q(s,a) = \sum\limits_{s'}P(s'|s,a)[R(s,a,s') + \gamma \mathop\max\limits_{a'}Q(s',a')]\]

And the function to extract an optimal policy from the Q-Function:

\[\pi^*(s) = \mathop{\mathrm{argmax}}\limits_aQ(s,a)\]

Part A

Implement the above functions in Python

Part B

Given a FrozenLake map, and a list of pre-calculated expected utilities, (e.g.:

utilities = [0.41,0.38,0.35,0.34,0.43,0,0.12,0,0.45,0.48,0.43,0,0,0.59,0.71,1]

for the default FrozenLake 4x4 map)

Test out the utility-function, and see if it makes sense/work as it should.
Use the function to extract an optimal policy to move around the map (discount factor can be e.g. 0.99).