--- title: Reinformcement Learning categories: session --- + **Goal** Understand and be able to implement Q-learning + **Reading** Russel and Norvig Chapter 23 + [Eirik's slides from 2022](Reinforcement Learning Slides 2.pdf) # Exercises Last session we discussedl the Q-Function, $$Q(s,a) = \sum_{s'}P(s'|s,a)[R(s,a,s') + \gamma \max_{a'}Q(s',a')]$$ and the function for the optimal policy based on the results from the Q-Function: $$\pi^*(s) = \mathop{\text{argmax }}\limits_aQ(s,a)$$ We also discussed iterative estimation of the utilities and the policies. This session, we will implement an iterative estimation algorithm for the Q-values, knowns as Q-learning. This is a model-free, off-policy reinforcement learning algorithm. The exercise outline below is based partly on Eirik's assigment in 2022 and partly on the Gymnasium [tutorial on Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/). Note that I have not asked you explicitly to output any diagnostics on the way. You almost certainly have to do this yourself, so that you know what is going on. ## Goal and overvew The goal for this session is to implement an agent that can solve the Frozen Lake problem as well as possible, using Q-learning. The skeleton for the Agent will look like this: ```python class Agent: def __init__( self, env, learning_rate=0.1, initial_epsilon=1.0, epsilon_decay=10**(-50000), final_epsilon=0.1, discount_factor=0.95): pass def get_action(self, obs): pass def update( self, obs, action, reward, terminated, next_obs): pass def decay_epsilon(self): pass ``` Thus we need four methods. The most obvious ones are the constructor, the move generator, and model updater. The last method reduces $\epsilon$ which is the probability of making a random move instead of the best move according to the model. In order to run the model, you can use the following script: ```python import matplotlib.pyplot as plt from tqdm import tqdm from Agent import Agent import gymnasium as gym env = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False,render_mode="human") done = False observation, info = env.reset() action = env.action_space.sample() observation, reward, terminated, truncated, info = env.step(action) agent = Agent( env ) for episode in range(30): obs, info = env.reset() done = False # play one episode while not done: action = agent.get_action(obs) next_obs, reward, terminated, truncated, info = env.step(action) # update the agent agent.update(obs, action, reward, terminated, next_obs) # update if the environment is done and the current obs done = terminated or truncated obs = next_obs agent.decay_epsilon() ``` **Note** We have set `is_slippery=False` above. That's useful for the initial testing; we will change it to `True` later. **Note 2** We use 30 episodes. This is ridiculously little, but the animation is slow, and we need to be able to run it several times in testing. **Note 3** You can turn off the animation by changhing to `render_mode="array"`. This is a lot faster, but you will need some other way to see what is going on. ## 1. Constructor Implement the constructor. You need to record all the parameters and initialise the Q-table. You can use Eirik's initial Q-values below, or it is also possible to use a `defaultdict` as does the [Blackjack tutorial](https://www.gymlibrary.dev/environments/toy_text/blackjack/). ```python initalQ = np.array([ [0.009, 0.192, 0.007, 0.009], [0.003, 0.002, 0.003, 0.17], [0.003, 0.002, 0.001, 0.067], [0.001, 0.001, 0.002, 0.037], [0.526, 0.002, 0.001, 0.002], [0., 0., 0., 0.], [0.046, 0., 0., 0.], [0., 0., 0., 0.], [0.002, 0.002, 0.002, 0.709], [0.001, 0.597, 0.001, 0.001], [0.945, 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], [0.02, 0.012, 0.898, 0.016], [0.061, 0.991, 0.092, 0.068], [0., 0., 0., 0.] ]) ``` In this format `initialQ[state][action]` is the tenative value for $Q$(`state`,`action`). ## 2. Move generator The move generator `get_action()` has to return a valid action, that is an integer in the 0--3 range for the Frozen Lake problem. With probability $\epsilon$ you want to return a random action (see last session for code example), and with probability $1-\epsilon$, the action which maximises $Q$ according to the current estimate. + Implement `get_action()`. + Test the simulator. It should work already at this stage. ## 3. Diagnostic output + Add code to count the number of times you win the game. + Turn off the animation and increase the number of episodes. + Is the default strategy able to win the game ever? ## 4. Model updater Now we need some way to update the Q-table. Q-learning is based on one very simple update rule: $$Q(s,a) \leftarrow Q(s,a) + \alpha\left(\left[ R(s,a,s') + \gamma \max\limits_{a'}Q(s',a')\right] - Q(s,a)\right),$$ where $\alpha$ is the learning rate, which controls the speed of convergence. + Implement `update()` ## 5. Epsilon decay ```python self.epsilon = max(self.final_epsilon, self.epsilon - self.epsilon_decay) ``` + What does the above line do? + Do the attribute name match the ones you have used? + Implement `decay_epsilon()`. ## 6. Testing + Test the system. Add more diagnostic output as required.. + Turn the animation off to be able to test realistically. + Try both Eirik's default Q-table and one initialised with zeroes only. Does this matter a lot? + Does the Q-table change a lot during training? + What happens when you change the training parameters (input to the Agent constructor)? ## 7. The slippery ice Change the environment to be slippery ```python environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=False) ``` + Repeat the tests from 1F. What do you observe? ## 8. The slippery ice + Reflect upon your solution. + What are the key elements of Q-learning? + Which design decisions are critically to make Q-learning work? + Is Q-learning an appropriate solution to the problem? ## 9. Optional Adapt your solution for other problems in Gymnasium, such as [Blackjack](https://www.gymlibrary.dev/environments/toy_text/blackjack/).