diff --git a/M2/Reinforcement Learning/Lab 2 - Maze Game as a Markov Decision Process Part 1.ipynb b/M2/Reinforcement Learning/Lab 2 - Maze Game as a Markov Decision Process Part 1.ipynb new file mode 100644 index 0000000..38b504a --- /dev/null +++ b/M2/Reinforcement Learning/Lab 2 - Maze Game as a Markov Decision Process Part 1.ipynb @@ -0,0 +1,1400 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "44b75d44", + "metadata": {}, + "source": [ + "# Lab 2 - Maze Game as a Markov Decision Process Part 1\n", + "\n", + "## **1. Objectives**\n", + "\n", + "In this lab, we will:\n", + "\n", + "- Model a simple **maze game** as a **Markov Decision Process (MDP)** by defining:\n", + " - **States**\n", + " - **Actions**\n", + " - **Transition probabilities**\n", + " - **Rewards**\n", + "\n", + "- Implement **policy evaluation** to compute the value function of a given policy.\n", + "\n", + "This week, we **do not** improve the policy and search for an optimal one yet. \n", + "We will continue working on the Maze Game **next week**, where we will use these components to compute an **optimal policy**.\n", + "\n", + "We consider a **discounted MDP** with discount factor $\\gamma \\in (0,1)$.\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "100d1e0d", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "\n", + "np.set_printoptions(\n", + " precision=3, suppress=True\n", + ") # (not mandatory) This line is for limiting floats to 3 decimal places, avoiding scientific notation (like 1.23e-04) for small numbers.\n", + "\n", + "# For reproducibility\n", + "rng = np.random.default_rng(seed=42) # This line creates a random number generator.\n" + ] + }, + { + "cell_type": "markdown", + "id": "1018deab", + "metadata": {}, + "source": [ + "## 2. Maze definition and MDP formulation\n", + "\n", + "We consider a small 2D maze on a grid. The agent is a **robot** that moves on the grid.\n", + "\n", + "- `S` : start state\n", + "- `G` : goal state, with positive reward\n", + "- `#` : wall (not accessible)\n", + "- `.` : empty cell\n", + "- `X` : \"trap\" (negative reward)\n", + "\n", + "At each step, the robot can choose among 4 actions:\n", + "\n", + "$$\n", + "\\mathcal{A} = \\{\\text{Up} \\uparrow, \\quad \\text{Right} \\rightarrow, \\quad \\text{Down} \\downarrow, \\quad \\text{Left}\\leftarrow\\}.\n", + "$$\n", + "\n", + "The movement is deterministic, but here we set a small probability of “error” to make the example more realistic.\n", + "- With probability $1 - p_{\\text{error}}$, it moves in the chosen direction.\n", + "- With probability $p_{\\text{error}}$, it moves in a random *other* direction.\n", + "- If the movement would hit a wall or go outside the grid, the agent stays in place.\n", + "\n", + "We will represent the MDP with:\n", + "\n", + "- A list of **states** $\\mathcal{S} = \\{0, \\dots, n_{S - 1}\\}$, **each corresponding to a grid cell.**\n", + "- For each action $a$, a transition matrix $P[a]$ of size $(n_S, n_S)$, where\n", + " $$\n", + " P[a][s, s'] = \\mathbb{P}(S_{t+1} = s' \\mid S_t = s, A_t = a).\n", + " $$\n", + "- A reward vector $R$ of length $n_S$, where $R[s]$ is the immediate reward obtained when **leaving** state $s$.\n", + "\n", + "We will use a discount factor $\\gamma = 0.95$.\n" + ] + }, + { + "cell_type": "markdown", + "id": "ca4fa301-c14f-44ec-b04f-b01ca42d979a", + "metadata": {}, + "source": [ + "### 2.1 Define the maze \n", + "\n", + "Let us now define the maze as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "f91cda05", + "metadata": {}, + "outputs": [], + "source": [ + "maze_str = [\n", + " \"#######\",\n", + " \"S...#.#\",\n", + " \"#.#...#\",\n", + " \"#.#..##\",\n", + " \"#..#..G\",\n", + " \"#..X..#\",\n", + " \"#######\",\n", + "]\n" + ] + }, + { + "cell_type": "markdown", + "id": "99820cf4-292d-49ba-b662-f9f05f901f62", + "metadata": {}, + "source": [ + "**Exercise 1.** Compute the dimensions of the maze (complete the “TO DO” parts):\n", + "- How many rows does the maze have?\n", + "- How many columns does the maze have?" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "564cb757-eefe-4be6-9b6f-bb77ace42a97", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "7\n", + "7\n" + ] + } + ], + "source": [ + "n_rows = len(maze_str)\n", + "print(n_rows)\n", + "n_cols = len(maze_str[0])\n", + "print(n_cols)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "26c821d3-2362-4b60-8c77-3d09296d130d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Maze:\n", + "#######\n", + "S...#.#\n", + "#.#...#\n", + "#.#..##\n", + "#..#..G\n", + "#..X..#\n", + "#######\n" + ] + } + ], + "source": [ + "print(\"Maze:\")\n", + "for row in maze_str:\n", + " print(row)\n" + ] + }, + { + "cell_type": "markdown", + "id": "adc49d58-2730-41d8-96fb-ca7c9cb4fcdf", + "metadata": {}, + "source": [ + "### 2.2 Map each walkable cell (not a wall '#') to a state index\n", + "\n", + "Now we convert the maze grid into state indices for the MDP.\n", + "\n", + "\n", + "The cells where the robot is allowed to stand are \n", + "\n", + "- . : empty space\n", + "\n", + "- S : start\n", + "\n", + "- G : goal\n", + "\n", + "- X : trap\n", + "\n", + "Everything else (i.e., #) is a wall and cannot be a state in the MDP.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "7116044b-c134-43de-9f30-01ab62325300", + "metadata": {}, + "outputs": [], + "source": [ + "FREE = {\n", + " \".\",\n", + " \"S\",\n", + " \"G\",\n", + " \"X\",\n", + "} # The vector Free represents cells that the agent is allowed to move into.\n" + ] + }, + { + "cell_type": "markdown", + "id": "1c9ad05e-9c6c-4e00-918c-44b858f45298", + "metadata": {}, + "source": [ + "**Dictionaries to convert between grid and state index**\n", + "\n", + "We now want to identify all **valid states** of the maze (all non-wall cells). \n", + "To do this, we need two mappings:\n", + "\n", + "1. `state_to_pos[s] = (i, j)`: Given a state index $s$, return its grid coordinates (row, column).\n", + "2. `pos_to_state[(i, j)] = s`: Given coordinates (i, j), return the corresponding state index $s$.\n", + "\n", + "These two dictionaries allow easy conversion between **MDP state indices** and the **physical maze positions**. " + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "a1258de4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of states (non-wall cells): 22\n", + "Start state: 0 at (1, 0)\n", + "Goal states: [16] at (4, 6)\n", + "Trap states: [19] at (5, 3)\n" + ] + } + ], + "source": [ + "state_to_pos = {} # s -> (i,j)\n", + "pos_to_state = {} # (i,j) -> s\n", + "\n", + "start_state = None # will store the state index of start state\n", + "goal_states = [] # will store the state index of goal state # We use a list in case there are multiple goals\n", + "trap_states = [] # will store the state index of trap state # We use a list in case there are multiple traps\n", + "\n", + "s = 0\n", + "for i in range(n_rows): # i = row index\n", + " for j in range(n_cols): # j = column index\n", + " cell = maze_str[i][j] # cell = the character at that position (S, ., #, etc.)\n", + "\n", + " if (\n", + " cell in FREE\n", + " ): # FREE contains: free cells \".\", start cell \"S\", goal cell \"G\" and trap cell \"X\"\n", + " # Walls # are ignored, they are not MDP states.\n", + " state_to_pos[s] = (i, j)\n", + " pos_to_state[(i, j)] = s\n", + "\n", + " if cell == \"S\":\n", + " start_state = s\n", + " elif cell == \"G\":\n", + " goal_states.append(s)\n", + " elif cell == \"X\":\n", + " trap_states.append(s)\n", + "\n", + " s += 1\n", + "\n", + "n_states = s\n", + "\n", + "print(\"Number of states (non-wall cells):\", n_states)\n", + "print(\"Start state:\", start_state, \"at\", state_to_pos[start_state])\n", + "print(\"Goal states:\", goal_states, \"at\", state_to_pos[goal_states[0]])\n", + "print(\"Trap states:\", trap_states, \"at\", state_to_pos[trap_states[0]])\n" + ] + }, + { + "cell_type": "markdown", + "id": "721b968c-a355-46eb-aae4-5950441ba604", + "metadata": {}, + "source": [ + "*Hint.* If you don’t know what a dictionary is in Python, try the following code to help you understand." + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "68744dd6-7278-4c20-8b82-34212685352f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "value2\n" + ] + } + ], + "source": [ + "my_dict = {\"key1\": \"value1\", \"key2\": \"value2\"}\n", + "print(my_dict[\"key2\"])\n" + ] + }, + { + "cell_type": "markdown", + "id": "0c76f4e1-b0ba-49c5-b9d5-cfb523024ba9", + "metadata": {}, + "source": [ + "**Exercise 2.** Read the program above and answer the following questions:\n", + "1. What is the purpose of state_to_pos and pos_to_state?\n", + "2. Why do we only assign states to cells in FREE?\n", + "3. What would happen if the maze had multiple goal cells?\n", + "4. What is the total number of states (n_states) in this maze? Does this match the number of non-wall cells you can count visually?" + ] + }, + { + "cell_type": "markdown", + "id": "4c26a18f-2d03-401c-8eae-f9a17ac55f6d", + "metadata": {}, + "source": [ + "1. What is the purpose of `state_to_pos` and `pos_to_state`? These dictionaries establish a bijective mapping between the mathematical representation of the state and its spatial representation:\n", + "\n", + " `state_to_pos`: Maps the scalar state index `s` (an integer used for matrix/vector operations in RL algorithms like Q-learning) to the grid coordinates (i,j).\n", + "\n", + " `pos_to_state`: Maps the grid coordinates (`i,j`) (used to calculate movement and dynamics within the 2D grid) back to the unique state index s.\n", + "\n", + "2. Why do we only assign states to cells in FREE? In a Markov Decision Process (MDP), walls (#) are obstructions, not valid states.\n", + "\n", + " The agent can never \"be\" in a wall, so assigning a state index to a wall would needlessly increase the dimensionality of the state space (∣S∣).\n", + "\n", + " Excluding walls ensures the transition matrices and value vectors remain compact and contain only reachable positions.\n", + "\n", + "3. What would happen if the maze had multiple goal cells?\n", + "\n", + " In the code: The logic is robust. Since goal_states is initialized as a list (`[]`), the code would simply append the state index `s` of every `G` cell found during the iteration. The list would contain multiple integers representing all terminal states.\n", + "\n", + " Caveat: While the logic holds, the final print statement in the provided script (`state_to_pos[goal_states[0]]`) would only display the coordinates of the first goal found, ignoring the others in the console output.\n", + "\n", + "4. What is the total number of states (`n_states`) in this maze? Does this match the number of non-wall cells you can count visually?\n", + "\n", + " `n_states` represents the total count of walkable cells (Start, Goal, Trap, and empty space).\n", + "\n", + " Yes, this value matches exactly the number of non-wall cells visible in the maze, as the counter s is incremented precisely when a cell is found in the FREE set." + ] + }, + { + "cell_type": "markdown", + "id": "6d0fa298-7b7c-44fc-bbed-15ea002037c2", + "metadata": {}, + "source": [ + "-----\n", + "\n", + "The following function `plot_maze_with_states` creates a figure showing:\n", + "- the maze walls and free cells\n", + "- the state index for each non-wall cell\n", + "- special labels and colors for S (start state), G (goal state), and X (trap state). " + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "fc61ceef-217c-47f4-8eba-0353369210db", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def plot_maze_with_states():\n", + " \"\"\"Plot the maze with state indices.\"\"\"\n", + " grid = np.ones(\n", + " (n_rows, n_cols)\n", + " ) # Start with a matrix of ones. Here 1 means “free cell”\n", + " for i in range(n_rows):\n", + " for j in range(n_cols):\n", + " if maze_str[i][j] == \"#\":\n", + " grid[i, j] = 0 # We replace walls (#) with 0\n", + "\n", + " fig, ax = plt.subplots()\n", + " ax.imshow(grid, cmap=\"gray\", alpha=0.7)\n", + "\n", + " # Plot state indices\n", + " for (\n", + " s,\n", + " (i, j),\n", + " ) in state_to_pos.items(): # Calling .items() returns a list-like sequence of (key, value) pairs in the dictionary.\n", + " cell = maze_str[i][j]\n", + "\n", + " if cell == \"S\":\n", + " label = f\"S\\n{s}\"\n", + " color = \"green\"\n", + " elif cell == \"G\":\n", + " label = f\"G\\n{s}\"\n", + " color = \"blue\"\n", + " elif cell == \"X\":\n", + " label = f\"X\\n{s}\"\n", + " color = \"red\"\n", + " else:\n", + " label = str(s)\n", + " color = \"black\"\n", + "\n", + " ax.text(\n", + " j,\n", + " i,\n", + " label, # Attention : matplotlib, text(x, y, ...) expects (column, row)\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=10,\n", + " fontweight=\"bold\",\n", + " color=color,\n", + " )\n", + "\n", + " ax.set_xticks([]) # remove numeric axes, we don't need.\n", + " ax.set_yticks([])\n", + " ax.set_title(\"Maze with state indices\")\n", + "\n", + " plt.show()\n", + "\n", + "\n", + "plot_maze_with_states()" + ] + }, + { + "cell_type": "markdown", + "id": "db078d86", + "metadata": {}, + "source": [ + "### 2.4 Actions and deterministic movement" + ] + }, + { + "cell_type": "markdown", + "id": "96e7f1f2-9d73-410b-853d-e39f40dfb5da", + "metadata": {}, + "source": [ + "We first define integer codes for each action. \n", + "\n", + "**Exercise 3.** How many possible actions can the agent take in the maze?" + ] + }, + { + "cell_type": "markdown", + "id": "22259ab4-527e-4d7c-bb30-98fb240da6d5", + "metadata": {}, + "source": [ + "We have four possible actions in the maze. \n", + "\n", + "In this following cell, each action is mapped to an integer (0,1,2,3). This makes it easy to store and use actions inside arrays and matrices\n", + "\n", + "Here we use Unicode arrow character:\n", + "\n", + "- \"\\u2191\" : ↑ (up arrow)\n", + "\n", + "- \"\\u2192\" : → (right arrow)\n", + "\n", + "- \"\\u2193\" : ↓ (down arrow)\n", + "\n", + "- \"\\u2190\" : ← (left arrow)" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "f7f0b8e4-1f48-4d03-9e5f-a47e59c3e827", + "metadata": {}, + "outputs": [], + "source": [ + "A_UP, A_RIGHT, A_DOWN, A_LEFT = 0, 1, 2, 3\n", + "ACTIONS = [A_UP, A_RIGHT, A_DOWN, A_LEFT]\n", + "action_names = {A_UP: \"\\u2191\", A_RIGHT: \"\\u2192\", A_DOWN: \"\\u2193\", A_LEFT: \"\\u2190\"}" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "3773781c-a0cd-48db-967b-d4b432d17046", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "↑\n" + ] + } + ], + "source": [ + "print(action_names[0])" + ] + }, + { + "cell_type": "markdown", + "id": "4b957f5a-ee39-4437-abc1-4809105ad83c", + "metadata": {}, + "source": [ + "**Exercise 4.** Now we define a **deterministic movement function** `move_deterministic(i, j, a)`. \n", + "\n", + "This function simulates the robot trying to move from (i, j) in direction a.\n", + "\n", + "But if the movement hits a wall or boundary, the agent stays in place." + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "4b06da5e-bc63-48e5-a336-37bce952443d", + "metadata": {}, + "outputs": [], + "source": [ + "def move_deterministic(i: int, j: int, a: int) -> tuple[int, int]:\n", + " \"\"\"Deterministic movement on the grid. If the movement hits a wall or boundary, the agent stays in place.\n", + "\n", + " Args:\n", + " i (int): current row index\n", + " j (int): current column index\n", + " a (int): action to take (A_UP, A_DOWN, A_LEFT, A_RIGHT)\n", + "\n", + " Returns:\n", + " (tuple[int, int]): new (row, column) position after taking action a\n", + "\n", + " \"\"\"\n", + " candidate_i, candidate_j = (\n", + " i,\n", + " j,\n", + " ) # It means “Unless the action succeeds, the robot stays in place.”\n", + "\n", + " # Now each action changes the coordinates of the robot:\n", + " if a == A_UP:\n", + " candidate_i, candidate_j = (\n", + " i - 1,\n", + " j,\n", + " ) # if the action is UP, then row becomes row -1\n", + " elif a == A_DOWN:\n", + " candidate_i, candidate_j = (\n", + " i + 1,\n", + " j,\n", + " ) # if the action is DOWN, then row becomes row +1\n", + " elif a == A_LEFT:\n", + " candidate_i, candidate_j = (\n", + " i,\n", + " j - 1,\n", + " ) # if the action is LEFT, then column becomes column -1\n", + " elif a == A_RIGHT:\n", + " candidate_i, candidate_j = (\n", + " i,\n", + " j + 1,\n", + " ) # if the action is RIGHT, then column becomes column +1\n", + "\n", + " # Check boundaries\n", + " if not (0 <= candidate_i < n_rows and 0 <= candidate_j < n_cols):\n", + " # If the robot tries to move outside the maze\n", + " # It will not move and it stays at (i, j).\n", + " return i, j\n", + "\n", + " # Check wall\n", + " if maze_str[candidate_i][candidate_j] == \"#\":\n", + " # If the next cell is a wall, the robot stays in place.\n", + " return i, j\n", + "\n", + " return candidate_i, candidate_j # Otherwise, return the new position\n" + ] + }, + { + "cell_type": "markdown", + "id": "c9e620e6", + "metadata": {}, + "source": [ + "### 2.5 Transition probabilities and reward function" + ] + }, + { + "cell_type": "markdown", + "id": "80bd2bca-7717-4b5f-bffa-76fe86a51d35", + "metadata": {}, + "source": [ + "Recall that we set the discount factor $\\gamma \\in(0,1)$, that is, the future rewards are multiplied by $\\gamma$, so immediate rewards matter a little bit more than future ones. \n", + "\n", + "\n", + "Moreover, we consider a probability error $p_{\\text{error}}$, which means, with probability $p_{\\text{error}}$, the robot **does not** execute the intended action but one of the 3 other directions (chosen uniformly). With probability $1-p_{\\text{error}}$, the robot executes the action that we asked." + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "610253e7-f3f7-4a30-be3e-2ec5a1e2ed04", + "metadata": {}, + "outputs": [], + "source": [ + "gamma = 0.95\n", + "p_error = 0.1 # probability of the error to a random other direction\n" + ] + }, + { + "cell_type": "markdown", + "id": "0d1ceff8-86e0-4c45-83d3-af9fae974608", + "metadata": {}, + "source": [ + "Now we initialize the state–transition probability : the probability of reaching next state $s'$ after taking action $a$ in state $s$. \n", + "$$\n", + " p(s' \\mid s, a)\n", + " = \\mathbb{P} \\big[S_t=s'\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]\n", + "$$\n", + "\n", + "We store these transition probabilities in the 3D array `P` (`P[a][s, s_next]`), which has shape `(n_actions, n_states, n_states)`:\n", + "\n", + "`P[a, s, s_next] = P(S_{t+1} = s_next | S_t = s, A_t = a)`.\n", + "\n", + "We also initialize the reward vector `R`, which has length `n_states`, where `R[s]` is the reward received when the agent is in state `s`.\n", + "\n", + "In this maze game, we assume that the reward depends only on the current state, which is natural: in navigation tasks, being in a particular location is what matters, not the direction you used to reach it." + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "7a51f242-fe4e-4e74-8a1f-a8df32b194b8", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize transition matrices and reward vector\n", + "P = np.zeros((len(ACTIONS), n_states, n_states))\n", + "R = np.zeros(n_states)" + ] + }, + { + "cell_type": "markdown", + "id": "c08f4af5-a2a7-4baa-b5da-c7ce636d8a4a", + "metadata": {}, + "source": [ + "Now we assign the reward to each state. \n", + "\n", + "For each state index s:\n", + "\n", + "1. If s is a goal, then the reward = +1.0\n", + "2. If s is a trap, then the reward = −1.0\n", + "3. Otherwise for the normal cell, the reward = −0.01 every time you leave this cell.\n", + "\n", + "Recall that rewards are received at the moment the agent executes an action. Here when the agent moves out of the cell, we set reward −0.01. " + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "49d54d1f-dc29-45b6-ad31-ad0e848f920d", + "metadata": {}, + "outputs": [], + "source": [ + "# Set rewards for each state\n", + "step_penalty = -0.01\n", + "goal_reward = 1.0\n", + "trap_reward = -1.0\n" + ] + }, + { + "cell_type": "markdown", + "id": "dd571ec8-c36a-4e20-bec6-9e6458dc622b", + "metadata": {}, + "source": [ + "**Exercise 5.** Why do we set the step penalty to -0.01 in this MDP?" + ] + }, + { + "cell_type": "markdown", + "id": "1e8ea171", + "metadata": {}, + "source": [ + "We set a small negative step penalty (`-0.01`) for two main reasons:\n", + "\n", + "- Incentivize Efficiency: It forces the agent to find the shortest path to the goal. By losing a small amount of reward at every step, the agent learns that the faster it reaches the goal, the higher its total cumulative return will be.\n", + "\n", + "- Prevent Loitering: It discourages infinite loops or wandering. Without this penalty (i.e., if step reward = 0), the agent might be indifferent between reaching the goal now or in 1000 steps, potentially leading to a policy that never terminates." + ] + }, + { + "cell_type": "markdown", + "id": "07bfb065-b1af-4df1-885e-780fe250f2fb", + "metadata": {}, + "source": [ + "**Exercise 6.** We now define the reward vector. Recall that we have already initialized\n", + "`R = np.zeros(n_states)`.\n", + "If a state belongs to `goal_states`, we assign the `goal_reward`.\n", + "If it belongs to `trap_states`, we assign the `trap_reward`.\n", + "Otherwise, we assign the `step_penalty`. " + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "b9b7495a-c233-425c-99c0-5bddaf6c3225", + "metadata": {}, + "outputs": [], + "source": [ + "for s in range(n_states):\n", + " if s in goal_states:\n", + " R[s] = goal_reward\n", + " elif s in trap_states:\n", + " R[s] = trap_reward\n", + " else:\n", + " R[s] = step_penalty\n" + ] + }, + { + "cell_type": "markdown", + "id": "b90fb80c-9452-48a2-889f-286703c2ae93", + "metadata": {}, + "source": [ + "Now we define terminal states and a helper function. Here terminal_states is a set containing all absorbing states, which means, reaching them ends the episode conceptually. \n", + "\n", + "Moreover, `is_terminal(s)` is a small helper to check if a state is terminal." + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "eca4c571-39c7-468b-af86-0bab9489415e", + "metadata": {}, + "outputs": [], + "source": [ + "terminal_states = set(goal_states + trap_states)\n", + "\n", + "\n", + "def is_terminal(s: int) -> bool:\n", + " \"\"\"Check if a state is terminal (goal or trap).\"\"\"\n", + " return s in terminal_states\n" + ] + }, + { + "cell_type": "markdown", + "id": "3a9a1d54-8339-402b-84e9-105961ed78d7", + "metadata": {}, + "source": [ + "Now we need to fill the transition matrices `P[a][s, s_next]`. \n" + ] + }, + { + "cell_type": "markdown", + "id": "d9cfd15c-12cc-48bb-bd88-07f3ae3db31c", + "metadata": {}, + "source": [ + "**Exercise 7.** **Complete the `# TO DO` part in the program below** to fill the transition matrices `P[a][s, s_next]`. " + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "2d03276b-e206-4d1f-9024-f6948ca61523", + "metadata": {}, + "outputs": [], + "source": [ + "for s in range(n_states): # We loop over all states s.\n", + " i, j = state_to_pos[\n", + " s\n", + " ] # We recover the states to their coordinates (i, j) in the maze.\n", + "\n", + " # First, in a goal or trap state,\n", + " # No matter which action you “choose”, you stay in the same state with probability 1.\n", + " # This makes the terminal states as the absorbing states.\n", + " if is_terminal(s):\n", + " # Terminal states: stay forever\n", + " for a in ACTIONS:\n", + " P[a, s, s] = goal_reward\n", + " continue\n", + "\n", + " # If the state is non-terminal, we define the stochastic movement.\n", + " # For a given state s and intended action a,\n", + " # With probability 1 - p_error, the robot will move in direction a;\n", + " # With probability p_error, the robot will move in one of the other 3 directions, each with probability p_error / 3.\n", + " for a in ACTIONS:\n", + " # main action (intended action)\n", + " main_i, main_j = move_deterministic(i, j, a)\n", + " s_main = pos_to_state[\n", + " (main_i, main_j)\n", + " ] # s_main is the state index of that next cell.\n", + " P[a, s, s_main] += (\n", + " 1 - p_error\n", + " ) # We add probability 1 - p_error to P[a, s, s_main].\n", + "\n", + " # error actions\n", + " other_actions = [\n", + " a2 for a2 in ACTIONS if a2 != a\n", + " ] # other_actions = the 3 actions different from a.\n", + " for a2 in other_actions: # for each of the error action,\n", + " error_i, error_j = move_deterministic(i, j, a2)\n", + " s_error = pos_to_state[(error_i, error_j)] # get its state index s_error\n", + " P[a, s, s_error] += p_error / len(\n", + " other_actions\n", + " ) # add p_error / 3 to P[a, s, s_error]\n", + "# So for each (s,a), probabilities over all s_next sum to 1.\n" + ] + }, + { + "cell_type": "markdown", + "id": "7841b264-af00-4322-b728-adcffac0ef89", + "metadata": {}, + "source": [ + "Now we check if the transition matrices `P[a][s, s_next]` are computed correctly.\n", + "For each action `a`, we sum the transition probabilities over all possible next states `s_next` and verify that these sums are equal to 1.\n", + "\n", + "This is because the matrix `P[a, s, s_next]` stores the transition probability\n", + "\n", + "$\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$. \n", + "\n", + "Therefore, for each action $a$, and for each state $s$, the sum over $s_{\\text{next}}$ of $\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$ should be 1. " + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "341fe630-8f87-4773-84ad-92d3516e53e2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Action ↑: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", + "Action →: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", + "Action ↓: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", + "Action ←: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n" + ] + } + ], + "source": [ + "for a in ACTIONS:\n", + " # For each action a:\n", + " # P[a] is a matrix of shape (n_states, n_states).\n", + " # P[a].sum(axis=1) sums over next states s_next, giving for each state s:\n", + " # We print these row sums.\n", + " # If everything is correct, they should be very close to 1.\n", + "\n", + " probs = P[a].sum(axis=1)\n", + " print(f\"Action {action_names[a]}:\", probs)\n" + ] + }, + { + "cell_type": "markdown", + "id": "46d23991", + "metadata": {}, + "source": [ + "## 3. Policy evaluation\n", + "\n", + "### 3.1 Bellman expectation equation" + ] + }, + { + "cell_type": "markdown", + "id": "305b047c-e83b-4f42-b64e-e2050d5deeff", + "metadata": {}, + "source": [ + "Recall that the value function under a policy $\\pi$ is defined as:\n", + "$$\n", + "V^{\\pi}(s)=\\mathbb{E}\\Big[\\:G_t \\:\\Big|\\: S_t=s\\:\\Big]\n", + "$$\n", + "where the return $G_t$ is\n", + "$$\n", + "G_t=R_t +\\gamma R_{t+1}+\\gamma^2 R_{t+2}+... . \n", + "$$\n", + "This means *The value of a state is the expected discounted sum of all future rewards\n", + "when following policy $\\pi$.*\n", + "\n", + "We know that $G_t=R_t+\\gamma G_{t+1}$, and plugging this equation into the definition of $V^{\\pi}(s)$, we get \n", + "$$\n", + "V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n", + "$$\n", + "This step shows simply ``The total future reward = immediate reward + discounted reward from next state.''" + ] + }, + { + "cell_type": "markdown", + "id": "88ea8d56-3b62-4690-9ff7-469e43726fbc", + "metadata": {}, + "source": [ + "For the expected immediate reward part $\\mathbb{E}[R_t| S_t=s]$, as we are in a maze problem, the reward depends only on the current state, not the time step, i.e., $\\mathbb{E}[R_t| S_t=s]=R(s)$. Hence we get \n", + "$$\n", + "V^{\\pi}(s)=R(s)+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n", + "$$\n", + "\n", + "Moreover, in this maze problem, we consider a deterministic policy $A_t=\\pi(s)$ (the action depends only on the state). Therefore, \n", + "$$\n", + "V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s, A_t=\\pi(s)\\:\\Big]. \n", + "$$\n", + "\n", + "Now **given the state $S_t=s$ and $A_t=a$**, the next state is random (because of the error probability) and we know the transition probability \n", + "$$\n", + "\\mathbb{P}\\big(\\:S_{t+1}=s' \\:|\\:S_t=s, \\, A_t=a\\big)=P\\big(s'\\:\\big|\\:s, a\\big). \n", + "$$" + ] + }, + { + "cell_type": "markdown", + "id": "c25e255d-8f58-4eaf-9485-cee6ab3bea6c", + "metadata": {}, + "source": [ + "Therefore,\n", + "$$\n", + "\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_t=s,A_t=a\\,\\big] =\\sum_{s'}\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_{t+1}=s'\\,\\big]\\times \\mathbb{P}\\big[S_{t+1}=s'\\,\\big|\\,S_t=s, A_t=a\\, \\big]\n", + "$$\n", + "$$\n", + "\\hspace{-1.2cm}=\\sum_{s'}V^{\\pi}(s')P\\big(s'\\:\\big|\\:s, a\\big),\n", + "$$\n", + "where here we use the Markov property. (**Question: Can you show the detailed computations here?**)" + ] + }, + { + "cell_type": "markdown", + "id": "9a2b6cff-e848-44a2-b504-973067b367b3", + "metadata": {}, + "source": [ + "In conclusion, we have (the Bellman expectation equation)\n", + "$$\n", + "V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n", + "$$" + ] + }, + { + "cell_type": "markdown", + "id": "15049fdb-f3af-4f78-b556-817284260ed0", + "metadata": {}, + "source": [ + "### 3.2 Define a function which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n", + "\n", + "\n", + "**Exercise $8^*$.** Now we define `policy_evaluation(...)`, which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n", + "\n", + "The input of this function `policy_evaluation(...)` are:\n", + "1. policy: array of size `n_states`, each entry is an action 0,1,2,3, which correspond to UP, RIGHT, DOWN, LEFT.\n", + "2. `P`: the transition probabilities `P[a, s, s']`.\n", + "3. `R`: the reward vector `R[s]`.\n", + "4. gamma: the discount factor $\\gamma\\in(0,1)$.\n", + "5. theta: convergence threshold.\n", + "6. max_iter: which is used to avoid infinite loops.\n", + "\n", + "How can we apply the Bellman expectation equation\n", + "$$\n", + "V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n", + "$$\n", + "here ?\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5c48f489-3508-4981-8b35-5bedc2e5838c", + "metadata": {}, + "source": [ + "We start with an initial guess of $V^{\\pi}$(e.g., all values = 0) and repeatedly apply the Bellman equation to update each state:\n", + "$$\n", + "V_{k+1}^\\pi(s) \\leftarrow R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}_k(s').\n", + "$$\n", + "until values converge." + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "2fffe0b7", + "metadata": {}, + "outputs": [], + "source": [ + "def policy_evaluation( # noqa: PLR0913\n", + " policy: np.ndarray,\n", + " P: np.ndarray,\n", + " R: np.ndarray,\n", + " gamma: float,\n", + " theta: float = 1e-6,\n", + " max_iter: int = 10_000,\n", + ") -> np.ndarray:\n", + " \"\"\"Evaluate a deterministic policy for the given MDP.\n", + "\n", + " Args:\n", + " policy: array of shape (n_states,), with values in {0,1,2,3}\n", + " P: array of shape (n_actions, n_states, n_states)\n", + " R: array of shape (n_states,)\n", + " gamma: discount factor\n", + " theta: convergence threshold\n", + " max_iter: maximum number of iterations\n", + "\n", + " \"\"\"\n", + " n_states = len(R) # get the number of states\n", + " V = np.zeros(n_states) # initialize the value function\n", + "\n", + " for _it in range(max_iter): # Main iterative loop\n", + " V_new = np.zeros_like(\n", + " V\n", + " ) # Create a new value vector and we will compute an updated value for each state.\n", + "\n", + " # Now we update each state using the Bellman expectation equation\n", + " for s in range(n_states):\n", + " a = policy[s] # Extract the action chosen by the policy in state\n", + " V_new[s] = R[s] + gamma * np.sum(P[a, s, :] * V)\n", + "\n", + " delta = np.max(\n", + " np.abs(V_new - V)\n", + " ) # This measures how much the value function changed in this iteration:\n", + " # If delta is small, the values start to converge; otherwise, we need to keep iterating.\n", + " V = V_new # Update V, i.e. Set the new values for the next iteration.\n", + "\n", + " if delta < theta: # Check convergence: When changes are tiny, we stop.\n", + " break\n", + "\n", + " return V # Return the final value function, this is our estimate for V^{pi}(s), s in the state set.\n" + ] + }, + { + "cell_type": "markdown", + "id": "09ef3439", + "metadata": {}, + "source": [ + "### 3.3 Evaluating a random policy" + ] + }, + { + "cell_type": "markdown", + "id": "eecbca15-f89f-47bf-a13d-7d7c051699b8", + "metadata": {}, + "source": [ + "Now we use the policy evaluation function `policy_evaluation` to evaluate a random policy. \n", + "\n", + "We first generate a `random_policy`, which is an array like [2, 0, 1, 3, 0, 2, ...] and has the size `n_states`. (Recall that the policy is a mapping from states to actions)." + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "b4a44e38", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0 3 2 1 1 3 0 2 0 0 2 3 2 3 2 3 2 0 3 1 2 1]\n" + ] + } + ], + "source": [ + "# Random policy: for each state, pick a random action\n", + "random_policy = rng.integers(low=0, high=len(ACTIONS), size=n_states)\n", + "\n", + "print(random_policy)" + ] + }, + { + "cell_type": "markdown", + "id": "3fe07992-ce82-4124-aebc-a6384d417f64", + "metadata": {}, + "source": [ + "Now we call the function `policy_evaluation(...)` to compute $V^{\\pi_{\\text{random}}}(s)$." + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "c5f559b2-452a-477c-a1fa-258b40805670", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Value function under random policy:\n", + "[ -0.2 -0.2 -0.201 -0.204 -0.205 -0.202 -0.214 -0.429 -0.212\n", + " -0.207 -0.276 -0.459 -0.352 -0.366 -5.827 -4.605 20. -0.366\n", + " -0.999 -20. -6.4 -3.163]\n" + ] + } + ], + "source": [ + "V_random = policy_evaluation(policy=random_policy, P=P, R=R, gamma=gamma)\n", + "print(\"Value function under random policy:\")\n", + "print(V_random)\n" + ] + }, + { + "cell_type": "markdown", + "id": "f46c70ba-2932-49af-b568-b5477260bc94", + "metadata": {}, + "source": [ + "Here in this value vector of the policy, \n", + "- If it is a negative values, then the agent tends to move around aimlessly, fall in traps, or take too long.\n", + "- It it is a higher values, then the agent is closer to the goal or more likely to reach it" + ] + }, + { + "cell_type": "markdown", + "id": "1efcb076-467c-42d8-94e8-87453f688bbd", + "metadata": {}, + "source": [ + "Now we define a function `plot_values`, which displays the value function $V(s)$ and displays it on the maze grid. It helps students visually understand:\n", + "- which states are good (high value, near the goal),\n", + "- which states are bad (low value, near traps),\n", + "- how a policy affects the long-term expected reward." + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "4c428327", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAdcAAAGbCAYAAACWHtrWAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAARjhJREFUeJzt3Qd4FGX+B/DvptcNoSQhEHqHUAUUUWnSBEQ9FRtgvb/lUBEBGyJFBBThlAPvTkVQBPQUEAREBFHpJfROIAFSSCC9kez8n/dddskmm7rvbrLL9+MzLjsz+868O5v5zdtmdJqmaSAiIiJl3NQlRURERAKDKxERkWIMrkRERIoxuBIRESnG4EpERKQYgysREZFiDK5ERESKMbgSEREpxuBKRESkGIOrg507dw46nQ6LFi2qku2vX78eHTt2hI+Pj9yPlJQUVEdi3yZPnlzVu1GtiN+M+F7Eb8hViWMu8lhYo0aNMHr06CrbJ6LKYHAtxbBhw+Dn54f09PQS13nsscfg5eWF5ORkVHdiHx966CH4+vpi/vz5WLJkCfz9/atsf37++WcGUCJySR5VvQPVmQicP/30E3788UeMHDmy2PKsrCysWrUKAwcORK1atVDd7d69W14oTJ06Ff369avq3ZHBVQR5awE2OzsbHh78eRJw4sQJuLmxHEDOhb/YMkqugYGBWLp0qdXlIrBmZmbKIOwMEhMT5WuNGjVQ3Ylq6+ocXMXzLsQFANmft7c3PD09q3o3iCqEwbUUovr0/vvvx6ZNm8yBqTARdEXwFUH4ypUrGDduHCIjIxEQEAC9Xo9BgwbhwIEDZW6nV69ecipKtDOJ9qbCDAYD5s6di7Zt28oAFBoair///e+4evVqmdsYNWqU/HfXrl1lu5apHaukNq2i+7Vlyxb5uRUrVmD69OmoX7++3Ie+ffvi9OnTxT6/c+dODB48GMHBwbL6uX379pg3b545b6LUKog0TVNpba779++X36n4bsV3LLa7Y8cOq+2Sf/31F8aOHYs6derIbd933324fPmyxbqpqak4fvy4fC2L+I6GDBmCDRs24JZbbpG/jc8++0wu+/LLL9GnTx+EhITIQNCmTRssWLCgxDT+/PNPdOvWTX53TZo0weLFi4ute+TIEZmm2I74nqdNmyaPvTX/+te/5O9BbDs8PBwvvvhisbZ0cRzbtWuHgwcP4q677pLNHc2aNcP3338vl//+++/o3r273F7Lli3x66+/lvmdmH4Py5cvx5tvvomwsDD5XYu/h9jY2GLrf/fdd+jSpYvcRu3atfH444/j4sWLZW7H2u9T5O/VV1+Vy0S+xXckapeSkpKQkZEh9+Pll18ultaFCxfg7u6OGTNmlLldIpuIR85RyX755RfxSD7tk08+sZifnJyseXp6aiNHjpTvd+/erTVt2lSbOHGi9tlnn2lTpkzR6tWrpwUFBWkXL140fy46Olqm9+WXX5rn3XXXXXIqatSoUVrDhg0t5j3zzDOah4eH9uyzz2oLFy7UJkyYoPn7+2tdu3bV8vLySs3Hc889J7ct9m3JkiXatm3b5DKxDbGtooru1+bNm+XnO3XqpHXp0kX7+OOPtcmTJ2t+fn5at27dim3Py8tLpv3uu+9qCxYs0MaMGaP169dPLhfbvvvuu2V6Yl9Mk4mYLz5ncvjwYZnPunXralOnTtU++OADrXHjxpq3t7e2Y8cO83riezXtY58+feRxe+211zR3d3ftoYcesthH07qFj0VJRD6aNWumBQcHy2MsvnvxfQjiux89erT8PsT2+vfvL9P99NNPi6XRsmVLLTQ0VHvzzTfl8s6dO2s6nU7mzyQuLk6rU6eO3Jb4fmfPnq01b95ca9++vUxX/IZMxHck5onvVWz7pZdeknkt+nsQxzE8PFyLiIjQXn/9dblumzZt5LrLli3TwsLC5Lbmzp1r/t2mpaWV+p2Yfg+RkZFy3+bMmSO/Gx8fH61FixZaVlZWse9a7Jf4nsR6vr6+WqNGjbSrV68Wy0/R763w7zM9PV1r166d3HfxdyB+W+I3IdLev3+/XOexxx6T33N+fr5FWrNmzZLf9/nz58s85kS2YHAtg/jjFCf02267zWK+OLmKk8CGDRvk+5ycHK2goMBiHXESFCd/EcxUBNc//vhDfvabb76xWG/9+vVW5xdlOsGJC4HCKhpcW7dureXm5prnz5s3T84/dOiQ+TsTgU+kW/jEKRgMBvO/X3zxxWIn0pKC6/Dhw2WwPnPmjHnepUuXtMDAQO3OO+8slkcRbApv69VXX5Un45SUlEoHV7Gu+K6LKhxETAYMGKA1adLEahpbt241z0tMTJS/EXEBYPLKK6/I9Xbu3Gmxngh4hYOrmCe+ExHMC//2RNAW633xxRfmeeI4inlLly41zzt+/Lic5+bmZnGBIn7T5fleTL8HEYwLB+IVK1bI+eJ3IYggHxISIgNidna2eb01a9bI9SZNmlSh4CrWF+v88MMPxfbJdMxNeVi3bp3FcnERYO1vjUg1VguXQVQhjRgxAtu3b7cYAiGqhEWVrKiaFETVlKnTRUFBgeyZK6ouRRXbvn37lOyLqFYLCgrC3XffLau/TJOoahPb2rx5MxzhySeflD2kTe644w75evbsWXP1bXR0NF555ZVi7btFh1mUh/g+f/nlFwwfPlxWo5rUrVsXjz76qKxmTUtLs/jMc889Z7EtsY8infPnz5vniapGEcfLO8yjcePGGDBgQLH5oprTRFQxi2Miql7F91G0yllUGZu+L0FUW4vfiOm7M3X0uvXWW2XVceH1irbti6rbvLw8+T0X7vDz7LPPyqrztWvXWqwvfiPit2witiuOT+vWrWWVsInp34X3qTSiOlY0j5j87W9/k8dG5EPYs2ePbFZ54YUXZFW4yT333INWrVoV28+y/O9//0OHDh1kVX9RpmMuOuyJKvJvvvnGvOzw4cOyWlxURxPZG4NrOZhOaqaOTaLd5o8//pAnKhF8BdEe9vHHH6N58+Yy0Io2JXFCFH/M5WnTK49Tp07JtETbnki78CTamay1C9tDgwYNLN6LNlXB1O575swZ+Sra+FQQbaWiZ7YIBkWJwCC++6JtfGXtY2WI4GqNaN8VJ3PRzieClTgeog1SKHrsi+6Xad8K75e4ABC/o6KK5t90oVB0vrjwERchhS8kBNEuWfTiRlysRUREFJtXke+q6L6KbYj2XNPFaEn7KYjgWnQ/yyJ+X2X9tsTFhvi7XblypfztCCLQiuD+4IMPVmh7RJVRfbtjViOiZChOAt9++608aYpXUeIpXJJ4//338c477+Cpp56SQ11q1qwp/8BFqaKkjiiFT0bGmlBLoqRVmEhHBNbCV+OFiZN6ZZRUmhTbN108FGZtnmAtD1XFHvtYuIRa+EQvai/E72POnDkyUIngJkpt4mKr6LGvyu+upG07w/GsDFGinj17tgywjzzyiLw4Fh3KTBcPRPbE4FpOIpCK4ClKouKPVFyti163JqLXZe/evfH5558X69UoSrGlESUXa1VwRa/omzZtKqsCb7/9dqsn+soS27d2pyax/cLVsOUl9tNUDVfaeNryVhGLiwbRu1WMdyxK9PYVFzFFS1+OIsZB5+bmYvXq1RalUluq6Bs2bChrKYoqmn+xnml+4eMkqopFtbyjxjIX3VcRlEXvcdE7vOh+ih7QhYl5puUV+X2J31ZZROm2U6dO8mJUlNpjYmLwySefVGhbRJXFauFyMpVSJ02ahKioqGLtX+Lqv+iVvmgjLc9QA3GyEEGi8FARMYRHVDcWJu6uJEqTomRcVH5+fqVvZSi2L4a0iJOyyZo1a6wOpyiPzp07yypUMWSo6D4V/o5Md4cqa7/Fd9u/f385rrhwu3dCQoK80OnZs6dsY6yoigzFKW3fiuZLpCeG51SWGL4kjseuXbvM88Rvo2iNhQieopT8z3/+02L74gJP7INo03QEMZSo8F3MxIVmXFycHDYliKFLosZl4cKF8kLEZN26dTh27FiF9/OBBx6Qfx/i5i5FFf0bfOKJJ2R7vfgtihu9mPaJyN5Yci0nESx69OghT/BC0eAqqpumTJkiO/uI9Q4dOiRPhuUp+YmqZFGlKDrLPP3007LtVJyIxNjFwh11RCcZMaZVjNETAV4EHDG4XpQcRCAXY0hFZ5KKeuaZZ+QJUdxpSgRwUdX59ddfm0ugFSVKkmKc59ChQ+V9jMV3Ijq4iEAmxm+KsaKm6nZhzJgxMu+mzmPWiHGeGzdulIFUdIwRN5gQ40zFyXrWrFmV2k9xchb7JgJhZe9dK46BCHAir+LYiLbv//znPzKYiABTGePHj5e3phTHQ4zVFBch//73v2UJT9ScFC7Rv/HGG3jvvffkumJ8qSgJinGvolbFUR13RBOIOC7iuxQXPCKQiTZX0bFKEL/RmTNnyuXiNyyqaMV64vcqxqmK8aoV8frrr8vfq2g7FX874nckxpmL2gPxdyM6O5mIDm/i+xTH+vnnn+fNKMhxlPc/dmHz58+X3fuLjuk0DcURwynEsB0xfu/222/Xtm/fXmw4i7WhOMLXX38th26IoRUdO3aUQwmsjXMV/v3vf8txpmI7YiiKGGc4fvx4OTSlMkNxhI8++kgOqRDDQsS+79mzp8ShON99953FZ0vK059//inHsop9FGNUxTCIwuOFxZCdf/zjH3JMpxh7WPjnWHQojrBv3z45xCUgIECOre3du7d5rG5ZeTTtu2lsamWG4txzzz1Wl61evVrmTYzvFOM2Z86cKYfBFB2TWlIa1oZiHTx4UM4TaYrjIsZxfv7558XSNA29adWqlRx3LcZ2Pv/888WGQIm02rZtW+58ie2IoVKlMX2n3377rfbGG2/I4TbiNynSszaOdPny5XL8sfiN1axZU45FvXDhgsU65RmKYxpnLsb0iu9G/M3Ur19frpOUlFRsu4MHD5ZpFv2tENmTTvzPgbGciFyEuEOT6Gcgak0qU2PiKGLIjqhJsnYXMSJ7YZsrEbksUTUvxtGKtlciR2KbKxG5HNFbWnQI/O9//yvbWUV7OJEjseRKRC5HPIhAlFZFkP3qq6/kQwWIHIltrkRERIqx5EpERFQVba7iFm6XLl2SN+euzI3XiYioaolKSnGzD/FAg8IPelApJyfH4mY0thDjxws/6MElg6sIrFV1ezkiIlJH3HlN3A7SHoG1ceN6iI+/oiQ90U4u2sydNcCWK7iaHiclDkplbjNHRERVS9ztTRSSCj8eUCVRYhWB9dz5FdDr/WxKKy0tC40aPiTTdOngaqoKFoGVwZWIyHnZu2lPH+ADfYCNDxYp40lizoDjXImISB0RGA02BkcXCK7sLUxERKQYS65ERKQOS64SgysREakj7kuk2XhvIhe4txGrhYmIiBRjyZWIiNQxaAqqhZ2/5MrgSkRE6rDNVWK1MBERkWIsuRIRkTosuUoMrkREpA6Dq8TgSkRE6mgKgqtIw8mxzZWIiEgxllyJiEgZnWaQk61pODsGVyIiUodtrhKrhYmIiBRjyZWIiBTfoUmzPQ0nx+BKRETqsFpYYrUwERGRYiy5EhGROiy5SgyuRESk+HmuBtvTcHKsFiYiIlKMJVciIlKH1cISgysREanDoTgSgysREanDkqvENlciIiLFWHIlIiJ1+Mg5icGViIiU0RkMcrI1DWfHamEiIiLFWHIlIiLFN5HQbE/DyTG4EhGROuwtLLFamIiISDGWXImISB2WXCUGVyIiUod3aJIYXImISB2WXCW2uRIRESnGkisRESmuFjbYnoaTY3AlIiJ1OM5VYrUwERGRYiy5EhGROuzQJDG4EhGROqJK18BqYVYLExGRU9u6dSuGDh2K8PBw6HQ6rFy50mL56NGj5fzC08CBA+26Tyy5EhGRU1cLZ2ZmokOHDnjqqadw//33W11HBNMvv/zS/N7b2xv2xOBKREROHVwHDRokp9KIYBoWFgZHYbUwERFVS2lpaRZTbm5updPasmULQkJC0LJlSzz//PNITk6GPTG4EhGR+nsLG2ycAERERCAoKMg8zZgxo1K7JKqEFy9ejE2bNmHmzJn4/fffZUm3oKAA9sJqYSIiUkczGCdb0wAQGxsLvV5vczvpiBEjzP+OjIxE+/bt0bRpU1ma7du3L+yBJVciIqqWJVe9Xm8xqeqE1KRJE9SuXRunT5+GvTC4EhHRTeXChQuyzbVu3bp22warhYmIyKl7C2dkZFiUQqOjoxEVFYWaNWvK6b333sMDDzwgewufOXMG48ePR7NmzTBgwADYC4MrERE59cPS9+zZg969e5vfjx07Vr6OGjUKCxYswMGDB/HVV18hJSVF3miif//+mDp1ql3HulYouK7vPQV+7vYdeEvk6obsmg5Xs6bbW3A1rnicXFWvXr2glXLLxA0bNsDRWHIlIiJ1+DxXicGViIiculq4OmJvYSIiIsVYciUiIoUU3ERCpOHkGFyJiEgdVgtLrBYmIiJSjCVXIiJShyVXicGViIic+g5N1RGDKxERqcOSq8Q2VyIiIsVYciUiInVYcpUYXImISB22uUqsFiYiIlKMJVciIlJHPJ1Gs7Fa19bPVwMMrkREpA7bXCVWCxMRESnGkisREanDkqvE4EpEROqIJ+IYbOzta/NTdaoeq4WJiIgUY8mViIjUYbWwxOBKRETqiBpdg63BFU7PYcE1uH0DRE4YBv+IWsiIScahmauQcijW6roht7dE05F3ILBpGLT8AlzZfw5HPl6LnMQ08zqhd7VGm38MhE+IHqnHL+HA9B+ReT4JjuRqeXK1/LhqnlwRj5MLYcnVcW2unnpfdJ0zEudW7MCGvtNw/rsd6DZnJDwCfKyu7xHgjTOL/8CmobPw2/APcS0zF53fH2Fe7t+gNjpNeQhH5v6MDf2mI2nPWXT98HHo3B3XhOxqeXK1/LhqnlwRjxO5Iof82sJ6tUHO5TTErNoDw7UC+ZqbnC7nW3Npw0Ek/nUCBdl5KMi5huhlfyG4bYT5j6PeoI5I3nsWiX+egCEvH6c+3wyv4ADU7NjQEdlxyTy5Wn5cNU+uiMfJtWgGTcnk7BwSXPXNwpB2Ms5innivbx5Wrs/X6twY6ecuQyswWE1PzM+ITpTzHcXV8uRq+bG2D66QJ1fE4+Sitz/UbJycnEOCq7ufF/LTcyzmXUvPgYefd5mf1beoi5Z/74ejH681z/Pw85KfL5qeu3/Z6anianlytfy4ap5cEY8TuSK7dGiqN6ADIt+4V/47Oz4FSbvOyHYViw0H+CAvJbPUdAKbhqLbvFE4PPsnmYZJflZesfYYzwBvFGTmwl5cLU+ulh9XzZMr4nFycezQZL/genHDATmZRAzrgsYjehS74oxe+lepfzi3fvoUjs3fgIvrb6QlpJ2OR1CLuub3oq0loHEI0s4kwF5cLU+ulh9XzZMr4nFycQyujqsWjt9yFD4hQfKPSOfhLl99agcifssRq+sHNAmRfzgnFm7EhTX7ii2/uC4KtW5pgpAeLeDm6Y7mT/VCXmqW7JLvKK6WJ1fLj6vmyRXxOJEr0mla2S3HaWlpCAoKwvLOr8HPvXLtFsEdGiJyvHEcW2ZsEg59sBpXD8XIZT6hQei1/GVseXgechJS0eGd+1H/nk6yJ2BhpuWC6EnY+qUB8o8y9cQlHJj2g+PHULpYnlwtP9U1T0N2TYerWdPtLZs+z+Nkf6bzeGpqKvR6vd3Svzr7Seh9vWxLKzsPwa9/abd9dangSkSuedJWEVyrI1c7Tg4LrjNHqwmuExY5dXDlqGoiIiLFeG9hIiJSRlSGajZ2SCpHhWq1x+BKRETqsLewxOBKRETqMLhKbHMlIiKntnXrVgwdOhTh4eHQ6XRYuXJlsWrmSZMmoW7duvD19UW/fv1w6tQpu+4TgysREakvuRpsnCogMzMTHTp0wPz5860unzVrFv75z39i4cKF2LlzJ/z9/TFgwADk5FjeJlMlVgsTEZE6Km68r1Xs84MGDZKT9aQ0zJ07F2+//Tbuvdd4283FixcjNDRUlnBHjLjxuEKVWHIlIqJqKS0tzWLKza34/aGjo6MRHx8vq4JNxHjc7t27Y/v27bAXBlciIlJGM6iZhIiICBkITdOMGTMqvD8isAqipFqYeG9aZg+sFiYiomrZWzg2NtbiDk3e3s5zh0CWXImIqFrS6/UWU2WCa1hYmHxNSLB8KpJ4b1pmDwyuRETk1L2FS9O4cWMZRDdt2mSeJ9pvRa/h2267DfbCamEiIlKmcJtpZVX08xkZGTh9+rRFJ6aoqCjUrFkTDRo0wCuvvIJp06ahefPmMti+8847ckzs8OHDYS8MrkRE5NT27NmD3r17m9+PHTtWvo4aNQqLFi3C+PHj5VjY5557DikpKejZsyfWr18PHx8fu+0TgysREamjKajWreA41169epV6s39x16YpU6bIyVEYXImISB1RpWtQkIaTY3AlIiJlxOPmNFsfOccb9xMREVFRLLkSEZE6rBaWGFyJiEgdUaOrKUjDybFamIiIqCpLrgM3T7K4z6OzW9PtLbiaIbumw5W44jFa7YJ54lU6mbBDkxGrhYmISB22uUq84CQiIlKMJVciInLqewtXRwyuRESkDquFJVYLExERKcaSKxERKcNqYSMGVyIiUkeMojEoSMPJMbgSEZEy4slvmmOfOFctsc2ViIhIMZZciYhIGba5GjG4EhGROhyKI7FamIiISDGWXImISBlWCxsxuBIRkTLsLWzEamEiIiLFWHIlIiJ1DDrjZGsaTo7BlYiIlGGbqxGrhYmIiBRjyZWIiJTRNJ2cbE3D2TG4EhGRMqwWNmJwJSIitUNxDLan4ewYXG0Q3L4BIicMg39ELWTEJOPQzFVIORRrdd2Q21ui6cg7ENg0DFp+Aa7sP4cjH69FTmKaeZ3Qu1qjzT8GwidEj9Tjl3Bg+o/IPJ/kwBy5HpXHKLBJCFq/Mhg1WoXDq4Y/1veZivyMHAfnCKhZKE+ZMck4OHMVrpaQp8IaDu+KDm8Ox+E5a3F22TY5L+S2FmjzjwHwCQmSZzTxuzs892ekn0mAI7nicaKbGzs0VZKn3hdd54zEuRU7sKHvNJz/bge6zRkJjwAfq+t7BHjjzOI/sGnoLPw2/ENcy8xF5/dHmJf7N6iNTlMewpG5P2NDv+lI2nMWXT98HDp3HqLqcowM+QbE/XoIUVP+h6rMk8hD9IodWN93GqK/24HupeTJxLt2IJo+3hNpp+It5qeejMP2fyzC+n7TsGHgDCT8dQLdZj0GR3LF43QzM7W5ajZOzo5n7koK69UGOZfTELNqDwzXCuRrbnK6nG/NpQ0HkfjXCRRk56Eg5xqil/2F4LYR5uBZb1BHJO89i8Q/T8CQl49Tn2+GV3AAanZs6OCcuQ7VxygzJgmxq/c6vFRXWF0recpJTpfzS9N+/DCc/GIz8tKyLOaL70NMJprBAN+6NRx6UeeKx+mmZtBBs3HiONebmL5ZGNJOxlnME+/1zcPK9flanRsj/dxlaAUGq+mJ+RnRiXJ+8t5oxXt/c1B9jJw1T3X7tIWHvzcu/ByFBkO7FFvuGxqEXkv/AQ8/b0AHnPzyd4fm2RWPExGDayW5+3khP92yHedaeo7xBFUGfYu6aPn3ftj7xrfmeR5+XvLzRdNz9y87PXLMMaouebpWgTx5BvqgzZiB2PGPRSWmmZ2QinV9p8m0I+7pjJyEVDiSKx6nmxnvLWzE4FpO9QZ0QOQb98p/Z8enIGnXGdlWVJhoI8pLySw1ncCmoeg2bxQOz/5JpmGSn5VXrI3JM8AbBZm5SvPhyux9jKoqTx2u5ymrhDx5Bvggt4Q8tRkzCDGr9yIzNrnMbRVk5eHc9zsx8Jc3sXXUv5B16SrswRWPE93Aca5GDK7ldHHDATmZRAzrgsYjehS7io5e+lepJ4NbP30Kx+ZvwMX1N9IS0k7HI6hFXfN70X4U0DgEaWw3qjbHqDrkqcGwLmhiJU9nSshTna5NZZWw6TMiENdoXU+25e+ZaKW0pwPcvD1ku6u9gqsrHieiotihqZLitxyVwxfEiUHn4S5ffWoHIn7LEavrBzQJkSeDEws34sKafcWWX1wXhVq3NEFIjxZw83RH86d6IS81Sw4zoOpxjAQ3Lw+4eRqvSd283OV7R4q7nqcG1/PUoIw8/fH0Qmx57BP8/vincko5dhGnv/4TB2esksvD746Ef/2agE4nS4uRY4egIPuaHJLjKK54nG5mtnZm0kydmipg8uTJ0Ol0FlOrVq1QlfiLq6RradnY/doSRI4fhnbjhiIzNgm7xy4xt4f5iE4iy1/GlofnyTaspo/1hFewH9q8OlhOJqbloodj1Lvfoe3Ye+SJJvXEJZk+O2lUn2MkSnN9V71unt9//ZvyddO9s5Edl+KwPO16bYns/Rs5bigyYpOws1CeROek3stfxuaH58m21NzkDIvPi57oYsynuHAT/OoGo/WL/eEdHCB73149egHbX/oS+Q5sjnDF43Qzq6o217Zt2+LXX381v/fwqNrwptO0srORlpaGoKAgpKamQq/Xw1Ws6fYWXM2QXdPhSlzxGLni5ZIrVoG52t+Svc/jpvRPDXscgZ5eNqWVfi0PzVd/Xe59FSXXlStXIioqCtWFK/5NEBGRC9xEIi0tzWLKzS25RuXUqVMIDw9HkyZN8NhjjyEmJgZVicGViIiUMRh0SiYhIiJCloZN04wZM6xus3v37li0aBHWr1+PBQsWIDo6GnfccQfS02/cIMXR2OZKRETVss01NjbWolrY29v62OdBgwaZ/92+fXsZbBs2bIgVK1bg6aefRlVgcCUiompJr9dXqn24Ro0aaNGiBU6fPo2qwmphIiJyqRv3Z2Rk4MyZM6hb98a9AxyNwZWIiJw6uI4bNw6///47zp07h23btuG+++6Du7s7HnnkEVQVVgsTEZFTu3DhggykycnJqFOnDnr27IkdO3bIf1cVBlciIlLGoOnkZGsaFbFs2TJUNwyuRESkTGVuX1iUrZ+vDtjmSkREpBhLrkREpAyf52rE4EpERMoYoKDNVTz70MmxWpiIiEgxllyJiEgZFTeB0Gz8fHXA4EpERMqIwGhgcGVwJSIidVhyNWKbKxERkWIsuRIRkTKG65MtbP18dcDgSkREyrBa2IjVwkRERIqx5EpENhuya3pV7wJVEwat4jfet5aGs2NwJSIiZVgtbMRqYSIiIsVYciUiIsXVwrA5DWfH4EpERMqwWtiI1cJERESKseRKRERqHzkHPnKOwZWIiJThw9KNGFyJiEgZMcbVYPM4V+cvubLNlYiISDGWXImISBlNQZurxjZXIiKiG9jmasRqYSIiIsVYciUiImXYocmIwZWIiJQR7aUa21xZLUxERKQaS65ERKQMb9xvxOBKRETKsM3ViNXCREREirHkSkREyrBDkxGDKxERKcM2VyMGVyIiUoYlVyO2uRIRESnGkqsNgts3QOSEYfCPqIWMmGQcmrkKKYdira4bcntLNB15BwKbhkHLL8CV/edw5OO1yElMk8sDm4Sg9SuDUaNVOLxq+GN9n6nIz8hxcI5cj8pjJDR7shcaDL8FnoG+yLp4Bcc+3YCknacdmCOgZqE8ZcYk4+DMVbhaQp4Kazi8Kzq8ORyH56zF2WXbzPN9QvRo9+pg1OnWTL6/euQCdoxZZNc8kOtitbARS66V5Kn3Rdc5I3FuxQ5s6DsN57/bgW5zRsIjwMfq+h4B3jiz+A9sGjoLvw3/ENcyc9H5/RHm5YZ8A+J+PYSoKf9zYC5cm+pjFHpXazR5rCd2j12CDX2m4uzSv3DLrMfkdhyZJ5GH6BU7sL7vNER/twPdS8mTiXftQDR9vCfSTsVbzHf38USPfz2N1FPx+GXoLKzv/z6OL9ho51zQzTAUx2DjVFHz589Ho0aN4OPjg+7du2PXrl2oSgyulRTWqw1yLqchZtUeGK4VyNfc5HQ535pLGw4i8a8TKMjOQ0HONUQv+wvBbSOgczcegsyYJMSu3ov0MwkOzonrUn2M/OrVROrRC+ZjdHFdFNw83OR8R6lrJU85yelyfmnajx+Gk19sRl5alsX8iCGdkZeahVNfbEFBVh60AgNSjl20cy6I1Fq+fDnGjh2Ld999F/v27UOHDh0wYMAAJCYmoqowuFaSvlkY0k7GWcwT7/XNw8r1+VqdGyP93GV5MiPnOEZxGw/Bu1Yg9C3qAm461B/SGdmJaQ69IKpMnur2aQsPf29c+DnKah5zElPRfe4oDNz4Fu786gWE9Ghhl32nm4OmaBLS0tIsptzcXFgzZ84cPPvss3jyySfRpk0bLFy4EH5+fvjiiy9QVRhcK8ndzwv56ZZtotfSc+Dh513mZ8XJueXf++Hox2vtuIek+hjlXsmQJds7vnoBg/98D23H3oOD7/8IQ14+HJknkYfy5skz0AdtxgzEwQ9WWV3upfdF3V5tcf7HXdgwcIYs3d7ywSPwr++40ji54MPSNdsmU2/hiIgIBAUFmacZM2YU215eXh727t2Lfv36mee5ubnJ99u3b0dVYYemcqo3oAMi37hX/js7PgVJu84Ua2sT7V55KZmlphPYNBTd5o3C4dk/yTTIeY5R82f6IOT2Ftj8t4+RdekqanVqhC4fPIIdL36JtFNxdstTh+t5yiohT54BPsgtIU9txgxCzOq9yIxNtro8PzsPVw7FIP73Y/K9eE09fgl1ujdH5oWdyvNDVBGxsbHQ6/Xm997exS8ik5KSUFBQgNDQUIv54v3x48dRVRhcy+nihgNyMokY1gWNR/QoVtqJXvpXqSftWz99Csfmb8DF9TfSIuc4RkEt6+LSpsOyl7CQvC9adhCq3a2p3YJr0Tw1GNYFTazk6UwJearTtamsEjZ9RgTiGq3roWbHhtgz8VuknYxH7a5N7LLvdHMSjSgGBWkIIrAWDq7OhNXClRS/5Sh8QoLkCVzn4S5ffWoHIn7LEavrBzQJkSftEws34sKafVbXcfPygJun8XrHzctdvqfqc4zEcJe6fdrBN6yGeZhPjTb1i7WB2lPc9Tw1uJ6nBmXk6Y+nF2LLY5/g98c/lZPorHT66z9xcIaxmjj25/0IahmO0J4tAZ1Ovor3iTtOOSxP5Fo0Ua2r2T6VV+3ateHu7o6EBMu+D+J9WFj5+lfYA8/elXQtLRu7X1uCyPHD0G7cUGTGJskhGqb2MJ/QIPRa/jK2PDwPOQmpaPpYT3gF+6HNq4PlZGJa7lu3Bvquet08v//6N+XrpntnIzsupQpy6PxUH6Mzi7fKKtke/3kWngG+spfu8QW/IGn3GYfmaddrS2Tv38hxQ5ERm4SdhfLkGxqE3stfxuaH5yE7IRW5yRkWnxftw2L8tOghLIhSuCjBtn15ELpMexiZF65g94Sl5tI5UXXn5eWFLl26YNOmTRg+fLicZzAY5PuXXnqpyvZLp2lamcN1RS8t0ZicmprqtEV0a9Z0ewuuZsiu6XAlrniMXLF/+DAX+925Inufx03pL+08Dn7uZXcaLE1WQS4e3fdhufdVDMUZNWoUPvvsM3Tr1g1z587FihUrZJtr0bZYR2HJlYiInPoOTQ8//DAuX76MSZMmIT4+Hh07dsT69eurLLAKDK5EROT0N+5/6aWXqrQauCh2aCIiIlKMJVciIlKGN+43YnAlIiJl+DxXI1YLExERKcaSKxERKcNqYSMGVyIiUobB1YjVwkRERIqx5EpERMqwQ5MRgysRESkjbqhrsLFat+yb8lZ/rBYmIiJSjCVXIiKqls9zdWYMrkREpExFn8dqja2frw4YXImISBmWXI3Y5kpERKQYS65ERKQMbyJhxOBKRETKiLioKUjD2bFamIiISDGWXImISHG1sM7mNJzdTR1cXaFHWlGru70FVzJs13S4mjebToGrOdzS9fI0/uidcCX5BZkO2Q6rhY1YLUxERKTYTV1yJSIitdhb2IjBlYiIlOFNJIxYLUxERKQYS65ERKSMeFycxkfOMbgSEZE64kHnBj4sncGViIjUYcnViG2uREREirHkSkREyrC3sBGDKxERKcNxrkasFiYiIlKMJVciIlKG9xY2YnAlIiJlWC1sxGphIiIixVhyJSIiZTjO1YjBlYiIlOFQHCNWCxMR0U2jUaNG0Ol0FtMHH3ygfDssuRIR0U3VoWnKlCl49tlnze8DAwOVb4PBlYiIquVQnLS0NIv53t7ecrKVCKZhYWGwJ1YLExGR8pKrwcZJiIiIQFBQkHmaMWOGkn0U1cC1atVCp06dMHv2bOTn50M1llyJiKhaio2NhV6vN79XUWodM2YMOnfujJo1a2Lbtm144403EBcXhzlz5kAlBlcb1GzfAJEThsE/ohYyY5JxcOYqXD0Ua3XdoJbh6PDmcPiFB0PnpkN6dCKOzv8FV/afk8trdW6M2xc+g/ysXPNnYtfsx6EPf3LK/LSfeC/qD+xw4wNuOnj4eOH3J+Yj9cQlR2XJ5dSoF4TxW19Gbmaeed7ZHeew5LllJX7mloc64c7neiCwTgDSEtLx2ydbceCnw3JZeNsw3Pf+EATXNx7HxNOXsWHWJpzbHYOq0PGhzhg8dQg2vr8Bu7/aWeJ6gaGB6PfmADTu0US+v3TgIpY98415eaeHO6PH/90B3xq+iNl1Hmvf/gmZlzPssMc6uOmaQ6cLBuAJIA8GLQaaFn99uTvcdC2g09WSfWAN2kVo2vlS0itr/Yqm53jiWayaoue5isBaOLiWZOLEiZg5c2ap6xw7dgytWrXC2LFjzfPat28PLy8v/P3vf5elYhXB24TBtZI89b7oNmckjn6yHhd+3o/6gzuh+5yR+PW+j5CfkVNs/az4q9g9YSmy41Pk+7q92uDWOSOxfuD7MOQaqySupWdjXd9pcIX8HPxglZxMmj56Oxre15WBVZGZt3+MnPQbF2IlqdsmDMPeG4xFT34jg3DTHo0x8r+PIO5YPBJPJ+HqxVR88/x3SLmUKtdv278VRv33EUzv9hHyr/8uHSUgJAC3Pn0bEk8klLqep68nHls8EodWHsTaN1fjWs41hLWpa17e8NZG6D2uH5Y9/Q0un0pE/3cG4t4P78PSUUvssNciCOShwHAAgPg70cPdLRIGLRcarsrAC50nCgw7ZPB1d+sAA3KgadbzWNb6FU2vKmgKOiRpFVz/tddew+jRo0tdp0kT44VYUd27d5fVwufOnUPLli2hCttcK0kEk5zLaYhZtQeGawXyNSc5Xc635lpqtjkQQaeDZtDg4e8Nn1rqe6lVx/w0GHYLYn7aa88skBU169dAysUUGViFM9uikRqXipBmdeT77JRsc2DV6QCDwQDvAG9ZynW0AZMG489//SH3qTTt7+uArKvZ+GvBH8jLzINWoCHu0I2Ltvb3d8Th1Qdx6eBFXMu+hi0f/YYGXRuiRv0adthrUXoU363pAjQNGlKg0wXJ06tOFwKDIRqAuFDJliVNN92NCwFLZa1f0fRuHnXq1JGl0tImUUK1JioqCm5ubggJCVG6Tyy5VpK+WRjSTsZZzBPv9c1L74E2aNPbcPf1gpuHO2LX7kPWpavmZWJ+/7UTZKBK3h+No59skAHPWfNjEhwZAf8GtRC7Zp/y/b5Zvbzuebh5uOHCgYtYP/NXXD6bbHW9k3+cQa8Xe6LZ7U1wZttZNOvZFL6BPji3x7La95394+Hl5wV3Dzfs++EArl64fuHkIK0GtJZB/fCqg+jwQMdS123QrSHSE9Lw8H8eRXj7eki5cBVb523Bma2n5fKQliHYs2S3ef3M5ExkJmWgTssQpNg9X27QQQ+DlgjADzqdKL8Uqo7WMgBdgxI+W9b6FU2valTnoTjbt2/Hzp070bt3b9ljWLx/9dVX8fjjjyM4WFTtq8PgWknufl64lm5ZXSree/iVXmcvqn3dvD0Q3rst3LxFG41RxvnL+P3xT5F+7jK8g/3R9pXB6PbRE9g66l8OuReY6vwU1uDeW5Dw5wnkXslUus83o6yrWfjXff/FpaPx8PL1RO+X7sSTix/HvIELkJtxox3WRJTcolYewhP/flheAGkFBvxv4mpkJFkei6mdZsHD2wPtBraWr47ko/dBn/H98O1T35Rv/SBfNOzeCD/84zt89/wyNLurOe7/54P477CFuBpzVV4k5Bb5Leek5cDLX117WkncdC2hIQsaLoueCdC0AotKTk2WOEv6ft3LWL+s5dVDdX4qjre3N5YtW4bJkycjNzcXjRs3lsG1cDusKtXrqFRj9QZ0QIc37pX/zopPQdKuM7KdsjDPAB/kppQdQESb5IX1B9Br2RhknLuMKwfOIzc5Q06CeD3w/koM/u0dBDSohYzzSU6XHxNRqq3XNxJ731muPA83gw7D2mH4tCHy36J6d96ghbhw0FgFKtpc183YiI73RqJB5wic2nqm2Oe7PNgRPZ+5DQse+AIJJxIQ2jIUI/87AjlpuTix5ZTFuqKNNWrVIby87v9w+UwSzu+13pnNVm2HtsOg94x5Sr2UgotRF3Dg+yhcPX+lXJ/Py8rDxf0XcHLTCflevMYfiUPjnk1xdekeudw70DKQegf6IC+z7DZqWxg7Nvleb38VRCB0u94uawwXOnnKLaktu6z1K5oeFSV6Ce/YIdqr7Y/BtZwubjggJ5MGw7qgyYgeFuvoW9TFmaV/lTtNUZIQPXMLByMzO5dWHZWfev3b41pmLhK2nVS05zeXA6sPy6k0Wim/lfA2YTj5+2nEHzd2eBGvp/84ixZ3NS0WXE3cPd1Rq1FNuwXXIz8dlpPJC5vGyCrhrqO6y/fi33XbhSOiSwP8MOa7Yp9PPJ6ARrc2LjH9xBOJCG11oznDr6YfAuoE4PIJUVVrz8Cqvx5YRRAUsq4HQf8bVbk60ZZd0gVrWetXNL2qUZ2rhR2JHZoqKW7LUfiEBMmgpPNwl68+tQMRv+WI1fVDe7aEvlkodO5ucPf2RPPRd8EnRI9k01CcLo3lsBbBM8hXDmVJP5uIjNhkp8yPiUhHtMW6xF9LNVC/Qz3UaVpbDpvx8vPEgPF95fk2Zt8Fq+vH7L+A5nc2RUhzYwcm8dr8jqayWllo2bs5wlqGwM1dB08fD9z1fE/ow/QOHYrz1cOfyyrdz+/9TE5xh+Ow4/NtWPfuGqvri17CoW3C0KyX6DkL+Sren/3DWHI/+EMU2g6LRN3IcHj4eKDX2D6I2X3ebu2txsAadD2wFi5FGqBpiXBzExcC7gB84aarB4Nm2beh/OtXNL2qoSn6z9mx5FpJ19Kyseu1JWg/fhgixw1FRmwSdo5dYm639A0NQu/lL2Pzw/OQnZAKryA/tH15EHzq6FGQl4/00wnY+epiZF28Yh432nny3+Cp90N+Zi6S9p7FzrGLHRaUVOdHCGhcB8Ft62PfOysckoebQc0GNXD3q71lb17Rnhp74CK+HP01cjOMVZ5BdfV4ZcMLmDvgX0iNS5Ol3hrhQRj57xHwr+WPrJQs7P0+Cnu/i5Lr+9f0w+A374Y+VC+rhRNOJmLxM9/iSkzxjmn2klmk/Vf8nkR+sq8aew2LEqzovPRhZ+PN1VNir+KHl79Hvwl3Y/icB3A15oos4Yr5wvkd57Blzm944NOHZHuuCKyrxv1op733hptbPWiaAe5ut5nniqExBu0kDNopuKHF9WWmcak3hs24uUVC01KhacaLmbLWL2s5VR86rbQ6pUL3dxS3nkpNTS3XgF5nsbrbW1W9C1SGYbumw9W82XQKXE2AC16mjz96J1xJWlomatUcYrfzuClOvNzwDXi7+diUVq4hB/POz3DqmOOCfxJERFRVqnNvYUdicCUiImXYocmIHZqIiIgUY8mViIiUEb14NFvvLewCJVcGVyIiUsZwfbKFrZ+vDlgtTEREpBhLrkREpAw7NBkxuBIRkToK2lzhAsGV1cJERESKseRKRETKsEOTEYMrEREpw6E4RqwWJiIiUowlVyIiUobVwkYMrkREpIx40JpmY72urZ+vDhhciYhIGY5zNWKbKxERkWIsuRIRkTJ8nqsRgysRESnDamEjVgsTEREpxpIrEREpw5KrEYMrEREpbnPVbE7D2TG4upg8g2vV9Ot0rvcTHddoElxNkKcrnA4tebj3givxcE+r6l24qbjemYuIiKoMq4WNGFyJiEgZ3rjfiMGViIiUEe2tBpvbXJ0/urpWAx0REVE1wJIrEREpw2phIwZXIiJSho+cM2K1MBERkWIsuRIRkTJ8nqsRgysRESnDca5GrBYmIqKbxvTp09GjRw/4+fmhRo0aVteJiYnBPffcI9cJCQnB66+/jvz8/ApthyVXIiJSxqBgnKvBjuNc8/Ly8OCDD+K2227D559/Xmx5QUGBDKxhYWHYtm0b4uLiMHLkSHh6euL9998v93YYXImISO2N+zXb07CX9957T74uWrTI6vJffvkFR48exa+//orQ0FB07NgRU6dOxYQJEzB58mR4eXmVazusFiYiomopLS3NYsrNzbX7Nrdv347IyEgZWE0GDBggt3/kyJFyp8PgSkREyquFDTZOQkREBIKCgszTjBkz7L7/8fHxFoFVML0Xy8qL1cJERKT2Dk2wPQ0hNjYWer3ePN/b29vq+hMnTsTMmTNLTfPYsWNo1aoVHIXBlYiIqmWHJr1ebxFcS/Laa69h9OjRpa7TpEmTcm1bdGTatWuXxbyEhATzsvJicCUiIqdWp04dOakgehGL4TqJiYlyGI6wceNGGeTbtGlT7nQYXImISBmDpqDkasc7NIkxrFeuXJGvYthNVFSUnN+sWTMEBASgf//+Mog+8cQTmDVrlmxnffvtt/Hiiy+WWC1tDYMrEREpI57FqlXj57lOmjQJX331lfl9p06d5OvmzZvRq1cvuLu7Y82aNXj++edlKdbf3x+jRo3ClClTKrQdBlciIrppLFq0qMQxriYNGzbEzz//bNN2GFyJiEgZTcEj4zQ4PwZXG9Rs3wCRE4bBP6IWMmOScXDmKlw9FGt13aCW4ejw5nD4hQdD56ZDenQijs7/BVf2n5PL20+8F/UHdrjxATcdPHy88PsT85F64pLd8+JTKwCd37oXwa3D4VtHj42PzkfqydLHdIXf1RqRLw+Ab0ggUo7HYe/UlUg/n1Tu5aoNHjwYEya8jsjIdrh27Rq2bv0Dr7wyFhcvXjSvc++9wzB79kzUq1cP+/btxzPPPIcTJ06UmGZZ61c0PRV8Ar0x5K270e7uVnD3dMfl6GQseHgRruWUfu/TgeP6oO+LPbHoueU4svHGPt76SGf0efEO+Af74syO8/hu4k9Iv5wBR6ndOgy9Jw+Fvn4wdDodrpy9jO0f/4q4vTFW16/ZrA5uf70/6rQJh2+wH/5z2wfIS7e8uUBwk9roOWEAwjpGwJBvwNlNx7F50moH5ejmVt1vf+govIlEJXnqfdFtzkhEr9iB9X2nIfq7Heg+ZyQ8Anysrp8VfxW7JyzF+runY13faTjz9Z+4dc5IuHkbr28OfrAKP/eaYp5OLPwVGecvOySwmh7xFL/tFLaNW1qu9QMa1ka3aX/DwTk/Y3WfGUjcfRY95jwGnbtbuZbbQ1CQHjNnzkZERCM0btxM3lFlxYpl5uUtWrTAN98swauvjkPNmnXw22+bsWrVD7KNxZqy1q9oeirodMBTnz8CwzUDZvb5FJM6zMT3b6xBQX7pZYW6rUPRpm9zpCakW8xvelsjDJ7YD0te+h6Tb/kI6UkZeHTufXCk9EupWP/KCnx++yz8t8dMRC3ahiH/ehTu1/82ihLB8vSGo9j09kqry/3qBGD4F6PkOl/c+SG+7PURDn1rObSCyN4YXCupbq82yLmchphVe2C4ViBfc5LT5XxrrqVmIzs+xfhGp4Nm0ODh7w2fWoFW128w7BbE/LQXjpJ7JRNnv9+Fq0dulPJK03BQB1zeE424P0/CkJePY//dAu9gf9Tu2LBcy+3h22+XyXaSzMxMZGVlYe7cf6J7927mYPf4449h8+YtWLt2rbyN2tSp02RX+zvuuMNqemWtX9H0VGjZqzlqhAdh5eR1yE7NkYPtLx2NlwGnJKKm5MEZQ7By8noUXCuwWNb1wY7Yt/IgYqMu4lr2Nayb/RuadG+ImhHWnxZiD7mp2UiPS72+s4BWoMHL3xt+tQOsrp9yLhnHftiPK6cSrS7vOPI2XNgZLdcpyM2Xf59Jx8p/Zx1S8zxXzcbJ2bFauJL0zcKQdjLOYp54r29e+iDjQZvehruvF9w83BG7dh+yLl0ttk5wZAT8G9RC7Jp9qK6CmocipVD+tQID0qIT5fzLe6PLXO4Id911p7wri+huL7RvH4moqAPm5eIRUkePHpPzt2zZUuzzZa1f0fRUaNq9IZLPX8GIOfehxR1NZPXtls+2Ye8PB0v8zJ1P34q444k4u/N8sWV1W4Xgr692m99nJGXKNMX8K7HXLwYd5JltE+DpJ/423HB8VRTSL1Zu++G3NETS8Xjcv+RJBDeujStnLmPbhxuRcKh8F45kG1YLGzG4VpK7nxeupedYzBPvPfxKHwclqoRFVXB477Zw8/a0uk6De29Bwp8nZGmyuvLwLSH//t7lWm5vxidZvIcHHxxhnifGsKWkWJ6wxfvAQOu1B2WtX9H0VPCt4YNmPRrjx3fXYfm4lYhoXw9PL3oUVy6kIHpX8TZKUQLtMbIr5g75t9X0vP28kJ1meZzEe28HHafCRJWwqApuenfrEquEy8MnyBfNB7fDT//3DRIPXUTbB7vgnvmP4JshnyK3SF6J7IXBtZzqDeiADm/cK/+dFZ+CpF1nZLtrYZ4BPshNKTsgGnLzcWH9AfRaNgYZ5y7jyoEbJQpRqq3XNxJ731kOe4oY2B5d3hwm/50Zl4qND39Soc/nZ+fJ/BYm3udn5pZruQqPPvoIPvtsgfz3+fPn0a6dsUNYu3btsG7dGrz00hj52CiTjIwMefPvwsT79HTLdsjyrl/R9Cqj073t8MD0IfLfVy+m4NSfZ5FyKRXbFhtLm+f2xsrOSW36tLAaXP/2/hBs+GizrEK2JjcrT3aQKswn0Ae5Co9TUS3uiUSvd415Sr+Ugm+HG4+hIKpxT645hEdWPo+Us0mI22+9g2BprmXlIf5ALOKvf/bQt7vR6enbEdahPs7/cVphTsgallyNGFzL6eKGA3IyaTCsC5qM6GGxjr5FXZxZ+le50xRVw6KnceHgWq9/e1zLzEXCtpOwp9j1B+VUWamnEhDU4kYVuOiopG9cB6mnE8q1XIWlS7+VU2EisP766wZMnPgmvvnGsnPWwYOH0LHjjR7ZHh4eaNOmNQ4dOmw1/bLWr2h6lbF/1WE5mdzytw6IHNi63J9v3rMJwtuEYdg7A+R73yAfjPhoOHat2I+fpv0iq4vFchP/Wn7QhwTI+fZycu0hOZX1txHUsFalgmvSiXjZzkxVw/RcG1vY+vnqgB2aKiluy1H4hATJIKvzcJevPrUDEb/F+vP+Qnu2hL5ZqAwy7t6eaD76LviE6JF8fSiOiUhHtMXC4PgrNzcvDznJf3u6G/8tuqdacX7dAYR0bYKw25vLdVs/fRdyU7KQtP98uZbbg7hlmQisb789CYsW3bgDi8nXX3+DPn16Y9CgQfKBx2+99SaSkpKwdetWq+mVtX5F01Ph8Ibj8PD2wK2PdpEBJKJjPbTt1xJHfrU+/GfabR/j43s+M09pCelYPW0Dfv2ncR93fxeFzsMjEdEhHJ4+Hhj0eh/ZNuvI9taGdzVHrRYh0LmL4Wce6PJsT/iH6nFpb8m/FXcvd7hf/62KV/He5Oj3+9C4dyuERtaT31Hbh7rA3dMDcVEVD9RUtY+cc2YsuVbStbRs7HptCdqPH4bIcUOREZuEnWOXmNsZfUOD0Hv5y9j88DxkJ6TCK8gPbV8eBJ86ehTk5SP9dAJ2vroYWRevmNMMaFwHwW3rY987K6okT/dve9f8775f/Z98/f3vn+Py3nOyl2/Pfz6BlXdOk/Mzzidh1zvfo8Nr98AvRI+rJy5h29hvZMel8iy3h3Hjxsqbd3/88UdyMmnTJlI+uurkyZN4/PGRmDdvDurXry/HpQ4bdp+5w1PPnj1ldXJgoLGnbFnrl7XcHnLSc/HF09/ivvcGYehbdyMlPg0/vvszzu0xBo7GXRvg6S8fxdvtPpDvU+Mtq6gNBg1ZV7PN7axntp/Dulm/YeSCh+AX5IMzO89j6Ss/wpF8a/jh9nH9ERCqR35uPpJPJWDtC0uRFmvs7Fe3cwMMXfgY/t3N+CzPwPAgjPzlFfPnn/p9nHxd3H+uHNYjSrtb31+H/rMfgE+wH5JPJWLti0uLjYUlsiedVo4+z2K8oGhLSk1NLdfjf5zF6m5vwdXkGVyrMuLBvfZ/OLKjjWs0Ca6mkb/zlzSKevHwjYtNV2Dv87gp/e765+Ghs61DXL6Wi51pC5w65rDkSkREyhiu/2cLWz9fHbhWMYeIiKgaYMmViIiU0XQaNJ2tvYWdv5mBwZWIiJTRFPT21VwguLJamIiISDGWXImISBnRGUnHDk0MrkREpA7v0GTEamEiIiLFWHIlIiJlDDoDdDb2Fma1MBERUSFsczVicCUiImUYXI3Y5kpERKQYS65ERKQMewsbMbgSEZEyBhRAhwKb03B2rBYmIiJSjCVXIiJSRtwXWLO5Wtj57y3M4EpERMpwnKsRq4WJiIgUY8mViIgUd2hyszkNZ8fgSkRECtk+FEek4exYLUxERKTYTV1yHbZrelXvApVBw9Sq3gUiqgCDJqp03RSk4dxu6uBKRERq8Q5NRgyuRESkjIYCaDaWXEUazo5trkREdNOYPn06evToAT8/P9SoUcPqOjqdrti0bNmyCm2HJVciIlLGeAMIg4I07CMvLw8PPvggbrvtNnz++eclrvfll19i4MCB5vclBeKSMLgSEdFNc/vD9957T74uWrSo1PVEMA0LC6v0dlgtTERE1VJaWprFlJub67Btv/jii6hduza6deuGL774AppWsYDPkisRESmjaaJDk87mNISIiAiL+e+++y4mT54Me5syZQr69Okj22V/+eUXvPDCC8jIyMCYMWPKnQaDKxERVcs219jYWOj1evN8b29vq+tPnDgRM2fOLDXNY8eOoVWrVuXa/jvvvGP+d6dOnZCZmYnZs2czuBIRkfPT6/UWwbUkr732GkaPHl3qOk2aNKn0fnTv3h1Tp06V1dIlBfiiGFyJiEjxOFedzWlURJ06deRkL1FRUQgODi53YBUYXImISBlNU3CHJs1+Q3FiYmJw5coV+VpQUCADp9CsWTMEBATgp59+QkJCAm699Vb4+Phg48aNeP/99zFu3LgKbYfBlYiIbhqTJk3CV199ZdGmKmzevBm9evWCp6cn5s+fj1dffVX2EBZBd86cOXj22WcrtB2dVo7+xaILdFBQEFJTU8tV/01ERNWLvc/jpvRr67vDTWdbuc2g5SMpbadTxxyWXImIqFoOxXFmDK5ERHTT3KHJUXiHJiIiIsVYciUiIsW9hXU2p+HsGFyJiEgh0eZqexrOjtXCREREirHkSkREyhirdHUK0nBuDK5ERKQMg6sRq4WJiIgUY8mViIiUEY+L09l8437nL7kyuBIRkTKsFjZitTAREZFiLLkSEZEyKu4LrPHewkREREXvC2xQkIZzY3AlIiJlVLSXamxzJSIioqJYciUiImVYcjVicCUiImVUjFHVXGCcK6uFiYiIFGPJlYiIlGG1sBGDKxERKcPgasRqYSIiIsVYciUiIoVUlDqdv+TK4EpERMqwWtiI1cJERESKseRKRETKcJyrEYMrEREpo2kKbtwv03BuDK5ERKSQeFyczuayq7NjmysREZFiLLkSEZEyxp6+OhvTcP6SK4MrEREpZHtwZbUwERERFcOSKxERqaOgWhisFiYiIrpBU1Clq7FamIiIiIpiyZWIiBRihyaBwZWIiBTSFHT2ZbUwERERVabkahrQm5aWVp7ViYiomjGdv+1/gwbRHcn5S54OCa7p6enyNSIiwt77Q0REdiTO50FBQcrT9fLyQlhYGOLj45WkJ9ISaTornVaOyxiDwYBLly4hMDAQOp2td94gIiJHE6d6EVjDw8Ph5mafFsGcnBzk5eUpSUsEVh8fH7h0cCUiIqLyY4cmIiIixRhciYiIFGNwJSIiUozBlYiISDEGVyIiIsUYXImIiBRjcCUiIoJa/w/0ndzpKjipNgAAAABJRU5ErkJggg==", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def plot_values(V: np.ndarray, title=\"Value function\") -> None:\n", + " \"\"\"Plot the value function V on the maze as a heatmap.\"\"\"\n", + " grid_values = np.full(\n", + " (n_rows, n_cols), np.nan\n", + " ) # Initializes a grid the same size as the maze. Every cell starts as NaN.\n", + " for (\n", + " s,\n", + " (i, j),\n", + " ) in (\n", + " state_to_pos.items()\n", + " ): # recall that state_to_pos maps each state index to its maze coordinates (i,j).\n", + " grid_values[i, j] = V[\n", + " s\n", + " ] # For each reachable cell, we write the value V[s] in the grid.\n", + " # Walls # never get values, and they stay as NaN.\n", + "\n", + " fig, ax = plt.subplots()\n", + " im = ax.imshow(grid_values, cmap=\"magma\")\n", + " plt.colorbar(im, ax=ax)\n", + "\n", + " # For each state:\n", + " # Place the text label at (column j, row i).\n", + " # Display value to two decimals.\n", + " # Use white text so it’s visible on the heatmap.\n", + " # Center the text inside each cell.\n", + "\n", + " for s, (i, j) in state_to_pos.items():\n", + " ax.text(\n", + " j, i, f\"{V[s]:.2f}\", ha=\"center\", va=\"center\", color=\"white\", fontsize=9\n", + " )\n", + "\n", + " # Remove axis ticks and set title\n", + " ax.set_xticks([])\n", + " ax.set_yticks([])\n", + " ax.set_title(title)\n", + " plt.show()\n", + "\n", + "\n", + "plot_values(V_random, title=\"Value function: random policy\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "8275a1eb-b58e-4e05-ae5d-5635ff9a1556", + "metadata": {}, + "source": [ + "The next function `plot_policy` visualizes a policy on the maze.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "c1ab67f0-bd5e-4ffe-b655-aec030401b78", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_policy(policy: np.ndarray, title=\"Policy\") -> None:\n", + " \"\"\"Plot the given policy on the maze.\"\"\"\n", + " _fig, ax = plt.subplots()\n", + " # draw walls as dark cells\n", + " wall_grid = np.zeros((n_rows, n_cols))\n", + " for i in range(n_rows):\n", + " for j in range(n_cols):\n", + " if maze_str[i][j] == \"#\":\n", + " wall_grid[i, j] = 1\n", + " ax.imshow(wall_grid, cmap=\"Greys\", alpha=0.5)\n", + "\n", + " for s, (i, j) in state_to_pos.items():\n", + " cell = maze_str[i][j]\n", + " if cell == \"#\":\n", + " continue\n", + "\n", + " if s in goal_states:\n", + " ax.text(\n", + " j,\n", + " i,\n", + " \"G\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " fontweight=\"bold\",\n", + " color=\"blue\",\n", + " )\n", + " elif s in trap_states:\n", + " ax.text(\n", + " j,\n", + " i,\n", + " \"X\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " fontweight=\"bold\",\n", + " color=\"red\",\n", + " )\n", + " elif s == start_state:\n", + " ax.text(\n", + " j,\n", + " i,\n", + " \"S\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " fontweight=\"bold\",\n", + " color=\"green\",\n", + " )\n", + " else:\n", + " a = policy[s]\n", + " ax.text(\n", + " j,\n", + " i,\n", + " action_names[a],\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " color=\"black\",\n", + " )\n", + "\n", + " ax.set_xticks(np.arange(-0.5, n_cols, 1))\n", + " ax.set_yticks(np.arange(-0.5, n_rows, 1))\n", + " ax.set_xticklabels([])\n", + " ax.set_yticklabels([])\n", + " ax.grid(True)\n", + " ax.set_title(title)\n", + " plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "id": "48037254-dccc-4f9c-a4d7-349adba5c74f", + "metadata": {}, + "source": [ + "Now let’s visualize the `random_policy`. Does it seem like a good policy?" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "d452681c-c89c-41cc-95dc-df75993b0391", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plot_policy(policy=random_policy, title=\"Policy\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "cbad5bf1-0150-4c3f-8cce-c82e0f1d1695", + "metadata": {}, + "source": [ + "**Exercise 9.** Define your own policy and evaluate it using the functions `policy_evaluation(...)` and `plot_values(...)`. **Can you identify an optimal policy visually?** Plot your own policy using `plot_policy`. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "929707e6-3022-4d86-96cc-12f251f890a9", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "my_policy = [\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_DOWN,\n", + " A_DOWN, # First row\n", + " A_UP,\n", + " A_DOWN,\n", + " A_DOWN,\n", + " A_LEFT, # Second row\n", + " A_UP,\n", + " A_RIGHT,\n", + " A_DOWN, # Third row\n", + " A_UP,\n", + " A_LEFT,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_RIGHT, # Fourth row\n", + " A_UP,\n", + " A_LEFT,\n", + " A_DOWN,\n", + " A_RIGHT,\n", + " A_UP, # Fifth row\n", + "]\n", + "\n", + "V_my_policy = policy_evaluation(policy=my_policy, P=P, R=R, gamma=gamma)\n", + "\n", + "plot_values(V=V_my_policy, title=\"Value function: my policy\")\n", + "plot_policy(policy=my_policy, title=\"My policy\")" + ] + }, + { + "cell_type": "markdown", + "id": "e61f5ee8-f9cd-4fbc-96c0-0a8d661bd1e5", + "metadata": {}, + "source": [ + "**Exercise 10.** (optional) How can we find an optimal policy?\n", + "(We will discuss this question next week, but you can already start thinking about it!)" + ] + }, + { + "cell_type": "markdown", + "id": "00ae548b", + "metadata": {}, + "source": [ + "To find an optimal policy $π^*$ (a policy that yields the highest possible expected return from every state), we generally use one of two main dynamic programming algorithms:\n", + "\n", + "1. **Policy Iteration**: This method alternates between two steps until convergence:\n", + "\n", + "- *Policy Evaluation*: Calculate the value function Vπ(s) for the current specific policy (as we did in Exercise 8).\n", + "\n", + "- *Policy Improvement*: Update the policy to be greedy with respect to the current values. For every state s, we choose the action a that maximizes the expected next value:\n", + " $$π_{new}​(s) = argmax​_{a} \\sum_{s\\prime} ​P({s \\prime}∣s,a)[R(s)+ \\gamma V_{\\pi}({s\\prime})]$$\n", + "\n", + "1. **Value Iteration**: Instead of evaluating a specific policy until convergence every time, we iteratively update the value function directly using the *Bellman Optimality Equation*:\n", + " $$V_{k+1}​(s) = max_a ​(R(s)+ \\gamma \\sum_{s\\prime} ​P(s\\prime∣s,a)V_k​(s\\prime))$$\n", + "\n", + " Once the values converge to the optimal values $V^{*}$, we simply extract the optimal policy by acting greedily towards those values." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "studies", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/M2/Reinforcement Learning/Lab 2 - Second maze.ipynb b/M2/Reinforcement Learning/Lab 2 - Second maze.ipynb new file mode 100644 index 0000000..b1a8ebc --- /dev/null +++ b/M2/Reinforcement Learning/Lab 2 - Second maze.ipynb @@ -0,0 +1,872 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "44b75d44", + "metadata": {}, + "source": [ + "# Lab 2 - Second maze\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 535, + "id": "100d1e0d", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "\n", + "np.set_printoptions(\n", + " precision=3, suppress=True\n", + ") # (not mandatory) This line is for limiting floats to 3 decimal places, avoiding scientific notation (like 1.23e-04) for small numbers.\n", + "\n", + "# For reproducibility\n", + "rng = np.random.default_rng(seed=42) # This line creates a random number generator.\n" + ] + }, + { + "cell_type": "markdown", + "id": "1018deab", + "metadata": {}, + "source": [ + "## 2. Maze definition and MDP formulation\n" + ] + }, + { + "cell_type": "markdown", + "id": "ca4fa301-c14f-44ec-b04f-b01ca42d979a", + "metadata": {}, + "source": [ + "### 2.1 Define the maze " + ] + }, + { + "cell_type": "code", + "execution_count": 536, + "id": "f91cda05", + "metadata": {}, + "outputs": [], + "source": [ + "maze_str = [\n", + " \"############\",\n", + " \"#S#.......X#\",\n", + " \"#.#.###.#.##\",\n", + " \"#.....#X#..#\",\n", + " \"#.###.####.#\",\n", + " \"#...#X#X...#\",\n", + " \"###.######X#\",\n", + " \"#.....X...##\",\n", + " \"#.###.#.#..#\",\n", + " \"#...#...X#.#\",\n", + " \"#X#.X#X##G.#\",\n", + " \"############\",\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 537, + "id": "564cb757-eefe-4be6-9b6f-bb77ace42a97", + "metadata": {}, + "outputs": [], + "source": [ + "n_rows = len(maze_str)\n", + "n_cols = len(maze_str[0])\n", + "\n", + "figsize = (n_cols / 2 if n_cols > n_rows else 8, n_rows / 2 if n_rows > n_cols else 8)" + ] + }, + { + "cell_type": "markdown", + "id": "adc49d58-2730-41d8-96fb-ca7c9cb4fcdf", + "metadata": {}, + "source": [ + "### 2.2 Map each walkable cell (not a wall '#') to a state index\n" + ] + }, + { + "cell_type": "code", + "execution_count": 538, + "id": "7116044b-c134-43de-9f30-01ab62325300", + "metadata": {}, + "outputs": [], + "source": [ + "FREE = {\n", + " \".\",\n", + " \"S\",\n", + " \"G\",\n", + " \"X\",\n", + "} # The vector Free represents cells that the agent is allowed to move into.\n" + ] + }, + { + "cell_type": "markdown", + "id": "1c9ad05e-9c6c-4e00-918c-44b858f45298", + "metadata": {}, + "source": [ + "**Dictionaries to convert between grid and state index**" + ] + }, + { + "cell_type": "code", + "execution_count": 539, + "id": "a1258de4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of states (non-wall cells): 62\n", + "Start state: 0 at (1, 1)\n", + "Goal states: [60] at (10, 9)\n", + "Trap states: [8, 18, 27, 28, 33, 39, 54, 56, 58, 59] at (1, 10)\n" + ] + } + ], + "source": [ + "state_to_pos = {} # s -> (i,j)\n", + "pos_to_state = {} # (i,j) -> s\n", + "\n", + "start_state = None # will store the state index of start state\n", + "goal_states = [] # will store the state index of goal state # We use a list in case there are multiple goals\n", + "trap_states = [] # will store the state index of trap state # We use a list in case there are multiple traps\n", + "\n", + "s = 0\n", + "for i in range(n_rows): # i = row index\n", + " for j in range(n_cols): # j = column index\n", + " cell = maze_str[i][j] # cell = the character at that position (S, ., #, etc.)\n", + "\n", + " if (\n", + " cell in FREE\n", + " ): # FREE contains: free cells \".\", start cell \"S\", goal cell \"G\" and trap cell \"X\"\n", + " # Walls # are ignored, they are not MDP states.\n", + " state_to_pos[s] = (i, j)\n", + " pos_to_state[(i, j)] = s\n", + "\n", + " if cell == \"S\":\n", + " start_state = s\n", + " elif cell == \"G\":\n", + " goal_states.append(s)\n", + " elif cell == \"X\":\n", + " trap_states.append(s)\n", + "\n", + " s += 1\n", + "\n", + "n_states = s\n", + "\n", + "print(\"Number of states (non-wall cells):\", n_states)\n", + "print(\"Start state:\", start_state, \"at\", state_to_pos[start_state])\n", + "print(\"Goal states:\", goal_states, \"at\", state_to_pos[goal_states[0]])\n", + "print(\"Trap states:\", trap_states, \"at\", state_to_pos[trap_states[0]])\n" + ] + }, + { + "cell_type": "code", + "execution_count": 540, + "id": "fc61ceef-217c-47f4-8eba-0353369210db", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def plot_maze_with_states():\n", + " \"\"\"Plot the maze with state indices.\"\"\"\n", + " grid = np.ones(\n", + " (n_rows, n_cols)\n", + " ) # Start with a matrix of ones. Here 1 means “free cell”\n", + " for i in range(n_rows):\n", + " for j in range(n_cols):\n", + " if maze_str[i][j] == \"#\":\n", + " grid[i, j] = 0 # We replace walls (#) with 0\n", + "\n", + " fig, ax = plt.subplots(figsize=figsize)\n", + " ax.imshow(grid, cmap=\"gray\", alpha=0.7)\n", + "\n", + " # Plot state indices\n", + " for (\n", + " s,\n", + " (i, j),\n", + " ) in state_to_pos.items(): # Calling .items() returns a list-like sequence of (key, value) pairs in the dictionary.\n", + " cell = maze_str[i][j]\n", + "\n", + " if cell == \"S\":\n", + " label = f\"S\\n{s}\"\n", + " color = \"green\"\n", + " elif cell == \"G\":\n", + " label = f\"G\\n{s}\"\n", + " color = \"blue\"\n", + " elif cell == \"X\":\n", + " label = f\"X\\n{s}\"\n", + " color = \"red\"\n", + " else:\n", + " label = str(s)\n", + " color = \"black\"\n", + "\n", + " ax.text(\n", + " j,\n", + " i,\n", + " label, # Attention : matplotlib, text(x, y, ...) expects (column, row)\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=10,\n", + " fontweight=\"bold\",\n", + " color=color,\n", + " )\n", + "\n", + " ax.set_xticks([]) # remove numeric axes, we don't need.\n", + " ax.set_yticks([])\n", + " ax.set_title(\"Maze with state indices\")\n", + "\n", + " plt.show()\n", + "\n", + "\n", + "plot_maze_with_states()" + ] + }, + { + "cell_type": "markdown", + "id": "db078d86", + "metadata": {}, + "source": [ + "### 2.4 Actions and deterministic movement" + ] + }, + { + "cell_type": "code", + "execution_count": 541, + "id": "f7f0b8e4-1f48-4d03-9e5f-a47e59c3e827", + "metadata": {}, + "outputs": [], + "source": [ + "A_UP, A_RIGHT, A_DOWN, A_LEFT = 0, 1, 2, 3\n", + "ACTIONS = [A_UP, A_RIGHT, A_DOWN, A_LEFT]\n", + "action_names = {A_UP: \"\\u2191\", A_RIGHT: \"\\u2192\", A_DOWN: \"\\u2193\", A_LEFT: \"\\u2190\"}" + ] + }, + { + "cell_type": "code", + "execution_count": 542, + "id": "4b06da5e-bc63-48e5-a336-37bce952443d", + "metadata": {}, + "outputs": [], + "source": [ + "def move_deterministic(i: int, j: int, a: int) -> tuple[int, int]:\n", + " \"\"\"Deterministic movement on the grid. If the movement hits a wall or boundary, the agent stays in place.\n", + "\n", + " Args:\n", + " i (int): current row index\n", + " j (int): current column index\n", + " a (int): action to take (A_UP, A_DOWN, A_LEFT, A_RIGHT)\n", + "\n", + " Returns:\n", + " (tuple[int, int]): new (row, column) position after taking action a\n", + "\n", + " \"\"\"\n", + " candidate_i, candidate_j = (\n", + " i,\n", + " j,\n", + " ) # It means “Unless the action succeeds, the robot stays in place.”\n", + "\n", + " # Now each action changes the coordinates of the robot:\n", + " if a == A_UP:\n", + " candidate_i, candidate_j = (\n", + " i - 1,\n", + " j,\n", + " ) # if the action is UP, then row becomes row -1\n", + " elif a == A_DOWN:\n", + " candidate_i, candidate_j = (\n", + " i + 1,\n", + " j,\n", + " ) # if the action is DOWN, then row becomes row +1\n", + " elif a == A_LEFT:\n", + " candidate_i, candidate_j = (\n", + " i,\n", + " j - 1,\n", + " ) # if the action is LEFT, then column becomes column -1\n", + " elif a == A_RIGHT:\n", + " candidate_i, candidate_j = (\n", + " i,\n", + " j + 1,\n", + " ) # if the action is RIGHT, then column becomes column +1\n", + "\n", + " # Check boundaries\n", + " if not (0 <= candidate_i < n_rows and 0 <= candidate_j < n_cols):\n", + " # If the robot tries to move outside the maze\n", + " # It will not move and it stays at (i, j).\n", + " return i, j\n", + "\n", + " # Check wall\n", + " if maze_str[candidate_i][candidate_j] == \"#\":\n", + " # If the next cell is a wall, the robot stays in place.\n", + " return i, j\n", + "\n", + " return candidate_i, candidate_j # Otherwise, return the new position\n" + ] + }, + { + "cell_type": "markdown", + "id": "c9e620e6", + "metadata": {}, + "source": [ + "### 2.5 Transition probabilities and reward function" + ] + }, + { + "cell_type": "code", + "execution_count": 543, + "id": "610253e7-f3f7-4a30-be3e-2ec5a1e2ed04", + "metadata": {}, + "outputs": [], + "source": [ + "gamma = 0.95\n", + "p_error = 0.1 # probability of the error to a random other direction\n" + ] + }, + { + "cell_type": "code", + "execution_count": 544, + "id": "7a51f242-fe4e-4e74-8a1f-a8df32b194b8", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize transition matrices and reward vector\n", + "P = np.zeros((len(ACTIONS), n_states, n_states))\n", + "R = np.zeros(n_states)" + ] + }, + { + "cell_type": "code", + "execution_count": 545, + "id": "49d54d1f-dc29-45b6-ad31-ad0e848f920d", + "metadata": {}, + "outputs": [], + "source": [ + "# Set rewards for each state\n", + "step_penalty = -0.01\n", + "goal_reward = 1.0\n", + "trap_reward = -1.0\n" + ] + }, + { + "cell_type": "code", + "execution_count": 546, + "id": "b9b7495a-c233-425c-99c0-5bddaf6c3225", + "metadata": {}, + "outputs": [], + "source": [ + "for s in range(n_states):\n", + " if s in goal_states:\n", + " R[s] = goal_reward\n", + " elif s in trap_states:\n", + " R[s] = trap_reward\n", + " else:\n", + " R[s] = step_penalty\n" + ] + }, + { + "cell_type": "code", + "execution_count": 547, + "id": "eca4c571-39c7-468b-af86-0bab9489415e", + "metadata": {}, + "outputs": [], + "source": [ + "terminal_states = set(goal_states + trap_states)\n", + "\n", + "\n", + "def is_terminal(s: int) -> bool:\n", + " \"\"\"Check if a state is terminal (goal or trap).\"\"\"\n", + " return s in terminal_states\n" + ] + }, + { + "cell_type": "code", + "execution_count": 548, + "id": "2d03276b-e206-4d1f-9024-f6948ca61523", + "metadata": {}, + "outputs": [], + "source": [ + "for s in range(n_states): # We loop over all states s.\n", + " i, j = state_to_pos[\n", + " s\n", + " ] # We recover the states to their coordinates (i, j) in the maze.\n", + "\n", + " # First, in a goal or trap state,\n", + " # No matter which action you “choose”, you stay in the same state with probability 1.\n", + " # This makes the terminal states as the absorbing states.\n", + " if is_terminal(s):\n", + " # Terminal states: stay forever\n", + " for a in ACTIONS:\n", + " P[a, s, s] = goal_reward\n", + " continue\n", + "\n", + " # If the state is non-terminal, we define the stochastic movement.\n", + " # For a given state s and intended action a,\n", + " # With probability 1 - p_error, the robot will move in direction a;\n", + " # With probability p_error, the robot will move in one of the other 3 directions, each with probability p_error / 3.\n", + " for a in ACTIONS:\n", + " # main action (intended action)\n", + " main_i, main_j = move_deterministic(i, j, a)\n", + " s_main = pos_to_state[\n", + " (main_i, main_j)\n", + " ] # s_main is the state index of that next cell.\n", + " P[a, s, s_main] += (\n", + " 1 - p_error\n", + " ) # We add probability 1 - p_error to P[a, s, s_main].\n", + "\n", + " # error actions\n", + " other_actions = [\n", + " a2 for a2 in ACTIONS if a2 != a\n", + " ] # other_actions = the 3 actions different from a.\n", + " for a2 in other_actions: # for each of the error action,\n", + " error_i, error_j = move_deterministic(i, j, a2)\n", + " s_error = pos_to_state[(error_i, error_j)] # get its state index s_error\n", + " P[a, s, s_error] += p_error / len(\n", + " other_actions\n", + " ) # add p_error / 3 to P[a, s, s_error]\n", + "# So for each (s,a), probabilities over all s_next sum to 1.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 549, + "id": "341fe630-8f87-4773-84ad-92d3516e53e2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Action ↑: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", + " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", + " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", + "Action →: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", + " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", + " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", + "Action ↓: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", + " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", + " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", + "Action ←: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", + " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n", + " 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n" + ] + } + ], + "source": [ + "for a in ACTIONS:\n", + " # For each action a:\n", + " # P[a] is a matrix of shape (n_states, n_states).\n", + " # P[a].sum(axis=1) sums over next states s_next, giving for each state s:\n", + " # We print these row sums.\n", + " # If everything is correct, they should be very close to 1.\n", + "\n", + " probs = P[a].sum(axis=1)\n", + " print(f\"Action {action_names[a]}:\", probs)\n" + ] + }, + { + "cell_type": "markdown", + "id": "46d23991", + "metadata": {}, + "source": [ + "## 3. Policy evaluation\n", + "\n", + "### 3.1 Bellman expectation equation" + ] + }, + { + "cell_type": "code", + "execution_count": 550, + "id": "2fffe0b7", + "metadata": {}, + "outputs": [], + "source": [ + "def policy_evaluation( # noqa: PLR0913\n", + " policy: np.ndarray,\n", + " P: np.ndarray,\n", + " R: np.ndarray,\n", + " gamma: float,\n", + " theta: float = 1e-6,\n", + " max_iter: int = 10_000,\n", + ") -> np.ndarray:\n", + " \"\"\"Evaluate a deterministic policy for the given MDP.\n", + "\n", + " Args:\n", + " policy: array of shape (n_states,), with values in {0,1,2,3}\n", + " P: array of shape (n_actions, n_states, n_states)\n", + " R: array of shape (n_states,)\n", + " gamma: discount factor\n", + " theta: convergence threshold\n", + " max_iter: maximum number of iterations\n", + "\n", + " \"\"\"\n", + " n_states = len(R) # get the number of states\n", + " V = np.zeros(n_states) # initialize the value function\n", + "\n", + " for _it in range(max_iter): # Main iterative loop\n", + " V_new = np.zeros_like(\n", + " V\n", + " ) # Create a new value vector and we will compute an updated value for each state.\n", + "\n", + " # Now we update each state using the Bellman expectation equation\n", + " for s in range(n_states):\n", + " a = policy[s] # Extract the action chosen by the policy in state\n", + " V_new[s] = R[s] + gamma * np.sum(P[a, s, :] * V)\n", + "\n", + " delta = np.max(\n", + " np.abs(V_new - V)\n", + " ) # This measures how much the value function changed in this iteration:\n", + " # If delta is small, the values start to converge; otherwise, we need to keep iterating.\n", + " V = V_new # Update V, i.e. Set the new values for the next iteration.\n", + "\n", + " if delta < theta: # Check convergence: When changes are tiny, we stop.\n", + " break\n", + "\n", + " return V # Return the final value function, this is our estimate for V^{pi}(s), s in the state set.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 551, + "id": "4c428327", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_values(V: np.ndarray, title=\"Value function\") -> None:\n", + " \"\"\"Plot the value function V on the maze as a heatmap.\"\"\"\n", + " grid_values = np.full(\n", + " (n_rows, n_cols), np.nan\n", + " ) # Initializes a grid the same size as the maze. Every cell starts as NaN.\n", + " for (\n", + " s,\n", + " (i, j),\n", + " ) in (\n", + " state_to_pos.items()\n", + " ): # recall that state_to_pos maps each state index to its maze coordinates (i,j).\n", + " grid_values[i, j] = V[\n", + " s\n", + " ] # For each reachable cell, we write the value V[s] in the grid.\n", + " # Walls # never get values, and they stay as NaN.\n", + "\n", + " fig, ax = plt.subplots(figsize=figsize)\n", + " im = ax.imshow(grid_values, cmap=\"magma\")\n", + " plt.colorbar(im, ax=ax)\n", + "\n", + " # For each state:\n", + " # Place the text label at (column j, row i).\n", + " # Display value to two decimals.\n", + " # Use white text so it’s visible on the heatmap.\n", + " # Center the text inside each cell.\n", + "\n", + " for s, (i, j) in state_to_pos.items():\n", + " ax.text(\n", + " j, i, f\"{V[s]:.2f}\", ha=\"center\", va=\"center\", color=\"white\", fontsize=9\n", + " )\n", + "\n", + " # Remove axis ticks and set title\n", + " ax.set_xticks([])\n", + " ax.set_yticks([])\n", + " ax.set_title(title)\n", + " plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 552, + "id": "c1ab67f0-bd5e-4ffe-b655-aec030401b78", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_policy(policy: np.ndarray, title=\"Policy\") -> None:\n", + " \"\"\"Plot the given policy on the maze.\"\"\"\n", + " _fig, ax = plt.subplots(figsize=figsize)\n", + " # draw walls as dark cells\n", + " wall_grid = np.zeros((n_rows, n_cols))\n", + " for i in range(n_rows):\n", + " for j in range(n_cols):\n", + " if maze_str[i][j] == \"#\":\n", + " wall_grid[i, j] = 1\n", + " ax.imshow(wall_grid, cmap=\"Greys\", alpha=0.5)\n", + "\n", + " for s, (i, j) in state_to_pos.items():\n", + " cell = maze_str[i][j]\n", + " if cell == \"#\":\n", + " continue\n", + "\n", + " if s in goal_states:\n", + " ax.text(\n", + " j,\n", + " i,\n", + " \"G\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " fontweight=\"bold\",\n", + " color=\"blue\",\n", + " )\n", + " elif s in trap_states:\n", + " ax.text(\n", + " j,\n", + " i,\n", + " \"X\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " fontweight=\"bold\",\n", + " color=\"red\",\n", + " )\n", + " elif s == start_state:\n", + " ax.text(\n", + " j,\n", + " i,\n", + " \"S\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " fontweight=\"bold\",\n", + " color=\"green\",\n", + " )\n", + " else:\n", + " a = policy[s]\n", + " ax.text(\n", + " j,\n", + " i,\n", + " action_names[a],\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " color=\"black\",\n", + " )\n", + "\n", + " ax.set_xticks(np.arange(-0.5, n_cols, 1))\n", + " ax.set_yticks(np.arange(-0.5, n_rows, 1))\n", + " ax.set_xticklabels([])\n", + " ax.set_yticklabels([])\n", + " ax.grid(True)\n", + " ax.set_title(title)\n", + " plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "id": "813121a9", + "metadata": {}, + "source": [ + "### 3.3 Evaluating a random policy" + ] + }, + { + "cell_type": "code", + "execution_count": 553, + "id": "ceb5dfe2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0 3 2 1 1 3 0 2 0 0 2 3 2 3 2 3 2 0 3 1 2 1 0 3 3 2 1 3 2 1 1 0 0 2 3 0 3\n", + " 3 1 2 0 3 2 1 0 3 1 3 2 3 3 0 1 1 1 0 2 0 2 2 3 2]\n" + ] + } + ], + "source": [ + "# Random policy: for each state, pick a random action\n", + "random_policy = rng.integers(low=0, high=len(ACTIONS), size=n_states)\n", + "\n", + "print(random_policy)" + ] + }, + { + "cell_type": "code", + "execution_count": 554, + "id": "8f3e2ac2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Value function under random policy:\n", + "[ -0.2 -0.295 -0.534 -1.301 -1.394 -1.467 -0.827 -1.176 -20.\n", + " -0.2 -0.207 -6.086 -0.548 -0.2 -0.201 -0.204 -0.28 -0.481\n", + " -20. -0.546 -0.566 -0.2 -1.126 -0.588 -0.201 -0.203 -0.209\n", + " -20. -20. -1.769 -1.186 -1.222 -0.229 -20. -1.944 -0.862\n", + " -0.824 -1.381 -18.279 -20. -7.557 -6.924 -0.44 -5.78 -17.248\n", + " -7.364 -0.214 -0.207 -18.427 -17.386 -16.41 -16.347 -17.519 -18.483\n", + " -20. -0.013 -20. -15.666 -20. -20. 20. 5.496]\n" + ] + } + ], + "source": [ + "V_random = policy_evaluation(policy=random_policy, P=P, R=R, gamma=gamma)\n", + "print(\"Value function under random policy:\")\n", + "print(V_random)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 555, + "id": "cf45291e", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plot_policy(policy=random_policy, title=\"Policy\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 557, + "id": "5a82a3b7", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "my_policy = [\n", + " A_DOWN,\n", + " A_DOWN,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_DOWN,\n", + " A_DOWN,\n", + " A_UP,\n", + " A_UP,\n", + " A_DOWN,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_UP,\n", + " A_UP,\n", + " A_LEFT,\n", + " A_DOWN,\n", + " A_UP,\n", + " A_UP,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_DOWN,\n", + " A_UP,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_UP,\n", + " A_DOWN,\n", + " A_UP,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_DOWN,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_DOWN,\n", + " A_UP,\n", + " A_DOWN,\n", + " A_UP,\n", + " A_RIGHT,\n", + " A_DOWN,\n", + " A_UP,\n", + " A_LEFT,\n", + " A_LEFT,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_UP,\n", + " A_LEFT,\n", + " A_DOWN,\n", + " A_UP,\n", + " A_UP,\n", + " A_LEFT,\n", + " A_UP,\n", + " A_DOWN,\n", + " A_LEFT,\n", + "]\n", + "\n", + "V_my_policy = policy_evaluation(policy=my_policy, P=P, R=R, gamma=gamma)\n", + "\n", + "plot_values(V=V_my_policy, title=\"Value function: my policy\")\n", + "plot_policy(policy=my_policy, title=\"My policy\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "446c93e4", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "studies", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}