diff --git a/M2/Reinforcement Learning/Lab 3 - Solving Maze Game with Dynamic Programming.ipynb b/M2/Reinforcement Learning/Lab 3 - Solving Maze Game with Dynamic Programming.ipynb new file mode 100644 index 0000000..623abe8 --- /dev/null +++ b/M2/Reinforcement Learning/Lab 3 - Solving Maze Game with Dynamic Programming.ipynb @@ -0,0 +1,2152 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "44b75d44", + "metadata": {}, + "source": [ + "# Lab 3 - Maze Game as a Markov Decision Process Part 2\n", + "\n", + "## **1. Objectives**\n", + "\n", + "Last week in Lab 2, we \n", + "\n", + "- Modeled a simple **maze game** as a **Markov Decision Process (MDP)** by defining:\n", + " - **States**\n", + " - **Actions**\n", + " - **Transition probabilities**\n", + " - **Rewards**\n", + "\n", + "- Implemented **policy evaluation** to compute the value function of a given policy.\n", + "\n", + "We consider a **discounted MDP** with discount factor $\\gamma \\in (0,1)$.\n", + "\n", + "\n", + "This week, we will use **dynamic programming** to find **an optimal policy**.\n", + "\n", + "**Important: Lab 3 starts with Question 12. Questions 1–11 are already included in Lab 2.**\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "100d1e0d", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "\n", + "np.set_printoptions(precision=3, suppress=True)\n", + "# (not mandatory) This line is for limiting floats to 3 decimal places,\n", + "# avoiding scientific notation (like 1.23e-04) for small numbers.\n", + "\n", + "# For reproducibility\n", + "rng = np.random.default_rng(seed=42) # This line creates a random number generator.\n" + ] + }, + { + "cell_type": "markdown", + "id": "1018deab", + "metadata": {}, + "source": [ + "## 2. Maze definition and MDP formulation\n", + "\n", + "We consider a small 2D maze on a grid. The agent is a **robot** that moves on the grid.\n", + "\n", + "- `S` : start state\n", + "- `G` : goal state, with positive reward\n", + "- `#` : wall (not accessible)\n", + "- `.` : empty cell\n", + "- `X` : \"trap\" (negative reward)\n", + "\n", + "At each step, the robot can choose among 4 actions:\n", + "\n", + "$$\n", + "\\mathcal{A} = \\{\\text{Up} \\uparrow, \\quad \\text{Right} \\rightarrow, \\quad \\text{Down} \\downarrow, \\quad \\text{Left}\\leftarrow\\}.\n", + "$$\n", + "\n", + "The movement is deterministic, but here we set a small probability of “error” to make the example more realistic.\n", + "- With probability $1 - p_{\\text{error}}$, it moves in the chosen direction.\n", + "- With probability $p_{\\text{error}}$, it moves in a random *other* direction.\n", + "- If the movement would hit a wall or go outside the grid, the agent stays in place.\n", + "\n", + "We will represent the MDP with:\n", + "\n", + "- A list of **states $\\mathcal{S} = \\{0, \\dots, n_{S - 1}\\}$, each corresponding to a grid cell.**\n", + "- For each action $a$, a transition matrix $P[a]$ of size $(n_S, n_S)$, where\n", + " $$\n", + " P[a][s, s'] = \\mathbb{P}(S_{t+1} = s' \\mid S_t = s, A_t = a).\n", + " $$\n", + "- A reward vector $R$ of length $n_S$, where $R[s]$ is the immediate reward obtained when **leaving** state $s$.\n", + "\n", + "We will use a discount factor $\\gamma = 0.95$.\n" + ] + }, + { + "cell_type": "markdown", + "id": "ca4fa301-c14f-44ec-b04f-b01ca42d979a", + "metadata": {}, + "source": [ + "### 2.1 Define the maze \n", + "\n", + "Let us now define the maze as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f91cda05", + "metadata": {}, + "outputs": [], + "source": [ + "maze_str = [\n", + " \"#######\",\n", + " \"S...#.#\",\n", + " \"#.#...#\",\n", + " \"#.#..##\",\n", + " \"#..#..G\",\n", + " \"#..X..#\",\n", + " \"#######\",\n", + "]\n" + ] + }, + { + "cell_type": "markdown", + "id": "99820cf4-292d-49ba-b662-f9f05f901f62", + "metadata": {}, + "source": [ + "**Exercise 1.** Compute the dimensions of the maze (complete the “TO DO” parts):\n", + "- How many rows does the maze have?\n", + "- How many columns does the maze have?" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "24d7b74c-66c7-4615-b5e6-c2973a975fc9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "7\n", + "7\n" + ] + } + ], + "source": [ + "# Solution 1.\n", + "\n", + "n_rows = len(maze_str)\n", + "print(n_rows)\n", + "n_cols = len(maze_str[0])\n", + "print(n_cols)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "26c821d3-2362-4b60-8c77-3d09296d130d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Maze:\n", + "#######\n", + "S...#.#\n", + "#.#...#\n", + "#.#..##\n", + "#..#..G\n", + "#..X..#\n", + "#######\n" + ] + } + ], + "source": [ + "print(\"Maze:\")\n", + "for row in maze_str:\n", + " print(row)" + ] + }, + { + "cell_type": "markdown", + "id": "adc49d58-2730-41d8-96fb-ca7c9cb4fcdf", + "metadata": {}, + "source": [ + "### 2.2 Map each walkable cell (not a wall '#') to a state index\n", + "\n", + "Now we convert the maze grid into state indices for the MDP.\n", + "\n", + "\n", + "The cells where the robot is allowed to stand are \n", + "\n", + "- . : empty space\n", + "\n", + "- S : start\n", + "\n", + "- G : goal\n", + "\n", + "- X : trap\n", + "\n", + "Everything else (i.e., #) is a wall and cannot be a state in the MDP.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "7116044b-c134-43de-9f30-01ab62325300", + "metadata": {}, + "outputs": [], + "source": [ + "FREE = {\n", + " \".\",\n", + " \"S\",\n", + " \"G\",\n", + " \"X\",\n", + "} # The vector Free represents cells that the agent is allowed to move into." + ] + }, + { + "cell_type": "markdown", + "id": "1c9ad05e-9c6c-4e00-918c-44b858f45298", + "metadata": {}, + "source": [ + "**Dictionaries to convert between grid and state index**\n", + "\n", + "We now want to identify all **valid states** of the maze (all non-wall cells). \n", + "To do this, we need two mappings:\n", + "\n", + "1. `state_to_pos[s] = (i, j)`: Given a state index $s$, return its grid coordinates (row, column).\n", + "2. `pos_to_state[(i, j)] = s`: Given coordinates (i, j), return the corresponding state index $s$.\n", + "\n", + "These two dictionaries allow easy conversion between **MDP state indices** and the **physical maze positions**. " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "a1258de4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of states (non-wall cells): 22\n", + "Start state: 0 at (1, 0)\n", + "Goal states: [16] at (4, 6)\n", + "Trap states: [19] at (5, 3)\n" + ] + } + ], + "source": [ + "state_to_pos = {} # s -> (i,j)\n", + "pos_to_state = {} # (i,j) -> s\n", + "\n", + "start_state = None # will store the state index of start state\n", + "goal_states = [] # will store the state index of goal state\n", + "trap_states = [] # will store the state index of trap state\n", + "\n", + "s = 0\n", + "for i in range(n_rows): # i = row index\n", + " for j in range(n_cols): # j = column index\n", + " cell = maze_str[i][j] # cell = the character at that position (S, ., #, etc.)\n", + "\n", + " if cell in FREE:\n", + " # FREE contains: free cells \".\", start cell \"S\", goal cell \"G\" and trap cell \"X\"\n", + " # Walls # are ignored, they are not MDP states.\n", + " state_to_pos[s] = (i, j)\n", + " pos_to_state[(i, j)] = s\n", + "\n", + " if cell == \"S\":\n", + " start_state = s\n", + " elif cell == \"G\":\n", + " goal_states.append(s)\n", + " elif cell == \"X\":\n", + " trap_states.append(s)\n", + "\n", + " s += 1\n", + "\n", + "n_states = s\n", + "\n", + "print(\"Number of states (non-wall cells):\", n_states)\n", + "print(\"Start state:\", start_state, \"at\", state_to_pos[start_state])\n", + "print(\"Goal states:\", goal_states, \"at\", state_to_pos[goal_states[0]])\n", + "print(\"Trap states:\", trap_states, \"at\", state_to_pos[trap_states[0]])\n" + ] + }, + { + "cell_type": "markdown", + "id": "721b968c-a355-46eb-aae4-5950441ba604", + "metadata": {}, + "source": [ + "*Hint.* If you don’t know what a dictionary is in Python, try the following code to help you understand." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "68744dd6-7278-4c20-8b82-34212685352f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "value2\n" + ] + } + ], + "source": [ + "my_dict = {\"key1\": \"value1\", \"key2\": \"value2\"}\n", + "print(my_dict[\"key2\"])" + ] + }, + { + "cell_type": "markdown", + "id": "0c76f4e1-b0ba-49c5-b9d5-cfb523024ba9", + "metadata": {}, + "source": [ + "**Exercise 2.** Read the program above and answer the following questions:\n", + "1. What is the purpose of state_to_pos and pos_to_state?\n", + "2. Why do we only assign states to cells in FREE?\n", + "3. What would happen if the maze had multiple goal cells?\n", + "4. What is the total number of states (n_states) in this maze? Does this match the number of non-wall cells you can count visually?" + ] + }, + { + "cell_type": "markdown", + "id": "d45828e3-43be-4318-a14c-1242d3a0dcbc", + "metadata": {}, + "source": [ + "**Solution 2.**\n", + "1. `state_to_pos` maps: \n", + "$$\n", + "\\text{state index} \\quad s\\quad \\rightarrow \\quad \\text{grid position} \\quad (i, j)\n", + "$$\n", + "\n", + "`pos_to_state` maps: \n", + "$$\n", + " \\text{grid position} \\quad (i, j) \\quad\\rightarrow \\quad \\text{state index} \\quad s\n", + "$$\n", + "\n", + "We need both because:\n", + "\n", + "- `state_to_pos` lets us visualize, display, or plot the value function on the grid. \n", + "- `pos_to_state` lets us convert a grid position into the correct MDP state index, useful when building transition probabilities.\n", + "\n", + "2. We only assign states to cells in `FREE = {'.', 'S', 'G', 'X'}` because only these cells are **walkable**. Wall cells (`'#'`) **cannot be entered** by the agent, so they are **not included as MDP states**.\n", + "\n", + "3. If the maze had multiple `'G'` cells (several goal locations), we store them in a **list**, for example, goal_states = [5, 12, 23].\n", + "\n", + "4. 22 states. (Row 1: 5 free cells; Row 2: 4 free cells; Row 3: 3 free cells; Row 4: 5 free cells; Row 5: 5 free cells)\n" + ] + }, + { + "cell_type": "markdown", + "id": "6d0fa298-7b7c-44fc-bbed-15ea002037c2", + "metadata": {}, + "source": [ + "-----\n", + "\n", + "The following function `plot_maze_with_states` creates a figure showing:\n", + "- the maze walls and free cells\n", + "- the state index for each non-wall cell\n", + "- special labels and colors for S (start state), G (goal state), and X (trap state). " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "fc61ceef-217c-47f4-8eba-0353369210db", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def plot_maze_with_states() -> None:\n", + " \"\"\"Plot the maze with state indices.\"\"\"\n", + " grid = np.ones(\n", + " (n_rows, n_cols),\n", + " ) # Start with a matrix of ones. Here 1 means “free cell”\n", + " for i in range(n_rows):\n", + " for j in range(n_cols):\n", + " if maze_str[i][j] == \"#\":\n", + " grid[i, j] = 0 # We replace walls (#) with 0\n", + "\n", + " _fig, ax = plt.subplots()\n", + " ax.imshow(grid, cmap=\"gray\", alpha=0.7)\n", + "\n", + " # Plot state indices\n", + " for (\n", + " s,\n", + " (i, j),\n", + " ) in state_to_pos.items():\n", + " cell = maze_str[i][j]\n", + "\n", + " if cell == \"S\":\n", + " label = f\"S\\n{s}\"\n", + " color = \"green\"\n", + " elif cell == \"G\":\n", + " label = f\"G\\n{s}\"\n", + " color = \"blue\"\n", + " elif cell == \"X\":\n", + " label = f\"X\\n{s}\"\n", + " color = \"red\"\n", + " else:\n", + " label = str(s)\n", + " color = \"black\"\n", + "\n", + " ax.text(\n", + " j,\n", + " i,\n", + " label, # Attention : matplotlib, text(x, y, ...) expects (column, row)\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=10,\n", + " fontweight=\"bold\",\n", + " color=color,\n", + " )\n", + "\n", + " ax.set_xticks([]) # remove numeric axes, we don't need.\n", + " ax.set_yticks([])\n", + " ax.set_title(\"Maze with state indices\")\n", + "\n", + " plt.show()\n", + "\n", + "\n", + "plot_maze_with_states()\n" + ] + }, + { + "cell_type": "markdown", + "id": "db078d86", + "metadata": {}, + "source": [ + "### 2.4 Actions and deterministic movement" + ] + }, + { + "cell_type": "markdown", + "id": "96e7f1f2-9d73-410b-853d-e39f40dfb5da", + "metadata": {}, + "source": [ + "We first define integer codes for each action. \n", + "\n", + "**Exercise 3.** How many possible actions can the agent take in the maze?" + ] + }, + { + "cell_type": "markdown", + "id": "22259ab4-527e-4d7c-bb30-98fb240da6d5", + "metadata": {}, + "source": [ + "We have four possible actions in the maze. \n", + "\n", + "In this following cell, each action is mapped to an integer (0,1,2,3). This makes it easy to store and use actions inside arrays and matrices\n", + "\n", + "Here we use Unicode arrow character:\n", + "\n", + "- \"\\u2191\" : ↑ (up arrow)\n", + "\n", + "- \"\\u2192\" : → (right arrow)\n", + "\n", + "- \"\\u2193\" : ↓ (down arrow)\n", + "\n", + "- \"\\u2190\" : ← (left arrow)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "f7f0b8e4-1f48-4d03-9e5f-a47e59c3e827", + "metadata": {}, + "outputs": [], + "source": [ + "A_UP, A_RIGHT, A_DOWN, A_LEFT = 0, 1, 2, 3\n", + "ACTIONS = [A_UP, A_RIGHT, A_DOWN, A_LEFT]\n", + "action_names = {A_UP: \"\\u2191\", A_RIGHT: \"\\u2192\", A_DOWN: \"\\u2193\", A_LEFT: \"\\u2190\"}" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "3773781c-a0cd-48db-967b-d4b432d17046", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "↑\n" + ] + } + ], + "source": [ + "print(action_names[0])" + ] + }, + { + "cell_type": "markdown", + "id": "4b957f5a-ee39-4437-abc1-4809105ad83c", + "metadata": {}, + "source": [ + "**Exercise 4.** Now we define a **deterministic movement function** `move_deterministic(i, j, a)`. \n", + "\n", + "This function simulates the robot trying to move from (i, j) in direction a.\n", + "\n", + "But if the movement hits a wall or boundary, the agent stays in place.\n", + "\n", + "**Complete the `# !!TO DO HERE !!` part in the program below.**" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "4b06da5e-bc63-48e5-a336-37bce952443d", + "metadata": {}, + "outputs": [], + "source": [ + "def move_deterministic(i: int, j: int, a: int) -> tuple[int, int]:\n", + " \"\"\"Deterministic movement on the grid. If the movement hits a wall or boundary, the agent stays in place.\n", + "\n", + " Args:\n", + " i (int): current row index\n", + " j (int): current column index\n", + " a (int): action to take (A_UP, A_DOWN, A_LEFT, A_RIGHT)\n", + "\n", + " Returns:\n", + " (tuple[int, int]): new (row, column) position after taking action a\n", + "\n", + " \"\"\"\n", + " candidate_i, candidate_j = (\n", + " i,\n", + " j,\n", + " ) # It means “Unless the action succeeds, the robot stays in place.”\n", + "\n", + " # Now each action changes the coordinates of the robot:\n", + " if a == A_UP:\n", + " candidate_i, candidate_j = (\n", + " i - 1,\n", + " j,\n", + " ) # if the action is UP, then row becomes row -1\n", + " elif a == A_DOWN:\n", + " candidate_i, candidate_j = (\n", + " i + 1,\n", + " j,\n", + " ) # if the action is DOWN, then row becomes row +1\n", + " elif a == A_LEFT:\n", + " candidate_i, candidate_j = (\n", + " i,\n", + " j - 1,\n", + " ) # if the action is LEFT, then column becomes column -1\n", + " elif a == A_RIGHT:\n", + " candidate_i, candidate_j = (\n", + " i,\n", + " j + 1,\n", + " ) # if the action is RIGHT, then column becomes column +1\n", + "\n", + " # Check boundaries\n", + " if not (0 <= candidate_i < n_rows and 0 <= candidate_j < n_cols):\n", + " # If the robot tries to move outside the maze\n", + " # It will not move and it stays at (i, j).\n", + " return i, j\n", + "\n", + " # Check wall\n", + " if maze_str[candidate_i][candidate_j] == \"#\":\n", + " # If the next cell is a wall, the robot stays in place.\n", + " return i, j\n", + "\n", + " return candidate_i, candidate_j # Otherwise, return the new position\n" + ] + }, + { + "cell_type": "markdown", + "id": "c9e620e6", + "metadata": {}, + "source": [ + "### 2.5 Transition probabilities and reward function" + ] + }, + { + "cell_type": "markdown", + "id": "80bd2bca-7717-4b5f-bffa-76fe86a51d35", + "metadata": {}, + "source": [ + "Recall that we set the discount factor $\\gamma \\in(0,1)$, that is, the future rewards are multiplied by $\\gamma$, so immediate rewards matter a little bit more than future ones. \n", + "\n", + "\n", + "Moreover, we consider a probability error $p_{\\text{error}}$, which means, with probability $p_{\\text{error}}$, the robot **does not** execute the intended action but one of the 3 other directions (chosen uniformly). With probability $1-p_{\\text{error}}$, the robot executes the action that we asked." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "610253e7-f3f7-4a30-be3e-2ec5a1e2ed04", + "metadata": {}, + "outputs": [], + "source": [ + "gamma = 0.95\n", + "p_error = 0.1 # probability of the error to a random other direction\n" + ] + }, + { + "cell_type": "markdown", + "id": "0d1ceff8-86e0-4c45-83d3-af9fae974608", + "metadata": {}, + "source": [ + "Now we initialize the state–transition probability : the probability of reaching next state $s'$ after taking action $a$ in state $s$. \n", + "$$\n", + " p(s' \\mid s, a)\n", + " = \\mathbb{P} \\big[S_t=s'\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]\n", + "$$\n", + "\n", + "We store these transition probabilities in the 3D array `P` (`P[a][s, s_next]`), which has shape `(n_actions, n_states, n_states)`:\n", + "\n", + "`P[a, s, s_next] = P(S_{t+1} = s_next | S_t = s, A_t = a)`.\n", + "\n", + "We also initialize the reward vector `R`, which has length `n_states`, where `R[s]` is the reward received when the agent is in state `s`.\n", + "\n", + "In this maze game, we assume that the reward depends only on the current state, which is natural: in navigation tasks, being in a particular location is what matters, not the direction you used to reach it." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "7a51f242-fe4e-4e74-8a1f-a8df32b194b8", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize transition matrices and reward vector\n", + "P = np.zeros((len(ACTIONS), n_states, n_states))\n", + "R = np.zeros(n_states)" + ] + }, + { + "cell_type": "markdown", + "id": "c08f4af5-a2a7-4baa-b5da-c7ce636d8a4a", + "metadata": {}, + "source": [ + "Now we assign the reward to each state. \n", + "\n", + "For each state index s:\n", + "\n", + "1. If s is a goal, then the reward = +1.0\n", + "2. If s is a trap, then the reward = −1.0\n", + "3. Otherwise for the normal cell, the reward = −0.01 every time you leave this cell.\n", + "\n", + "Recall that rewards are received at the moment the agent executes an action. Here when the agent moves out of the cell, we set reward −0.01. " + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "49d54d1f-dc29-45b6-ad31-ad0e848f920d", + "metadata": {}, + "outputs": [], + "source": [ + "# Set rewards for each state\n", + "step_penalty = -0.01\n", + "goal_reward = 1.0\n", + "trap_reward = -1.0" + ] + }, + { + "cell_type": "markdown", + "id": "dd571ec8-c36a-4e20-bec6-9e6458dc622b", + "metadata": {}, + "source": [ + "**Exercise 5.** Why do we set the step penalty to -0.01 in this MDP?" + ] + }, + { + "cell_type": "markdown", + "id": "00c51189-3ff0-4a5e-ad52-92747b971e16", + "metadata": {}, + "source": [ + "**Solution 5** We assign a small negative reward for every step, which encourages the agent to reach the goal quickly.\n" + ] + }, + { + "cell_type": "markdown", + "id": "07bfb065-b1af-4df1-885e-780fe250f2fb", + "metadata": {}, + "source": [ + "**Exercise 6.** We now define the reward vector. Recall that we have already initialized\n", + "`R = np.zeros(n_states)`.\n", + "If a state belongs to `goal_states`, we assign the `goal_reward`.\n", + "If it belongs to `trap_states`, we assign the `trap_reward`.\n", + "Otherwise, we assign the `step_penalty`. \n", + "\n", + "**Complete the `# TO DO` part in the program below.** " + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "c70885b4-a301-42f2-ab70-2901d941cde7", + "metadata": {}, + "outputs": [], + "source": [ + "for s in range(n_states):\n", + " if s in goal_states:\n", + " R[s] = goal_reward\n", + " elif s in trap_states:\n", + " R[s] = trap_reward\n", + " else:\n", + " R[s] = step_penalty" + ] + }, + { + "cell_type": "markdown", + "id": "b90fb80c-9452-48a2-889f-286703c2ae93", + "metadata": {}, + "source": [ + "Now we define terminal states and a helper function. Here terminal_states is a set containing all absorbing states, which means, reaching them ends the episode conceptually. \n", + "\n", + "Moreover, `is_terminal(s)` is a small helper to check if a state is terminal." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eca4c571-39c7-468b-af86-0bab9489415e", + "metadata": {}, + "outputs": [], + "source": [ + "terminal_states = set(goal_states + trap_states)\n", + "\n", + "\n", + "def is_terminal(s: int) -> bool:\n", + " \"\"\"Check if a state is terminal.\"\"\"\n", + " return s in terminal_states\n" + ] + }, + { + "cell_type": "markdown", + "id": "3a9a1d54-8339-402b-84e9-105961ed78d7", + "metadata": {}, + "source": [ + "Now we need to fill the transition matrices `P[a][s, s_next]`. \n" + ] + }, + { + "cell_type": "markdown", + "id": "d9cfd15c-12cc-48bb-bd88-07f3ae3db31c", + "metadata": {}, + "source": [ + "**Exercise 7.** **Complete the `# TO DO` part in the program below** to fill the transition matrices `P[a][s, s_next]`. (There are only 2 # TO DO here)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "2d03276b-e206-4d1f-9024-f6948ca61523", + "metadata": {}, + "outputs": [], + "source": [ + "for s in range(n_states): # We loop over all states s.\n", + " i, j = state_to_pos[\n", + " s\n", + " ] # We recover the states to their coordinates (i, j) in the maze.\n", + "\n", + " # First, in a goal or trap state,\n", + " # No matter which action you “choose”, you stay in the same state with probability 1.\n", + " # This makes the terminal states as the absorbing states.\n", + " if is_terminal(s):\n", + " # Terminal states: stay forever\n", + " for a in ACTIONS:\n", + " P[a, s, s] = goal_reward\n", + " continue\n", + "\n", + " # If the state is non-terminal, we define the stochastic movement.\n", + " # For a given state s and intended action a,\n", + " # With probability 1 - p_error, the robot will move in direction a;\n", + " # With probability p_error, the robot will move in one of the other 3 directions, each with probability p_error / 3.\n", + " for a in ACTIONS:\n", + " # main action (intended action)\n", + " main_i, main_j = move_deterministic(i, j, a)\n", + " s_main = pos_to_state[\n", + " (main_i, main_j)\n", + " ] # s_main is the state index of that next cell.\n", + " P[a, s, s_main] += (\n", + " 1 - p_error\n", + " ) # We add probability 1 - p_error to P[a, s, s_main].\n", + "\n", + " # error actions\n", + " other_actions = [\n", + " a2 for a2 in ACTIONS if a2 != a\n", + " ] # other_actions = the 3 actions different from a.\n", + " for a2 in other_actions: # for each of the error action,\n", + " error_i, error_j = move_deterministic(i, j, a2)\n", + " s_error = pos_to_state[(error_i, error_j)] # get its state index s_error\n", + " P[a, s, s_error] += p_error / len(\n", + " other_actions,\n", + " ) # add p_error / 3 to P[a, s, s_error]\n", + "# So for each (s,a), probabilities over all s_next sum to 1.\n" + ] + }, + { + "cell_type": "markdown", + "id": "7841b264-af00-4322-b728-adcffac0ef89", + "metadata": {}, + "source": [ + "Now we check if the transition matrices `P[a][s, s_next]` are computed correctly.\n", + "For each action `a`, we sum the transition probabilities over all possible next states `s_next` and verify that these sums are equal to 1.\n", + "\n", + "This is because the matrix `P[a, s, s_next]` stores the transition probability\n", + "\n", + "$\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$. \n", + "\n", + "Therefore, for each action $a$, and for each state $s$, the sum over $s_{\\text{next}}$ of $\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$ should be 1. " + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "341fe630-8f87-4773-84ad-92d3516e53e2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Action ↑: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", + "Action →: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", + "Action ↓: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", + "Action ←: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n" + ] + } + ], + "source": [ + "for a in ACTIONS:\n", + " # For each action a:\n", + " # P[a] is a matrix of shape (n_states, n_states).\n", + " # P[a].sum(axis=1) sums over next states s_next, giving for each state s:\n", + " # We print these row sums.\n", + " # If everything is correct, they should be very close to 1.\n", + "\n", + " probs = P[a].sum(axis=1)\n", + " print(f\"Action {action_names[a]}:\", probs)\n" + ] + }, + { + "cell_type": "markdown", + "id": "46d23991", + "metadata": {}, + "source": [ + "## 3. Policy evaluation\n", + "\n", + "### 3.1 Bellman expectation equation" + ] + }, + { + "cell_type": "markdown", + "id": "305b047c-e83b-4f42-b64e-e2050d5deeff", + "metadata": {}, + "source": [ + "Recall that the value function under a policy $\\pi$ is defined as:\n", + "$$\n", + "V^{\\pi}(s)=\\mathbb{E}\\Big[\\:G_t \\:\\Big|\\: S_t=s\\:\\Big]\n", + "$$\n", + "where the return $G_t$ is\n", + "$$\n", + "G_t=R_t +\\gamma R_{t+1}+\\gamma^2 R_{t+2}+... . \n", + "$$\n", + "This means *The value of a state is the expected discounted sum of all future rewards\n", + "when following policy $\\pi$.*\n", + "\n", + "We know that $G_t=R_t+\\gamma G_{t+1}$, and plugging this equation into the definition of $V^{\\pi}(s)$, we get \n", + "$$\n", + "V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n", + "$$\n", + "This step shows simply ``The total future reward = immediate reward + discounted reward from next state.''" + ] + }, + { + "cell_type": "markdown", + "id": "88ea8d56-3b62-4690-9ff7-469e43726fbc", + "metadata": {}, + "source": [ + "For the expected immediate reward part $\\mathbb{E}[R_t| S_t=s]$, as we are in a maze problem, the reward depends only on the current state, not the time step, i.e., $\\mathbb{E}[R_t| S_t=s]=R(s)$. Hence we get \n", + "$$\n", + "V^{\\pi}(s)=R(s)+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n", + "$$\n", + "\n", + "Moreover, in this maze problem, we consider a deterministic policy $A_t=\\pi(s)$ (the action depends only on the state). Therefore, \n", + "$$\n", + "V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s, A_t=\\pi(s)\\:\\Big]. \n", + "$$\n", + "\n", + "Now **given the state $S_t=s$ and $A_t=a$**, the next state is random (because of the error probability) and we know the transition probability \n", + "$$\n", + "\\mathbb{P}\\big(\\:S_{t+1}=s' \\:|\\:S_t=s, \\, A_t=a\\big)=P\\big(s'\\:\\big|\\:s, a\\big). \n", + "$$" + ] + }, + { + "cell_type": "markdown", + "id": "c25e255d-8f58-4eaf-9485-cee6ab3bea6c", + "metadata": {}, + "source": [ + "Therefore,\n", + "$$\n", + "\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_t=s,A_t=a\\,\\big] =\\sum_{s'}\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_{t+1}=s'\\,\\big]\\times \\mathbb{P}\\big[S_{t+1}=s'\\,\\big|\\,S_t=s, A_t=a\\, \\big]\n", + "$$\n", + "$$\n", + "\\hspace{-1.2cm}=\\sum_{s'}V^{\\pi}(s')P\\big(s'\\:\\big|\\:s, a\\big),\n", + "$$\n", + "where here we use the Markov property. (**Question: Can you show the detailed computations here?**)" + ] + }, + { + "cell_type": "markdown", + "id": "9a2b6cff-e848-44a2-b504-973067b367b3", + "metadata": {}, + "source": [ + "In conclusion, we have (the Bellman expectation equation)\n", + "$$\n", + "V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n", + "$$" + ] + }, + { + "cell_type": "markdown", + "id": "15049fdb-f3af-4f78-b556-817284260ed0", + "metadata": {}, + "source": [ + "### 3.2 Define a function which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n", + "\n", + "\n", + "**Exercise $8^*$.** Now we define `policy_evaluation(...)`, which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n", + "\n", + "The input of this function `policy_evaluation(...)` are:\n", + "1. policy: array of size `n_states`, each entry is an action 0,1,2,3, which correspond to UP, RIGHT, DOWN, LEFT.\n", + "2. `P`: the transition probabilities `P[a, s, s']`.\n", + "3. `R`: the reward vector `R[s]`.\n", + "4. gamma: the discount factor $\\gamma\\in(0,1)$.\n", + "5. theta: convergence threshold.\n", + "6. max_iter: which is used to avoid infinite loops.\n", + "\n", + "How can we apply the Bellman expectation equation\n", + "$$\n", + "V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n", + "$$\n", + "here ?\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "20ef113f-0872-46e1-95ab-3cf5016f5a14", + "metadata": {}, + "source": [ + "We start with an initial guess of $V^{\\pi}$(e.g., all values = 0) and repeatedly apply the Bellman equation to update each state:\n", + "$$\n", + "V_{k+1}^\\pi(s) \\leftarrow R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}_k(s').\n", + "$$\n", + "until values converge.\n", + "\n", + "**Complete the `# TO DO HERE` part in the program below** " + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "3a05f8bc-2b8f-4a4c-9931-6d28c3b0db35", + "metadata": {}, + "outputs": [], + "source": [ + "def policy_evaluation( # noqa: PLR0913\n", + " policy: np.ndarray,\n", + " P: np.ndarray,\n", + " R: np.ndarray,\n", + " gamma: float,\n", + " theta: float = 1e-6,\n", + " max_iter: int = 10_000,\n", + ") -> np.ndarray:\n", + " \"\"\"Evaluate a deterministic policy for the given MDP.\n", + "\n", + " Args:\n", + " policy: array of shape (n_states,), with values in {0,1,2,3}\n", + " P: array of shape (n_actions, n_states, n_states)\n", + " R: array of shape (n_states,)\n", + " gamma: discount factor\n", + " theta: convergence threshold\n", + " max_iter: maximum number of iterations\n", + "\n", + " \"\"\"\n", + " n_states = len(R) # get the number of states\n", + " V = np.zeros(n_states) # initialize the value function\n", + "\n", + " for _it in range(max_iter): # Main iterative loop\n", + " V_new = np.zeros_like(\n", + " V,\n", + " ) # Create a new value vector and we will compute an updated value for each state.\n", + "\n", + " # Now we update each state using the Bellman expectation equation\n", + " for s in range(n_states):\n", + " a = policy[s] # Extract the action chosen by the policy in state\n", + " V_new[s] = R[s] + gamma * np.sum(P[a, s, :] * V)\n", + "\n", + " delta = np.max(\n", + " np.abs(V_new - V),\n", + " ) # This measures how much the value function changed in this iteration:\n", + " # If delta is small, the values start to converge; otherwise, we need to keep iterating.\n", + " V = V_new # Update V, i.e. Set the new values for the next iteration.\n", + "\n", + " if delta < theta: # Check convergence: When changes are tiny, we stop.\n", + " break\n", + "\n", + " return V # Return the final value function, this is our estimate for V^{pi}(s), s in the state set.\n" + ] + }, + { + "cell_type": "markdown", + "id": "09ef3439", + "metadata": {}, + "source": [ + "### 3.3 Evaluating a random policy" + ] + }, + { + "cell_type": "markdown", + "id": "eecbca15-f89f-47bf-a13d-7d7c051699b8", + "metadata": {}, + "source": [ + "Now we use the policy evaluation function `policy_evaluation` to evaluate a random policy. \n", + "\n", + "We first generate a `random_policy`, which is an array like [2, 0, 1, 3, 0, 2, ...] and has the size `n_states`. (Recall that the policy is a mapping from states to actions)." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "b4a44e38", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0 3 2 1 1 3 0 2 0 0 2 3 2 3 2 3 2 0 3 1 2 1]\n" + ] + } + ], + "source": [ + "# Random policy: for each state, pick a random action\n", + "random_policy = rng.integers(low=0, high=len(ACTIONS), size=n_states)\n", + "\n", + "print(random_policy)" + ] + }, + { + "cell_type": "markdown", + "id": "3fe07992-ce82-4124-aebc-a6384d417f64", + "metadata": {}, + "source": [ + "Now we call the function `policy_evaluation(...)` to compute $V^{\\pi_{\\text{random}}}(s)$." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "c5f559b2-452a-477c-a1fa-258b40805670", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Value function under random policy:\n", + "[ -0.2 -0.2 -0.201 -0.204 -0.205 -0.202 -0.214 -0.429 -0.212\n", + " -0.207 -0.276 -0.459 -0.352 -0.366 -5.827 -4.605 20. -0.366\n", + " -0.999 -20. -6.4 -3.163]\n" + ] + } + ], + "source": [ + "V_random = policy_evaluation(random_policy, P, R, gamma)\n", + "print(\"Value function under random policy:\")\n", + "print(V_random)" + ] + }, + { + "cell_type": "markdown", + "id": "f46c70ba-2932-49af-b568-b5477260bc94", + "metadata": {}, + "source": [ + "Here in this value vector of the policy, \n", + "- If it is a negative values, then the agent tends to move around aimlessly, fall in traps, or take too long.\n", + "- It it is a higher values, then the agent is closer to the goal or more likely to reach it" + ] + }, + { + "cell_type": "markdown", + "id": "1efcb076-467c-42d8-94e8-87453f688bbd", + "metadata": {}, + "source": [ + "Now we define a function `plot_values`, which displays the value function $V(s)$ and displays it on the maze grid. It helps students visually understand:\n", + "- which states are good (high value, near the goal),\n", + "- which states are bad (low value, near traps),\n", + "- how a policy affects the long-term expected reward." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "4c428327", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def plot_values(V: np.ndarray, title: str = \"Value function\") -> None:\n", + " \"\"\"Plot the value function V on the maze as a heatmap.\"\"\"\n", + " grid_values = np.full(\n", + " (n_rows, n_cols),\n", + " np.nan,\n", + " ) # Initializes a grid the same size as the maze. Every cell starts as NaN.\n", + " for (\n", + " s,\n", + " (i, j),\n", + " ) in (\n", + " state_to_pos.items()\n", + " ): # recall that state_to_pos maps each state index to its maze coordinates (i,j).\n", + " grid_values[i, j] = V[\n", + " s\n", + " ] # For each reachable cell, we write the value V[s] in the grid.\n", + " # Walls # never get values, and they stay as NaN.\n", + "\n", + " _fig, ax = plt.subplots()\n", + " im = ax.imshow(grid_values, cmap=\"magma\")\n", + " plt.colorbar(im, ax=ax)\n", + "\n", + " # For each state:\n", + " # Place the text label at (column j, row i).\n", + " # Display value to two decimals.\n", + " # Use white text so it's visible on the heatmap.\n", + " # Center the text inside each cell.\n", + "\n", + " for s, (i, j) in state_to_pos.items():\n", + " ax.text(\n", + " j,\n", + " i,\n", + " f\"{V[s]:.2f}\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " color=\"white\",\n", + " fontsize=9,\n", + " )\n", + "\n", + " # Remove axis ticks and set title\n", + " ax.set_xticks([])\n", + " ax.set_yticks([])\n", + " ax.set_title(title)\n", + " plt.show()\n", + "\n", + "\n", + "plot_values(V_random, title=\"Value function: random policy\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "8275a1eb-b58e-4e05-ae5d-5635ff9a1556", + "metadata": {}, + "source": [ + "The next function `plot_policy` visualizes a policy on the maze.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "c1ab67f0-bd5e-4ffe-b655-aec030401b78", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_policy(policy: np.ndarray, title: str = \"Policy\") -> None:\n", + " \"\"\"Plot the given policy on the maze.\"\"\"\n", + " _fig, ax = plt.subplots()\n", + " # draw walls as dark cells\n", + " wall_grid = np.zeros((n_rows, n_cols))\n", + " for i in range(n_rows):\n", + " for j in range(n_cols):\n", + " if maze_str[i][j] == \"#\":\n", + " wall_grid[i, j] = 1\n", + " ax.imshow(wall_grid, cmap=\"Greys\", alpha=0.5)\n", + "\n", + " for s, (i, j) in state_to_pos.items():\n", + " cell = maze_str[i][j]\n", + " if cell == \"#\":\n", + " continue\n", + "\n", + " if s in goal_states:\n", + " ax.text(\n", + " j,\n", + " i,\n", + " \"G\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " fontweight=\"bold\",\n", + " color=\"blue\",\n", + " )\n", + " elif s in trap_states:\n", + " ax.text(\n", + " j,\n", + " i,\n", + " \"X\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " fontweight=\"bold\",\n", + " color=\"red\",\n", + " )\n", + " elif s == start_state:\n", + " ax.text(\n", + " j,\n", + " i,\n", + " \"S\",\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " fontweight=\"bold\",\n", + " color=\"green\",\n", + " )\n", + " else:\n", + " a = policy[s]\n", + " ax.text(\n", + " j,\n", + " i,\n", + " action_names[a],\n", + " ha=\"center\",\n", + " va=\"center\",\n", + " fontsize=14,\n", + " color=\"black\",\n", + " )\n", + "\n", + " ax.set_xticks(np.arange(-0.5, n_cols, 1))\n", + " ax.set_yticks(np.arange(-0.5, n_rows, 1))\n", + " ax.set_xticklabels([])\n", + " ax.set_yticklabels([])\n", + " ax.grid(visible=True)\n", + " ax.set_title(title)\n", + " plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "id": "48037254-dccc-4f9c-a4d7-349adba5c74f", + "metadata": {}, + "source": [ + "Now let’s visualize the `random_policy`. Does it seem like a good policy?" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "d452681c-c89c-41cc-95dc-df75993b0391", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAGgCAYAAAC0SSBAAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAAH6JJREFUeJzt3Q1wlNWh//Ff1pCQxLw0AQUvEVulcm16UyJSShlem5bSoS1cKCUoOmqrcOWC5Ho7KUVIteoES6UjU65ixb6kV+xM6aU4XAIBW4tN9Z8/oyGASlHBvzLgSyIGQ0L2P+fETUKCxw0meXZPvp+Z4+4+2eA5eZ7d356XZ5+EcDgcFgAAHyH0UT8AAMAgKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIiiADiZNmmRLxCuvvKKEhARt3Lgx0HoBQSIoEPfMm7h5M4+UgQMH6rOf/axuu+02HTt2LOjqAXEvMegKAD3lxz/+sT796U/rgw8+0NNPP61f/OIXevLJJ1VTU6PU1NTz+jeHDx+uU6dOacCAAT1eXyBeEBTwxte//nWNHj3a3r/55puVk5OjNWvW6I9//KPmzZt3Xv9mpIcC9GcMPcFbU6ZMsbeHDx9Wc3Oz7rrrLl1++eVKTk7WZZddph/+8IdqbGx0/hsfNUdx4MABfec739HgwYOVkpKiK6+8UsuXL7c/27Vrl/2dP/zhD13+vfLycvuzZ555pkfbCvQmggLeOnTokL01PQvTw7jzzjtVUFCgn/3sZ5o4caLuvfdeffe73+32v/v888/ri1/8oiorK/W9731Pa9eu1be//W1t2bLF/txMhufm5uq3v/1tl98120xYfelLX+qBFgJ9g6EneKOurk4nTpywcxR//etf7ZyF+bQ/cuRI3XrrrTYsHn74YfvcRYsW6aKLLtL9999vewCTJ0+O+v+zePFimcu4VFdX69JLL23bft9999lb02O49tpr7bCXqVNmZqbdfvz4cW3fvr2t5wHEC3oU8MZXvvIVOxRkPs2bnsKFF15oh3/27Nljf75s2bKznl9cXGxvt27dGvX/w7zZ//nPf9aNN954VkhEAiJiwYIFdljr97//fdu2xx9/3A6BmRAB4gk9Cnhj3bp1dllsYmKiLr74YjtvEAqFbFiY2yuuuOKs5w8ZMkRZWVl69dVXo/5//OMf/7C3eXl5zueZXsw111xjh5puuukmu83cHzt2bJd6ALGOoIA3xowZ07bq6Vw6fuLvC6ZXsWTJEh09etT2Lv72t7/pwQcf7NM6AD2BoSd4z5wL0dLSopdeeums7eZkvHfffdf+PFqf+cxn7K05N+PjmOGvCy64QL/73e9sb8KcizF37tzzaAEQLIIC3ps+fbq9feCBB87abiabjW984xtR/1tmDmTChAn65S9/qddee+2sn5kJ7o4GDRpkz+34zW9+Y4Ni2rRpdhsQbxh6gvfy8/N1/fXX66GHHrI9CLM09u9//7see+wxu6y1OyuejJ///OcaP368XWr7/e9/354Nbs63MJPie/fu7TL8NHv2bHvfnMcBxCOCAv3Chg0b7LCROXHOTG6bieySkhKtXLnyvILHzDesWLHCfk2IWY5rhq/MCXidzZgxQ5/61Kfs0Nc3v/nNHmoN0LcSwp37ywB6jFkOe8kll9jAeOSRR4KuDnBemKMAetHmzZvtuRdmCAqIV/QogF5QVVVlv+rDzEuYCWxzFjcQr+hRAL3AzF0sXLjQfk3Ir371q6CrA3wi9CgAAE70KAAATgQFAKBnzqMw31XT8SIvZl3422+/bb/rv6+/QwcA8MmYWYf33nvPLt82X5rZI0FhLvJSWlr6CasGAIglR44c0bBhw3pmMrtzj8JckMV8H/8dd9zhzYXnTapeddVVqq2ttT2meOdbewzaFB9oU+xramrS6tWr7dfaRC6u9Yl7FOY6w6Z0ZkIiKSlJvhwIqamptj0+HAi+tcegTfGBNsWPaKYOmMwGADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEAcCIoAABOBAUAwImgAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAHAiKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEAcCIoAABOBAUAwImgAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAHAiKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIigAAE4EBQDAKTHaJzY2NtoSUV9fb29DoZAtPoi0g/bELtoUH2hT7OtOOxLC4XA4mieuWrVKpaWlXbaXl5crNTW1ezUEAASqoaFBRUVFqqurU0ZGRs8Exbl6FLm5uXrjjTeUk5MjHzQ1NamiokKFhYUaMGCA4p1v7fG9TTU1NWppaZEvn1bz8vLYTzHs9OnTuvvuu6MKiqiHnpKTk23pzBwEvhwIvrbJt/b42ibz5uPDG1BH7KfY1Z02+DHYBgDoNQQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEA6JkLF/W1/675bz2691HtfXOv3j71tlIHpCo7JVvDM4cr/+J8TR8xXV+74mtBVxMfWrJkiR5//HG9+eabQVcFQH8IigV/WKBfP//rs7bVN9bb8sq7r+ipV5/Sq3WvEhQxxFxO8dixY0FXA0B/CIptL287KySuHnq1vnb513Rh0oU63nBc1W9U65mjzwRaRwDoT2IuKLYf2t52/4rsK1R1c5UuCF1w1nNMz+KFYy8o3q5Pe8cdd+imm27SVVddFXR1cB4Xol+2bJndh8OHDw+6OjgH9lE/msxubmluu//uB+/aoabOMpIz9OVLv6x4cebMGV133XVas2aNHn300aCrg/NQXV2tDRs2aMKECTp06FDQ1cE5sI/6UVAUDC1ou3+i4YQ+++BndfVDV+vWP92qh//Pw3r57ZcVT5qbmzVv3jyVl5dr0aJFKisrC7pKOA9jx47Vli1bdPz4cftGdPDgwaCrhE7YR/1o6Onaf7lW655dp+f+33P2cUu4xc5LmBIx/tLxevDrDyp/SL5i3Zw5c7R582ZlZmYqISFBixcvjnoV0YgRI3q9fpBWrFihd955J6rn5uXl6dlnn9XEiRP11FNP6corr+z1+oF9FLSYC4rEUKIqF1Tq3qfv1S//7y917P2uK2mefu1pFf66UPsW7dPgtMGK5XmJ3bt3t60KWrduXdS/O3v2bIKij5jhwNdff71bv2NWeNXW1vIm1EfYR8GKuaEnIz05XfdMvUdvFL+hmoU1euSbj+j6/OuVnpTe9hyzAqrzEtpYEwqFtHPnTmVnZysrK0tVVVUKh8NRlUmTJgVd/X7j6NGjUe2TkydPtu2X0tJSzZw5M+iq9xvso2DFZFBEmKGaz130Od046kZt/PZGPb/weYUS2qv80lsvKdYVFBSosrJSiYmJKiws1J49e+TT5OH69eu7bN+/f7/Wrl0rn7z33nuaNm2a7SHed999uvPOO4OuEjphH/WjoafH9j6mD5o/0LzPz7OrmzpKG5Bmg8LMWxhZA7MUD/Lz87Vr1y5NnTrV9jDGjRsnHyxfvlzbtm1TQ0ND27Z9+/ZpypQpOnXqlGbNmqXc3Fz54OWXX9YLL7xgV67dfvvtQVcH58A+6kdBcfjdwyp9qlRL/3epnbT+wsVfsF/d8dapt/T72t+ftXx22hXTFC/MBJt5Ex00aJB8sWnTJk2fPl3FxcUaPLh1rmjy5Ml2Pfv27du9CQlj1KhR9o3Ip/3nG/ZRPwqKCNOr2PGPHbacy/cKvqeJl01UPPHtAE5PT7c9ihkzZtgeU2Q58I4dOzR69Gj5xrf95yP2UT8JiqVjl+rzF31elYcr9dwbz+nNk2/q+PvHdSZ8RoNTB+vqS662E9uz/nlW0FWFGQ5MS9PWrVv1rW99y85ZVFRU2E92APwRc0Fh5h3+9ap/tQXxISUlxQ41AfBTTK96AgAEj6AAADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEAcCIoAABOBAUAwImgAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAHAiKAAATgQFAMCJoAAA9MylUBsbG22JqK+vt7dNTU22+CDSDtoTu3xuUyjkz+e2SFvYT7GrO+1ICIfD4WieuGrVKpWWlnbZXl5ertTU1O7VEAAQqIaGBhUVFamurk4ZGRk906MoKSnRsmXLzupR5Obmqra2VklJSfIlYfPy8lRYWKgBAwbIh09AFRUVqqmpUUtLi3zg2z7quJ9oU2xr8uz1dPr06aifG3VQJCcn29KZ+YP58EfryBzYvhzcBvsoPtCm+NDiyeupO23wY7ANANBrCAoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEAcCIoAABOBAUAwImgAAA4ERQAACeCAgDgRFAAnTQ3N2v+/PkaOXKkXnzxRfliyZIlGjJkiHzh636KRQQF0OnykHPmzLHXgj948KAmTZqkAwcOyAfm2sjHjh2TD3zeT7GIoAA+1NjYqFmzZmnz5s1tF5s/efKkfRPat29f0NXDh9hPfY+g6MPr0xYXF6u2tjboquAjzJ07V1u3blVJSYlmzpxpt23fvl2nTp3S5MmTdfTo0aCrCPZTIBKD+d/2L2fOnNGCBQtsNzkUCmn16tVBVwnnsHTpUl1zzTVavny5brjhBrtt7Nixqqio0JYtWzRs2LCgqwj2UyAIij6YcCsqKtITTzyhRYsWqaysLOgq4SOYoQtTOhszZowtiA3sp75HUPQyM+FmxlIzMzOVkJCgxYsXR71CZcSIEb1ePwD4OARFL89L7N69u23Fybp166L+3dmzZxMUAGICk9m9yMxH7Ny5U9nZ2crKylJVVZXC4XBU5VxdawAIAkHRywoKClRZWanExEQVFhZqz549QVcJ/Uh1dbXWr1/fZfv+/fu1du3aQOqE+MPQUx/Iz8/Xrl27NHXqVNvDGDduXNBVQj9hVgZt27ZNDQ0NbdvMuQZTpkyxy0nN+Qi5ubmB1hGxj6DoI3l5efYFOmjQoKCrgn5k06ZNmj59uj2HZ/DgwXabOdfAnNlszj0gJBANhp76ECGBvpaenm57FCYcjh8/3rZke8eOHfbcAyAaBAXgubS0NHsms5kjy8nJscOfo0ePDrpaiCMEBXAOGzdutKvPfJGSkmKHmk6cOKFRo0bJF77tp1hFUAAAnAgKAIATQQEAcCIoAABOBAUAwImgAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAHAiKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgFOiotTY2GhLRH19vb0NhUK2+CDSjqamJvkg0g5f9o+P+6hjW3xsU1lZmVpaWuTLsZeXl+fN66k77UgIR3nB2VWrVqm0tLTL9vLycqWmpnavhgCAQDU0NKioqEh1dXXKyMjomR5FSUmJli1bdlaPIjc3V7W1tUpKSpJPnxgKCws1YMAA+fCprqKiQjU1Nd59qvNlH3XcTz62ycdjr8aTNp0+fTrq50YdFMnJybZ0Zv5gPvzROjIvVl9esAb7KD742CYfj70WT9rUnTb4MdgGAOg1BAUAwImgAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAHAiKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIij6WHNzs+bPn6+RI0fqxRdfDLo6+AhLlizRkCFDgq4GEBMIij6+9OCcOXPsdcYPHjyoSZMm6cCBA0FXC+dgriN87NixoKsBxASCoo80NjZq1qxZ2rx5c9uFzE+ePGnDYt++fUFXDwA+EkHRR+bOnautW7eqpKREM2fOtNu2b9+uU6dOafLkyTp69GjQVYSHzHWRi4uLVVtbG3RVEMcSg65Af7F06VJdc801Wr58uW644Qa7bezYsaqoqNCWLVs0bNiwoKsIz5w5c0YLFiywQ52hUEirV68OukqIUwRFHzFDTKZ0NmbMGFuAnl40UVRUpCeeeEKLFi1SWVlZ0FVCHCMoAA+ZRRNmPiwzM1MJCQlavHhx1Ku9RowY0ev1Q3whKAAP5yV2797dtnpr3bp1Uf/u7NmzCQp0wWQ24BkzH7Fz505lZ2crKytLVVVVCofDUZVzDY8CBAXwoerqaq1fv77L9v3792vt2rWKJwUFBaqsrFRiYqIKCwu1Z8+eoKuEOMbQE/AhsyJt27ZtamhoaNtmznGZMmWKXcZszoPJzc1VvMjPz9euXbs0depU28MYN25c0FVCnCIogA9t2rRJ06dPt+cdDB482G4z57iYM+rNOS/xFBIReXl5NuwGDRoUdFUQxxh6Aj6Unp5uexQmHI4fP962zHTHjh32nJd4RUjgkyIogA7S0tLsGfRmXD8nJ8cO2YwePTroagGBIigCsHHjRrvCBLEpJSXFDjWdOHFCo0aNCro6QOAICgCAE0EBAHAiKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEAcEpUlBobG22JqK+vt7ehUMgWH0Ta0dTUJB9E2lFcXKwBAwbIlzZVVFSorKxMLS0t8uW4y8vL8+a48/3YK/akTW+99ZbuueeeqJ6bEI7y4s2rVq1SaWlpl+3l5eVKTU3tfi0BAIFpaGhQUVGR6urqlJGR0TNBca4eRW5urn70ox8pKSlJPn2yKyws9OITQ+QTkC/t6dimmpoa73oUPu4n2hTbPYqhQ4dGFRRRDz0lJyfb0pl5sfrygo0wB4EPB4Kv7TE47uIDbYpd3WmDH5MLAIBeQ1AAAJwICgCAE0EBAHAiKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAFYsmSJhgwZEnQ10M9w3MWut96Sysqkr35VuuQSaeBAcw0gaehQacIE6Y47pL/8RYruMnM9L+oLF6HnmCtKHTt2LOhqoJ/huItNDz0kLVsmvf9+15+9+WZrMSFx//3SG29IQWQ9QQEAAVm9WvrP/2x/nJAgTZ4sjR0rXXih9Pbb0t690tNPSx98EFw9CQoACMD+/VJJSfvjnBzpf/5HGjeu63NPnpR+/WspJUWBYI4C581cs7q4uFi1tbVBVwX9iC/H3c9/Lp050/54/fpzh4RhehcLF0qZmQoEPQqclzNnzmjBggUqLy9XKBTSatOHBnqZT8fdzp3t9z/1KWnWLMUsehTotubmZs2bN8++WBctWqQys1wD6GW+HXevv95+f8QIKdTh3fjAgdb5is7lhhsCqSo9CnTfnDlztHnzZmVmZiohIUGLFy+OennmCPOKAM6Dz8ddQoJiGkGBbo8P7969u2255bp166L+3dmzZ8f8Cxaxycfj7p/+SXrppdb75tacIxEJjIsual0RZaxcKTU0BFdPg6GnPlJdXa31Zraqk/3792vt2rWKF2ZceOfOncrOzlZWVpaqqqoUDoejKpMmTQq6+v0Ox13sHndTp7bfN8tgzYqniOxs6T/+o7UEtdKpI4KijyxfvlwLFy7UmjVr2rbt27fPHsQrVqzQkSNHFC8KCgpUWVmpxMREFRYWas+ePUFXCR+B4y523XabdMEF7Y9vvbX1nIlYRFD0kU2bNmn8+PF2Wd+TTz5pt02ePFmNjY3avn27cnNzFU/y8/O1a9cuDRw40H7SQ2ziuItdn/ucdNdd7Y/NGdijR0szZkirVkk/+Yl0881Sfb0CxxxFH0lPT9e2bds0Y8YMe6BHVnHs2LFDo83REYfy8vLsp9NBgwYFXRV8BI672FZSIqWltZ6d3djYel7Fn/7UWs7FnJQXBHoUfSgtLU1bt2613eacnBz7iSheX6wRPrxYfcdxF9v+/d+lw4dbexHjx0uDB0uJia1zE5deKhUWtv6sulr66U+DqSM9ij6WkpJiu/xAX+K4i21Dh7aubjIlFtGjAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAHAiKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAAAJ4ICANAz18xubGy0JaK+vt7ehkIhW3wQaUdTU5N8EGmHL+3p2Jbi4mINGDBAvrSpoqLCy/3kY5vKysrU0tKieHf69Omon5sQDofD0Txx1apVKi0t7bK9vLxcqamp3ashACBQDQ0NKioqUl1dnTIyMnqmR1FSUqJly5ad1aPIzc1VbW2tkpKS5EuPIi8vT4WFhV58Wo18UvWlPQZtig8+t6mmpqbf9SiiDork5GRbOjN/MB/+aB2ZA9uXg9vH9hi0KT742KYWT97zutMGPyYXAAC9hqAAADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEAcCIoAABOBAUAwImgAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAHAiKPpYc3Oz5s+fr5EjR+rFF18MujowDh2SLrxQSkhoLV/9qtT5wo/mcWFh+3PS0qSXXlI8WbJkiYYMGRJ0NRCHCIo+vqLUnDlz7OVjDx48qEmTJunAgQNBVwuXXy799KftjysqpHXrzn7Ogw9KO3a0P77/fmnECMUTc8nLY8eOBV0NxCGCoo80NjZq1qxZ2rx5c9v1aU+ePGnDYt++fUFXD7fcIk2f3v74Bz+QIj0+c2seR0ybJi1c2Pd1BAJCUPSRuXPnauvWrfba4zNnzrTbtm/frlOnTmny5Mk6evRo0FXEhg1STk7r/YYGacECk/DSdddJp061bs/Olh55JNBqAn2NoOgjS5cu1d1336177rmnbdvYsWPtxdpvueUWDRs2TPF4zd3i4mLV1tbKC0OHSr/4RfvjqippzBjp739v32Z+fsklgVQPHz+0e9ttt+nVV18NuireISj6iBliWr58eZftY8aM0V133aV4c+bMGV133XVas2aNHn30UXljzhxp/vz2x88/336/qEj6zncCqRY+XnV1tTZs2KAJEybokFmggB5DUOC8Vm7NmzfPTsovWrRIZWVl8oqZuDa9i44uvrjrBDdiiumhb9myRcePH7dhYRaMoGck9tC/g37ErNwyk/KZmZlKSEjQ4sWLo16eOSIeVgqZ+aK33z57m3n8yivSF74QVK36tRUrVuidd96J6rl5eXl69tlnNXHiRD311FO68sore71+viMo0O15id27d7ctt1zXjU/Zs2fPjv2gaGpqnbw2k9jn2v7cc1JyclC167fM8Obrr7/erd8xS4HN/BlB8ckx9IRuCYVC2rlzp7Kzs5WVlaWqqiqFw+GoipmniXkrV0p797Y//rd/a79fUyP96EeKt3H79evXd9m+f/9+rV27VvHCrAqM5hiLLDk3SktL21YY4pMhKNBtBQUFqqysVGJiogoLC7Vnzx55wbSj43zLjTe2zlfcdFP7tjVrpL/8RfHCLKBYuHChXXQQYc7bMW+mZjjnyJEj8sV7772nadOm2R7vfffdpzvvvDPoKnmDoMB5yc/P165duzRw4EDbw4h7778vXX+9Wc7V+viyy6QHHmi9b24/85nW+y0trc87eVLxYNOmTRo/frxdxvzkk0/abea8HXMCqDmPJzc3V754+eWX9cILL9hQ/EHHEyTxiTFHgfNmJg3Np9NBgwYp7hUXm3ea1vuhkPTYY1J6eutj8z1Qv/qVNHFia5AcPizdfrv08MOKdenp6dq2bZtmzJhhgz2yam3Hjh0aPXq0fDJq1CgbFl4cjzGGHgU+ES9elNu2Sf/1X+2PTQhMmHD2c7785bO/xsOcxb11q+JBWlqa/VYAM0yYk5Nje4C+hYRXx2MMokcRgI0bN9qCGGG+u6nzt8Wey09+0lriUEpKih1qAs4HPQoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEAcCIoAABOBAUAwImgAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAOiZS6E2NjbaElFfX29vQ6GQLT6ItKOpqUk+iLTDl/YYtCk++NymkGfvd9FICIejuViwtGrVKpWWlnbZXl5ertTU1O7VEAAQqIaGBhUVFamurk4ZGRk906MoKSnRsmXLzupR5Obmqra2VklJSfIlYfPy8lRTU6OWlhbFO9/aY9Cm+ECbYt/p06ejfm7UQZGcnGxLZ+YP5sMfzec2+dYegzbFB9oUu7rTBj8G2wAAvYagAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAHAiKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEAcCIoAABOBAUAwImgAAA4ERQAACeCAgDgRFAAAJwICgCAE0EBAHAiKAAATgQFAMCJoAAAOBEUAAAnggIA4ERQAACcCAoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAAAJ4ICAOBEUAAAnAgKAIATQQEAcCIoAABOBAUAwClRUWpsbLQloq6uzt42NTXJF6FQSA0NDTp9+rRaWloU73xrj0Gb4gNtin2R9+5wOPzxTw5HaeXKleZfo1AoFIr8KYcOHfrY9/8E85/z6VG8++67Gj58uF577TVlZmbKB/X19crNzdWRI0eUkZGheOdbewzaFB9oU+wzo0KXXnqp3nnnHWVlZfXM0FNycrItnZmQ8OGP1pFpj09t8q09Bm2KD7QpPobUPvY5fVITAEDcIigAAL0TFGYYauXKleccjopXvrXJt/YYtCk+0Ca/2hP1ZDYAoH9i6AkA4ERQAACcCAoAgBNBAQBwIigAAE4EBQDAiaAAADgRFAAAufx/wkBHS6w2LYIAAAAASUVORK5CYII=", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plot_policy(random_policy, title=\"Policy\")" + ] + }, + { + "cell_type": "markdown", + "id": "cbad5bf1-0150-4c3f-8cce-c82e0f1d1695", + "metadata": {}, + "source": [ + "**Exercise 9.** Define your own policy and evaluate it using the functions `policy_evaluation(...)` and `plot_values(...)`. **Can you identify an optimal policy visually?** Plot your own policy using `plot_policy`. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "929707e6-3022-4d86-96cc-12f251f890a9", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "my_policy = np.array(\n", + " [\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_DOWN,\n", + " A_DOWN, # First row\n", + " A_UP,\n", + " A_DOWN,\n", + " A_DOWN,\n", + " A_LEFT, # Second row\n", + " A_UP,\n", + " A_RIGHT,\n", + " A_DOWN, # Third row\n", + " A_UP,\n", + " A_LEFT,\n", + " A_RIGHT,\n", + " A_RIGHT,\n", + " A_RIGHT, # Fourth row\n", + " A_UP,\n", + " A_LEFT,\n", + " A_DOWN,\n", + " A_RIGHT,\n", + " A_UP, # Fifth row\n", + " ],\n", + ")\n", + "\n", + "V_my_policy = policy_evaluation(policy=my_policy, P=P, R=R, gamma=gamma)\n", + "\n", + "plot_values(V=V_my_policy, title=\"Value function: my policy\")\n", + "plot_policy(policy=my_policy, title=\"My policy\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "9cd7b3d0-e5ea-48c1-a68a-be3b3e782e9f", + "metadata": {}, + "source": [ + "-----------------------------------\n", + "\n", + "## 4. Dynamic programming : Policy improvement and Policy iteration" + ] + }, + { + "cell_type": "markdown", + "id": "bb35acc8-6469-499b-b565-3f2d590b13bc", + "metadata": {}, + "source": [ + "**Exercise 12.** \n", + "\n", + "Write a `policy_improvement` function whose inputs are the state-value function `V`, the transition probability matrix `P`, the reward vector `R`, and the discount factor $\\gamma$. \n", + "The function should return a **greedy policy** that, for each state, selects the action that maximizes the expected return according to the input `V`.\n", + "\n", + "\n", + "*Question: Why don’t we input the old policy in this policy improvement step?*\n", + "\n", + "\n", + "*Remark.* In this maze game, we consider a deterministic policy $\\pi:s\\in\\mathcal{S}\\mapsto a\\in\\mathcal{A}$ that assigns one single action to each state.\n" + ] + }, + { + "cell_type": "markdown", + "id": "94f00eeb-63d4-43a1-813f-37dda7276693", + "metadata": {}, + "source": [ + "------------------\n", + "\n", + "*Hint.* 1. This exercise can be completed in two steps. \n", + "\n", + "In the first step, compute the action-value function $q^{\\pi}(s,a)$ from the state-value function $s' \\mapsto v^{\\pi}(s') $, for a fixed state $ s $. \n", + "Which formula should be used to express $ q^{\\pi}(s,a) $ in terms of $ v^{\\pi} $?\n", + "\n", + "In the second step, perform the greedy policy improvement step by computing a new policy $ \\pi' $ such that\n", + "$$\n", + "\\pi'(s) = \\arg\\max_{a} q^{\\pi}(s,a).\n", + "$$\n", + "\n", + "Attention, for terminal states, action choice is irrelevant, we can set 0 to terminal states. \n", + "\n", + "2. Bellman action-value equation for the maze: \n", + "\n", + "In this maze environment, the **immediate reward depends only on the current state (for non-terminal state)**:\n", + "\n", + "$$\n", + "r(s,a,s') = R(s).\n", + "$$\n", + "\n", + "This means:\n", + "\n", + "- The reward does **not** depend on the action taken.\n", + "- The reward does **not** depend on the next state.\n", + "- All actions taken from the same state yield the same immediate reward.\n", + "\n", + "The general Bellman equation for the action-value function is:\n", + "\n", + "$$\n", + "Q(s,a)\n", + "=\n", + "\\sum_{s', r} P(s',r\\mid s,a)\n", + "\\left(\n", + "r(s,a,s') + \\gamma V(s')\n", + "\\right).\n", + "$$\n", + "\n", + "Since the reward satisfies \n", + "$$\n", + "r(s,a,s') = R(s),\n", + "$$\n", + "we can simplify the expression:\n", + "\n", + "$$\n", + "\\begin{aligned}\n", + "Q(s,a)\n", + "&= \\sum_{s'} P(s' \\mid s,a)\n", + "\\left(\n", + "R(s) + \\gamma V(s')\n", + "\\right) \\\\\n", + "&= R(s) + \\gamma \\sum_{s'} P(s' \\mid s,a) V(s').\n", + "\\end{aligned}\n", + "$$\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "998e3005-9a8b-4759-a008-aeccedd25924", + "metadata": {}, + "outputs": [], + "source": [ + "def policy_improvement(\n", + " V: np.ndarray, P: np.ndarray, R: np.ndarray, gamma: float,\n", + ") -> np.ndarray:\n", + " \"\"\"Given a value function V, compute the improved policy.\n", + "\n", + " Args:\n", + " V: array of shape (n_states,)\n", + " P: array of shape (n_actions, n_states, n_states)\n", + " R: array of shape (n_states,)\n", + " gamma: discount factor\n", + "\n", + " Returns:\n", + " policy: array of shape (n_states,), with values in {0,1,2,3}\n", + "\n", + " \"\"\"\n", + " n_actions = P.shape[0]\n", + " policy = np.zeros(n_states, dtype=int)\n", + " for s in range(n_states): # We decide the best action separately for each state.\n", + " # For terminal states, action choice is irrelevant; keep 0\n", + " if is_terminal(s):\n", + " continue\n", + "\n", + " Q_values = np.zeros(n_actions)\n", + " for a in range(n_actions):\n", + " Q_values[a] = R[s] + gamma * np.dot(P[a, s, :], V)\n", + " policy[s] = np.argmax(Q_values)\n", + "\n", + " return policy" + ] + }, + { + "cell_type": "markdown", + "id": "28800a10-f76b-4f27-a697-1238678f6bb3", + "metadata": {}, + "source": [ + "**Exercise 13.** \n", + "\n", + "Write a `policy_iteration` function whose inputs are the initial policy `initial_policy`, the transition probability matrix `P`, the reward vector `R`, the discount factor $\\gamma$ `gamma`, the tolerance parameter `theta` used in policy evaluation (the evaluation stops when the value function changes by less than `theta`), and `max_iter`, which serves as a safety limit to prevent the loop from running indefinitely. \n", + "\n", + "The function should return two outputs: \n", + "- `policy`, the final (optimal) policy, represented as an array of action indices; \n", + "- `V`, the value function corresponding to this policy.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "22663608-c3a4-47d0-b1c9-b7d1bb67fa64", + "metadata": {}, + "source": [ + "--------------------------\n", + "\n", + "*Hint.* The `policy_iteration` algorithm consists of two main steps. \n", + "First, the **policy evaluation** step, where you will use the function implemented in **Exercise 8**. \n", + "Second, the **policy improvement** step, where you will use the function implemented in **Exercise 12**.\n", + "\n", + "--------------------------" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b6c2216", + "metadata": {}, + "outputs": [], + "source": [ + "def policy_iteration( # noqa: PLR0913\n", + " initial_policy: np.ndarray,\n", + " P: np.ndarray,\n", + " R: np.ndarray,\n", + " gamma: float,\n", + " theta: float = 1e-6,\n", + " max_iter: int = 1000,\n", + ") -> tuple[np.ndarray, np.ndarray]:\n", + " \"\"\"Policy Iteration.\n", + "\n", + " Goal:\n", + " Learn an optimal policy by alternating:\n", + " 1) Policy Evaluation\n", + " 2) Policy Improvement\n", + "\n", + " Inputs:\n", + " initial_policy : array of shape (num_states,)\n", + " Initial deterministic policy.\n", + " P : transition probabilities\n", + " R : reward function\n", + " gamma : discount factor\n", + " theta : stopping threshold for policy evaluation\n", + " max_iter : maximum number of policy iteration steps\n", + "\n", + " Returns:\n", + " policy : optimal policy\n", + " V : value function of the optimal policy\n", + "\n", + " \"\"\"\n", + " policy = initial_policy\n", + "\n", + " for _it in range(max_iter):\n", + " # Compute the value function V^pi for the current policy.\n", + " V = policy_evaluation(policy, P, R, gamma, theta)\n", + "\n", + " # Improve the policy by acting greedily with respect to V.\n", + " new_policy = policy_improvement(V, P, R, gamma)\n", + "\n", + " # Check whether the policy has stopped changing.\n", + " if np.array_equal(new_policy, policy):\n", + " break\n", + " policy = new_policy\n", + " return policy, V\n" + ] + }, + { + "cell_type": "markdown", + "id": "7ed35eca-81fd-45ac-bc8c-550117124e21", + "metadata": {}, + "source": [ + "**Exercise 14.** \n", + "\n", + "Starting from a random policy (see Section 3.3), compute an optimal policy for the Maze game. \n", + "Then, plot the value function of this optimal policy and visualize the policy itself by displaying arrows on the maze.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "e12d62ee-3324-4e1b-b5e2-6be96404ac2c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Optimal policy found by Policy Iteration:\n", + "[1 1 1 2 2 0 2 2 3 0 1 2 0 3 1 1 0 0 0 0 1 0]\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "opt_policy, V = policy_iteration(random_policy, P, R, gamma)\n", + "print(\"Optimal policy found by Policy Iteration:\")\n", + "print(opt_policy)\n", + "plot_policy(opt_policy, title=\"Optimal policy found by Policy Iteration\")\n", + "plot_values(V, title=\"Value function of the optimal policy\")" + ] + }, + { + "cell_type": "markdown", + "id": "7d52252c-7aee-4be9-9896-939348add5de", + "metadata": {}, + "source": [ + "## 5. Dynamic programming : Value iteration" + ] + }, + { + "cell_type": "markdown", + "id": "00c8cbd1-e0ea-4919-b61f-ff0bc3c95880", + "metadata": {}, + "source": [ + "**Exercise 15.** Write a `value_iteration` function whose inputs are the transition probability matrix `P`, the reward vector `R`, the discount factor $\\gamma$ `gamma`, the parameter `theta`, which is a stopping tolerance (stop when the value function changes by less than theta), and `max_iter`, which serves as a safety limit to prevent the loop from running indefinitely. \n", + "\n", + "The outputs of value_iteration are `V`, which is an approximation of the optimal value function, and `policy`, which is a greedy policy derived from the final `V`.\n", + "\n", + "*Question:* Do `value_iteration` and `policy_iteration` find the same optimal policy?" + ] + }, + { + "cell_type": "markdown", + "id": "fd52fa5c-dbaa-4281-a1a6-dd7180123756", + "metadata": {}, + "source": [ + "*Hint.* Value iteration repeatedly applies the Bellman optimality operator. In the maze case, it is \n", + "$$\n", + "(\\mathcal{T}^* V)(s)=\\max_a \\Big\\{ R(s) + \\gamma \\sum_{s'}P(s'|s,a)V(s')\\Big\\}\n", + "$$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "293ba7fc-f9dc-41b0-ad78-677af1ac7e0f", + "metadata": {}, + "outputs": [], + "source": [ + "def value_iteration(\n", + " P: np.ndarray,\n", + " R: np.ndarray,\n", + " gamma: float,\n", + " theta: float = 1e-6,\n", + " max_iter: int = 10_000,\n", + ") -> tuple[np.ndarray, np.ndarray]:\n", + " \"\"\"Value Iteration (student version).\n", + "\n", + " Goal:\n", + " Approximate the optimal value function V*\n", + " and derive an optimal policy.\n", + "\n", + " Inputs:\n", + " P : array of shape (n_actions, n_states, n_states)\n", + " Transition probabilities.\n", + " R : array of shape (n_states,)\n", + " Reward for each state.\n", + " gamma : float\n", + " Discount factor.\n", + " theta : float\n", + " Stopping tolerance for convergence.\n", + " max_iter : int\n", + " Maximum number of iterations.\n", + "\n", + " Returns:\n", + " V : array of shape (n_states,)\n", + " Approximation of the optimal value function V*.\n", + " policy : array of shape (n_states,)\n", + " Greedy policy derived from V.\n", + "\n", + " \"\"\"\n", + " n_states = len(R)\n", + " n_actions = len(P)\n", + " V = np.zeros(n_states)\n", + "\n", + " # Main value iteration loop\n", + " for _it in range(max_iter):\n", + " V_new = np.zeros_like(V)\n", + "\n", + " # Loop over all states\n", + " for s in range(n_states):\n", + " if is_terminal(s):\n", + " V_new[s] = R[s] / (1 - gamma)\n", + " continue\n", + "\n", + " Q_values = np.zeros(n_actions)\n", + " for a in range(n_actions):\n", + " Q_values[a] = R[s] + gamma * np.sum(P[a, s, :] * V)\n", + " V_new[s] = np.max(Q_values)\n", + "\n", + " delta = np.max(np.abs(V_new - V))\n", + " if delta < theta:\n", + " break\n", + "\n", + " V = V_new\n", + " policy = policy_improvement(V, P, R, gamma)\n", + "\n", + " return V, policy\n" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "9b6ff9d3-ccc9-4f35-a6c3-545aeed552f7", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n" + ] + } + ], + "source": [ + "V_vi, policy_vi = value_iteration(P, R, gamma)\n", + "\n", + "plot_values(V_vi, title=\"Optimal value function (value iteration)\")\n", + "plot_policy(policy_vi, title=\"Optimal policy (value iteration)\")\n", + "\n", + "print(np.abs(opt_policy - policy_vi))\n" + ] + }, + { + "cell_type": "markdown", + "id": "f4db246d-07c2-4587-b185-7298fe292674", + "metadata": {}, + "source": [ + "## 6. Advanced exercises \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "b0c04200-39f7-41fe-a479-62c87efab8a3", + "metadata": {}, + "source": [ + "**Exercise 16 (Policy Iteration vs Value Iteration)**\n", + "\n", + "In this exercise, we compare the number of iterations required by **policy iteration** and **value iteration** to reach an optimal policy in the Maze game.\n", + "\n", + "1. Modify the definition of `policy_iteration` and `value_iteration` so that they can record:\n", + " - the **number of iterations** until convergence,\n", + " - and optionally the runtime.\n", + "2. Run both algorithms starting from:\n", + " - the same random initialization (a random policy for policy iteration, and $V_0 \\equiv 0$ for value iteration),\n", + " - and repeat the experiment over several random seeds in order to compute the **average number of iterations** and **average runtime**.\n", + "3. Report and interpret the results.\n", + "\n", + "*Question:* What do you observe?\n", + "\n", + "-------------\n", + "\n", + "*Hint.* the word “iteration” means something different for **policy iteration** and **value iteration**:\n", + "\n", + "- Policy iteration: one “iteration” = one outer loop step = policy evaluation + policy improvement.\n", + "- Value iteration: one “iteration” = one Bellman optimality sweep over all states.\n", + "\n", + "-------------\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cce78e3c-ca82-4002-9a8f-08af9457147c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Same policy? True\n", + "Same policy? True\n", + "Same policy? True\n", + "Same policy? True\n", + "Same policy? True\n", + "Same policy? True\n", + "Same policy? True\n", + "Same policy? True\n", + "Same policy? True\n", + "Same policy? True\n", + "Policy Iteration - outer iterations: [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]\n", + "Value Iteration - iterations: [100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000]\n", + "Mean PI iterations: 1000.0\n", + "Mean VI iterations: 100000.0\n", + "Mean PI runtime: 0.022669649124145506\n", + "Mean VI runtime: 0.008730292320251465\n" + ] + } + ], + "source": [ + "import time\n", + "\n", + "\n", + "def policy_iteration_count( # noqa: PLR0913\n", + " P: np.ndarray,\n", + " R: np.ndarray,\n", + " gamma: float,\n", + " theta: float = 1e-6,\n", + " max_iter: int = 1_000,\n", + " seed: int = 0,\n", + ") -> tuple[np.ndarray, np.ndarray, int, float]:\n", + " \"\"\"Policy Iteration with iteration count and runtime.\"\"\"\n", + " start_time = time.time()\n", + " rng = np.random.default_rng(seed)\n", + " policy = rng.integers(low=0, high=len(ACTIONS), size=n_states)\n", + " for _it in range(max_iter):\n", + " V = policy_evaluation(policy, P, R, gamma, theta)\n", + "\n", + " new_policy = policy_improvement(V, P, R, gamma)\n", + "\n", + " if np.array_equal(new_policy, policy):\n", + " break\n", + " policy = new_policy\n", + " runtime = time.time() - start_time\n", + " return policy, V, max_iter, runtime\n", + "\n", + "\n", + "def value_iteration_count(\n", + " P: np.ndarray,\n", + " R: np.ndarray,\n", + " gamma: float,\n", + " theta: float = 1e-6,\n", + " max_iter: int = 100_000,\n", + ") -> tuple[np.ndarray, np.ndarray, int, float]:\n", + " \"\"\"Value Iteration with iteration count and runtime.\"\"\"\n", + " start_time = time.time()\n", + " n_states = len(R)\n", + " n_actions = len(P)\n", + " V = np.zeros(n_states)\n", + " for _it in range(max_iter):\n", + " V_new = np.zeros_like(V)\n", + " for s in range(n_states):\n", + " if is_terminal(s):\n", + " V_new[s] = R[s] / (1 - gamma)\n", + " continue\n", + " Q_values = np.zeros(n_actions)\n", + " for a in range(n_actions):\n", + " Q_values[a] = R[s] + gamma * np.sum(P[a, s, :] * V)\n", + " V_new[s] = np.max(Q_values)\n", + " delta = np.max(np.abs(V_new - V))\n", + " if delta < theta:\n", + " break\n", + " V = V_new\n", + " runtime = time.time() - start_time\n", + " policy = policy_improvement(V, P, R, gamma)\n", + "\n", + " return V, policy, max_iter, runtime\n", + "\n", + "\n", + "# Next, run the comparison over several seeds\n", + "\n", + "gamma = 0.9\n", + "theta = 1e-6\n", + "seeds = list(range(10))\n", + "\n", + "pi_iters = []\n", + "vi_iters = []\n", + "pi_times = []\n", + "vi_times = []\n", + "\n", + "for seed in seeds:\n", + " # Policy iteration\n", + " pi_policy, pi_V, n_pi, t_pi = policy_iteration_count( # noqa: N816\n", + " P,\n", + " R,\n", + " gamma,\n", + " theta=theta,\n", + " seed=seed,\n", + " )\n", + " # Value iteration\n", + " vi_V, vi_policy, n_vi, t_vi = value_iteration_count(P, R, gamma, theta=theta) # noqa: N816\n", + "\n", + " pi_iters.append(n_pi)\n", + " vi_iters.append(n_vi)\n", + " pi_times.append(t_pi)\n", + " vi_times.append(t_vi)\n", + "\n", + " # Optional: check both found the same final policy\n", + "\n", + " print(\"Same policy?\", np.array_equal(pi_policy, vi_policy))\n", + "\n", + "print(\"Policy Iteration - outer iterations:\", pi_iters)\n", + "print(\"Value Iteration - iterations:\", vi_iters)\n", + "\n", + "print(\"Mean PI iterations:\", np.mean(pi_iters))\n", + "print(\"Mean VI iterations:\", np.mean(vi_iters))\n", + "print(\"Mean PI runtime:\", np.mean(pi_times))\n", + "print(\"Mean VI runtime:\", np.mean(vi_times))\n" + ] + }, + { + "cell_type": "markdown", + "id": "c4e197e2-b0e4-4d8c-b5ab-8028385c4cd3", + "metadata": {}, + "source": [ + "**Exercise 17.** (Asynchronous Value Iteration)\n", + "\n", + "Implement asynchronous value iteration, where the value function is updated in place :\n", + "$$\n", + "V(s) \\leftarrow \\max_a\\left\\{R(s)+\\gamma \\sum_{s'}P(s'|s,a)V(s')\\right\\}.\n", + "$$\n", + "\n", + "Compare the number of iterations needed for convergence with the synchronous version.\n", + "\n", + "-------------------\n", + "\n", + "Hint. Synchronous value iteration uses a copy `V_new` and updates all states from the old `V`. Asynchronous value iteration updates `V[s]` immediately, so later states in the same sweep can use the newest values.\n", + "\n", + "-------------------\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "0b3469a6", + "metadata": {}, + "outputs": [], + "source": [ + "def asynchronous_value_iteration(\n", + " P: np.ndarray,\n", + " R: np.ndarray,\n", + " gamma: float,\n", + " theta: float = 1e-6,\n", + " max_iter: int = 200_000,\n", + ") -> tuple[np.ndarray, np.ndarray, int]:\n", + " \"\"\"Asynchronous (in-place) value iteration: updates V[s] immediately inside the loop.\"\"\"\n", + " n_states = len(R)\n", + " n_actions = len(P)\n", + " V = np.zeros(n_states)\n", + "\n", + " for _it in range(max_iter):\n", + " delta = 0\n", + " for s in range(n_states):\n", + " if is_terminal(s):\n", + " V[s] = R[s] / (1 - gamma)\n", + " continue\n", + "\n", + " Q_values = np.zeros(n_actions)\n", + " for a in range(n_actions):\n", + " Q_values[a] = R[s] + gamma * np.sum(P[a, s, :] * V)\n", + " v_new = np.max(Q_values)\n", + "\n", + " delta = max(delta, abs(v_new - V[s]))\n", + " V[s] = v_new\n", + "\n", + " if delta < theta:\n", + " break\n", + "\n", + " pi = policy_improvement(V, P, R, gamma)\n", + "\n", + " return (\n", + " V,\n", + " pi,\n", + " _it,\n", + " ) # return final value function, greedy policy from V, and number of iteration performed\n" + ] + }, + { + "cell_type": "markdown", + "id": "48ae870c-1f18-4f83-8b6d-1ae393d08de3", + "metadata": {}, + "source": [ + "**Exercise 18.** (Bellman Optimality Operator)\n", + "\n", + "Show numerically that the Bellman optimality operator $\\mathcal{T}^*$ satisfies the contraction property, which means, for arbitrary value functions $V$ and $W$, we have \n", + "$$\n", + "\\big\\Vert \\mathcal{T}^* V - \\mathcal{T}^* W \\big\\Vert_{\\infty}\\leq \\gamma \\Vert V-W\\Vert_{\\infty}\n", + "$$\n", + "\n", + "---------------\n", + "\n", + "*Hint.* Generate random value functions $V$ and $W$, apply one Bellman optimality update to each, and compare both sides of the inequality.\n", + "\n", + "---------------" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "b64a1e78", + "metadata": {}, + "outputs": [], + "source": [ + "def T_opt(V: np.ndarray, P: np.ndarray, R: np.ndarray, gamma: float) -> np.ndarray:\n", + " \"\"\"Bellman optimality operator in the maze: (T* V)(s) = max_a [ R[s] + gamma * sum_{s'} P[a,s,s'] V[s'] ].\"\"\"\n", + " n_states = len(R)\n", + " n_actions = len(P)\n", + " V_new = np.zeros_like(V)\n", + "\n", + " for s in range(n_states):\n", + " if is_terminal(s):\n", + " V_new[s] = R[s] / (1 - gamma)\n", + " continue\n", + "\n", + " Q_values = np.zeros(n_actions)\n", + " for a in range(n_actions):\n", + " Q_values[a] = R[s] + gamma * np.sum(P[a, s, :] * V)\n", + " V_new[s] = np.max(Q_values)\n", + "\n", + " return V_new\n" + ] + }, + { + "cell_type": "markdown", + "id": "d58d20fd-509a-41e4-bde1-eca6c9da6ddc", + "metadata": {}, + "source": [ + "**Exercise 19** (Effect of the Discount Factor)\n", + "\n", + "Recall that the discount factor controls how future rewards are weighted relative to immediate rewards. Run value iteration for different values of the discount factor $\\gamma\\in\\{0.2, 0.5, 0.9, 0.99\\}$. \n", + "\n", + "For each value of $\\gamma$: \n", + "\n", + "1. Compute the optimal value function $V^*$.\n", + "2. Compute the corresponding optimal policy.\n", + "3. Plot the value function and visualize the policy on the maze.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "47219cdc-b99b-4b73-a3d1-c133afb0e215", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Max difference between V_opt and V_vi: 18.749999999999982\n", + "Asynchronous VI iterations for gamma = 0.2 : 9\n", + "Max difference between V_async and V_vi: 18.749999999999982\n", + "Same policy between async VI and VI? True\n", + "Max difference between V_opt and V_vi: 17.999999999999982\n", + "Asynchronous VI iterations for gamma = 0.5 : 14\n", + "Max difference between V_async and V_vi: 17.999999999999982\n", + "Same policy between async VI and VI? True\n", + "Max difference between V_opt and V_vi: 9.99999999999998\n", + "Asynchronous VI iterations for gamma = 0.9 : 21\n", + "Max difference between V_async and V_vi: 9.99999999999998\n", + "Same policy between async VI and VI? True\n", + "Max difference between V_opt and V_vi: 79.99999999999993\n", + "Asynchronous VI iterations for gamma = 0.99 : 25\n", + "Max difference between V_async and V_vi: 79.99999999999993\n", + "Same policy between async VI and VI? True\n" + ] + } + ], + "source": [ + "for gamma in [0.2, 0.5, 0.9, 0.99]:\n", + " V_opt = T_opt(V_vi, P, R, gamma)\n", + " print(\"Max difference between V_opt and V_vi:\", np.max(np.abs(V_opt - V_vi)))\n", + "\n", + " V_async, pi_async, n_async = asynchronous_value_iteration(P, R, gamma, theta=theta)\n", + " print(\"Asynchronous VI iterations for gamma =\", gamma, \":\", n_async)\n", + "\n", + " print(\"Max difference between V_async and V_vi:\", np.max(np.abs(V_async - V_vi)))\n", + " print(\"Same policy between async VI and VI?\", np.array_equal(pi_async, policy_vi))" + ] + }, + { + "cell_type": "markdown", + "id": "31083905-cc29-431e-9f87-6595e187e5d0", + "metadata": {}, + "source": [ + "**Exercise 20** (What we will learn in the next weeks)\n", + "\n", + "Assume now that the transition matrix $P$ is unknown.\n", + "\n", + "1. Which parts of policy iteration and value iteration can no longer be applied?\n", + "2. Which quantities would need to be learned from data?\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "e8a7738a-584d-43ae-ba2f-2598608b38fa", + "metadata": {}, + "source": [ + "**Exercise 21.** Try different configurations of the maze and compute an optimal policy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "453188a8-bc26-463b-9784-be9c68328495", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "studies", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}