mirror of
https://github.com/ArthurDanjou/ArtStudies.git
synced 2026-01-24 01:51:52 +01:00
2437 lines
408 KiB
Plaintext
2437 lines
408 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "44b75d44",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Lab 3 - Maze Game as a Markov Decision Process Part 2\n",
|
||
"\n",
|
||
"## **1. Objectives**\n",
|
||
"\n",
|
||
"Last week in Lab 2, we \n",
|
||
"\n",
|
||
"- Modeled a simple **maze game** as a **Markov Decision Process (MDP)** by defining:\n",
|
||
" - **States**\n",
|
||
" - **Actions**\n",
|
||
" - **Transition probabilities**\n",
|
||
" - **Rewards**\n",
|
||
"\n",
|
||
"- Implemented **policy evaluation** to compute the value function of a given policy.\n",
|
||
"\n",
|
||
"We consider a **discounted MDP** with discount factor $\\gamma \\in (0,1)$.\n",
|
||
"\n",
|
||
"\n",
|
||
"This week, we will use **dynamic programming** to find **an optimal policy**.\n",
|
||
"\n",
|
||
"**<span style=\"color:red;\">Important: Lab 3 starts with Question 12. Questions 1–11 are already included in Lab 2.</span>**\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "100d1e0d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"np.set_printoptions(precision=3, suppress=True)\n",
|
||
"# (not mandatory) This line is for limiting floats to 3 decimal places,\n",
|
||
"# avoiding scientific notation (like 1.23e-04) for small numbers.\n",
|
||
"\n",
|
||
"# For reproducibility\n",
|
||
"rng = np.random.default_rng(seed=42) # This line creates a random number generator.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1018deab",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 2. Maze definition and MDP formulation\n",
|
||
"\n",
|
||
"We consider a small 2D maze on a grid. The agent is a **robot** that moves on the grid.\n",
|
||
"\n",
|
||
"- `S` : start state\n",
|
||
"- `G` : goal state, with positive reward\n",
|
||
"- `#` : wall (not accessible)\n",
|
||
"- `.` : empty cell\n",
|
||
"- `X` : \"trap\" (negative reward)\n",
|
||
"\n",
|
||
"At each step, the robot can choose among 4 actions:\n",
|
||
"\n",
|
||
"$$\n",
|
||
"\\mathcal{A} = \\{\\text{Up} \\uparrow, \\quad \\text{Right} \\rightarrow, \\quad \\text{Down} \\downarrow, \\quad \\text{Left}\\leftarrow\\}.\n",
|
||
"$$\n",
|
||
"\n",
|
||
"The movement is deterministic, but here we set a small probability of “error” to make the example more realistic.\n",
|
||
"- With probability $1 - p_{\\text{error}}$, it moves in the chosen direction.\n",
|
||
"- With probability $p_{\\text{error}}$, it moves in a random *other* direction.\n",
|
||
"- If the movement would hit a wall or go outside the grid, the agent stays in place.\n",
|
||
"\n",
|
||
"We will represent the MDP with:\n",
|
||
"\n",
|
||
"- A list of **states $\\mathcal{S} = \\{0, \\dots, n_{S - 1}\\}$, each corresponding to a grid cell.**\n",
|
||
"- For each action $a$, a transition matrix $P[a]$ of size $(n_S, n_S)$, where\n",
|
||
" $$\n",
|
||
" P[a][s, s'] = \\mathbb{P}(S_{t+1} = s' \\mid S_t = s, A_t = a).\n",
|
||
" $$\n",
|
||
"- A reward vector $R$ of length $n_S$, where $R[s]$ is the immediate reward obtained when **leaving** state $s$.\n",
|
||
"\n",
|
||
"We will use a discount factor $\\gamma = 0.95$.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ca4fa301-c14f-44ec-b04f-b01ca42d979a",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.1 Define the maze \n",
|
||
"\n",
|
||
"Let us now define the maze as follows."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "f91cda05",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"maze_str = [\n",
|
||
" \"#######\",\n",
|
||
" \"S...#.#\",\n",
|
||
" \"#.#...#\",\n",
|
||
" \"#.#..##\",\n",
|
||
" \"#..#..G\",\n",
|
||
" \"#..X..#\",\n",
|
||
" \"#######\",\n",
|
||
"]\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "99820cf4-292d-49ba-b662-f9f05f901f62",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 1.** Compute the dimensions of the maze (complete the “TO DO” parts):\n",
|
||
"- How many rows does the maze have?\n",
|
||
"- How many columns does the maze have?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "24d7b74c-66c7-4615-b5e6-c2973a975fc9",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"7\n",
|
||
"7\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Solution 1.\n",
|
||
"\n",
|
||
"n_rows = len(maze_str)\n",
|
||
"print(n_rows)\n",
|
||
"n_cols = len(maze_str[0])\n",
|
||
"print(n_cols)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "26c821d3-2362-4b60-8c77-3d09296d130d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Maze:\n",
|
||
"#######\n",
|
||
"S...#.#\n",
|
||
"#.#...#\n",
|
||
"#.#..##\n",
|
||
"#..#..G\n",
|
||
"#..X..#\n",
|
||
"#######\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"Maze:\")\n",
|
||
"for row in maze_str:\n",
|
||
" print(row)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "adc49d58-2730-41d8-96fb-ca7c9cb4fcdf",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.2 Map each walkable cell (not a wall '#') to a state index\n",
|
||
"\n",
|
||
"Now we convert the maze grid into state indices for the MDP.\n",
|
||
"\n",
|
||
"\n",
|
||
"The cells where the robot is allowed to stand are \n",
|
||
"\n",
|
||
"- . : empty space\n",
|
||
"\n",
|
||
"- S : start\n",
|
||
"\n",
|
||
"- G : goal\n",
|
||
"\n",
|
||
"- X : trap\n",
|
||
"\n",
|
||
"Everything else (i.e., #) is a wall and cannot be a state in the MDP.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "7116044b-c134-43de-9f30-01ab62325300",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"FREE = {\n",
|
||
" \".\",\n",
|
||
" \"S\",\n",
|
||
" \"G\",\n",
|
||
" \"X\",\n",
|
||
"} # The vector Free represents cells that the agent is allowed to move into."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1c9ad05e-9c6c-4e00-918c-44b858f45298",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Dictionaries to convert between grid and state index**\n",
|
||
"\n",
|
||
"We now want to identify all **valid states** of the maze (all non-wall cells). \n",
|
||
"To do this, we need two mappings:\n",
|
||
"\n",
|
||
"1. `state_to_pos[s] = (i, j)`: Given a state index $s$, return its grid coordinates (row, column).\n",
|
||
"2. `pos_to_state[(i, j)] = s`: Given coordinates (i, j), return the corresponding state index $s$.\n",
|
||
"\n",
|
||
"These two dictionaries allow easy conversion between **MDP state indices** and the **physical maze positions**. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "a1258de4",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Number of states (non-wall cells): 22\n",
|
||
"Start state: 0 at (1, 0)\n",
|
||
"Goal states: [16] at (4, 6)\n",
|
||
"Trap states: [19] at (5, 3)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"state_to_pos = {} # s -> (i,j)\n",
|
||
"pos_to_state = {} # (i,j) -> s\n",
|
||
"\n",
|
||
"start_state = None # will store the state index of start state\n",
|
||
"goal_states = [] # will store the state index of goal state\n",
|
||
"trap_states = [] # will store the state index of trap state\n",
|
||
"\n",
|
||
"s = 0\n",
|
||
"for i in range(n_rows): # i = row index\n",
|
||
" for j in range(n_cols): # j = column index\n",
|
||
" cell = maze_str[i][j] # cell = the character at that position (S, ., #, etc.)\n",
|
||
"\n",
|
||
" if cell in FREE:\n",
|
||
" # FREE contains: free cells \".\", start cell \"S\", goal cell \"G\" and trap cell \"X\"\n",
|
||
" # Walls # are ignored, they are not MDP states.\n",
|
||
" state_to_pos[s] = (i, j)\n",
|
||
" pos_to_state[(i, j)] = s\n",
|
||
"\n",
|
||
" if cell == \"S\":\n",
|
||
" start_state = s\n",
|
||
" elif cell == \"G\":\n",
|
||
" goal_states.append(s)\n",
|
||
" elif cell == \"X\":\n",
|
||
" trap_states.append(s)\n",
|
||
"\n",
|
||
" s += 1\n",
|
||
"\n",
|
||
"n_states = s\n",
|
||
"\n",
|
||
"print(\"Number of states (non-wall cells):\", n_states)\n",
|
||
"print(\"Start state:\", start_state, \"at\", state_to_pos[start_state])\n",
|
||
"print(\"Goal states:\", goal_states, \"at\", state_to_pos[goal_states[0]])\n",
|
||
"print(\"Trap states:\", trap_states, \"at\", state_to_pos[trap_states[0]])\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "721b968c-a355-46eb-aae4-5950441ba604",
|
||
"metadata": {},
|
||
"source": [
|
||
"*Hint.* If you don’t know what a dictionary is in Python, try the following code to help you understand."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "68744dd6-7278-4c20-8b82-34212685352f",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"value2\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"my_dict = {\"key1\": \"value1\", \"key2\": \"value2\"}\n",
|
||
"print(my_dict[\"key2\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0c76f4e1-b0ba-49c5-b9d5-cfb523024ba9",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 2.** Read the program above and answer the following questions:\n",
|
||
"1. What is the purpose of state_to_pos and pos_to_state?\n",
|
||
"2. Why do we only assign states to cells in FREE?\n",
|
||
"3. What would happen if the maze had multiple goal cells?\n",
|
||
"4. What is the total number of states (n_states) in this maze? Does this match the number of non-wall cells you can count visually?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d45828e3-43be-4318-a14c-1242d3a0dcbc",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Solution 2.**\n",
|
||
"1. `state_to_pos` maps: \n",
|
||
"$$\n",
|
||
"\\text{state index} \\quad s\\quad \\rightarrow \\quad \\text{grid position} \\quad (i, j)\n",
|
||
"$$\n",
|
||
"\n",
|
||
"`pos_to_state` maps: \n",
|
||
"$$\n",
|
||
" \\text{grid position} \\quad (i, j) \\quad\\rightarrow \\quad \\text{state index} \\quad s\n",
|
||
"$$\n",
|
||
"\n",
|
||
"We need both because:\n",
|
||
"\n",
|
||
"- `state_to_pos` lets us visualize, display, or plot the value function on the grid. \n",
|
||
"- `pos_to_state` lets us convert a grid position into the correct MDP state index, useful when building transition probabilities.\n",
|
||
"\n",
|
||
"2. We only assign states to cells in `FREE = {'.', 'S', 'G', 'X'}` because only these cells are **walkable**. Wall cells (`'#'`) **cannot be entered** by the agent, so they are **not included as MDP states**.\n",
|
||
"\n",
|
||
"3. If the maze had multiple `'G'` cells (several goal locations), we store them in a **list**, for example, goal_states = [5, 12, 23].\n",
|
||
"\n",
|
||
"4. 22 states. (Row 1: 5 free cells; Row 2: 4 free cells; Row 3: 3 free cells; Row 4: 5 free cells; Row 5: 5 free cells)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6d0fa298-7b7c-44fc-bbed-15ea002037c2",
|
||
"metadata": {},
|
||
"source": [
|
||
"-----\n",
|
||
"\n",
|
||
"The following function `plot_maze_with_states` creates a figure showing:\n",
|
||
"- the maze walls and free cells\n",
|
||
"- the state index for each non-wall cell\n",
|
||
"- special labels and colors for S (start state), G (goal state), and X (trap state). "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "fc61ceef-217c-47f4-8eba-0353369210db",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"def plot_maze_with_states() -> None:\n",
|
||
" \"\"\"Plot the maze with state indices.\"\"\"\n",
|
||
" grid = np.ones(\n",
|
||
" (n_rows, n_cols),\n",
|
||
" ) # Start with a matrix of ones. Here 1 means “free cell”\n",
|
||
" for i in range(n_rows):\n",
|
||
" for j in range(n_cols):\n",
|
||
" if maze_str[i][j] == \"#\":\n",
|
||
" grid[i, j] = 0 # We replace walls (#) with 0\n",
|
||
"\n",
|
||
" _fig, ax = plt.subplots()\n",
|
||
" ax.imshow(grid, cmap=\"gray\", alpha=0.7)\n",
|
||
"\n",
|
||
" # Plot state indices\n",
|
||
" for (\n",
|
||
" s,\n",
|
||
" (i, j),\n",
|
||
" ) in state_to_pos.items():\n",
|
||
" cell = maze_str[i][j]\n",
|
||
"\n",
|
||
" if cell == \"S\":\n",
|
||
" label = f\"S\\n{s}\"\n",
|
||
" color = \"green\"\n",
|
||
" elif cell == \"G\":\n",
|
||
" label = f\"G\\n{s}\"\n",
|
||
" color = \"blue\"\n",
|
||
" elif cell == \"X\":\n",
|
||
" label = f\"X\\n{s}\"\n",
|
||
" color = \"red\"\n",
|
||
" else:\n",
|
||
" label = str(s)\n",
|
||
" color = \"black\"\n",
|
||
"\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" label, # Attention : matplotlib, text(x, y, ...) expects (column, row)\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=10,\n",
|
||
" fontweight=\"bold\",\n",
|
||
" color=color,\n",
|
||
" )\n",
|
||
"\n",
|
||
" ax.set_xticks([]) # remove numeric axes, we don't need.\n",
|
||
" ax.set_yticks([])\n",
|
||
" ax.set_title(\"Maze with state indices\")\n",
|
||
"\n",
|
||
" plt.show()\n",
|
||
"\n",
|
||
"\n",
|
||
"plot_maze_with_states()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "db078d86",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.4 Actions and deterministic movement"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "96e7f1f2-9d73-410b-853d-e39f40dfb5da",
|
||
"metadata": {},
|
||
"source": [
|
||
"We first define integer codes for each action. \n",
|
||
"\n",
|
||
"**Exercise 3.** How many possible actions can the agent take in the maze?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "22259ab4-527e-4d7c-bb30-98fb240da6d5",
|
||
"metadata": {},
|
||
"source": [
|
||
"We have four possible actions in the maze. \n",
|
||
"\n",
|
||
"In this following cell, each action is mapped to an integer (0,1,2,3). This makes it easy to store and use actions inside arrays and matrices\n",
|
||
"\n",
|
||
"Here we use Unicode arrow character:\n",
|
||
"\n",
|
||
"- \"\\u2191\" : ↑ (up arrow)\n",
|
||
"\n",
|
||
"- \"\\u2192\" : → (right arrow)\n",
|
||
"\n",
|
||
"- \"\\u2193\" : ↓ (down arrow)\n",
|
||
"\n",
|
||
"- \"\\u2190\" : ← (left arrow)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "f7f0b8e4-1f48-4d03-9e5f-a47e59c3e827",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"A_UP, A_RIGHT, A_DOWN, A_LEFT = 0, 1, 2, 3\n",
|
||
"ACTIONS = [A_UP, A_RIGHT, A_DOWN, A_LEFT]\n",
|
||
"action_names = {A_UP: \"\\u2191\", A_RIGHT: \"\\u2192\", A_DOWN: \"\\u2193\", A_LEFT: \"\\u2190\"}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"id": "3773781c-a0cd-48db-967b-d4b432d17046",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"↑\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(action_names[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4b957f5a-ee39-4437-abc1-4809105ad83c",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 4.** Now we define a **deterministic movement function** `move_deterministic(i, j, a)`. \n",
|
||
"\n",
|
||
"This function simulates the robot trying to move from (i, j) in direction a.\n",
|
||
"\n",
|
||
"But if the movement hits a wall or boundary, the agent stays in place.\n",
|
||
"\n",
|
||
"**Complete the `# !!TO DO HERE !!` part in the program below.**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"id": "4b06da5e-bc63-48e5-a336-37bce952443d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def move_deterministic(i: int, j: int, a: int) -> tuple[int, int]:\n",
|
||
" \"\"\"Deterministic movement on the grid. If the movement hits a wall or boundary, the agent stays in place.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" i (int): current row index\n",
|
||
" j (int): current column index\n",
|
||
" a (int): action to take (A_UP, A_DOWN, A_LEFT, A_RIGHT)\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" (tuple[int, int]): new (row, column) position after taking action a\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i,\n",
|
||
" j,\n",
|
||
" ) # It means “Unless the action succeeds, the robot stays in place.”\n",
|
||
"\n",
|
||
" # Now each action changes the coordinates of the robot:\n",
|
||
" if a == A_UP:\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i - 1,\n",
|
||
" j,\n",
|
||
" ) # if the action is UP, then row becomes row -1\n",
|
||
" elif a == A_DOWN:\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i + 1,\n",
|
||
" j,\n",
|
||
" ) # if the action is DOWN, then row becomes row +1\n",
|
||
" elif a == A_LEFT:\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i,\n",
|
||
" j - 1,\n",
|
||
" ) # if the action is LEFT, then column becomes column -1\n",
|
||
" elif a == A_RIGHT:\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i,\n",
|
||
" j + 1,\n",
|
||
" ) # if the action is RIGHT, then column becomes column +1\n",
|
||
"\n",
|
||
" # Check boundaries\n",
|
||
" if not (0 <= candidate_i < n_rows and 0 <= candidate_j < n_cols):\n",
|
||
" # If the robot tries to move outside the maze\n",
|
||
" # It will not move and it stays at (i, j).\n",
|
||
" return i, j\n",
|
||
"\n",
|
||
" # Check wall\n",
|
||
" if maze_str[candidate_i][candidate_j] == \"#\":\n",
|
||
" # If the next cell is a wall, the robot stays in place.\n",
|
||
" return i, j\n",
|
||
"\n",
|
||
" return candidate_i, candidate_j # Otherwise, return the new position\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c9e620e6",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.5 Transition probabilities and reward function"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "80bd2bca-7717-4b5f-bffa-76fe86a51d35",
|
||
"metadata": {},
|
||
"source": [
|
||
"Recall that we set the discount factor $\\gamma \\in(0,1)$, that is, the future rewards are multiplied by $\\gamma$, so immediate rewards matter a little bit more than future ones. \n",
|
||
"\n",
|
||
"\n",
|
||
"Moreover, we consider a probability error $p_{\\text{error}}$, which means, with probability $p_{\\text{error}}$, the robot **does not** execute the intended action but one of the 3 other directions (chosen uniformly). With probability $1-p_{\\text{error}}$, the robot executes the action that we asked."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"id": "610253e7-f3f7-4a30-be3e-2ec5a1e2ed04",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"gamma = 0.95\n",
|
||
"p_error = 0.1 # probability of the error to a random other direction\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0d1ceff8-86e0-4c45-83d3-af9fae974608",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we initialize the state–transition probability : the probability of reaching next state $s'$ after taking action $a$ in state $s$. \n",
|
||
"$$\n",
|
||
" p(s' \\mid s, a)\n",
|
||
" = \\mathbb{P} \\big[S_t=s'\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]\n",
|
||
"$$\n",
|
||
"\n",
|
||
"We store these transition probabilities in the 3D array `P` (`P[a][s, s_next]`), which has shape `(n_actions, n_states, n_states)`:\n",
|
||
"\n",
|
||
"`P[a, s, s_next] = P(S_{t+1} = s_next | S_t = s, A_t = a)`.\n",
|
||
"\n",
|
||
"We also initialize the reward vector `R`, which has length `n_states`, where `R[s]` is the reward received when the agent is in state `s`.\n",
|
||
"\n",
|
||
"In this maze game, we assume that the reward depends only on the current state, which is natural: in navigation tasks, being in a particular location is what matters, not the direction you used to reach it."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"id": "7a51f242-fe4e-4e74-8a1f-a8df32b194b8",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Initialize transition matrices and reward vector\n",
|
||
"P = np.zeros((len(ACTIONS), n_states, n_states))\n",
|
||
"R = np.zeros(n_states)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c08f4af5-a2a7-4baa-b5da-c7ce636d8a4a",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we assign the reward to each state. \n",
|
||
"\n",
|
||
"For each state index s:\n",
|
||
"\n",
|
||
"1. If s is a goal, then the reward = +1.0\n",
|
||
"2. If s is a trap, then the reward = −1.0\n",
|
||
"3. Otherwise for the normal cell, the reward = −0.01 every time you leave this cell.\n",
|
||
"\n",
|
||
"Recall that rewards are received at the moment the agent executes an action. Here when the agent moves out of the cell, we set reward −0.01. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"id": "49d54d1f-dc29-45b6-ad31-ad0e848f920d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Set rewards for each state\n",
|
||
"step_penalty = -0.01\n",
|
||
"goal_reward = 1.0\n",
|
||
"trap_reward = -1.0"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "dd571ec8-c36a-4e20-bec6-9e6458dc622b",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 5.** Why do we set the step penalty to -0.01 in this MDP?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "00c51189-3ff0-4a5e-ad52-92747b971e16",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Solution 5** We assign a small negative reward for every step, which encourages the agent to reach the goal quickly.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "07bfb065-b1af-4df1-885e-780fe250f2fb",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 6.** We now define the reward vector. Recall that we have already initialized\n",
|
||
"`R = np.zeros(n_states)`.\n",
|
||
"If a state belongs to `goal_states`, we assign the `goal_reward`.\n",
|
||
"If it belongs to `trap_states`, we assign the `trap_reward`.\n",
|
||
"Otherwise, we assign the `step_penalty`. \n",
|
||
"\n",
|
||
"**Complete the `# TO DO` part in the program below.** "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"id": "c70885b4-a301-42f2-ab70-2901d941cde7",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"for s in range(n_states):\n",
|
||
" if s in goal_states:\n",
|
||
" R[s] = goal_reward\n",
|
||
" elif s in trap_states:\n",
|
||
" R[s] = trap_reward\n",
|
||
" else:\n",
|
||
" R[s] = step_penalty"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b90fb80c-9452-48a2-889f-286703c2ae93",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we define terminal states and a helper function. Here terminal_states is a set containing all absorbing states, which means, reaching them ends the episode conceptually. \n",
|
||
"\n",
|
||
"Moreover, `is_terminal(s)` is a small helper to check if a state is terminal."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"id": "eca4c571-39c7-468b-af86-0bab9489415e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"terminal_states = set(goal_states + trap_states)\n",
|
||
"\n",
|
||
"\n",
|
||
"def is_terminal(s: int) -> bool:\n",
|
||
" \"\"\"Check if a state is terminal.\"\"\"\n",
|
||
" return s in terminal_states\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3a9a1d54-8339-402b-84e9-105961ed78d7",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we need to fill the transition matrices `P[a][s, s_next]`. \n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d9cfd15c-12cc-48bb-bd88-07f3ae3db31c",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 7.** **Complete the `# TO DO` part in the program below** to fill the transition matrices `P[a][s, s_next]`. (There are only 2 # TO DO here)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"id": "2d03276b-e206-4d1f-9024-f6948ca61523",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"for s in range(n_states): # We loop over all states s.\n",
|
||
" i, j = state_to_pos[\n",
|
||
" s\n",
|
||
" ] # We recover the states to their coordinates (i, j) in the maze.\n",
|
||
"\n",
|
||
" # First, in a goal or trap state,\n",
|
||
" # No matter which action you “choose”, you stay in the same state with probability 1.\n",
|
||
" # This makes the terminal states as the absorbing states.\n",
|
||
" if is_terminal(s):\n",
|
||
" # Terminal states: stay forever\n",
|
||
" for a in ACTIONS:\n",
|
||
" P[a, s, s] = goal_reward\n",
|
||
" continue\n",
|
||
"\n",
|
||
" # If the state is non-terminal, we define the stochastic movement.\n",
|
||
" # For a given state s and intended action a,\n",
|
||
" # With probability 1 - p_error, the robot will move in direction a;\n",
|
||
" # With probability p_error, the robot will move in one of the other 3 directions, each with probability p_error / 3.\n",
|
||
" for a in ACTIONS:\n",
|
||
" # main action (intended action)\n",
|
||
" main_i, main_j = move_deterministic(i, j, a)\n",
|
||
" s_main = pos_to_state[\n",
|
||
" (main_i, main_j)\n",
|
||
" ] # s_main is the state index of that next cell.\n",
|
||
" P[a, s, s_main] += (\n",
|
||
" 1 - p_error\n",
|
||
" ) # We add probability 1 - p_error to P[a, s, s_main].\n",
|
||
"\n",
|
||
" # error actions\n",
|
||
" other_actions = [\n",
|
||
" a2 for a2 in ACTIONS if a2 != a\n",
|
||
" ] # other_actions = the 3 actions different from a.\n",
|
||
" for a2 in other_actions: # for each of the error action,\n",
|
||
" error_i, error_j = move_deterministic(i, j, a2)\n",
|
||
" s_error = pos_to_state[(error_i, error_j)] # get its state index s_error\n",
|
||
" P[a, s, s_error] += p_error / len(\n",
|
||
" other_actions,\n",
|
||
" ) # add p_error / 3 to P[a, s, s_error]\n",
|
||
"# So for each (s,a), probabilities over all s_next sum to 1.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7841b264-af00-4322-b728-adcffac0ef89",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we check if the transition matrices `P[a][s, s_next]` are computed correctly.\n",
|
||
"For each action `a`, we sum the transition probabilities over all possible next states `s_next` and verify that these sums are equal to 1.\n",
|
||
"\n",
|
||
"This is because the matrix `P[a, s, s_next]` stores the transition probability\n",
|
||
"\n",
|
||
"$\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$. \n",
|
||
"\n",
|
||
"Therefore, for each action $a$, and for each state $s$, the sum over $s_{\\text{next}}$ of $\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$ should be 1. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"id": "341fe630-8f87-4773-84ad-92d3516e53e2",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Action ↑: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n",
|
||
"Action →: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n",
|
||
"Action ↓: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n",
|
||
"Action ←: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for a in ACTIONS:\n",
|
||
" # For each action a:\n",
|
||
" # P[a] is a matrix of shape (n_states, n_states).\n",
|
||
" # P[a].sum(axis=1) sums over next states s_next, giving for each state s:\n",
|
||
" # We print these row sums.\n",
|
||
" # If everything is correct, they should be very close to 1.\n",
|
||
"\n",
|
||
" probs = P[a].sum(axis=1)\n",
|
||
" print(f\"Action {action_names[a]}:\", probs)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "46d23991",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 3. Policy evaluation\n",
|
||
"\n",
|
||
"### 3.1 Bellman expectation equation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "305b047c-e83b-4f42-b64e-e2050d5deeff",
|
||
"metadata": {},
|
||
"source": [
|
||
"Recall that the value function under a policy $\\pi$ is defined as:\n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=\\mathbb{E}\\Big[\\:G_t \\:\\Big|\\: S_t=s\\:\\Big]\n",
|
||
"$$\n",
|
||
"where the return $G_t$ is\n",
|
||
"$$\n",
|
||
"G_t=R_t +\\gamma R_{t+1}+\\gamma^2 R_{t+2}+... . \n",
|
||
"$$\n",
|
||
"This means *The value of a state is the expected discounted sum of all future rewards\n",
|
||
"when following policy $\\pi$.*\n",
|
||
"\n",
|
||
"We know that $G_t=R_t+\\gamma G_{t+1}$, and plugging this equation into the definition of $V^{\\pi}(s)$, we get \n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n",
|
||
"$$\n",
|
||
"This step shows simply ``The total future reward = immediate reward + discounted reward from next state.''"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "88ea8d56-3b62-4690-9ff7-469e43726fbc",
|
||
"metadata": {},
|
||
"source": [
|
||
"For the expected immediate reward part $\\mathbb{E}[R_t| S_t=s]$, as we are in a maze problem, the reward depends only on the current state, not the time step, i.e., $\\mathbb{E}[R_t| S_t=s]=R(s)$. Hence we get \n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=R(s)+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n",
|
||
"$$\n",
|
||
"\n",
|
||
"Moreover, in this maze problem, we consider a deterministic policy $A_t=\\pi(s)$ (the action depends only on the state). Therefore, \n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s, A_t=\\pi(s)\\:\\Big]. \n",
|
||
"$$\n",
|
||
"\n",
|
||
"Now **given the state $S_t=s$ and $A_t=a$**, the next state is random (because of the error probability) and we know the transition probability \n",
|
||
"$$\n",
|
||
"\\mathbb{P}\\big(\\:S_{t+1}=s' \\:|\\:S_t=s, \\, A_t=a\\big)=P\\big(s'\\:\\big|\\:s, a\\big). \n",
|
||
"$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c25e255d-8f58-4eaf-9485-cee6ab3bea6c",
|
||
"metadata": {},
|
||
"source": [
|
||
"Therefore,\n",
|
||
"$$\n",
|
||
"\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_t=s,A_t=a\\,\\big] =\\sum_{s'}\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_{t+1}=s'\\,\\big]\\times \\mathbb{P}\\big[S_{t+1}=s'\\,\\big|\\,S_t=s, A_t=a\\, \\big]\n",
|
||
"$$\n",
|
||
"$$\n",
|
||
"\\hspace{-1.2cm}=\\sum_{s'}V^{\\pi}(s')P\\big(s'\\:\\big|\\:s, a\\big),\n",
|
||
"$$\n",
|
||
"where here we use the Markov property. (**Question: Can you show the detailed computations here?**)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9a2b6cff-e848-44a2-b504-973067b367b3",
|
||
"metadata": {},
|
||
"source": [
|
||
"In conclusion, we have (the Bellman expectation equation)\n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n",
|
||
"$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "15049fdb-f3af-4f78-b556-817284260ed0",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3.2 Define a function which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n",
|
||
"\n",
|
||
"\n",
|
||
"**Exercise $8^*$.** Now we define `policy_evaluation(...)`, which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n",
|
||
"\n",
|
||
"The input of this function `policy_evaluation(...)` are:\n",
|
||
"1. policy: array of size `n_states`, each entry is an action 0,1,2,3, which correspond to UP, RIGHT, DOWN, LEFT.\n",
|
||
"2. `P`: the transition probabilities `P[a, s, s']`.\n",
|
||
"3. `R`: the reward vector `R[s]`.\n",
|
||
"4. gamma: the discount factor $\\gamma\\in(0,1)$.\n",
|
||
"5. theta: convergence threshold.\n",
|
||
"6. max_iter: which is used to avoid infinite loops.\n",
|
||
"\n",
|
||
"How can we apply the Bellman expectation equation\n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n",
|
||
"$$\n",
|
||
"here ?\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "20ef113f-0872-46e1-95ab-3cf5016f5a14",
|
||
"metadata": {},
|
||
"source": [
|
||
"We start with an initial guess of $V^{\\pi}$(e.g., all values = 0) and repeatedly apply the Bellman equation to update each state:\n",
|
||
"$$\n",
|
||
"V_{k+1}^\\pi(s) \\leftarrow R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}_k(s').\n",
|
||
"$$\n",
|
||
"until values converge.\n",
|
||
"\n",
|
||
"**Complete the `# TO DO HERE` part in the program below** "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"id": "3a05f8bc-2b8f-4a4c-9931-6d28c3b0db35",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def policy_evaluation( # noqa: PLR0913\n",
|
||
" policy: np.ndarray,\n",
|
||
" P: np.ndarray,\n",
|
||
" R: np.ndarray,\n",
|
||
" gamma: float,\n",
|
||
" theta: float = 1e-6,\n",
|
||
" max_iter: int = 10_000,\n",
|
||
") -> np.ndarray:\n",
|
||
" \"\"\"Evaluate a deterministic policy for the given MDP.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" policy: array of shape (n_states,), with values in {0,1,2,3}\n",
|
||
" P: array of shape (n_actions, n_states, n_states)\n",
|
||
" R: array of shape (n_states,)\n",
|
||
" gamma: discount factor\n",
|
||
" theta: convergence threshold\n",
|
||
" max_iter: maximum number of iterations\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" n_states = len(R) # get the number of states\n",
|
||
" V = np.zeros(n_states) # initialize the value function\n",
|
||
"\n",
|
||
" for _it in range(max_iter): # Main iterative loop\n",
|
||
" V_new = np.zeros_like(\n",
|
||
" V,\n",
|
||
" ) # Create a new value vector and we will compute an updated value for each state.\n",
|
||
"\n",
|
||
" # Now we update each state using the Bellman expectation equation\n",
|
||
" for s in range(n_states):\n",
|
||
" a = policy[s] # Extract the action chosen by the policy in state\n",
|
||
" V_new[s] = R[s] + gamma * np.sum(P[a, s, :] * V)\n",
|
||
"\n",
|
||
" delta = np.max(\n",
|
||
" np.abs(V_new - V),\n",
|
||
" ) # This measures how much the value function changed in this iteration:\n",
|
||
" # If delta is small, the values start to converge; otherwise, we need to keep iterating.\n",
|
||
" V = V_new # Update V, i.e. Set the new values for the next iteration.\n",
|
||
"\n",
|
||
" if delta < theta: # Check convergence: When changes are tiny, we stop.\n",
|
||
" break\n",
|
||
"\n",
|
||
" return V # Return the final value function, this is our estimate for V^{pi}(s), s in the state set.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "09ef3439",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3.3 Evaluating a random policy"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "eecbca15-f89f-47bf-a13d-7d7c051699b8",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we use the policy evaluation function `policy_evaluation` to evaluate a random policy. \n",
|
||
"\n",
|
||
"We first generate a `random_policy`, which is an array like [2, 0, 1, 3, 0, 2, ...] and has the size `n_states`. (Recall that the policy is a mapping from states to actions)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"id": "b4a44e38",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[0 3 2 1 1 3 0 2 0 0 2 3 2 3 2 3 2 0 3 1 2 1]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Random policy: for each state, pick a random action\n",
|
||
"random_policy = rng.integers(low=0, high=len(ACTIONS), size=n_states)\n",
|
||
"\n",
|
||
"print(random_policy)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3fe07992-ce82-4124-aebc-a6384d417f64",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we call the function `policy_evaluation(...)` to compute $V^{\\pi_{\\text{random}}}(s)$."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"id": "c5f559b2-452a-477c-a1fa-258b40805670",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Value function under random policy:\n",
|
||
"[ -0.2 -0.2 -0.201 -0.204 -0.205 -0.202 -0.214 -0.429 -0.212\n",
|
||
" -0.207 -0.276 -0.459 -0.352 -0.366 -5.827 -4.605 20. -0.366\n",
|
||
" -0.999 -20. -6.4 -3.163]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"V_random = policy_evaluation(random_policy, P, R, gamma)\n",
|
||
"print(\"Value function under random policy:\")\n",
|
||
"print(V_random)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f46c70ba-2932-49af-b568-b5477260bc94",
|
||
"metadata": {},
|
||
"source": [
|
||
"Here in this value vector of the policy, \n",
|
||
"- If it is a negative values, then the agent tends to move around aimlessly, fall in traps, or take too long.\n",
|
||
"- It it is a higher values, then the agent is closer to the goal or more likely to reach it"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1efcb076-467c-42d8-94e8-87453f688bbd",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we define a function `plot_values`, which displays the value function $V(s)$ and displays it on the maze grid. It helps students visually understand:\n",
|
||
"- which states are good (high value, near the goal),\n",
|
||
"- which states are bad (low value, near traps),\n",
|
||
"- how a policy affects the long-term expected reward."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"id": "4c428327",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAdcAAAGbCAYAAACWHtrWAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAARjhJREFUeJzt3Qd4FGX+B/DvptcNoSQhEHqHUAUUUWnSBEQ9FRtgvb/lUBEBGyJFBBThlAPvTkVQBPQUEAREBFHpJfROIAFSSCC9kez8n/dddskmm7rvbrLL9+MzLjsz+868O5v5zdtmdJqmaSAiIiJl3NQlRURERAKDKxERkWIMrkRERIoxuBIRESnG4EpERKQYgysREZFiDK5ERESKMbgSEREpxuBKRESkGIOrg507dw46nQ6LFi2qku2vX78eHTt2hI+Pj9yPlJQUVEdi3yZPnlzVu1GtiN+M+F7Eb8hViWMu8lhYo0aNMHr06CrbJ6LKYHAtxbBhw+Dn54f09PQS13nsscfg5eWF5ORkVHdiHx966CH4+vpi/vz5WLJkCfz9/atsf37++WcGUCJySR5VvQPVmQicP/30E3788UeMHDmy2PKsrCysWrUKAwcORK1atVDd7d69W14oTJ06Ff369avq3ZHBVQR5awE2OzsbHh78eRJw4sQJuLmxHEDOhb/YMkqugYGBWLp0qdXlIrBmZmbKIOwMEhMT5WuNGjVQ3Ylq6+ocXMXzLsQFANmft7c3PD09q3o3iCqEwbUUovr0/vvvx6ZNm8yBqTARdEXwFUH4ypUrGDduHCIjIxEQEAC9Xo9BgwbhwIEDZW6nV69ecipKtDOJ9qbCDAYD5s6di7Zt28oAFBoair///e+4evVqmdsYNWqU/HfXrl1lu5apHaukNq2i+7Vlyxb5uRUrVmD69OmoX7++3Ie+ffvi9OnTxT6/c+dODB48GMHBwbL6uX379pg3b545b6LUKog0TVNpba779++X36n4bsV3LLa7Y8cOq+2Sf/31F8aOHYs6derIbd933324fPmyxbqpqak4fvy4fC2L+I6GDBmCDRs24JZbbpG/jc8++0wu+/LLL9GnTx+EhITIQNCmTRssWLCgxDT+/PNPdOvWTX53TZo0weLFi4ute+TIEZmm2I74nqdNmyaPvTX/+te/5O9BbDs8PBwvvvhisbZ0cRzbtWuHgwcP4q677pLNHc2aNcP3338vl//+++/o3r273F7Lli3x66+/lvmdmH4Py5cvx5tvvomwsDD5XYu/h9jY2GLrf/fdd+jSpYvcRu3atfH444/j4sWLZW7H2u9T5O/VV1+Vy0S+xXckapeSkpKQkZEh9+Pll18ultaFCxfg7u6OGTNmlLldIpuIR85RyX755RfxSD7tk08+sZifnJyseXp6aiNHjpTvd+/erTVt2lSbOHGi9tlnn2lTpkzR6tWrpwUFBWkXL140fy46Olqm9+WXX5rn3XXXXXIqatSoUVrDhg0t5j3zzDOah4eH9uyzz2oLFy7UJkyYoPn7+2tdu3bV8vLySs3Hc889J7ct9m3JkiXatm3b5DKxDbGtooru1+bNm+XnO3XqpHXp0kX7+OOPtcmTJ2t+fn5at27dim3Py8tLpv3uu+9qCxYs0MaMGaP169dPLhfbvvvuu2V6Yl9Mk4mYLz5ncvjwYZnPunXralOnTtU++OADrXHjxpq3t7e2Y8cO83riezXtY58+feRxe+211zR3d3ftoYcesthH07qFj0VJRD6aNWumBQcHy2MsvnvxfQjiux89erT8PsT2+vfvL9P99NNPi6XRsmVLLTQ0VHvzzTfl8s6dO2s6nU7mzyQuLk6rU6eO3Jb4fmfPnq01b95ca9++vUxX/IZMxHck5onvVWz7pZdeknkt+nsQxzE8PFyLiIjQXn/9dblumzZt5LrLli3TwsLC5Lbmzp1r/t2mpaWV+p2Yfg+RkZFy3+bMmSO/Gx8fH61FixZaVlZWse9a7Jf4nsR6vr6+WqNGjbSrV68Wy0/R763w7zM9PV1r166d3HfxdyB+W+I3IdLev3+/XOexxx6T33N+fr5FWrNmzZLf9/nz58s85kS2YHAtg/jjFCf02267zWK+OLmKk8CGDRvk+5ycHK2goMBiHXESFCd/EcxUBNc//vhDfvabb76xWG/9+vVW5xdlOsGJC4HCKhpcW7dureXm5prnz5s3T84/dOiQ+TsTgU+kW/jEKRgMBvO/X3zxxWIn0pKC6/Dhw2WwPnPmjHnepUuXtMDAQO3OO+8slkcRbApv69VXX5Un45SUlEoHV7Gu+K6LKhxETAYMGKA1adLEahpbt241z0tMTJS/EXEBYPLKK6/I9Xbu3Gmxngh4hYOrmCe+ExHMC//2RNAW633xxRfmeeI4inlLly41zzt+/Lic5+bmZnGBIn7T5fleTL8HEYwLB+IVK1bI+eJ3IYggHxISIgNidna2eb01a9bI9SZNmlSh4CrWF+v88MMPxfbJdMxNeVi3bp3FcnERYO1vjUg1VguXQVQhjRgxAtu3b7cYAiGqhEWVrKiaFETVlKnTRUFBgeyZK6ouRRXbvn37lOyLqFYLCgrC3XffLau/TJOoahPb2rx5MxzhySeflD2kTe644w75evbsWXP1bXR0NF555ZVi7btFh1mUh/g+f/nlFwwfPlxWo5rUrVsXjz76qKxmTUtLs/jMc889Z7EtsY8infPnz5vniapGEcfLO8yjcePGGDBgQLH5oprTRFQxi2Miql7F91G0yllUGZu+L0FUW4vfiOm7M3X0uvXWW2XVceH1irbti6rbvLw8+T0X7vDz7LPPyqrztWvXWqwvfiPit2witiuOT+vWrWWVsInp34X3qTSiOlY0j5j87W9/k8dG5EPYs2ePbFZ54YUXZFW4yT333INWrVoV28+y/O9//0OHDh1kVX9RpmMuOuyJKvJvvvnGvOzw4cOyWlxURxPZG4NrOZhOaqaOTaLd5o8//pAnKhF8BdEe9vHHH6N58+Yy0Io2JXFCFH/M5WnTK49Tp07JtETbnki78CTamay1C9tDgwYNLN6LNlXB1O575swZ+Sra+FQQbaWiZ7YIBkWJwCC++6JtfGXtY2WI4GqNaN8VJ3PRzieClTgeog1SKHrsi+6Xad8K75e4ABC/o6KK5t90oVB0vrjwERchhS8kBNEuWfTiRlysRUREFJtXke+q6L6KbYj2XNPFaEn7KYjgWnQ/yyJ+X2X9tsTFhvi7XblypfztCCLQiuD+4IMPVmh7RJVRfbtjViOiZChOAt9++608aYpXUeIpXJJ4//338c477+Cpp56SQ11q1qwp/8BFqaKkjiiFT0bGmlBLoqRVmEhHBNbCV+OFiZN6ZZRUmhTbN108FGZtnmAtD1XFHvtYuIRa+EQvai/E72POnDkyUIngJkpt4mKr6LGvyu+upG07w/GsDFGinj17tgywjzzyiLw4Fh3KTBcPRPbE4FpOIpCK4ClKouKPVFyti163JqLXZe/evfH5558X69UoSrGlESUXa1VwRa/omzZtKqsCb7/9dqsn+soS27d2pyax/cLVsOUl9tNUDVfaeNryVhGLiwbRu1WMdyxK9PYVFzFFS1+OIsZB5+bmYvXq1RalUluq6Bs2bChrKYoqmn+xnml+4eMkqopFtbyjxjIX3VcRlEXvcdE7vOh+ih7QhYl5puUV+X2J31ZZROm2U6dO8mJUlNpjYmLwySefVGhbRJXFauFyMpVSJ02ahKioqGLtX+Lqv+iVvmgjLc9QA3GyEEGi8FARMYRHVDcWJu6uJEqTomRcVH5+fqVvZSi2L4a0iJOyyZo1a6wOpyiPzp07yypUMWSo6D4V/o5Md4cqa7/Fd9u/f385rrhwu3dCQoK80OnZs6dsY6yoigzFKW3fiuZLpCeG51SWGL4kjseuXbvM88Rvo2iNhQieopT8z3/+02L74gJP7INo03QEMZSo8F3MxIVmXFycHDYliKFLosZl4cKF8kLEZN26dTh27FiF9/OBBx6Qfx/i5i5FFf0bfOKJJ2R7vfgtihu9mPaJyN5Yci0nESx69OghT/BC0eAqqpumTJkiO/uI9Q4dOiRPhuUp+YmqZFGlKDrLPP3007LtVJyIxNjFwh11RCcZMaZVjNETAV4EHDG4XpQcRCAXY0hFZ5KKeuaZZ+QJUdxpSgRwUdX59ddfm0ugFSVKkmKc59ChQ+V9jMV3Ijq4iEAmxm+KsaKm6nZhzJgxMu+mzmPWiHGeGzdulIFUdIwRN5gQ40zFyXrWrFmV2k9xchb7JgJhZe9dK46BCHAir+LYiLbv//znPzKYiABTGePHj5e3phTHQ4zVFBch//73v2UJT9ScFC7Rv/HGG3jvvffkumJ8qSgJinGvolbFUR13RBOIOC7iuxQXPCKQiTZX0bFKEL/RmTNnyuXiNyyqaMV64vcqxqmK8aoV8frrr8vfq2g7FX874nckxpmL2gPxdyM6O5mIDm/i+xTH+vnnn+fNKMhxlPc/dmHz58+X3fuLjuk0DcURwynEsB0xfu/222/Xtm/fXmw4i7WhOMLXX38th26IoRUdO3aUQwmsjXMV/v3vf8txpmI7YiiKGGc4fvx4OTSlMkNxhI8++kgOqRDDQsS+79mzp8ShON99953FZ0vK059//inHsop9FGNUxTCIwuOFxZCdf/zjH3JMpxh7WPjnWHQojrBv3z45xCUgIECOre3du7d5rG5ZeTTtu2lsamWG4txzzz1Wl61evVrmTYzvFOM2Z86cKYfBFB2TWlIa1oZiHTx4UM4TaYrjIsZxfv7558XSNA29adWqlRx3LcZ2Pv/888WGQIm02rZtW+58ie2IoVKlMX2n3377rfbGG2/I4TbiNynSszaOdPny5XL8sfiN1axZU45FvXDhgsU65RmKYxpnLsb0iu9G/M3Ur19frpOUlFRsu4MHD5ZpFv2tENmTTvzPgbGciFyEuEOT6Gcgak0qU2PiKGLIjqhJsnYXMSJ7YZsrEbksUTUvxtGKtlciR2KbKxG5HNFbWnQI/O9//yvbWUV7OJEjseRKRC5HPIhAlFZFkP3qq6/kQwWIHIltrkRERIqx5EpERFQVba7iFm6XLl2SN+euzI3XiYioaolKSnGzD/FAg8IPelApJyfH4mY0thDjxws/6MElg6sIrFV1ezkiIlJH3HlN3A7SHoG1ceN6iI+/oiQ90U4u2sydNcCWK7iaHiclDkplbjNHRERVS9ztTRSSCj8eUCVRYhWB9dz5FdDr/WxKKy0tC40aPiTTdOngaqoKFoGVwZWIyHnZu2lPH+ADfYCNDxYp40lizoDjXImISB0RGA02BkcXCK7sLUxERKQYS65ERKQOS64SgysREakj7kuk2XhvIhe4txGrhYmIiBRjyZWIiNQxaAqqhZ2/5MrgSkRE6rDNVWK1MBERkWIsuRIRkTosuUoMrkREpA6Dq8TgSkRE6mgKgqtIw8mxzZWIiEgxllyJiEgZnWaQk61pODsGVyIiUodtrhKrhYmIiBRjyZWIiBTfoUmzPQ0nx+BKRETqsFpYYrUwERGRYiy5EhGROiy5SgyuRESk+HmuBtvTcHKsFiYiIlKMJVciIlKH1cISgysREanDoTgSgysREanDkqvENlciIiLFWHIlIiJ1+Mg5icGViIiU0RkMcrI1DWfHamEiIiLFWHIlIiLFN5HQbE/DyTG4EhGROuwtLLFamIiISDGWXImISB2WXCUGVyIiUod3aJIYXImISB2WXCW2uRIRESnGkisRESmuFjbYnoaTY3AlIiJ1OM5VYrUwERGRYiy5EhGROuzQJDG4EhGROqJK18BqYVYLExGRU9u6dSuGDh2K8PBw6HQ6rFy50mL56NGj5fzC08CBA+26Tyy5EhGRU1cLZ2ZmokOHDnjqqadw//33W11HBNMvv/zS/N7b2xv2xOBKREROHVwHDRokp9KIYBoWFgZHYbUwERFVS2lpaRZTbm5updPasmULQkJC0LJlSzz//PNITk6GPTG4EhGR+nsLG2ycAERERCAoKMg8zZgxo1K7JKqEFy9ejE2bNmHmzJn4/fffZUm3oKAA9sJqYSIiUkczGCdb0wAQGxsLvV5vczvpiBEjzP+OjIxE+/bt0bRpU1ma7du3L+yBJVciIqqWJVe9Xm8xqeqE1KRJE9SuXRunT5+GvTC4EhHRTeXChQuyzbVu3bp22warhYmIyKl7C2dkZFiUQqOjoxEVFYWaNWvK6b333sMDDzwgewufOXMG48ePR7NmzTBgwADYC4MrERE59cPS9+zZg969e5vfjx07Vr6OGjUKCxYswMGDB/HVV18hJSVF3miif//+mDp1ql3HulYouK7vPQV+7vYdeEvk6obsmg5Xs6bbW3A1rnicXFWvXr2glXLLxA0bNsDRWHIlIiJ1+DxXicGViIiculq4OmJvYSIiIsVYciUiIoUU3ERCpOHkGFyJiEgdVgtLrBYmIiJSjCVXIiJShyVXicGViIic+g5N1RGDKxERqcOSq8Q2VyIiIsVYciUiInVYcpUYXImISB22uUqsFiYiIlKMJVciIlJHPJ1Gs7Fa19bPVwMMrkREpA7bXCVWCxMRESnGkisREanDkqvE4EpEROqIJ+IYbOzta/NTdaoeq4WJiIgUY8mViIjUYbWwxOBKRETqiBpdg63BFU7PYcE1uH0DRE4YBv+IWsiIScahmauQcijW6roht7dE05F3ILBpGLT8AlzZfw5HPl6LnMQ08zqhd7VGm38MhE+IHqnHL+HA9B+ReT4JjuRqeXK1/LhqnlwRj5MLYcnVcW2unnpfdJ0zEudW7MCGvtNw/rsd6DZnJDwCfKyu7xHgjTOL/8CmobPw2/APcS0zF53fH2Fe7t+gNjpNeQhH5v6MDf2mI2nPWXT98HHo3B3XhOxqeXK1/LhqnlwRjxO5Iof82sJ6tUHO5TTErNoDw7UC+ZqbnC7nW3Npw0Ek/nUCBdl5KMi5huhlfyG4bYT5j6PeoI5I3nsWiX+egCEvH6c+3wyv4ADU7NjQEdlxyTy5Wn5cNU+uiMfJtWgGTcnk7BwSXPXNwpB2Ms5innivbx5Wrs/X6twY6ecuQyswWE1PzM+ITpTzHcXV8uRq+bG2D66QJ1fE4+Sitz/UbJycnEOCq7ufF/LTcyzmXUvPgYefd5mf1beoi5Z/74ejH681z/Pw85KfL5qeu3/Z6anianlytfy4ap5cEY8TuSK7dGiqN6ADIt+4V/47Oz4FSbvOyHYViw0H+CAvJbPUdAKbhqLbvFE4PPsnmYZJflZesfYYzwBvFGTmwl5cLU+ulh9XzZMr4nFycezQZL/genHDATmZRAzrgsYjehS74oxe+lepfzi3fvoUjs3fgIvrb6QlpJ2OR1CLuub3oq0loHEI0s4kwF5cLU+ulh9XzZMr4nFycQyujqsWjt9yFD4hQfKPSOfhLl99agcifssRq+sHNAmRfzgnFm7EhTX7ii2/uC4KtW5pgpAeLeDm6Y7mT/VCXmqW7JLvKK6WJ1fLj6vmyRXxOJEr0mla2S3HaWlpCAoKwvLOr8HPvXLtFsEdGiJyvHEcW2ZsEg59sBpXD8XIZT6hQei1/GVseXgechJS0eGd+1H/nk6yJ2BhpuWC6EnY+qUB8o8y9cQlHJj2g+PHULpYnlwtP9U1T0N2TYerWdPtLZs+z+Nkf6bzeGpqKvR6vd3Svzr7Seh9vWxLKzsPwa9/abd9dangSkSuedJWEVyrI1c7Tg4LrjNHqwmuExY5dXDlqGoiIiLFeG9hIiJSRlSGajZ2SCpHhWq1x+BKRETqsLewxOBKRETqMLhKbHMlIiKntnXrVgwdOhTh4eHQ6XRYuXJlsWrmSZMmoW7duvD19UW/fv1w6tQpu+4TgysREakvuRpsnCogMzMTHTp0wPz5860unzVrFv75z39i4cKF2LlzJ/z9/TFgwADk5FjeJlMlVgsTEZE6Km68r1Xs84MGDZKT9aQ0zJ07F2+//Tbuvdd4283FixcjNDRUlnBHjLjxuEKVWHIlIqJqKS0tzWLKza34/aGjo6MRHx8vq4JNxHjc7t27Y/v27bAXBlciIlJGM6iZhIiICBkITdOMGTMqvD8isAqipFqYeG9aZg+sFiYiomrZWzg2NtbiDk3e3s5zh0CWXImIqFrS6/UWU2WCa1hYmHxNSLB8KpJ4b1pmDwyuRETk1L2FS9O4cWMZRDdt2mSeJ9pvRa/h2267DfbCamEiIlKmcJtpZVX08xkZGTh9+rRFJ6aoqCjUrFkTDRo0wCuvvIJp06ahefPmMti+8847ckzs8OHDYS8MrkRE5NT27NmD3r17m9+PHTtWvo4aNQqLFi3C+PHj5VjY5557DikpKejZsyfWr18PHx8fu+0TgysREamjKajWreA41169epV6s39x16YpU6bIyVEYXImISB1RpWtQkIaTY3AlIiJlxOPmNFsfOccb9xMREVFRLLkSEZE6rBaWGFyJiEgdUaOrKUjDybFamIiIqCpLrgM3T7K4z6OzW9PtLbiaIbumw5W44jFa7YJ54lU6mbBDkxGrhYmISB22uUq84CQiIlKMJVciInLqewtXRwyuRESkDquFJVYLExERKcaSKxERKcNqYSMGVyIiUkeMojEoSMPJMbgSEZEy4slvmmOfOFctsc2ViIhIMZZciYhIGba5GjG4EhGROhyKI7FamIiISDGWXImISBlWCxsxuBIRkTLsLWzEamEiIiLFWHIlIiJ1DDrjZGsaTo7BlYiIlGGbqxGrhYmIiBRjyZWIiJTRNJ2cbE3D2TG4EhGRMqwWNmJwJSIitUNxDLan4ewYXG0Q3L4BIicMg39ELWTEJOPQzFVIORRrdd2Q21ui6cg7ENg0DFp+Aa7sP4cjH69FTmKaeZ3Qu1qjzT8GwidEj9Tjl3Bg+o/IPJ/kwBy5HpXHKLBJCFq/Mhg1WoXDq4Y/1veZivyMHAfnCKhZKE+ZMck4OHMVrpaQp8IaDu+KDm8Ox+E5a3F22TY5L+S2FmjzjwHwCQmSZzTxuzs892ekn0mAI7nicaKbGzs0VZKn3hdd54zEuRU7sKHvNJz/bge6zRkJjwAfq+t7BHjjzOI/sGnoLPw2/ENcy8xF5/dHmJf7N6iNTlMewpG5P2NDv+lI2nMWXT98HDp3HqLqcowM+QbE/XoIUVP+h6rMk8hD9IodWN93GqK/24HupeTJxLt2IJo+3hNpp+It5qeejMP2fyzC+n7TsGHgDCT8dQLdZj0GR3LF43QzM7W5ajZOzo5n7koK69UGOZfTELNqDwzXCuRrbnK6nG/NpQ0HkfjXCRRk56Eg5xqil/2F4LYR5uBZb1BHJO89i8Q/T8CQl49Tn2+GV3AAanZs6OCcuQ7VxygzJgmxq/c6vFRXWF0recpJTpfzS9N+/DCc/GIz8tKyLOaL70NMJprBAN+6NRx6UeeKx+mmZtBBs3HiONebmL5ZGNJOxlnME+/1zcPK9flanRsj/dxlaAUGq+mJ+RnRiXJ+8t5oxXt/c1B9jJw1T3X7tIWHvzcu/ByFBkO7FFvuGxqEXkv/AQ8/b0AHnPzyd4fm2RWPExGDayW5+3khP92yHedaeo7xBFUGfYu6aPn3ftj7xrfmeR5+XvLzRdNz9y87PXLMMaouebpWgTx5BvqgzZiB2PGPRSWmmZ2QinV9p8m0I+7pjJyEVDiSKx6nmxnvLWzE4FpO9QZ0QOQb98p/Z8enIGnXGdlWVJhoI8pLySw1ncCmoeg2bxQOz/5JpmGSn5VXrI3JM8AbBZm5SvPhyux9jKoqTx2u5ymrhDx5Bvggt4Q8tRkzCDGr9yIzNrnMbRVk5eHc9zsx8Jc3sXXUv5B16SrswRWPE93Aca5GDK7ldHHDATmZRAzrgsYjehS7io5e+lepJ4NbP30Kx+ZvwMX1N9IS0k7HI6hFXfN70X4U0DgEaWw3qjbHqDrkqcGwLmhiJU9nSshTna5NZZWw6TMiENdoXU+25e+ZaKW0pwPcvD1ku6u9gqsrHieiotihqZLitxyVwxfEiUHn4S5ffWoHIn7LEavrBzQJkSeDEws34sKafcWWX1wXhVq3NEFIjxZw83RH86d6IS81Sw4zoOpxjAQ3Lw+4eRqvSd283OV7R4q7nqcG1/PUoIw8/fH0Qmx57BP8/vincko5dhGnv/4TB2esksvD746Ef/2agE4nS4uRY4egIPuaHJLjKK54nG5mtnZm0kydmipg8uTJ0Ol0FlOrVq1QlfiLq6RradnY/doSRI4fhnbjhiIzNgm7xy4xt4f5iE4iy1/GlofnyTaspo/1hFewH9q8OlhOJqbloodj1Lvfoe3Ye+SJJvXEJZk+O2lUn2MkSnN9V71unt9//ZvyddO9s5Edl+KwPO16bYns/Rs5bigyYpOws1CeROek3stfxuaH58m21NzkDIvPi57oYsynuHAT/OoGo/WL/eEdHCB73149egHbX/oS+Q5sjnDF43Qzq6o217Zt2+LXX381v/fwqNrwptO0srORlpaGoKAgpKamQq/Xw1Ws6fYWXM2QXdPhSlzxGLni5ZIrVoG52t+Svc/jpvRPDXscgZ5eNqWVfi0PzVd/Xe59FSXXlStXIioqCtWFK/5NEBGRC9xEIi0tzWLKzS25RuXUqVMIDw9HkyZN8NhjjyEmJgZVicGViIiUMRh0SiYhIiJCloZN04wZM6xus3v37li0aBHWr1+PBQsWIDo6GnfccQfS02/cIMXR2OZKRETVss01NjbWolrY29v62OdBgwaZ/92+fXsZbBs2bIgVK1bg6aefRlVgcCUiompJr9dXqn24Ro0aaNGiBU6fPo2qwmphIiJyqRv3Z2Rk4MyZM6hb98a9AxyNwZWIiJw6uI4bNw6///47zp07h23btuG+++6Du7s7HnnkEVQVVgsTEZFTu3DhggykycnJqFOnDnr27IkdO3bIf1cVBlciIlLGoOnkZGsaFbFs2TJUNwyuRESkTGVuX1iUrZ+vDtjmSkREpBhLrkREpAyf52rE4EpERMoYoKDNVTz70MmxWpiIiEgxllyJiEgZFTeB0Gz8fHXA4EpERMqIwGhgcGVwJSIidVhyNWKbKxERkWIsuRIRkTKG65MtbP18dcDgSkREyrBa2IjVwkRERIqx5EpENhuya3pV7wJVEwat4jfet5aGs2NwJSIiZVgtbMRqYSIiIsVYciUiIsXVwrA5DWfH4EpERMqwWtiI1cJERESKseRKRERqHzkHPnKOwZWIiJThw9KNGFyJiEgZMcbVYPM4V+cvubLNlYiISDGWXImISBlNQZurxjZXIiKiG9jmasRqYSIiIsVYciUiImXYocmIwZWIiJQR7aUa21xZLUxERKQaS65ERKQMb9xvxOBKRETKsM3ViNXCREREirHkSkREyrBDkxGDKxERKcM2VyMGVyIiUoYlVyO2uRIRESnGkqsNgts3QOSEYfCPqIWMmGQcmrkKKYdira4bcntLNB15BwKbhkHLL8CV/edw5OO1yElMk8sDm4Sg9SuDUaNVOLxq+GN9n6nIz8hxcI5cj8pjJDR7shcaDL8FnoG+yLp4Bcc+3YCknacdmCOgZqE8ZcYk4+DMVbhaQp4Kazi8Kzq8ORyH56zF2WXbzPN9QvRo9+pg1OnWTL6/euQCdoxZZNc8kOtitbARS66V5Kn3Rdc5I3FuxQ5s6DsN57/bgW5zRsIjwMfq+h4B3jiz+A9sGjoLvw3/ENcyc9H5/RHm5YZ8A+J+PYSoKf9zYC5cm+pjFHpXazR5rCd2j12CDX2m4uzSv3DLrMfkdhyZJ5GH6BU7sL7vNER/twPdS8mTiXftQDR9vCfSTsVbzHf38USPfz2N1FPx+GXoLKzv/z6OL9ho51zQzTAUx2DjVFHz589Ho0aN4OPjg+7du2PXrl2oSgyulRTWqw1yLqchZtUeGK4VyNfc5HQ535pLGw4i8a8TKMjOQ0HONUQv+wvBbSOgczcegsyYJMSu3ov0MwkOzonrUn2M/OrVROrRC+ZjdHFdFNw83OR8R6lrJU85yelyfmnajx+Gk19sRl5alsX8iCGdkZeahVNfbEFBVh60AgNSjl20cy6I1Fq+fDnGjh2Ld999F/v27UOHDh0wYMAAJCYmoqowuFaSvlkY0k7GWcwT7/XNw8r1+VqdGyP93GV5MiPnOEZxGw/Bu1Yg9C3qAm461B/SGdmJaQ69IKpMnur2aQsPf29c+DnKah5zElPRfe4oDNz4Fu786gWE9Ghhl32nm4OmaBLS0tIsptzcXFgzZ84cPPvss3jyySfRpk0bLFy4EH5+fvjiiy9QVRhcK8ndzwv56ZZtotfSc+Dh513mZ8XJueXf++Hox2vtuIek+hjlXsmQJds7vnoBg/98D23H3oOD7/8IQ14+HJknkYfy5skz0AdtxgzEwQ9WWV3upfdF3V5tcf7HXdgwcIYs3d7ywSPwr++40ji54MPSNdsmU2/hiIgIBAUFmacZM2YU215eXh727t2Lfv36mee5ubnJ99u3b0dVYYemcqo3oAMi37hX/js7PgVJu84Ua2sT7V55KZmlphPYNBTd5o3C4dk/yTTIeY5R82f6IOT2Ftj8t4+RdekqanVqhC4fPIIdL36JtFNxdstTh+t5yiohT54BPsgtIU9txgxCzOq9yIxNtro8PzsPVw7FIP73Y/K9eE09fgl1ujdH5oWdyvNDVBGxsbHQ6/Xm997exS8ik5KSUFBQgNDQUIv54v3x48dRVRhcy+nihgNyMokY1gWNR/QoVtqJXvpXqSftWz99Csfmb8DF9TfSIuc4RkEt6+LSpsOyl7CQvC9adhCq3a2p3YJr0Tw1GNYFTazk6UwJearTtamsEjZ9RgTiGq3roWbHhtgz8VuknYxH7a5N7LLvdHMSjSgGBWkIIrAWDq7OhNXClRS/5Sh8QoLkCVzn4S5ffWoHIn7LEavrBzQJkSftEws34sKafVbXcfPygJun8XrHzctdvqfqc4zEcJe6fdrBN6yGeZhPjTb1i7WB2lPc9Tw1uJ6nBmXk6Y+nF2LLY5/g98c/lZPorHT66z9xcIaxmjj25/0IahmO0J4tAZ1Ovor3iTtOOSxP5Fo0Ua2r2T6VV+3ateHu7o6EBMu+D+J9WFj5+lfYA8/elXQtLRu7X1uCyPHD0G7cUGTGJskhGqb2MJ/QIPRa/jK2PDwPOQmpaPpYT3gF+6HNq4PlZGJa7lu3Bvquet08v//6N+XrpntnIzsupQpy6PxUH6Mzi7fKKtke/3kWngG+spfu8QW/IGn3GYfmaddrS2Tv38hxQ5ERm4SdhfLkGxqE3stfxuaH5yE7IRW5yRkWnxftw2L8tOghLIhSuCjBtn15ELpMexiZF65g94Sl5tI5UXXn5eWFLl26YNOmTRg+fLicZzAY5PuXXnqpyvZLp2lamcN1RS8t0ZicmprqtEV0a9Z0ewuuZsiu6XAlrniMXLF/+DAX+925Inufx03pL+08Dn7uZXcaLE1WQS4e3fdhufdVDMUZNWoUPvvsM3Tr1g1z587FihUrZJtr0bZYR2HJlYiInPoOTQ8//DAuX76MSZMmIT4+Hh07dsT69eurLLAKDK5EROT0N+5/6aWXqrQauCh2aCIiIlKMJVciIlKGN+43YnAlIiJl+DxXI1YLExERKcaSKxERKcNqYSMGVyIiUobB1YjVwkRERIqx5EpERMqwQ5MRgysRESkjbqhrsLFat+yb8lZ/rBYmIiJSjCVXIiKqls9zdWYMrkREpExFn8dqja2frw4YXImISBmWXI3Y5kpERKQYS65ERKQMbyJhxOBKRETKiLioKUjD2bFamIiISDGWXImISHG1sM7mNJzdTR1cXaFHWlGru70FVzJs13S4mjebToGrOdzS9fI0/uidcCX5BZkO2Q6rhY1YLUxERKTYTV1yJSIitdhb2IjBlYiIlOFNJIxYLUxERKQYS65ERKSMeFycxkfOMbgSEZE64kHnBj4sncGViIjUYcnViG2uREREirHkSkREyrC3sBGDKxERKcNxrkasFiYiIlKMJVciIlKG9xY2YnAlIiJlWC1sxGphIiIixVhyJSIiZTjO1YjBlYiIlOFQHCNWCxMR0U2jUaNG0Ol0FtMHH3ygfDssuRIR0U3VoWnKlCl49tlnze8DAwOVb4PBlYiIquVQnLS0NIv53t7ecrKVCKZhYWGwJ1YLExGR8pKrwcZJiIiIQFBQkHmaMWOGkn0U1cC1atVCp06dMHv2bOTn50M1llyJiKhaio2NhV6vN79XUWodM2YMOnfujJo1a2Lbtm144403EBcXhzlz5kAlBlcb1GzfAJEThsE/ohYyY5JxcOYqXD0Ua3XdoJbh6PDmcPiFB0PnpkN6dCKOzv8FV/afk8trdW6M2xc+g/ysXPNnYtfsx6EPf3LK/LSfeC/qD+xw4wNuOnj4eOH3J+Yj9cQlR2XJ5dSoF4TxW19Gbmaeed7ZHeew5LllJX7mloc64c7neiCwTgDSEtLx2ydbceCnw3JZeNsw3Pf+EATXNx7HxNOXsWHWJpzbHYOq0PGhzhg8dQg2vr8Bu7/aWeJ6gaGB6PfmADTu0US+v3TgIpY98415eaeHO6PH/90B3xq+iNl1Hmvf/gmZlzPssMc6uOmaQ6cLBuAJIA8GLQaaFn99uTvcdC2g09WSfWAN2kVo2vlS0itr/Yqm53jiWayaoue5isBaOLiWZOLEiZg5c2ap6xw7dgytWrXC2LFjzfPat28PLy8v/P3vf5elYhXB24TBtZI89b7oNmckjn6yHhd+3o/6gzuh+5yR+PW+j5CfkVNs/az4q9g9YSmy41Pk+7q92uDWOSOxfuD7MOQaqySupWdjXd9pcIX8HPxglZxMmj56Oxre15WBVZGZt3+MnPQbF2IlqdsmDMPeG4xFT34jg3DTHo0x8r+PIO5YPBJPJ+HqxVR88/x3SLmUKtdv278VRv33EUzv9hHyr/8uHSUgJAC3Pn0bEk8klLqep68nHls8EodWHsTaN1fjWs41hLWpa17e8NZG6D2uH5Y9/Q0un0pE/3cG4t4P78PSUUvssNciCOShwHAAgPg70cPdLRIGLRcarsrAC50nCgw7ZPB1d+sAA3KgadbzWNb6FU2vKmgKOiRpFVz/tddew+jRo0tdp0kT44VYUd27d5fVwufOnUPLli2hCttcK0kEk5zLaYhZtQeGawXyNSc5Xc635lpqtjkQQaeDZtDg4e8Nn1rqe6lVx/w0GHYLYn7aa88skBU169dAysUUGViFM9uikRqXipBmdeT77JRsc2DV6QCDwQDvAG9ZynW0AZMG489//SH3qTTt7+uArKvZ+GvBH8jLzINWoCHu0I2Ltvb3d8Th1Qdx6eBFXMu+hi0f/YYGXRuiRv0adthrUXoU363pAjQNGlKg0wXJ06tOFwKDIRqAuFDJliVNN92NCwFLZa1f0fRuHnXq1JGl0tImUUK1JioqCm5ubggJCVG6Tyy5VpK+WRjSTsZZzBPv9c1L74E2aNPbcPf1gpuHO2LX7kPWpavmZWJ+/7UTZKBK3h+No59skAHPWfNjEhwZAf8GtRC7Zp/y/b5Zvbzuebh5uOHCgYtYP/NXXD6bbHW9k3+cQa8Xe6LZ7U1wZttZNOvZFL6BPji3x7La95394+Hl5wV3Dzfs++EArl64fuHkIK0GtJZB/fCqg+jwQMdS123QrSHSE9Lw8H8eRXj7eki5cBVb523Bma2n5fKQliHYs2S3ef3M5ExkJmWgTssQpNg9X27QQQ+DlgjADzqdKL8Uqo7WMgBdgxI+W9b6FU2valTnoTjbt2/Hzp070bt3b9ljWLx/9dVX8fjjjyM4WFTtq8PgWknufl64lm5ZXSree/iVXmcvqn3dvD0Q3rst3LxFG41RxvnL+P3xT5F+7jK8g/3R9pXB6PbRE9g66l8OuReY6vwU1uDeW5Dw5wnkXslUus83o6yrWfjXff/FpaPx8PL1RO+X7sSTix/HvIELkJtxox3WRJTcolYewhP/flheAGkFBvxv4mpkJFkei6mdZsHD2wPtBraWr47ko/dBn/H98O1T35Rv/SBfNOzeCD/84zt89/wyNLurOe7/54P477CFuBpzVV4k5Bb5Leek5cDLX117WkncdC2hIQsaLoueCdC0AotKTk2WOEv6ft3LWL+s5dVDdX4qjre3N5YtW4bJkycjNzcXjRs3lsG1cDusKtXrqFRj9QZ0QIc37pX/zopPQdKuM7KdsjDPAB/kppQdQESb5IX1B9Br2RhknLuMKwfOIzc5Q06CeD3w/koM/u0dBDSohYzzSU6XHxNRqq3XNxJ731muPA83gw7D2mH4tCHy36J6d96ghbhw0FgFKtpc183YiI73RqJB5wic2nqm2Oe7PNgRPZ+5DQse+AIJJxIQ2jIUI/87AjlpuTix5ZTFuqKNNWrVIby87v9w+UwSzu+13pnNVm2HtsOg94x5Sr2UgotRF3Dg+yhcPX+lXJ/Py8rDxf0XcHLTCflevMYfiUPjnk1xdekeudw70DKQegf6IC+z7DZqWxg7Nvleb38VRCB0u94uawwXOnnKLaktu6z1K5oeFSV6Ce/YIdqr7Y/BtZwubjggJ5MGw7qgyYgeFuvoW9TFmaV/lTtNUZIQPXMLByMzO5dWHZWfev3b41pmLhK2nVS05zeXA6sPy6k0Wim/lfA2YTj5+2nEHzd2eBGvp/84ixZ3NS0WXE3cPd1Rq1FNuwXXIz8dlpPJC5vGyCrhrqO6y/fi33XbhSOiSwP8MOa7Yp9PPJ6ARrc2LjH9xBOJCG11oznDr6YfAuoE4PIJUVVrz8Cqvx5YRRAUsq4HQf8bVbk60ZZd0gVrWetXNL2qUZ2rhR2JHZoqKW7LUfiEBMmgpPNwl68+tQMRv+WI1fVDe7aEvlkodO5ucPf2RPPRd8EnRI9k01CcLo3lsBbBM8hXDmVJP5uIjNhkp8yPiUhHtMW6xF9LNVC/Qz3UaVpbDpvx8vPEgPF95fk2Zt8Fq+vH7L+A5nc2RUhzYwcm8dr8jqayWllo2bs5wlqGwM1dB08fD9z1fE/ow/QOHYrz1cOfyyrdz+/9TE5xh+Ow4/NtWPfuGqvri17CoW3C0KyX6DkL+Sren/3DWHI/+EMU2g6LRN3IcHj4eKDX2D6I2X3ebu2txsAadD2wFi5FGqBpiXBzExcC7gB84aarB4Nm2beh/OtXNL2qoSn6z9mx5FpJ19Kyseu1JWg/fhgixw1FRmwSdo5dYm639A0NQu/lL2Pzw/OQnZAKryA/tH15EHzq6FGQl4/00wnY+epiZF28Yh432nny3+Cp90N+Zi6S9p7FzrGLHRaUVOdHCGhcB8Ft62PfOysckoebQc0GNXD3q71lb17Rnhp74CK+HP01cjOMVZ5BdfV4ZcMLmDvgX0iNS5Ol3hrhQRj57xHwr+WPrJQs7P0+Cnu/i5Lr+9f0w+A374Y+VC+rhRNOJmLxM9/iSkzxjmn2klmk/Vf8nkR+sq8aew2LEqzovPRhZ+PN1VNir+KHl79Hvwl3Y/icB3A15oos4Yr5wvkd57Blzm944NOHZHuuCKyrxv1op733hptbPWiaAe5ut5nniqExBu0kDNopuKHF9WWmcak3hs24uUVC01KhacaLmbLWL2s5VR86rbQ6pUL3dxS3nkpNTS3XgF5nsbrbW1W9C1SGYbumw9W82XQKXE2AC16mjz96J1xJWlomatUcYrfzuClOvNzwDXi7+diUVq4hB/POz3DqmOOCfxJERFRVqnNvYUdicCUiImXYocmIHZqIiIgUY8mViIiUEb14NFvvLewCJVcGVyIiUsZwfbKFrZ+vDlgtTEREpBhLrkREpAw7NBkxuBIRkToK2lzhAsGV1cJERESKseRKRETKsEOTEYMrEREpw6E4RqwWJiIiUowlVyIiUobVwkYMrkREpIx40JpmY72urZ+vDhhciYhIGY5zNWKbKxERkWIsuRIRkTJ8nqsRgysRESnDamEjVgsTEREpxpIrEREpw5KrEYMrEREpbnPVbE7D2TG4upg8g2vV9Ot0rvcTHddoElxNkKcrnA4tebj3givxcE+r6l24qbjemYuIiKoMq4WNGFyJiEgZ3rjfiMGViIiUEe2tBpvbXJ0/urpWAx0REVE1wJIrEREpw2phIwZXIiJSho+cM2K1MBERkWIsuRIRkTJ8nqsRgysRESnDca5GrBYmIqKbxvTp09GjRw/4+fmhRo0aVteJiYnBPffcI9cJCQnB66+/jvz8/ApthyVXIiJSxqBgnKvBjuNc8/Ly8OCDD+K2227D559/Xmx5QUGBDKxhYWHYtm0b4uLiMHLkSHh6euL9998v93YYXImISO2N+zXb07CX9957T74uWrTI6vJffvkFR48exa+//orQ0FB07NgRU6dOxYQJEzB58mR4eXmVazusFiYiomopLS3NYsrNzbX7Nrdv347IyEgZWE0GDBggt3/kyJFyp8PgSkREyquFDTZOQkREBIKCgszTjBkz7L7/8fHxFoFVML0Xy8qL1cJERKT2Dk2wPQ0hNjYWer3ePN/b29vq+hMnTsTMmTNLTfPYsWNo1aoVHIXBlYiIqmWHJr1ebxFcS/Laa69h9OjRpa7TpEmTcm1bdGTatWuXxbyEhATzsvJicCUiIqdWp04dOakgehGL4TqJiYlyGI6wceNGGeTbtGlT7nQYXImISBmDpqDkasc7NIkxrFeuXJGvYthNVFSUnN+sWTMEBASgf//+Mog+8cQTmDVrlmxnffvtt/Hiiy+WWC1tDYMrEREpI57FqlXj57lOmjQJX331lfl9p06d5OvmzZvRq1cvuLu7Y82aNXj++edlKdbf3x+jRo3ClClTKrQdBlciIrppLFq0qMQxriYNGzbEzz//bNN2GFyJiEgZTcEj4zQ4PwZXG9Rs3wCRE4bBP6IWMmOScXDmKlw9FGt13aCW4ejw5nD4hQdD56ZDenQijs7/BVf2n5PL20+8F/UHdrjxATcdPHy88PsT85F64pLd8+JTKwCd37oXwa3D4VtHj42PzkfqydLHdIXf1RqRLw+Ab0ggUo7HYe/UlUg/n1Tu5aoNHjwYEya8jsjIdrh27Rq2bv0Dr7wyFhcvXjSvc++9wzB79kzUq1cP+/btxzPPPIcTJ06UmGZZ61c0PRV8Ar0x5K270e7uVnD3dMfl6GQseHgRruWUfu/TgeP6oO+LPbHoueU4svHGPt76SGf0efEO+Af74syO8/hu4k9Iv5wBR6ndOgy9Jw+Fvn4wdDodrpy9jO0f/4q4vTFW16/ZrA5uf70/6rQJh2+wH/5z2wfIS7e8uUBwk9roOWEAwjpGwJBvwNlNx7F50moH5ejmVt1vf+govIlEJXnqfdFtzkhEr9iB9X2nIfq7Heg+ZyQ8Anysrp8VfxW7JyzF+runY13faTjz9Z+4dc5IuHkbr28OfrAKP/eaYp5OLPwVGecvOySwmh7xFL/tFLaNW1qu9QMa1ka3aX/DwTk/Y3WfGUjcfRY95jwGnbtbuZbbQ1CQHjNnzkZERCM0btxM3lFlxYpl5uUtWrTAN98swauvjkPNmnXw22+bsWrVD7KNxZqy1q9oeirodMBTnz8CwzUDZvb5FJM6zMT3b6xBQX7pZYW6rUPRpm9zpCakW8xvelsjDJ7YD0te+h6Tb/kI6UkZeHTufXCk9EupWP/KCnx++yz8t8dMRC3ahiH/ehTu1/82ihLB8vSGo9j09kqry/3qBGD4F6PkOl/c+SG+7PURDn1rObSCyN4YXCupbq82yLmchphVe2C4ViBfc5LT5XxrrqVmIzs+xfhGp4Nm0ODh7w2fWoFW128w7BbE/LQXjpJ7JRNnv9+Fq0dulPJK03BQB1zeE424P0/CkJePY//dAu9gf9Tu2LBcy+3h22+XyXaSzMxMZGVlYe7cf6J7927mYPf4449h8+YtWLt2rbyN2tSp02RX+zvuuMNqemWtX9H0VGjZqzlqhAdh5eR1yE7NkYPtLx2NlwGnJKKm5MEZQ7By8noUXCuwWNb1wY7Yt/IgYqMu4lr2Nayb/RuadG+ImhHWnxZiD7mp2UiPS72+s4BWoMHL3xt+tQOsrp9yLhnHftiPK6cSrS7vOPI2XNgZLdcpyM2Xf59Jx8p/Zx1S8zxXzcbJ2bFauJL0zcKQdjLOYp54r29e+iDjQZvehruvF9w83BG7dh+yLl0ttk5wZAT8G9RC7Jp9qK6CmocipVD+tQID0qIT5fzLe6PLXO4Id911p7wri+huL7RvH4moqAPm5eIRUkePHpPzt2zZUuzzZa1f0fRUaNq9IZLPX8GIOfehxR1NZPXtls+2Ye8PB0v8zJ1P34q444k4u/N8sWV1W4Xgr692m99nJGXKNMX8K7HXLwYd5JltE+DpJ/423HB8VRTSL1Zu++G3NETS8Xjcv+RJBDeujStnLmPbhxuRcKh8F45kG1YLGzG4VpK7nxeupedYzBPvPfxKHwclqoRFVXB477Zw8/a0uk6De29Bwp8nZGmyuvLwLSH//t7lWm5vxidZvIcHHxxhnifGsKWkWJ6wxfvAQOu1B2WtX9H0VPCt4YNmPRrjx3fXYfm4lYhoXw9PL3oUVy6kIHpX8TZKUQLtMbIr5g75t9X0vP28kJ1meZzEe28HHafCRJWwqApuenfrEquEy8MnyBfNB7fDT//3DRIPXUTbB7vgnvmP4JshnyK3SF6J7IXBtZzqDeiADm/cK/+dFZ+CpF1nZLtrYZ4BPshNKTsgGnLzcWH9AfRaNgYZ5y7jyoEbJQpRqq3XNxJ731kOe4oY2B5d3hwm/50Zl4qND39Soc/nZ+fJ/BYm3udn5pZruQqPPvoIPvtsgfz3+fPn0a6dsUNYu3btsG7dGrz00hj52CiTjIwMefPvwsT79HTLdsjyrl/R9Cqj073t8MD0IfLfVy+m4NSfZ5FyKRXbFhtLm+f2xsrOSW36tLAaXP/2/hBs+GizrEK2JjcrT3aQKswn0Ae5Co9TUS3uiUSvd415Sr+Ugm+HG4+hIKpxT645hEdWPo+Us0mI22+9g2BprmXlIf5ALOKvf/bQt7vR6enbEdahPs7/cVphTsgallyNGFzL6eKGA3IyaTCsC5qM6GGxjr5FXZxZ+le50xRVw6KnceHgWq9/e1zLzEXCtpOwp9j1B+VUWamnEhDU4kYVuOiopG9cB6mnE8q1XIWlS7+VU2EisP766wZMnPgmvvnGsnPWwYOH0LHjjR7ZHh4eaNOmNQ4dOmw1/bLWr2h6lbF/1WE5mdzytw6IHNi63J9v3rMJwtuEYdg7A+R73yAfjPhoOHat2I+fpv0iq4vFchP/Wn7QhwTI+fZycu0hOZX1txHUsFalgmvSiXjZzkxVw/RcG1vY+vnqgB2aKiluy1H4hATJIKvzcJevPrUDEb/F+vP+Qnu2hL5ZqAwy7t6eaD76LviE6JF8fSiOiUhHtMXC4PgrNzcvDznJf3u6G/8tuqdacX7dAYR0bYKw25vLdVs/fRdyU7KQtP98uZbbg7hlmQisb789CYsW3bgDi8nXX3+DPn16Y9CgQfKBx2+99SaSkpKwdetWq+mVtX5F01Ph8Ibj8PD2wK2PdpEBJKJjPbTt1xJHfrU+/GfabR/j43s+M09pCelYPW0Dfv2ncR93fxeFzsMjEdEhHJ4+Hhj0eh/ZNuvI9taGdzVHrRYh0LmL4Wce6PJsT/iH6nFpb8m/FXcvd7hf/62KV/He5Oj3+9C4dyuERtaT31Hbh7rA3dMDcVEVD9RUtY+cc2YsuVbStbRs7HptCdqPH4bIcUOREZuEnWOXmNsZfUOD0Hv5y9j88DxkJ6TCK8gPbV8eBJ86ehTk5SP9dAJ2vroYWRevmNMMaFwHwW3rY987K6okT/dve9f8775f/Z98/f3vn+Py3nOyl2/Pfz6BlXdOk/Mzzidh1zvfo8Nr98AvRI+rJy5h29hvZMel8iy3h3Hjxsqbd3/88UdyMmnTJlI+uurkyZN4/PGRmDdvDurXry/HpQ4bdp+5w1PPnj1ldXJgoLGnbFnrl7XcHnLSc/HF09/ivvcGYehbdyMlPg0/vvszzu0xBo7GXRvg6S8fxdvtPpDvU+Mtq6gNBg1ZV7PN7axntp/Dulm/YeSCh+AX5IMzO89j6Ss/wpF8a/jh9nH9ERCqR35uPpJPJWDtC0uRFmvs7Fe3cwMMXfgY/t3N+CzPwPAgjPzlFfPnn/p9nHxd3H+uHNYjSrtb31+H/rMfgE+wH5JPJWLti0uLjYUlsiedVo4+z2K8oGhLSk1NLdfjf5zF6m5vwdXkGVyrMuLBvfZ/OLKjjWs0Ca6mkb/zlzSKevHwjYtNV2Dv87gp/e765+Ghs61DXL6Wi51pC5w65rDkSkREyhiu/2cLWz9fHbhWMYeIiKgaYMmViIiU0XQaNJ2tvYWdv5mBwZWIiJTRFPT21VwguLJamIiISDGWXImISBnRGUnHDk0MrkREpA7v0GTEamEiIiLFWHIlIiJlDDoDdDb2Fma1MBERUSFsczVicCUiImUYXI3Y5kpERKQYS65ERKQMewsbMbgSEZEyBhRAhwKb03B2rBYmIiJSjCVXIiJSRtwXWLO5Wtj57y3M4EpERMpwnKsRq4WJiIgUY8mViIgUd2hyszkNZ8fgSkRECtk+FEek4exYLUxERKTYTV1yHbZrelXvApVBw9Sq3gUiqgCDJqp03RSk4dxu6uBKRERq8Q5NRgyuRESkjIYCaDaWXEUazo5trkREdNOYPn06evToAT8/P9SoUcPqOjqdrti0bNmyCm2HJVciIlLGeAMIg4I07CMvLw8PPvggbrvtNnz++eclrvfll19i4MCB5vclBeKSMLgSEdFNc/vD9957T74uWrSo1PVEMA0LC6v0dlgtTERE1VJaWprFlJub67Btv/jii6hduza6deuGL774AppWsYDPkisRESmjaaJDk87mNISIiAiL+e+++y4mT54Me5syZQr69Okj22V/+eUXvPDCC8jIyMCYMWPKnQaDKxERVcs219jYWOj1evN8b29vq+tPnDgRM2fOLDXNY8eOoVWrVuXa/jvvvGP+d6dOnZCZmYnZs2czuBIRkfPT6/UWwbUkr732GkaPHl3qOk2aNKn0fnTv3h1Tp06V1dIlBfiiGFyJiEjxOFedzWlURJ06deRkL1FRUQgODi53YBUYXImISBlNU3CHJs1+Q3FiYmJw5coV+VpQUCADp9CsWTMEBATgp59+QkJCAm699Vb4+Phg48aNeP/99zFu3LgKbYfBlYiIbhqTJk3CV199ZdGmKmzevBm9evWCp6cn5s+fj1dffVX2EBZBd86cOXj22WcrtB2dVo7+xaILdFBQEFJTU8tV/01ERNWLvc/jpvRr67vDTWdbuc2g5SMpbadTxxyWXImIqFoOxXFmDK5ERHTT3KHJUXiHJiIiIsVYciUiIsW9hXU2p+HsGFyJiEgh0eZqexrOjtXCREREirHkSkREyhirdHUK0nBuDK5ERKQMg6sRq4WJiIgUY8mViIiUEY+L09l8437nL7kyuBIRkTKsFjZitTAREZFiLLkSEZEyKu4LrPHewkREREXvC2xQkIZzY3AlIiJlVLSXamxzJSIioqJYciUiImVYcjVicCUiImVUjFHVXGCcK6uFiYiIFGPJlYiIlGG1sBGDKxERKcPgasRqYSIiIsVYciUiIoVUlDqdv+TK4EpERMqwWtiI1cJERESKseRKRETKcJyrEYMrEREpo2kKbtwv03BuDK5ERKSQeFyczuayq7NjmysREZFiLLkSEZEyxp6+OhvTcP6SK4MrEREpZHtwZbUwERERFcOSKxERqaOgWhisFiYiIrpBU1Clq7FamIiIiIpiyZWIiBRihyaBwZWIiBTSFHT2ZbUwERERVabkahrQm5aWVp7ViYiomjGdv+1/gwbRHcn5S54OCa7p6enyNSIiwt77Q0REdiTO50FBQcrT9fLyQlhYGOLj45WkJ9ISaTornVaOyxiDwYBLly4hMDAQOp2td94gIiJHE6d6EVjDw8Ph5mafFsGcnBzk5eUpSUsEVh8fH7h0cCUiIqLyY4cmIiIixRhciYiIFGNwJSIiUozBlYiISDEGVyIiIsUYXImIiBRjcCUiIoJa/w/0ndzpKjipNgAAAABJRU5ErkJggg==",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"def plot_values(V: np.ndarray, title: str = \"Value function\") -> None:\n",
|
||
" \"\"\"Plot the value function V on the maze as a heatmap.\"\"\"\n",
|
||
" grid_values = np.full(\n",
|
||
" (n_rows, n_cols),\n",
|
||
" np.nan,\n",
|
||
" ) # Initializes a grid the same size as the maze. Every cell starts as NaN.\n",
|
||
" for (\n",
|
||
" s,\n",
|
||
" (i, j),\n",
|
||
" ) in (\n",
|
||
" state_to_pos.items()\n",
|
||
" ): # recall that state_to_pos maps each state index to its maze coordinates (i,j).\n",
|
||
" grid_values[i, j] = V[\n",
|
||
" s\n",
|
||
" ] # For each reachable cell, we write the value V[s] in the grid.\n",
|
||
" # Walls # never get values, and they stay as NaN.\n",
|
||
"\n",
|
||
" _fig, ax = plt.subplots()\n",
|
||
" im = ax.imshow(grid_values, cmap=\"magma\")\n",
|
||
" plt.colorbar(im, ax=ax)\n",
|
||
"\n",
|
||
" # For each state:\n",
|
||
" # Place the text label at (column j, row i).\n",
|
||
" # Display value to two decimals.\n",
|
||
" # Use white text so it's visible on the heatmap.\n",
|
||
" # Center the text inside each cell.\n",
|
||
"\n",
|
||
" for s, (i, j) in state_to_pos.items():\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" f\"{V[s]:.2f}\",\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" color=\"white\",\n",
|
||
" fontsize=9,\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Remove axis ticks and set title\n",
|
||
" ax.set_xticks([])\n",
|
||
" ax.set_yticks([])\n",
|
||
" ax.set_title(title)\n",
|
||
" plt.show()\n",
|
||
"\n",
|
||
"\n",
|
||
"plot_values(V_random, title=\"Value function: random policy\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8275a1eb-b58e-4e05-ae5d-5635ff9a1556",
|
||
"metadata": {},
|
||
"source": [
|
||
"The next function `plot_policy` visualizes a policy on the maze.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"id": "c1ab67f0-bd5e-4ffe-b655-aec030401b78",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def plot_policy(policy: np.ndarray, title: str = \"Policy\") -> None:\n",
|
||
" \"\"\"Plot the given policy on the maze.\"\"\"\n",
|
||
" _fig, ax = plt.subplots()\n",
|
||
" # draw walls as dark cells\n",
|
||
" wall_grid = np.zeros((n_rows, n_cols))\n",
|
||
" for i in range(n_rows):\n",
|
||
" for j in range(n_cols):\n",
|
||
" if maze_str[i][j] == \"#\":\n",
|
||
" wall_grid[i, j] = 1\n",
|
||
" ax.imshow(wall_grid, cmap=\"Greys\", alpha=0.5)\n",
|
||
"\n",
|
||
" for s, (i, j) in state_to_pos.items():\n",
|
||
" cell = maze_str[i][j]\n",
|
||
" if cell == \"#\":\n",
|
||
" continue\n",
|
||
"\n",
|
||
" if s in goal_states:\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" \"G\",\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=14,\n",
|
||
" fontweight=\"bold\",\n",
|
||
" color=\"blue\",\n",
|
||
" )\n",
|
||
" elif s in trap_states:\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" \"X\",\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=14,\n",
|
||
" fontweight=\"bold\",\n",
|
||
" color=\"red\",\n",
|
||
" )\n",
|
||
" elif s == start_state:\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" \"S\",\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=14,\n",
|
||
" fontweight=\"bold\",\n",
|
||
" color=\"green\",\n",
|
||
" )\n",
|
||
" else:\n",
|
||
" a = policy[s]\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" action_names[a],\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=14,\n",
|
||
" color=\"black\",\n",
|
||
" )\n",
|
||
"\n",
|
||
" ax.set_xticks(np.arange(-0.5, n_cols, 1))\n",
|
||
" ax.set_yticks(np.arange(-0.5, n_rows, 1))\n",
|
||
" ax.set_xticklabels([])\n",
|
||
" ax.set_yticklabels([])\n",
|
||
" ax.grid(visible=True)\n",
|
||
" ax.set_title(title)\n",
|
||
" plt.show()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "48037254-dccc-4f9c-a4d7-349adba5c74f",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now let’s visualize the `random_policy`. Does it seem like a good policy?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"id": "d452681c-c89c-41cc-95dc-df75993b0391",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"plot_policy(random_policy, title=\"Policy\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cbad5bf1-0150-4c3f-8cce-c82e0f1d1695",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 9.** Define your own policy and evaluate it using the functions `policy_evaluation(...)` and `plot_values(...)`. **Can you identify an optimal policy visually?** Plot your own policy using `plot_policy`. \n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"id": "929707e6-3022-4d86-96cc-12f251f890a9",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"my_policy = np.array(\n",
|
||
" [\n",
|
||
" A_RIGHT,\n",
|
||
" A_RIGHT,\n",
|
||
" A_RIGHT,\n",
|
||
" A_DOWN,\n",
|
||
" A_DOWN, # First row\n",
|
||
" A_UP,\n",
|
||
" A_DOWN,\n",
|
||
" A_DOWN,\n",
|
||
" A_LEFT, # Second row\n",
|
||
" A_UP,\n",
|
||
" A_RIGHT,\n",
|
||
" A_DOWN, # Third row\n",
|
||
" A_UP,\n",
|
||
" A_LEFT,\n",
|
||
" A_RIGHT,\n",
|
||
" A_RIGHT,\n",
|
||
" A_RIGHT, # Fourth row\n",
|
||
" A_UP,\n",
|
||
" A_LEFT,\n",
|
||
" A_DOWN,\n",
|
||
" A_RIGHT,\n",
|
||
" A_UP, # Fifth row\n",
|
||
" ],\n",
|
||
")\n",
|
||
"\n",
|
||
"V_my_policy = policy_evaluation(policy=my_policy, P=P, R=R, gamma=gamma)\n",
|
||
"\n",
|
||
"plot_values(V=V_my_policy, title=\"Value function: my policy\")\n",
|
||
"plot_policy(policy=my_policy, title=\"My policy\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9cd7b3d0-e5ea-48c1-a68a-be3b3e782e9f",
|
||
"metadata": {},
|
||
"source": [
|
||
"-----------------------------------\n",
|
||
"\n",
|
||
"## 4. Dynamic programming : Policy improvement and Policy iteration"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bb35acc8-6469-499b-b565-3f2d590b13bc",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 12.** \n",
|
||
"\n",
|
||
"Write a `policy_improvement` function whose inputs are the state-value function `V`, the transition probability matrix `P`, the reward vector `R`, and the discount factor $\\gamma$. \n",
|
||
"The function should return a **greedy policy** that, for each state, selects the action that maximizes the expected return according to the input `V`.\n",
|
||
"\n",
|
||
"\n",
|
||
"*Question: Why don’t we input the old policy in this policy improvement step?*\n",
|
||
"\n",
|
||
"\n",
|
||
"*Remark.* In this maze game, we consider a deterministic policy $\\pi:s\\in\\mathcal{S}\\mapsto a\\in\\mathcal{A}$ that assigns one single action to each state.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "94f00eeb-63d4-43a1-813f-37dda7276693",
|
||
"metadata": {},
|
||
"source": [
|
||
"------------------\n",
|
||
"\n",
|
||
"*Hint.* 1. This exercise can be completed in two steps. \n",
|
||
"\n",
|
||
"In the first step, compute the action-value function $q^{\\pi}(s,a)$ from the state-value function $s' \\mapsto v^{\\pi}(s') $, for a fixed state $ s $. \n",
|
||
"Which formula should be used to express $ q^{\\pi}(s,a) $ in terms of $ v^{\\pi} $?\n",
|
||
"\n",
|
||
"In the second step, perform the greedy policy improvement step by computing a new policy $ \\pi' $ such that\n",
|
||
"$$\n",
|
||
"\\pi'(s) = \\arg\\max_{a} q^{\\pi}(s,a).\n",
|
||
"$$\n",
|
||
"\n",
|
||
"Attention, for terminal states, action choice is irrelevant, we can set 0 to terminal states. \n",
|
||
"\n",
|
||
"2. Bellman action-value equation for the maze: \n",
|
||
"\n",
|
||
"In this maze environment, the **immediate reward depends only on the current state (for non-terminal state)**:\n",
|
||
"\n",
|
||
"$$\n",
|
||
"r(s,a,s') = R(s).\n",
|
||
"$$\n",
|
||
"\n",
|
||
"This means:\n",
|
||
"\n",
|
||
"- The reward does **not** depend on the action taken.\n",
|
||
"- The reward does **not** depend on the next state.\n",
|
||
"- All actions taken from the same state yield the same immediate reward.\n",
|
||
"\n",
|
||
"The general Bellman equation for the action-value function is:\n",
|
||
"\n",
|
||
"$$\n",
|
||
"Q(s,a)\n",
|
||
"=\n",
|
||
"\\sum_{s', r} P(s',r\\mid s,a)\n",
|
||
"\\left(\n",
|
||
"r(s,a,s') + \\gamma V(s')\n",
|
||
"\\right).\n",
|
||
"$$\n",
|
||
"\n",
|
||
"Since the reward satisfies \n",
|
||
"$$\n",
|
||
"r(s,a,s') = R(s),\n",
|
||
"$$\n",
|
||
"we can simplify the expression:\n",
|
||
"\n",
|
||
"$$\n",
|
||
"\\begin{aligned}\n",
|
||
"Q(s,a)\n",
|
||
"&= \\sum_{s'} P(s' \\mid s,a)\n",
|
||
"\\left(\n",
|
||
"R(s) + \\gamma V(s')\n",
|
||
"\\right) \\\\\n",
|
||
"&= R(s) + \\gamma \\sum_{s'} P(s' \\mid s,a) V(s').\n",
|
||
"\\end{aligned}\n",
|
||
"$$\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"id": "998e3005-9a8b-4759-a008-aeccedd25924",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def policy_improvement(\n",
|
||
" V: np.ndarray,\n",
|
||
" P: np.ndarray,\n",
|
||
" R: np.ndarray,\n",
|
||
" gamma: float,\n",
|
||
") -> np.ndarray:\n",
|
||
" \"\"\"Given a value function V, output a greedy policy.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" V: array of shape (n_states,)\n",
|
||
" P: array of shape (n_actions, n_states, n_states)\n",
|
||
" R: array of shape (n_states,)\n",
|
||
" gamma: discount factor\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" policy: array of shape (n_states,), with values in {0,1,2,3}\n",
|
||
" n_states = len(R)\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" n_actions = P.shape[0]\n",
|
||
" policy = np.zeros(\n",
|
||
" n_states,\n",
|
||
" dtype=int,\n",
|
||
" )\n",
|
||
"\n",
|
||
" for s in range(n_states):\n",
|
||
" if is_terminal(s):\n",
|
||
" policy[s] = 0\n",
|
||
" continue\n",
|
||
"\n",
|
||
" Q_values = np.zeros(n_actions)\n",
|
||
" for a in range(n_actions):\n",
|
||
" Q_values[a] = R[s] + gamma * np.dot(\n",
|
||
" P[a, s, :],\n",
|
||
" V,\n",
|
||
" )\n",
|
||
" policy[s] = int(\n",
|
||
" np.argmax(Q_values),\n",
|
||
" )\n",
|
||
" return policy\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "28800a10-f76b-4f27-a697-1238678f6bb3",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 13.** \n",
|
||
"\n",
|
||
"Write a `policy_iteration` function whose inputs are the initial policy `initial_policy`, the transition probability matrix `P`, the reward vector `R`, the discount factor $\\gamma$ `gamma`, the tolerance parameter `theta` used in policy evaluation (the evaluation stops when the value function changes by less than `theta`), and `max_iter`, which serves as a safety limit to prevent the loop from running indefinitely. \n",
|
||
"\n",
|
||
"The function should return two outputs: \n",
|
||
"- `policy`, the final (optimal) policy, represented as an array of action indices; \n",
|
||
"- `V`, the value function corresponding to this policy.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "22663608-c3a4-47d0-b1c9-b7d1bb67fa64",
|
||
"metadata": {},
|
||
"source": [
|
||
"--------------------------\n",
|
||
"\n",
|
||
"*Hint.* The `policy_iteration` algorithm consists of two main steps. \n",
|
||
"First, the **policy evaluation** step, where you will use the function implemented in **Exercise 8**. \n",
|
||
"Second, the **policy improvement** step, where you will use the function implemented in **Exercise 12**.\n",
|
||
"\n",
|
||
"--------------------------"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"id": "8b6c2216",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def policy_iteration( # noqa: PLR0913\n",
|
||
" initial_policy: np.ndarray,\n",
|
||
" P: np.ndarray,\n",
|
||
" R: np.ndarray,\n",
|
||
" gamma: float,\n",
|
||
" theta: float = 1e-6,\n",
|
||
" max_iter: int = 1000,\n",
|
||
") -> tuple[np.ndarray, np.ndarray]:\n",
|
||
" \"\"\"Policy Iteration.\n",
|
||
"\n",
|
||
" Goal:\n",
|
||
" Learn an optimal policy by alternating:\n",
|
||
" 1) Policy Evaluation\n",
|
||
" 2) Policy Improvement\n",
|
||
"\n",
|
||
" Inputs:\n",
|
||
" initial_policy : array of shape (num_states,)\n",
|
||
" Initial deterministic policy.\n",
|
||
" P : transition probabilities\n",
|
||
" R : reward function\n",
|
||
" gamma : discount factor\n",
|
||
" theta : stopping threshold for policy evaluation\n",
|
||
" max_iter : maximum number of policy iteration steps\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" policy : optimal policy\n",
|
||
" V : value function of the optimal policy\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" policy = initial_policy\n",
|
||
"\n",
|
||
" for _it in range(max_iter):\n",
|
||
" V = policy_evaluation(policy, P, R, gamma, theta)\n",
|
||
" new_policy = policy_improvement(V, P, R, gamma)\n",
|
||
" if np.array_equal(new_policy, policy):\n",
|
||
" break\n",
|
||
" policy = new_policy\n",
|
||
" return policy, V\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7ed35eca-81fd-45ac-bc8c-550117124e21",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 14.** \n",
|
||
"\n",
|
||
"Starting from a random policy (see Section 3.3), compute an optimal policy for the Maze game. \n",
|
||
"Then, plot the value function of this optimal policy and visualize the policy itself by displaying arrows on the maze.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"id": "e12d62ee-3324-4e1b-b5e2-6be96404ac2c",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[0 3 3 2 1 3 2 1 1 0 0 2 3 0 3 3 1 2 0 3 2 1]\n",
|
||
"[1 1 1 2 2 0 2 2 3 0 1 2 0 3 1 1 0 0 0 0 1 0]\n",
|
||
"Optimal value function:\n",
|
||
"[ 11.62 12.311 13.095 13.901 13.875 11.596 14.756 15.6 14.699\n",
|
||
" 10.921 15.631 16.589 10.263 9.633 17.643 18.804 20. 9.633\n",
|
||
" 8.156 -20. 15.522 17.679]\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"random_policy = rng.integers(low=0, high=len(ACTIONS), size=n_states)\n",
|
||
"print(random_policy)\n",
|
||
"\n",
|
||
"opt_policy, V_opt = policy_iteration(\n",
|
||
" random_policy,\n",
|
||
" P,\n",
|
||
" R,\n",
|
||
" gamma,\n",
|
||
" theta=1e-6,\n",
|
||
" max_iter=1000,\n",
|
||
")\n",
|
||
"print(opt_policy)\n",
|
||
"print(\"Optimal value function:\")\n",
|
||
"print(V_opt)\n",
|
||
"\n",
|
||
"plot_values(V_opt, title=\"Optimal value function (policy iteration)\")\n",
|
||
"plot_policy(opt_policy, title=\"Optimal policy (policy iteration)\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7d52252c-7aee-4be9-9896-939348add5de",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 5. Dynamic programming : Value iteration"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "00c8cbd1-e0ea-4919-b61f-ff0bc3c95880",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 15.** Write a `value_iteration` function whose inputs are the transition probability matrix `P`, the reward vector `R`, the discount factor $\\gamma$ `gamma`, the parameter `theta`, which is a stopping tolerance (stop when the value function changes by less than theta), and `max_iter`, which serves as a safety limit to prevent the loop from running indefinitely. \n",
|
||
"\n",
|
||
"The outputs of value_iteration are `V`, which is an approximation of the optimal value function, and `policy`, which is a greedy policy derived from the final `V`.\n",
|
||
"\n",
|
||
"*Question:* Do `value_iteration` and `policy_iteration` find the same optimal policy?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fd52fa5c-dbaa-4281-a1a6-dd7180123756",
|
||
"metadata": {},
|
||
"source": [
|
||
"*Hint.* Value iteration repeatedly applies the Bellman optimality operator. In the maze case, it is \n",
|
||
"$$\n",
|
||
"(\\mathcal{T}^* V)(s)=\\max_a \\Big\\{ R(s) + \\gamma \\sum_{s'}P(s'|s,a)V(s')\\Big\\}\n",
|
||
"$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"id": "293ba7fc-f9dc-41b0-ad78-677af1ac7e0f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def value_iteration(\n",
|
||
" P: np.ndarray,\n",
|
||
" R: np.ndarray,\n",
|
||
" gamma: float,\n",
|
||
" theta: float = 1e-6,\n",
|
||
" max_iter: int = 10_000,\n",
|
||
") -> tuple[np.ndarray, np.ndarray]:\n",
|
||
" \"\"\"Value Iteration (student version).\n",
|
||
"\n",
|
||
" Goal:\n",
|
||
" Approximate the optimal value function V*\n",
|
||
" and derive an optimal policy.\n",
|
||
"\n",
|
||
" Inputs:\n",
|
||
" P : array of shape (n_actions, n_states, n_states)\n",
|
||
" Transition probabilities.\n",
|
||
" R : array of shape (n_states,)\n",
|
||
" Reward for each state.\n",
|
||
" gamma : float\n",
|
||
" Discount factor.\n",
|
||
" theta : float\n",
|
||
" Stopping tolerance for convergence.\n",
|
||
" max_iter : int\n",
|
||
" Maximum number of iterations.\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" V : array of shape (n_states,)\n",
|
||
" Approximation of the optimal value function V*.\n",
|
||
" policy : array of shape (n_states,)\n",
|
||
" Greedy policy derived from V.\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" n_states = len(R)\n",
|
||
" n_actions = P.shape[0]\n",
|
||
" V = np.zeros(n_states)\n",
|
||
"\n",
|
||
" for _it in range(max_iter):\n",
|
||
" V_new = np.zeros_like(V)\n",
|
||
"\n",
|
||
" for s in range(n_states):\n",
|
||
" if is_terminal(s):\n",
|
||
" V_new[s] = R[s] / (1 - gamma)\n",
|
||
" continue\n",
|
||
"\n",
|
||
" Q_values = np.zeros(\n",
|
||
" n_actions,\n",
|
||
" )\n",
|
||
" for a in range(n_actions):\n",
|
||
" Q_values[a] = R[s] + gamma * np.dot(P[a, s, :], V)\n",
|
||
" V_new[s] = np.max(Q_values)\n",
|
||
"\n",
|
||
" delta = np.max(np.abs(V_new - V))\n",
|
||
" V = V_new\n",
|
||
" if delta < theta:\n",
|
||
" break\n",
|
||
"\n",
|
||
" policy = policy_improvement(\n",
|
||
" V,\n",
|
||
" P,\n",
|
||
" R,\n",
|
||
" gamma,\n",
|
||
" )\n",
|
||
" return V, policy\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 30,
|
||
"id": "9b6ff9d3-ccc9-4f35-a6c3-545aeed552f7",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"V_vi, policy_vi = value_iteration(P, R, gamma)\n",
|
||
"\n",
|
||
"plot_values(V_vi, title=\"Optimal value function (value iteration)\")\n",
|
||
"plot_policy(policy_vi, title=\"Optimal policy (value iteration)\")\n",
|
||
"\n",
|
||
"print(np.abs(opt_policy - policy_vi))\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f4db246d-07c2-4587-b185-7298fe292674",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 6. Advanced exercises \n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b0c04200-39f7-41fe-a479-62c87efab8a3",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 16 (Policy Iteration vs Value Iteration)**\n",
|
||
"\n",
|
||
"In this exercise, we compare the number of iterations required by **policy iteration** and **value iteration** to reach an optimal policy in the Maze game.\n",
|
||
"\n",
|
||
"1. Modify the definition of `policy_iteration` and `value_iteration` so that they can record:\n",
|
||
" - the **number of iterations** until convergence,\n",
|
||
" - and optionally the runtime.\n",
|
||
"2. Run both algorithms starting from:\n",
|
||
" - the same random initialization (a random policy for policy iteration, and $V_0 \\equiv 0$ for value iteration),\n",
|
||
" - and repeat the experiment over several random seeds in order to compute the **average number of iterations** and **average runtime**.\n",
|
||
"3. Report and interpret the results.\n",
|
||
"\n",
|
||
"*Question:* What do you observe?\n",
|
||
"\n",
|
||
"-------------\n",
|
||
"\n",
|
||
"*Hint.* the word “iteration” means something different for **policy iteration** and **value iteration**:\n",
|
||
"\n",
|
||
"- Policy iteration: one “iteration” = one outer loop step = policy evaluation + policy improvement.\n",
|
||
"- Value iteration: one “iteration” = one Bellman optimality sweep over all states.\n",
|
||
"\n",
|
||
"-------------\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 31,
|
||
"id": "cce78e3c-ca82-4002-9a8f-08af9457147c",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Same policy? True\n",
|
||
"Same policy? True\n",
|
||
"Same policy? True\n",
|
||
"Same policy? True\n",
|
||
"Same policy? True\n",
|
||
"Same policy? True\n",
|
||
"Same policy? True\n",
|
||
"Same policy? True\n",
|
||
"Same policy? True\n",
|
||
"Same policy? True\n",
|
||
"Policy Iteration - outer iterations: [3, 3, 2, 3, 4, 3, 3, 3, 3, 4]\n",
|
||
"Value Iteration - iterations: [34, 34, 34, 34, 34, 34, 34, 34, 34, 34]\n",
|
||
"Mean PI iterations: 3.1\n",
|
||
"Mean VI iterations: 34.0\n",
|
||
"Mean PI runtime: 0.033542495701112784\n",
|
||
"Mean VI runtime: 0.007521004001318943\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import time\n",
|
||
"\n",
|
||
"\n",
|
||
"def policy_iteration_count( # noqa: PLR0913\n",
|
||
" P: np.ndarray,\n",
|
||
" R: np.ndarray,\n",
|
||
" gamma: float,\n",
|
||
" theta: float = 1e-6,\n",
|
||
" max_iter: int = 1_000,\n",
|
||
" seed: int = 0,\n",
|
||
") -> tuple[np.ndarray, np.ndarray, int, float]:\n",
|
||
" \"\"\"Policy Iteration that counts the number of outer iterations and runtime.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" P: array of shape (n_actions, n_states, n_states)\n",
|
||
" R: array of shape (n_states,)\n",
|
||
" gamma: discount factor\n",
|
||
" theta: convergence threshold for policy evaluation\n",
|
||
" max_iter: maximum number of outer iterations\n",
|
||
" seed: random seed for initial policy\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" policy: optimal policy found\n",
|
||
" V: value function of the optimal policy\n",
|
||
" n_iterations: number of outer iterations until convergence\n",
|
||
" runtime: total runtime in seconds\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" rng = np.random.default_rng(\n",
|
||
" seed,\n",
|
||
" )\n",
|
||
" n_states = len(R)\n",
|
||
" n_actions = P.shape[0]\n",
|
||
"\n",
|
||
" policy = rng.integers(\n",
|
||
" low=0,\n",
|
||
" high=n_actions,\n",
|
||
" size=n_states,\n",
|
||
" )\n",
|
||
"\n",
|
||
" t0 = time.perf_counter()\n",
|
||
"\n",
|
||
" for it in range(max_iter):\n",
|
||
" V = policy_evaluation(\n",
|
||
" policy,\n",
|
||
" P,\n",
|
||
" R,\n",
|
||
" gamma,\n",
|
||
" theta=theta,\n",
|
||
" )\n",
|
||
"\n",
|
||
" new_policy = policy_improvement(\n",
|
||
" V,\n",
|
||
" P,\n",
|
||
" R,\n",
|
||
" gamma,\n",
|
||
" )\n",
|
||
"\n",
|
||
" if np.array_equal(\n",
|
||
" new_policy,\n",
|
||
" policy,\n",
|
||
" ):\n",
|
||
" runtime = time.perf_counter() - t0\n",
|
||
" return (\n",
|
||
" policy,\n",
|
||
" V,\n",
|
||
" it + 1,\n",
|
||
" runtime,\n",
|
||
" )\n",
|
||
"\n",
|
||
" policy = new_policy\n",
|
||
"\n",
|
||
" runtime = time.perf_counter() - t0\n",
|
||
" return (\n",
|
||
" policy,\n",
|
||
" V,\n",
|
||
" max_iter,\n",
|
||
" runtime,\n",
|
||
" )\n",
|
||
"\n",
|
||
"\n",
|
||
"def value_iteration_count(\n",
|
||
" P: np.ndarray,\n",
|
||
" R: np.ndarray,\n",
|
||
" gamma: float,\n",
|
||
" theta: float = 1e-6,\n",
|
||
" max_iter: int = 100_000,\n",
|
||
") -> tuple[np.ndarray, np.ndarray, int, float]:\n",
|
||
" \"\"\"Value Iteration that counts the number of iterations and runtime.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" P: array of shape (n_actions, n_states, n_states)\n",
|
||
" R: array of shape (n_states,)\n",
|
||
" gamma: discount factor\n",
|
||
" theta: convergence threshold\n",
|
||
" max_iter: maximum number of iterations\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" V: array of shape (n_states,)\n",
|
||
" Approximation of the optimal value function V*.\n",
|
||
" policy: array of shape (n_states,)\n",
|
||
" Greedy policy derived from V.\n",
|
||
" n_iterations: number of iterations until convergence\n",
|
||
" runtime: total runtime in seconds\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" n_states = len(R)\n",
|
||
" n_actions = P.shape[0]\n",
|
||
"\n",
|
||
" V = np.zeros(n_states)\n",
|
||
" t0 = time.perf_counter()\n",
|
||
"\n",
|
||
" for it in range(max_iter):\n",
|
||
" V_new = np.zeros_like(V)\n",
|
||
"\n",
|
||
" for s in range(n_states):\n",
|
||
" if is_terminal(s):\n",
|
||
" V_new[s] = R[s] / (1 - gamma)\n",
|
||
" continue\n",
|
||
"\n",
|
||
" Q = np.zeros(n_actions)\n",
|
||
" for a in range(n_actions):\n",
|
||
" Q[a] = R[s] + gamma * np.dot(P[a, s, :], V)\n",
|
||
" V_new[s] = np.max(Q)\n",
|
||
"\n",
|
||
" delta = np.max(np.abs(V_new - V))\n",
|
||
" V = V_new\n",
|
||
"\n",
|
||
" if delta < theta:\n",
|
||
" runtime = time.perf_counter() - t0\n",
|
||
" policy = policy_improvement(V, P, R, gamma)\n",
|
||
" return (\n",
|
||
" V,\n",
|
||
" policy,\n",
|
||
" it + 1,\n",
|
||
" runtime,\n",
|
||
" )\n",
|
||
"\n",
|
||
" runtime = time.perf_counter() - t0\n",
|
||
" policy = policy_improvement(\n",
|
||
" V,\n",
|
||
" P,\n",
|
||
" R,\n",
|
||
" gamma,\n",
|
||
" )\n",
|
||
" return (\n",
|
||
" V,\n",
|
||
" policy,\n",
|
||
" max_iter,\n",
|
||
" runtime,\n",
|
||
" )\n",
|
||
"\n",
|
||
"\n",
|
||
"# Next, run the comparison over several seeds\n",
|
||
"gamma = 0.9\n",
|
||
"theta = 1e-6\n",
|
||
"seeds = list(range(10))\n",
|
||
"\n",
|
||
"pi_iters = []\n",
|
||
"vi_iters = []\n",
|
||
"pi_times = []\n",
|
||
"vi_times = []\n",
|
||
"\n",
|
||
"for seed in seeds:\n",
|
||
" pi_policy, pi_V, n_pi, t_pi = policy_iteration_count( # noqa: N816\n",
|
||
" P,\n",
|
||
" R,\n",
|
||
" gamma,\n",
|
||
" theta=theta,\n",
|
||
" seed=seed,\n",
|
||
" )\n",
|
||
" vi_V, vi_policy, n_vi, t_vi = value_iteration_count(P, R, gamma, theta=theta) # noqa: N816\n",
|
||
"\n",
|
||
" pi_iters.append(n_pi)\n",
|
||
" vi_iters.append(n_vi)\n",
|
||
" pi_times.append(t_pi)\n",
|
||
" vi_times.append(t_vi)\n",
|
||
"\n",
|
||
" print(\"Same policy?\", np.array_equal(pi_policy, vi_policy))\n",
|
||
"\n",
|
||
"\n",
|
||
"print(\"Policy Iteration - outer iterations:\", pi_iters)\n",
|
||
"print(\"Value Iteration - iterations:\", vi_iters)\n",
|
||
"\n",
|
||
"print(\"Mean PI iterations:\", np.mean(pi_iters))\n",
|
||
"print(\"Mean VI iterations:\", np.mean(vi_iters))\n",
|
||
"print(\"Mean PI runtime:\", np.mean(pi_times))\n",
|
||
"print(\"Mean VI runtime:\", np.mean(vi_times))\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c4e197e2-b0e4-4d8c-b5ab-8028385c4cd3",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 17.** (Asynchronous Value Iteration)\n",
|
||
"\n",
|
||
"Implement asynchronous value iteration, where the value function is updated in place :\n",
|
||
"$$\n",
|
||
"V(s) \\leftarrow \\max_a\\left\\{R(s)+\\gamma \\sum_{s'}P(s'|s,a)V(s')\\right\\}.\n",
|
||
"$$\n",
|
||
"\n",
|
||
"Compare the number of iterations needed for convergence with the synchronous version.\n",
|
||
"\n",
|
||
"-------------------\n",
|
||
"\n",
|
||
"Hint. Synchronous value iteration uses a copy `V_new` and updates all states from the old `V`. Asynchronous value iteration updates `V[s]` immediately, so later states in the same sweep can use the newest values.\n",
|
||
"\n",
|
||
"-------------------\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"id": "0b3469a6",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Same policy? True\n",
|
||
"Async iterations: 22\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"def asynchronous_value_iteration(\n",
|
||
" P: np.ndarray,\n",
|
||
" R: np.ndarray,\n",
|
||
" gamma: float,\n",
|
||
" theta: float = 1e-6,\n",
|
||
" max_iter: int = 200_000,\n",
|
||
") -> tuple[np.ndarray, np.ndarray, int]:\n",
|
||
" \"\"\"Asynchronous Value Iteration.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" P: array of shape (n_actions, n_states, n_states)\n",
|
||
" R: array of shape (n_states,)\n",
|
||
" gamma: discount factor\n",
|
||
" theta: convergence threshold\n",
|
||
" max_iter: maximum number of iterations\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" V: array of shape (n_states,)\n",
|
||
" Approximation of the optimal value function V*.\n",
|
||
" policy: array of shape (n_states,)\n",
|
||
" Greedy policy derived from V.\n",
|
||
" n_iterations: number of iterations until convergence\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" n_states = len(R)\n",
|
||
" n_actions = P.shape[0]\n",
|
||
" V = np.zeros(n_states)\n",
|
||
"\n",
|
||
" for _it in range(max_iter):\n",
|
||
" delta = 0.0\n",
|
||
"\n",
|
||
" for s in range(n_states):\n",
|
||
" v_old = V[s]\n",
|
||
"\n",
|
||
" if is_terminal(s):\n",
|
||
" V[s] = R[s] / (1 - gamma)\n",
|
||
" else:\n",
|
||
" Q = np.zeros(n_actions)\n",
|
||
" for a in range(n_actions):\n",
|
||
" Q[a] = R[s] + gamma * np.dot(P[a, s, :], V)\n",
|
||
" V[s] = np.max(Q)\n",
|
||
"\n",
|
||
" delta = max(\n",
|
||
" delta,\n",
|
||
" abs(V[s] - v_old),\n",
|
||
" )\n",
|
||
"\n",
|
||
" if delta < theta:\n",
|
||
" break\n",
|
||
"\n",
|
||
" pi = policy_improvement(V, P, R, gamma)\n",
|
||
" return V, pi, _it + 1\n",
|
||
"\n",
|
||
"\n",
|
||
"gamma = 0.9\n",
|
||
"V_sync, pi_sync = value_iteration(P, R, gamma, theta=1e-6)\n",
|
||
"V_async, pi_async, it_async = asynchronous_value_iteration(P, R, gamma, theta=1e-6)\n",
|
||
"\n",
|
||
"\n",
|
||
"print(\"Same policy?\", np.array_equal(pi_sync, pi_async))\n",
|
||
"print(\"Async iterations:\", it_async)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "48ae870c-1f18-4f83-8b6d-1ae393d08de3",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 18.** (Bellman Optimality Operator)\n",
|
||
"\n",
|
||
"Show numerically that the Bellman optimality operator $\\mathcal{T}^*$ satisfies the contraction property, which means, for arbitrary value functions $V$ and $W$, we have \n",
|
||
"$$\n",
|
||
"\\big\\Vert \\mathcal{T}^* V - \\mathcal{T}^* W \\big\\Vert_{\\infty}\\leq \\gamma \\Vert V-W\\Vert_{\\infty}\n",
|
||
"$$\n",
|
||
"\n",
|
||
"---------------\n",
|
||
"\n",
|
||
"*Hint.* Generate random value functions $V$ and $W$, apply one Bellman optimality update to each, and compare both sides of the inequality.\n",
|
||
"\n",
|
||
"---------------"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 33,
|
||
"id": "b64a1e78",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"fails: 0 out of 50\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"def T_opt(V: np.ndarray, P: np.ndarray, R: np.ndarray, gamma: float) -> np.ndarray:\n",
|
||
" \"\"\"Compute the optimal Bellman operator T^* applied to V.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" V: array of shape (n_states,)\n",
|
||
" P: array of shape (n_actions, n_states, n_states)\n",
|
||
" R: array of shape (n_states,)\n",
|
||
" gamma: discount factor\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" out: array of shape (n_states,)\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" n_states = len(R)\n",
|
||
" n_actions = P.shape[0]\n",
|
||
" out = np.zeros(n_states)\n",
|
||
"\n",
|
||
" for s in range(n_states):\n",
|
||
" if is_terminal(s):\n",
|
||
" out[s] = R[s] / (1 - gamma)\n",
|
||
" else:\n",
|
||
" Q = np.zeros(n_actions)\n",
|
||
" for a in range(n_actions):\n",
|
||
" Q[a] = R[s] + gamma * np.dot(P[a, s, :], V)\n",
|
||
" out[s] = np.max(Q)\n",
|
||
"\n",
|
||
" return out\n",
|
||
"\n",
|
||
"\n",
|
||
"def sup_norm(x: np.ndarray) -> float:\n",
|
||
" \"\"\"Compute the sup norm (infinity norm) of vector x.\"\"\"\n",
|
||
" return np.max(np.abs(x))\n",
|
||
"\n",
|
||
"\n",
|
||
"gamma = 0.9\n",
|
||
"n_states = len(R)\n",
|
||
"rng = np.random.default_rng()\n",
|
||
"\n",
|
||
"num_tests = 50\n",
|
||
"fails_numbers = 0\n",
|
||
"\n",
|
||
"for _ in range(num_tests):\n",
|
||
" V = rng.standard_normal(n_states)\n",
|
||
" W = rng.standard_normal(n_states)\n",
|
||
"\n",
|
||
" lhs = sup_norm(T_opt(V, P, R, gamma) - T_opt(W, P, R, gamma))\n",
|
||
" rhs = gamma * sup_norm(V - W)\n",
|
||
"\n",
|
||
" if lhs > rhs + 1e-10:\n",
|
||
" fails_numbers += 1\n",
|
||
"\n",
|
||
"print(\"fails:\", fails_numbers, \"out of\", num_tests)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d58d20fd-509a-41e4-bde1-eca6c9da6ddc",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 19** (Effect of the Discount Factor)\n",
|
||
"\n",
|
||
"Recall that the discount factor controls how future rewards are weighted relative to immediate rewards. Run value iteration for different values of the discount factor $\\gamma\\in\\{0.2, 0.5, 0.9, 0.99\\}$. \n",
|
||
"\n",
|
||
"For each value of $\\gamma$: \n",
|
||
"\n",
|
||
"1. Compute the optimal value function $V^*$.\n",
|
||
"2. Compute the corresponding optimal policy.\n",
|
||
"3. Plot the value function and visualize the policy on the maze.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 34,
|
||
"id": "47219cdc-b99b-4b73-a3d1-c133afb0e215",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"gamma=0.2: computed V* and pi*\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"gamma=0.5: computed V* and pi*\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAGgCAYAAAC0SSBAAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAAKIxJREFUeJzt3Qt4FNX5x/E3gRBIICAXBUrEG4WWKIKIqP1zsxGkVQpCqcELtVYUsVxSa1MEEosU0VKpoqgoWFv0wdqiSIsJ14q0FBsvhHARqmiQYkRICMFcyP6f98QNm0042Q1hd2fy/TzPEHYymz1nZnZ+M+ec2Y3yeDweAQDgFKJP9QsAABRBAQCwIigAAFYEBQDAiqAAAFgRFAAAK4ICAGBFUAAArAgKAIAVQeFgS5culaioKPn4448bzWufd955Mn78+KrHGzZsMOXQn6E2b9486dGjh1RUVIT8tRG8H/3oR/LDH/4w3MVwJIKiAW3fvl1uvvlm+cY3viGxsbHSuXNnGTdunJl/OubMmSMrVqwQJ0pPTzcHcu8UFxcn3/72t+WBBx6QwsJCcSot+8MPPyz333+/REfzNjpdmzdvlu985ztm/+jYsaP87Gc/k6KiooCe67t/+U5z586ttpxuq1dffVXef//9M1QL92oa7gK4xV/+8he56aabpG3btvKTn/xEzj//fHO2/dxzz8mf//xnefnll2XkyJH1DorRo0fLD37wg2rzb7nlFnOWpKEU6Z566ilp2bKlefNnZmbKQw89JOvWrZO3337bvKnra8CAAXL8+HFp1qyZhNLzzz8v5eXlZpvj9Lz33ntyzTXXyLe+9S2ZP3++5OXlyaOPPioffvih/P3vfw/obyQnJ8utt95abV7v3r1rPO7bt6/89re/lT/84Q8NWgfX0w8FxOnZs2ePJy4uztOjRw/P559/Xu13+fn5Zn58fLxn79699fr7+tzbbrvNE0mWLFmiHybp+eijj6zLzZo1yyyn68HXqFGjzPzNmzcH9bpdu3aNiHVxySWXeG6++eZwF8MVrrvuOk+nTp08BQUFVfOeffZZs3+8+eabdT5fl7vnnnsCeq1HH33UvJ+OHj16WmVubLhmbgCPPPKIFBcXyzPPPCMdOnSo9rv27dvL008/LceOHTNt2v5NMjt37jTtpgkJCdKuXTuZPHmyfPXVV1XL6TL63BdeeKHqktrbRl9bP4G24X//+983bfZ69tSiRQu5+OKLq9rw9cpHHzdv3lwuu+wyeffdd6uV94MPPjB//4ILLjDLaDPA7bffLocOHWrQdTZkyBDz86OPPjI/tY6pqamSmJhorpC6d+9uzirr+nDjU/VRbNmyRYYPHy5nnXWWxMfHyyWXXCILFiwwv1uyZIl5jn/dvVdvTZo0kf3795/yNbXMup6++93v1vidrie90tPt2aZNG7nttttMU4e+nm6vYNezdz/ZvXu3adZs3bq12cdmzJhh1s2nn34qI0aMMK+nf0PPlmtbP8uXL5eMjAzTLNqqVStzhVpQUCAlJSUyZcoUOfvss80V349//GMzz5euL91euoxuG2061CvEhmrCy8rKMnXTOnjp1YGWR8sdKL2y9H3vnOrKQ/c1fU0EjqanBrBy5UpzgP6///u/UzaP6O9XrVpV43caEvq73/zmN/Kvf/1Lfv/738vhw4erLo1ffPFFueOOO6Rfv35y5513mnkXXnihtTx79uyRlJQUmTBhgnkD6gH3+uuvl0WLFsmvfvUrmThxollOX1Nff9euXVXt7PoG+u9//2sOGHrg0f4VDUD9qeU7nWYiX3v37jU/NRz1gHfDDTfI+vXrTbPdpZdeKm+++abcd9995oD9u9/9Lqi/rXXQsOzUqZMJXq3Hjh075I033jCP9SB5zz33yJ/+9KcazRM6b9CgQeaAamtPV3369Kk2Xzu1dT3/+9//lrvvvtt0dL/22msmLGorYzDreezYsaZpRtvddT+aPXu2aebUkxA9iGt/iZb95z//uVx++eVmn/Ol21pPGn75y1+a/ePxxx+XmJgYs911f9NA0tfVMNNm05kzZ1Y9V0OhZ8+eZhs1bdrU7O+6D2l9dT16abNiXQdqpa+rgae2bdtmmvD0pMaXNiXqflBbmNdGy/3kk0+afUnXk/aB6XvAn4acrgdt8qxvU3CjFO5LGqc7cuSIufQdMWKEdbkbbrjBLFdYWFitSUbn+5o4caKZ//7779fZ9FRb8482zfg36ejlu85r0aKFZ9++fVXzn376aTN//fr1VfOKi4trvM5LL71klvvHP/5hfe3aeOu5a9cu0/yky+vrxsbGes455xzPsWPHPCtWrDDLzJ49u9pzR48e7YmKijJNe771810XWnbfOpSXl3vOP/98s9zhw4er/b2Kioqq/990002ezp07e06cOFE1Lzs72/wtrZvNAw88YJbzb7549dVXzfzHHnusap7+/SFDhtT4u4GuZ+/6u/POO6vmaR27dOli1s3cuXOr5mt9dRvXtn6SkpI8paWl1eqvz9dmH19XXnmlWXe+aivr0KFDPRdccEG1efq6+lp1TQMHDqx6ziuvvFKjzl5jxozxdOzY0VOXq666yqzz1157zfPUU0+ZuurffPLJJ2td/pvf/GaNesOOpqfTdPToUfNTL+dtvL/3H+nje0am7r33XvPzb3/7W73LpGdNV155ZdXjK664wvzUM89zzz23xnw9s/XSsy0vPTv84osvpH///uZxdnZ2vcukTUnaZKJnq3qlc9FFF5kzYx3lonXV5h4d6eJLm6L0DDHQDk2lZ6DaNKTNKdr048v3LF2bNj777DNzFeOlZ+Ra/xtvvNH6Gto8pGfW2jTia/Xq1eZs+ac//WnVPD1j99/G9VnPelXppetKz8B13egVmJfWV9ez7/b0ra+WzXfb6/O1ucuXztfmLD3Lr62s2lylZR04cKB5HX3s9Ytf/MJcKdU1+TaPaXORqm1AhjbJeX9vo1cHeqWoVzx33XWX/Oc//5GkpCRz9Vzb87U5UuuAwNH0dJq8AeANjGADpVu3btUea7OSHlxO5/4E3zBQ3st8bf+vbb42PXh9+eWXpi1bR2l9/vnn1Zb3PSgES4clahu0Hqy6dOlSrfls3759Ziix/7rRJgTv74Nt0tIDRV1t1do0peGgI260GeWll14y7f11hf6paDn1b2r4+dJQ9Bfseq5tm+qBVPvA/OfX1p8UzD6h60LLoM2C3gPxrFmz5J///Kfpi/Mvq/dv6QmKTsHwhpB/v4g3QH1DKlDabDVp0qSq0NBht740IBuqCbWxIChOk75J9OCgnZM2+ntt9/btsKtNQ+zAesYZzHzfDmPts9A2eO0f0DZiPWvWA8ewYcNO68YybTP3P6iFk64LbcN+9tlnTdu2Hgz1CkP7dOqiB1A949bwr2+oBLuea9t2gWzPupat629o8GqQan+LDl3VYNEDsV4Fat+Rb1k1NAK5AtDna/+K0veOOnDgQI3ldJ6eQNSHNwA1kP3piZH/CRrsaHpqANpxqs0dmzZtqvX3b731lrlC0OX86VhxX9rRqG8+7eD2CtXZj76B1q5dazo89WxXO/v0zFtH5pxJXbt2NQdp/6syHRHm/X2gvFcqOTk5dS6rzTHaFKids3ploU1jQ4cOrfN5etD0HbHlWw89uPmfdes2jYT1XB+6bvRs//XXXzdNhjqSTEd71Xamr80/euCvaxo1alTVc/TKT5vx3nnnnWp/q7S01NxfoSFaH97mN/9RiBrw2rTmvVpFYAiKBqBnhfrG0TeS/2W/ntHoJbA2R+hy/hYuXFjtsY5GUdddd13VPB3eeeTIETnTvGeX/mekjz322Bl9XT34nDhxQp544olq8/WMVUPSd13URUciaT+Iltl/nfnXS4fM6rR48WLTNKY3L+pBqy7e/h//g5uGTFlZmblK8dLQ99/G4VrP9VFbWfXKQYfM+qtPH4VekWvw/PGPf6x2oqCj/XQU1ZgxY6rmaQDryYNv/0J+fn6Ncujf0XWpV7A6BNxXbm6uadK66qqrTmu9NDY0PTUAvYzV+xz04zr0HgX/O7N1x9b279qGtepZqXbCaZODtgHrG0abRHr16lW1jO7sa9asMZf+eimuf9vbEd2QtFlMm4j0fg894GlTmd5F7X/m3NB0SOngwYNl+vTpZp1p3fV1dWipdkrXNRzYl/bv6HBO/Zt6NqrDT/UsVg8wOvRUh936X1XokFIVSLOT0jN/PRPWbeLbGax3zuswZu2E16sIvfLQM3Fv84f3yjBc67k+rr32WtNUpOtTT4T04K1BqPdU+DcX1aePQuld+nrg1g5yHQKud2ZrmOhr6/vCS4cd636i/SU6nFdpCOvH22j5tB9Gy6R3zX/yyScmbPzv2Neg0pM2vYJDEOoYFYUgfPDBB2bYod5lGhMTY4b26eNt27bVWNY77DE3N9cMA23VqpXnrLPO8kyaNMlz/Pjxasvu3LnTM2DAADP0UZ/jHf54quGx3/ve9wK6e1Wfp/MfeeSRqnl5eXmekSNHetq0aeNp3bq1GaL42WefmeW0zA11Z7Y/HWo6depUM2RV1123bt1MuXyHtAYyPNZr06ZNnuTkZLNedXix3kn9+OOP13jdAwcOeJo0aWKGTAZj/vz5npYtW9YYOqr1TElJMa+r62/8+PGet99+25Tx5ZdfDno9n2r96TrQevnToac9e/assX50GKov7/bbunVrtfm1vd7rr79u1l/z5s095513nufhhx/2PP/88wFt/0C99dZbZpirvkaHDh3MvuodSu5fF9/1k5mZabazvtd0v9H1ee2113rWrl1b6+tcccUV3FFfDwRFmAR6AMWZpeu/adOmngcffDDo+2fatm3rWbx4cZ3L/vWvfzXbWsML4fPuu++ae0f0J4JDHwUaNb2jV/tH9GM3gqFt69omrx/f4jvyx3/Uj/5t7XfS5ib/O7kRWnpXu96VX98O8saMPgo0SvrJtdqxqe3j2rfgO8osUPqx1Tr53zCpYaEd3jpaSD9bS4fB6mdI1eeeADQcvWcF9UNQoFF68MEHzQH86quvrhpp1hD07nftiNXPldLRNXqznf59vQEMcKoobX8KdyEAAJGLPgoAgBVBAQBomD4K7Zjz/eAuHemhNxLp597wAVsA4Cza66B3setNvHV973vAQaFffKKfSwMAcA/97Cv9ROcG6cz2v6LQz3vRW+b184t8P+feyTRV9SMIdNjk6XxSaqRwW30UdXIG6hT59ONj9D4g/Uw070fFn/YVhX6xSG1fLqIh4f95Kk7eEfRzYLQ+btgR3FYfRZ2cgTo5RyBdB3RmAwCsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFgRFAAAK4ICAGBFUAAArAgKAIAVQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVAAAKwICgCAFUEBALAiKAAAVgQFAMCKoAAAWBEUAAArggIAYEVQAACsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFgRFAAAK4ICAGBFUAAArAgKAIAVQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVAAAKwICgCAFUEBALAiKAAAVgQFAMCKoAAAWBEUAAArggIAYNU00AVLSkrM5FVYWGh+RkdHm8kNvPWgPpGLOjkDdYp8wdQjyuPxeAJZMD09XTIyMmrMX7ZsmcTFxQVXQgBAWBUXF0tKSooUFBRIQkJCwwRFbVcUiYmJcuDAAWnXrp24QVlZmWRlZUlycrLExMSI07mtPm6vU05OjlRUVIhbzlaTkpLYThGstLRUZs+eHVBQBNz0FBsbayZ/uhO4ZUdwa53cVh+31kkPPm44APliO0WuYOrgjsY2AMAZQ1AAAKwICgCAFUEBALAiKAAAVgQFAMCKoAAAWBEUAAArggIAYEVQAACsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFgRFAAAZwbFyzkvy9A/DpVzHj1HYn4dI63ntpbzF5wvg5YOksl/nyxv7nkz3EWEi02ePFk6duwY7mKgDmyn0Aj4G+5C6da/3iovfvBitXmFJYVm+vjIx7Jx30bZV7BPhl40NGxlhLvp10MePHgw3MVAHdhOjTQoVu9ZXS0kLut0mQy9cKi0bNZS8ovzJftAtvwz759hLSMANCYRFxSZezOr/n9R24tkyx1bpEl0k2rL6JXFtoPbxA30C86nTZsm9913n3Tt2lXcwI11AhqziOujKK8or/r/ka+OmKYmfwmxCXL1uVeLG2RnZ8vixYtlwIABsnfvXnEDN9YJaMwiLij6dOpT9f8vir+Qbz7xTbnsmcvkrjfukmf/86zs+XKPuEn//v1l5cqVkp+fbw6su3btEqdzY52Axizimp5uvuRmWbh1obzz2TvmcYWnwvRL6OT1nXO/I09c94T06thLItmMGTPk8OHDAS2blJQkW7dulYEDB8rGjRule/fuEoncWCcADguKptFNZd2t6+Q3m34jz7/7vBw8VnNEw6ZPNknyi8myfeJ26RDfQSLVkiVLZP/+/UE9R0dw5ObmRuxB1Y11AuCwpifVKraVzLlmjhxIPSA5d+fIczc8J7f1uk1aNWtVtYyOgPIfQhtp8vLyxOPx1DkVFRXJoEGDzHMyMjJk5MiREqncWCcADgwKr6ioKOl5dk+5vfftsvQHS+WDuz+Q6KiTRf7w0IfidEePHpVhw4bJhg0bZO7cuTJz5kxxOqfWSTvhFy1aVGP+jh07ZMGCBWEpE2piO4VexDU9vfDeC/JV+Vdy08U3mdFNvuJj4k1QaL+FatO8jTjdnj17ZNu2bTJ//nyZOnWquIFT6zR9+nRZvXq1FBcXV83bvn27DBkyRI4fPy6jRo2SxMTEsJYRbKdwiLig+OjIR5KxMUOmvDnFdFpfes6l0rZFWzl0/JD8OffP1YbPDrtomDhd7969zYG1ffv24hZOrdPy5ctl+PDhkpqaKh06VPZ9DR482NwXkpmZycEnQrCdQi/igsJLryrW/HeNmWrz0z4/lYHnDRQ3cNoB1a11atWqlTlTvf7662X9+vVmXnl5uaxZs0b69u0b7uLha2yn0Iu4oJjSf4pcfPbFsu6jdfLOgXfkf0X/k/xj+XLCc0I6xHWQyzpfZjq2R31rVLiLCheKj4+XVatWyYgRI0xbeFZWlrlCQmRhOzXyoNB+hxu/faOZgHBo0aKFacJAZGM7hU5Ej3oCAIQfQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVAAAKwICgCAFUEBALAiKAAAVgQFAMCKoAAAWBEUAAArggIAYEVQAACsCAoAgBVBAQBomK9CLSkpMZNXYWGh+VlWVmYmN/DWg/pELjfXKTraPedt3rqwnSJXMPWI8ng8nkAWTE9Pl4yMjBrzly1bJnFxccGVEAAQVsXFxZKSkiIFBQWSkJDQMFcUaWlpMm3atGpXFImJiZKbmyvNmjUTtyRsUlKSJCcnS0xMjLjhDCgrK0tycnKkoqJC3MBt28h3O1GnyFbmsvdTaWlpwMsGHBSxsbFm8qcrzA0rzZfu2G7ZuRXbyBmokzNUuOT9FEwd3NHYBgA4YwgKAIAVQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVAAAKwICgCAFUEBALAiKAAAVgQFAMCKoAAAWBEUAAArggIAYEVQhFh5ebmMGzdOevToIbt37w53cXAKkydPlo4dO4qbuLFOCA2CIsRfPThmzBjzPeO7du2SQYMGyc6dO8NdLNRCv0f44MGD4iZurBNCg6AIkZKSEhk1apSsWLGi6ovMi4qKTFhs37493MUDgFMiKEJk7NixsmrVKklLS5ORI0eaeZmZmXL8+HEZPHiw5OXlhbuIgGO+6zk1NVVyc3PDXZRGo2m4C9BYTJkyRS6//HKZPn26jB8/3szr37+/ZGVlycqVK6VLly7hLiIQ8U6cOCG33nqrab6Njo6WRx55JNxFahQIihDRJiad/PXr189MAOoeCJKSkiKvvPKKTJw4UebNmxfuIjUaBAUAR9CBINrH17p1a4mKipJ777034NFe3bp1O+PlczOCAoAj+iU2bNhQNXpr4cKFAT939OjRBMVpojMbQMTT/oi1a9dK27ZtpU2bNrJlyxbxeDwBTbU1+SI4BAXwtezsbFm0aFGN+Tt27JAFCxaIE7mpTn369JF169ZJ06ZNJTk5WTZv3hzuIjUaND0BX9MRaatXr5bi4uKqeXqPy5AhQ8wwZr0PJjExUZzEbXXq1auXrF+/Xq655hpzhXHVVVeFu0iNAkEBfG358uUyfPhwM0a/Q4cOZp7e46J31Os9L046oLq5TklJSSbs2rdvH+6iNBo0PQFfa9WqlTn71gNpfn5+1ZDMNWvWmHtenMiNdVKERGgRFICP+Ph4cwe9toG3a9fONG/07dtXnMyNdUJo0fQUBkuXLjUTIlOLFi1Ms4ybuLFOCB2uKAAAVgQFAMCKoAAAWBEUAAArggIAYEVQAACsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFgRFAAAK4ICAGBFUAAArAgKAIAVQQEAsCIoAABWBAUAoGG+CrWkpMRMXoWFheZndHS0mdzAW4+ysjJxA2893LJ93LiNfOvixjrNmzdPKioqxC37XlJSkmveT8HUI8rj8XgCWTA9PV0yMjJqzF+2bJnExcUFV0IAQFgVFxdLSkqKFBQUSEJCQsNcUaSlpcm0adOqXVEkJiZKbm6uNGvWTNx0xpCcnCwxMTHihrO6rKwsycnJcd1ZnVu2ke92cmOd3Ljv5bikTqWlpQEvG3BQxMbGmsmfrjA3rDRf+mZ1yxtWsY2cwY11cuO+V+GSOgVTB3c0tgEAzhiCAgBgRVAAAKwICgCAFUEBALAiKAAAVgQFAMCKoAAAWBEUAAArggIAYEVQAACsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFgRFCFWXl4u48aNkx49esju3bvDXRw0IpMnT5aOHTuGuxhwIIIixF89OGbMGPM947t27ZJBgwbJzp07w10sNBL63cgHDx4MdzHgQARFiJSUlMioUaNkxYoVVV9kXlRUZMJi+/bt4S4eAJwSQREiY8eOlVWrVklaWpqMHDnSzMvMzJTjx4/L4MGDJS8vL9xFRB1Xg5MmTZJ9+/aFuyhAyBEUITJlyhSZPXu2zJkzp2pe//79JSsrSyZMmCBdunQJa/lgl52dLYsXL5YBAwbI3r17w10cIKSahvblGi9tYtLJX79+/cyEyKahvnLlShkxYoQJi3Xr1kn37t3DXSwgJAgKNHozZsyQw4cPB7RsUlKSbN26VQYOHCgbN24kLNAoEBRo9JYsWSL79+8P6jk6eig3N5egQKNAHwUaPR1I4PF46py8o9RURkZG1aAEwO0ICiAAR48elWHDhsmGDRtk7ty5MnPmTHFKJ/yiRYtqzN+xY4csWLAgLGWC89D0BARgz549sm3bNpk/f75MnTpVnGL69OmyevVqKS4urpqn9+0MGTLEDM3We3sSExPDWkZEPoICCEDv3r1NWLRv316cZPny5TJ8+HBJTU2VDh06mHl6347eF6L38RASCARNT0CAnBYSqlWrVuaKQsMhPz+/6vPG1qxZY4b8AoEgKACXi4+PN58KkJycLO3atZO1a9dK3759w10sOAhBEQZLly41o2iAUGnRooVpavriiy9MMxoQDIICAGBFUAAArAgKAIAVQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVAAAKwICgCAFUEBALAiKAAAVgQFAMCKoAAAWBEUAAArggIAYEVQAACsmkqASkpKzORVWFhofkZHR5vJDbz1KCsrEzfw1iM1NVViYmLELXXKysqSefPmSUVFhbhlv0tKSnLNfuf2fS/VJXU6dOiQzJkzJ6BlozwBfnlzenq6ZGRk1Ji/bNkyiYuLC76UAICwKS4ulpSUFCkoKJCEhISGCYrarigSExPlgQcekGbNmombzuySk5NdccbgPQNyS31865STk+O6Kwo3bifqFNlXFJ06dQooKAJueoqNjTWTP32zuuUN66U7gRt2BLfWR7HfOQN1ilzB1MEdnQsAgDOGoAAAWBEUAAArggIAYEVQAACsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFgRFAAAK4ICAGBFUAAArAgKAIAVQQEAsCIoQqy8vFzGjRsnPXr0kN27d4e7OAAiwKFDIvPmiVx7rUjnziLNm+t3AIl06iQyYIDIffeJvPWWSGBfM9fwAv7iIpy+0tJSGTt2rKxYscI8HjRokKxbt86EBoDG6ZlnRKZNEzl2rObv/ve/yklD4tFHRQ4cEOnYMfRlJChCRL9G9sYbb5RVq1aZrx3Ur5ItKioyYbF27Vrp2bNnuIsIIMQeeUTkF784+TgqSmTwYJH+/UVathT58kuR994T2bRJ5KuvwldOgiJE9EpCQyItLU0+++wzeeGFFyQzM1OGDh0qgwcPluzsbOnSpUu4iwkgRHbsEElLO/m4XTuR118XueqqmssWFYm8+KJIixYSFvRRhMiUKVNk9uzZMmfOnKp5/fv3N1/WPmHCBEeGhH5ndWpqquTm5oa7KKhHM+ikSZNk37594hZOq9Pvfy9y4sTJx4sW1R4SSq8u7r5bpHVrCQuCIkS0iWn69Ok15vfr109+/etfi9OcOHFCbrnlFpk/f74sWbIk3MVBkPQKdvHixTJgwADZu3evuIHT6rR27cn/n3WWyKhRErEICtRr5NZNN90ky5Ytk4kTJ8o8Ha4BR9Gr2ZUrV0p+fr45sO7atUuczml12r//5P+7dROJ9jka79xZ2V/hP40fH5ai0keB4I0ZM8aM3GrdurVERUXJvffeG9DzJk+eLN30HYEzasaMGXL48OGAlk1KSpKtW7fKwIEDZePGjdK9e3eJRG6sky8NgUhGUCDofokNGzaY/xcUFMjChQsDfu7o0aMJihDQpsD9vqerATh48KDpa4rUg6ob6/SNb4h8+GHl//Wn3iPhDYyzz64cEaVmzRIpLg5fORVNTwhKdHS0Gc7btm1badOmjWzZskU8Hk9Ak/bT4MzLy8sLaHt4h2erjIwMGTlypEQqN9bpmmtO/l+HweqIJ6+2bUV+/vPKKVwjnXwRFAhanz59zI2CTZs2leTkZNm8eXO4i4QgHT16VIYNG2auDufOnSszZ84Up3NanSZNEmnS5OTju+6qvGciEhEUqJdevXrJ+vXrpXnz5uYKA86yZ88e2bZtmxm1dv/994sbOK1OPXuK+A541Duw+/YVuf56kfR0kYceErnjDpHCQgk7+ihQb9ppuH37dmnfvn24i4Ig9e7d2xxY3bTtnFintDSR+PjKu7NLSirvq3jjjcqpNnpTXjgQFDgtTnpTwv3bzol1+tnPdCRh5Wc+rVkjoqN6dYBXTIxIhw4i2hd/9dUiN9ygYRieMhIUYbB06VIzAYDST4nV0U06RSL6KAAAVgQFAMCKoAAAWBEUAAArggIAYEVQAACsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFgRFAAAK4ICAGBFUAAArAgKAIAVQQEAsCIoAABWBAUAwIqgAAA0zHdml5SUmMmrsLDQ/IyOjjaTG3jrUVZWJm7grYdb6uNbl9TUVInRb593SZ2ysrJcuZ3cWKd58+ZJRUWFOF1paWnAy0Z5PB5PIAump6dLRkZGjfnLli2TuLi44EoIAAir4uJiSUlJkYKCAklISGiYK4q0tDSZNm1atSuKxMREyc3NlWbNmolbriiSkpIkOTnZFWer3jNVt9RHUSdncHOdcnJyGt0VRcBBERsbayZ/usLcsNJ86Y7tlp3bjfVR1MkZ3FinCpcc84Kpgzs6FwAAZwxBAQCwIigAAFYEBQDAiqAAAFgRFAAAK4ICAGBFUAAArAgKAIAVQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVCEWHl5uYwbN0569Oghu3fvFjdwfJ327hVp2VIkKqpyuvZaEf8vftTHycknl4mPF/nww3CVGG7Y7xyEoAjxN0qNGTPGfH3srl27ZNCgQbJz505xMlfU6cILRX7725OPs7JEFi6svswTT4isWXPy8aOPinTrFroywn37nYMQFCFSUlIio0aNkhUrVlR9P21RUZHZwbdv3y5O5Ko6TZggMnz4ycf33y/iPUvVn/rYa9gwkbvvDn0Z4b79ziEIihAZO3asrFq1ynz3+MiRI828zMxMOX78uAwePFjy8vLEaVxXp8WLRdq1q/x/cbHIrbfqUUnklltEjh+vnN+2rchzz4W1mI2d6/Y7ByAoQmTKlCkye/ZsmTNnTtW8/v37my9rnzBhgnTp0kWcxnV16tRJ5KmnTj7eskWkXz+Rf//75Dz9fefO4oamm0mTJsm+ffvEaVy33zlA03AXoLHQy2Kd/PXr189MTuTGOsmYMSLjxon86U+Vjz/44OTvUlJEfvhDcYPs7GxZvHixrFy5UtatWycXaj+NQ7hyv4twXFEA/rTjWq8ufJ1zTs0ObgfTM3ANifz8fBkwYIDpEAZOhSsKwJ+2cX/5ZfV5+vjjj0UuvVQi3YwZM+Tw4cMBLZuUlCRbt26VgQMHysaNG6V79+5nvHxwHoIC8FVWVtl5rZ3Ytc1/5x2R2FiJZEuWLJH9+/cH9ZyDBw9Kbm4uQYFa0fQE+Jo1S+S9904+vueek//PyRF54AGJdDrqx+Px1Dl5h5SqjIyMqhFEgD+CAvDavFlk3ryTj2+/vbK/4ic/OTlv/nyRt94Spzt69KgMGzZMNmzYIHPnzpWZM2eGu0iIYAQFoI4dE7ntNpETJyofn3eeyGOPVf5ff15wQeX/KyoqlysqEifbs2ePbNu2TebPny/3+95MCNSCoABUaqoePSv/Hx0t8sILIq1aVT7Wz4H6wx9EmjSpfPzRRyJTp4qT9e7d24TFVIfXA6FBUACrV4s8/fTJx3rwHDCg+jJXX139Yzz0Lu5Vq8TJ2rdvH+4iwCEIijBYunSp6Ux0E0fXST+7ScvunfQD/2rz0EPVl/ve90JdUrhpv3MQggIAYEVQAACsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFgRFAAAK4ICAGBFUAAArAgKAIAVQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVAAAKyaSoBKSkrM5FVYWGh+RkdHm8kNvPUoKysTN/DWwy31UdTJGdxcp2iXHe8CEeUJ8Atn09PTJSMjo8b8ZcuWSVxcXHAlBACEVXFxsaSkpEhBQYEkJCQ0zBVFWlqaTJs2rdoVRWJiouTm5kqzZs3ELQmblJQkOTk5UlFRIU7ntvoo6uQM1CnylZaWBrxswEERGxtrJn+6wtyw0txcJ7fVR1EnZ6BOkSuYOrijsQ0AcMYQFAAAK4ICAGBFUAAArAgKAIAVQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVAAAKwICgCAFUEBALAiKAAAVgQFAMCKoAAAWBEUAAArggIAYEVQAACsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFgRFAAAK4ICAGBFUAAArAgKAIAVQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVAAAKwICgCAFUEBALAiKAAAVgQFAMCKoAAAWBEUAAArggIAYEVQAACsCAoAgBVBAQCwIigAAFYEBQDAiqAAAFg1lQCVlJSYyaugoMD8LCsrE7eIjo6W4uJiKS0tlYqKCnE6t9VHUSdnoE6Rz3vs9ng8dS/sCdCsWbP0rzExMTExiXumvXv31nn8j9J/6nNFceTIEenatat88skn0rp1a3GDwsJCSUxMlE8//VQSEhLE6dxWH0WdnIE6RT5tFTr33HPl8OHD0qZNm4ZpeoqNjTWTPw0JN6w0X1ofN9XJbfVR1MkZqJMzmtTqXCYkJQEAOBZBAQA4M0GhzVCzZs2qtTnKqdxWJ7fVR1EnZ6BO7qpPwJ3ZAIDGiaYnAIAVQQEAsCIoAABWBAUAwIqgAABYERQAACuCAgBgRVAAAMTm/wE2+Gdfoa9l4gAAAABJRU5ErkJggg==",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"gamma=0.9: computed V* and pi*\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"gamma=0.99: computed V* and pi*\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"gammas = [0.2, 0.5, 0.9, 0.99]\n",
|
||
"theta = 1e-6\n",
|
||
"\n",
|
||
"results = {}\n",
|
||
"\n",
|
||
"for gamma in gammas:\n",
|
||
" print(f\"gamma={gamma}: computed V* and pi*\")\n",
|
||
"\n",
|
||
" V_star, pi_star = value_iteration(P, R, gamma, theta=theta)\n",
|
||
"\n",
|
||
" results[gamma] = {\"V\": V_star, \"pi\": pi_star}\n",
|
||
"\n",
|
||
" plot_values(V_star, title=f\"Optimal Value Function (gamma={gamma})\")\n",
|
||
" plot_policy(pi_star, title=f\"Optimal Policy (gamma={gamma})\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "31083905-cc29-431e-9f87-6595e187e5d0",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 20** (What we will learn in the next weeks)\n",
|
||
"\n",
|
||
"Assume now that the transition matrix $P$ is unknown.\n",
|
||
"\n",
|
||
"1. Which parts of policy iteration and value iteration can no longer be applied?\n",
|
||
"2. Which quantities would need to be learned from data?\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e8a7738a-584d-43ae-ba2f-2598608b38fa",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 21.** Try different configurations of the maze and compute an optimal policy."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "453188a8-bc26-463b-9784-be9c68328495",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "studies",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.13.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|