mirror of
https://github.com/ArthurDanjou/ArtStudies.git
synced 2026-01-14 13:54:06 +01:00
1401 lines
133 KiB
Plaintext
1401 lines
133 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "44b75d44",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Lab 2 - Maze Game as a Markov Decision Process Part 1\n",
|
||
"\n",
|
||
"## **1. Objectives**\n",
|
||
"\n",
|
||
"In this lab, we will:\n",
|
||
"\n",
|
||
"- Model a simple **maze game** as a **Markov Decision Process (MDP)** by defining:\n",
|
||
" - **States**\n",
|
||
" - **Actions**\n",
|
||
" - **Transition probabilities**\n",
|
||
" - **Rewards**\n",
|
||
"\n",
|
||
"- Implement **policy evaluation** to compute the value function of a given policy.\n",
|
||
"\n",
|
||
"This week, we **do not** improve the policy and search for an optimal one yet. \n",
|
||
"We will continue working on the Maze Game **next week**, where we will use these components to compute an **optimal policy**.\n",
|
||
"\n",
|
||
"We consider a **discounted MDP** with discount factor $\\gamma \\in (0,1)$.\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 73,
|
||
"id": "100d1e0d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"np.set_printoptions(\n",
|
||
" precision=3, suppress=True\n",
|
||
") # (not mandatory) This line is for limiting floats to 3 decimal places, avoiding scientific notation (like 1.23e-04) for small numbers.\n",
|
||
"\n",
|
||
"# For reproducibility\n",
|
||
"rng = np.random.default_rng(seed=42) # This line creates a random number generator.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1018deab",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 2. Maze definition and MDP formulation\n",
|
||
"\n",
|
||
"We consider a small 2D maze on a grid. The agent is a **robot** that moves on the grid.\n",
|
||
"\n",
|
||
"- `S` : start state\n",
|
||
"- `G` : goal state, with positive reward\n",
|
||
"- `#` : wall (not accessible)\n",
|
||
"- `.` : empty cell\n",
|
||
"- `X` : \"trap\" (negative reward)\n",
|
||
"\n",
|
||
"At each step, the robot can choose among 4 actions:\n",
|
||
"\n",
|
||
"$$\n",
|
||
"\\mathcal{A} = \\{\\text{Up} \\uparrow, \\quad \\text{Right} \\rightarrow, \\quad \\text{Down} \\downarrow, \\quad \\text{Left}\\leftarrow\\}.\n",
|
||
"$$\n",
|
||
"\n",
|
||
"The movement is deterministic, but here we set a small probability of “error” to make the example more realistic.\n",
|
||
"- With probability $1 - p_{\\text{error}}$, it moves in the chosen direction.\n",
|
||
"- With probability $p_{\\text{error}}$, it moves in a random *other* direction.\n",
|
||
"- If the movement would hit a wall or go outside the grid, the agent stays in place.\n",
|
||
"\n",
|
||
"We will represent the MDP with:\n",
|
||
"\n",
|
||
"- A list of **states** $\\mathcal{S} = \\{0, \\dots, n_{S - 1}\\}$, **each corresponding to a grid cell.**\n",
|
||
"- For each action $a$, a transition matrix $P[a]$ of size $(n_S, n_S)$, where\n",
|
||
" $$\n",
|
||
" P[a][s, s'] = \\mathbb{P}(S_{t+1} = s' \\mid S_t = s, A_t = a).\n",
|
||
" $$\n",
|
||
"- A reward vector $R$ of length $n_S$, where $R[s]$ is the immediate reward obtained when **leaving** state $s$.\n",
|
||
"\n",
|
||
"We will use a discount factor $\\gamma = 0.95$.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ca4fa301-c14f-44ec-b04f-b01ca42d979a",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.1 Define the maze \n",
|
||
"\n",
|
||
"Let us now define the maze as follows."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 74,
|
||
"id": "f91cda05",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"maze_str = [\n",
|
||
" \"#######\",\n",
|
||
" \"S...#.#\",\n",
|
||
" \"#.#...#\",\n",
|
||
" \"#.#..##\",\n",
|
||
" \"#..#..G\",\n",
|
||
" \"#..X..#\",\n",
|
||
" \"#######\",\n",
|
||
"]\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "99820cf4-292d-49ba-b662-f9f05f901f62",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 1.** Compute the dimensions of the maze (complete the “TO DO” parts):\n",
|
||
"- How many rows does the maze have?\n",
|
||
"- How many columns does the maze have?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 75,
|
||
"id": "564cb757-eefe-4be6-9b6f-bb77ace42a97",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"7\n",
|
||
"7\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"n_rows = len(maze_str)\n",
|
||
"print(n_rows)\n",
|
||
"n_cols = len(maze_str[0])\n",
|
||
"print(n_cols)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 76,
|
||
"id": "26c821d3-2362-4b60-8c77-3d09296d130d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Maze:\n",
|
||
"#######\n",
|
||
"S...#.#\n",
|
||
"#.#...#\n",
|
||
"#.#..##\n",
|
||
"#..#..G\n",
|
||
"#..X..#\n",
|
||
"#######\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"Maze:\")\n",
|
||
"for row in maze_str:\n",
|
||
" print(row)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "adc49d58-2730-41d8-96fb-ca7c9cb4fcdf",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.2 Map each walkable cell (not a wall '#') to a state index\n",
|
||
"\n",
|
||
"Now we convert the maze grid into state indices for the MDP.\n",
|
||
"\n",
|
||
"\n",
|
||
"The cells where the robot is allowed to stand are \n",
|
||
"\n",
|
||
"- . : empty space\n",
|
||
"\n",
|
||
"- S : start\n",
|
||
"\n",
|
||
"- G : goal\n",
|
||
"\n",
|
||
"- X : trap\n",
|
||
"\n",
|
||
"Everything else (i.e., #) is a wall and cannot be a state in the MDP.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 77,
|
||
"id": "7116044b-c134-43de-9f30-01ab62325300",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"FREE = {\n",
|
||
" \".\",\n",
|
||
" \"S\",\n",
|
||
" \"G\",\n",
|
||
" \"X\",\n",
|
||
"} # The vector Free represents cells that the agent is allowed to move into.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1c9ad05e-9c6c-4e00-918c-44b858f45298",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Dictionaries to convert between grid and state index**\n",
|
||
"\n",
|
||
"We now want to identify all **valid states** of the maze (all non-wall cells). \n",
|
||
"To do this, we need two mappings:\n",
|
||
"\n",
|
||
"1. `state_to_pos[s] = (i, j)`: Given a state index $s$, return its grid coordinates (row, column).\n",
|
||
"2. `pos_to_state[(i, j)] = s`: Given coordinates (i, j), return the corresponding state index $s$.\n",
|
||
"\n",
|
||
"These two dictionaries allow easy conversion between **MDP state indices** and the **physical maze positions**. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 78,
|
||
"id": "a1258de4",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Number of states (non-wall cells): 22\n",
|
||
"Start state: 0 at (1, 0)\n",
|
||
"Goal states: [16] at (4, 6)\n",
|
||
"Trap states: [19] at (5, 3)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"state_to_pos = {} # s -> (i,j)\n",
|
||
"pos_to_state = {} # (i,j) -> s\n",
|
||
"\n",
|
||
"start_state = None # will store the state index of start state\n",
|
||
"goal_states = [] # will store the state index of goal state # We use a list in case there are multiple goals\n",
|
||
"trap_states = [] # will store the state index of trap state # We use a list in case there are multiple traps\n",
|
||
"\n",
|
||
"s = 0\n",
|
||
"for i in range(n_rows): # i = row index\n",
|
||
" for j in range(n_cols): # j = column index\n",
|
||
" cell = maze_str[i][j] # cell = the character at that position (S, ., #, etc.)\n",
|
||
"\n",
|
||
" if (\n",
|
||
" cell in FREE\n",
|
||
" ): # FREE contains: free cells \".\", start cell \"S\", goal cell \"G\" and trap cell \"X\"\n",
|
||
" # Walls # are ignored, they are not MDP states.\n",
|
||
" state_to_pos[s] = (i, j)\n",
|
||
" pos_to_state[(i, j)] = s\n",
|
||
"\n",
|
||
" if cell == \"S\":\n",
|
||
" start_state = s\n",
|
||
" elif cell == \"G\":\n",
|
||
" goal_states.append(s)\n",
|
||
" elif cell == \"X\":\n",
|
||
" trap_states.append(s)\n",
|
||
"\n",
|
||
" s += 1\n",
|
||
"\n",
|
||
"n_states = s\n",
|
||
"\n",
|
||
"print(\"Number of states (non-wall cells):\", n_states)\n",
|
||
"print(\"Start state:\", start_state, \"at\", state_to_pos[start_state])\n",
|
||
"print(\"Goal states:\", goal_states, \"at\", state_to_pos[goal_states[0]])\n",
|
||
"print(\"Trap states:\", trap_states, \"at\", state_to_pos[trap_states[0]])\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "721b968c-a355-46eb-aae4-5950441ba604",
|
||
"metadata": {},
|
||
"source": [
|
||
"*Hint.* If you don’t know what a dictionary is in Python, try the following code to help you understand."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 79,
|
||
"id": "68744dd6-7278-4c20-8b82-34212685352f",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"value2\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"my_dict = {\"key1\": \"value1\", \"key2\": \"value2\"}\n",
|
||
"print(my_dict[\"key2\"])\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0c76f4e1-b0ba-49c5-b9d5-cfb523024ba9",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 2.** Read the program above and answer the following questions:\n",
|
||
"1. What is the purpose of state_to_pos and pos_to_state?\n",
|
||
"2. Why do we only assign states to cells in FREE?\n",
|
||
"3. What would happen if the maze had multiple goal cells?\n",
|
||
"4. What is the total number of states (n_states) in this maze? Does this match the number of non-wall cells you can count visually?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4c26a18f-2d03-401c-8eae-f9a17ac55f6d",
|
||
"metadata": {},
|
||
"source": [
|
||
"1. What is the purpose of `state_to_pos` and `pos_to_state`? These dictionaries establish a bijective mapping between the mathematical representation of the state and its spatial representation:\n",
|
||
"\n",
|
||
" `state_to_pos`: Maps the scalar state index `s` (an integer used for matrix/vector operations in RL algorithms like Q-learning) to the grid coordinates (i,j).\n",
|
||
"\n",
|
||
" `pos_to_state`: Maps the grid coordinates (`i,j`) (used to calculate movement and dynamics within the 2D grid) back to the unique state index s.\n",
|
||
"\n",
|
||
"2. Why do we only assign states to cells in FREE? In a Markov Decision Process (MDP), walls (#) are obstructions, not valid states.\n",
|
||
"\n",
|
||
" The agent can never \"be\" in a wall, so assigning a state index to a wall would needlessly increase the dimensionality of the state space (∣S∣).\n",
|
||
"\n",
|
||
" Excluding walls ensures the transition matrices and value vectors remain compact and contain only reachable positions.\n",
|
||
"\n",
|
||
"3. What would happen if the maze had multiple goal cells?\n",
|
||
"\n",
|
||
" In the code: The logic is robust. Since goal_states is initialized as a list (`[]`), the code would simply append the state index `s` of every `G` cell found during the iteration. The list would contain multiple integers representing all terminal states.\n",
|
||
"\n",
|
||
" Caveat: While the logic holds, the final print statement in the provided script (`state_to_pos[goal_states[0]]`) would only display the coordinates of the first goal found, ignoring the others in the console output.\n",
|
||
"\n",
|
||
"4. What is the total number of states (`n_states`) in this maze? Does this match the number of non-wall cells you can count visually?\n",
|
||
"\n",
|
||
" `n_states` represents the total count of walkable cells (Start, Goal, Trap, and empty space).\n",
|
||
"\n",
|
||
" Yes, this value matches exactly the number of non-wall cells visible in the maze, as the counter s is incremented precisely when a cell is found in the FREE set."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6d0fa298-7b7c-44fc-bbed-15ea002037c2",
|
||
"metadata": {},
|
||
"source": [
|
||
"-----\n",
|
||
"\n",
|
||
"The following function `plot_maze_with_states` creates a figure showing:\n",
|
||
"- the maze walls and free cells\n",
|
||
"- the state index for each non-wall cell\n",
|
||
"- special labels and colors for S (start state), G (goal state), and X (trap state). "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 80,
|
||
"id": "fc61ceef-217c-47f4-8eba-0353369210db",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"def plot_maze_with_states():\n",
|
||
" \"\"\"Plot the maze with state indices.\"\"\"\n",
|
||
" grid = np.ones(\n",
|
||
" (n_rows, n_cols)\n",
|
||
" ) # Start with a matrix of ones. Here 1 means “free cell”\n",
|
||
" for i in range(n_rows):\n",
|
||
" for j in range(n_cols):\n",
|
||
" if maze_str[i][j] == \"#\":\n",
|
||
" grid[i, j] = 0 # We replace walls (#) with 0\n",
|
||
"\n",
|
||
" fig, ax = plt.subplots()\n",
|
||
" ax.imshow(grid, cmap=\"gray\", alpha=0.7)\n",
|
||
"\n",
|
||
" # Plot state indices\n",
|
||
" for (\n",
|
||
" s,\n",
|
||
" (i, j),\n",
|
||
" ) in state_to_pos.items(): # Calling .items() returns a list-like sequence of (key, value) pairs in the dictionary.\n",
|
||
" cell = maze_str[i][j]\n",
|
||
"\n",
|
||
" if cell == \"S\":\n",
|
||
" label = f\"S\\n{s}\"\n",
|
||
" color = \"green\"\n",
|
||
" elif cell == \"G\":\n",
|
||
" label = f\"G\\n{s}\"\n",
|
||
" color = \"blue\"\n",
|
||
" elif cell == \"X\":\n",
|
||
" label = f\"X\\n{s}\"\n",
|
||
" color = \"red\"\n",
|
||
" else:\n",
|
||
" label = str(s)\n",
|
||
" color = \"black\"\n",
|
||
"\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" label, # Attention : matplotlib, text(x, y, ...) expects (column, row)\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=10,\n",
|
||
" fontweight=\"bold\",\n",
|
||
" color=color,\n",
|
||
" )\n",
|
||
"\n",
|
||
" ax.set_xticks([]) # remove numeric axes, we don't need.\n",
|
||
" ax.set_yticks([])\n",
|
||
" ax.set_title(\"Maze with state indices\")\n",
|
||
"\n",
|
||
" plt.show()\n",
|
||
"\n",
|
||
"\n",
|
||
"plot_maze_with_states()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "db078d86",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.4 Actions and deterministic movement"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "96e7f1f2-9d73-410b-853d-e39f40dfb5da",
|
||
"metadata": {},
|
||
"source": [
|
||
"We first define integer codes for each action. \n",
|
||
"\n",
|
||
"**Exercise 3.** How many possible actions can the agent take in the maze?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "22259ab4-527e-4d7c-bb30-98fb240da6d5",
|
||
"metadata": {},
|
||
"source": [
|
||
"We have four possible actions in the maze. \n",
|
||
"\n",
|
||
"In this following cell, each action is mapped to an integer (0,1,2,3). This makes it easy to store and use actions inside arrays and matrices\n",
|
||
"\n",
|
||
"Here we use Unicode arrow character:\n",
|
||
"\n",
|
||
"- \"\\u2191\" : ↑ (up arrow)\n",
|
||
"\n",
|
||
"- \"\\u2192\" : → (right arrow)\n",
|
||
"\n",
|
||
"- \"\\u2193\" : ↓ (down arrow)\n",
|
||
"\n",
|
||
"- \"\\u2190\" : ← (left arrow)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 81,
|
||
"id": "f7f0b8e4-1f48-4d03-9e5f-a47e59c3e827",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"A_UP, A_RIGHT, A_DOWN, A_LEFT = 0, 1, 2, 3\n",
|
||
"ACTIONS = [A_UP, A_RIGHT, A_DOWN, A_LEFT]\n",
|
||
"action_names = {A_UP: \"\\u2191\", A_RIGHT: \"\\u2192\", A_DOWN: \"\\u2193\", A_LEFT: \"\\u2190\"}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 82,
|
||
"id": "3773781c-a0cd-48db-967b-d4b432d17046",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"↑\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(action_names[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4b957f5a-ee39-4437-abc1-4809105ad83c",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 4.** Now we define a **deterministic movement function** `move_deterministic(i, j, a)`. \n",
|
||
"\n",
|
||
"This function simulates the robot trying to move from (i, j) in direction a.\n",
|
||
"\n",
|
||
"But if the movement hits a wall or boundary, the agent stays in place."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 83,
|
||
"id": "4b06da5e-bc63-48e5-a336-37bce952443d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def move_deterministic(i: int, j: int, a: int) -> tuple[int, int]:\n",
|
||
" \"\"\"Deterministic movement on the grid. If the movement hits a wall or boundary, the agent stays in place.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" i (int): current row index\n",
|
||
" j (int): current column index\n",
|
||
" a (int): action to take (A_UP, A_DOWN, A_LEFT, A_RIGHT)\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" (tuple[int, int]): new (row, column) position after taking action a\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i,\n",
|
||
" j,\n",
|
||
" ) # It means “Unless the action succeeds, the robot stays in place.”\n",
|
||
"\n",
|
||
" # Now each action changes the coordinates of the robot:\n",
|
||
" if a == A_UP:\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i - 1,\n",
|
||
" j,\n",
|
||
" ) # if the action is UP, then row becomes row -1\n",
|
||
" elif a == A_DOWN:\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i + 1,\n",
|
||
" j,\n",
|
||
" ) # if the action is DOWN, then row becomes row +1\n",
|
||
" elif a == A_LEFT:\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i,\n",
|
||
" j - 1,\n",
|
||
" ) # if the action is LEFT, then column becomes column -1\n",
|
||
" elif a == A_RIGHT:\n",
|
||
" candidate_i, candidate_j = (\n",
|
||
" i,\n",
|
||
" j + 1,\n",
|
||
" ) # if the action is RIGHT, then column becomes column +1\n",
|
||
"\n",
|
||
" # Check boundaries\n",
|
||
" if not (0 <= candidate_i < n_rows and 0 <= candidate_j < n_cols):\n",
|
||
" # If the robot tries to move outside the maze\n",
|
||
" # It will not move and it stays at (i, j).\n",
|
||
" return i, j\n",
|
||
"\n",
|
||
" # Check wall\n",
|
||
" if maze_str[candidate_i][candidate_j] == \"#\":\n",
|
||
" # If the next cell is a wall, the robot stays in place.\n",
|
||
" return i, j\n",
|
||
"\n",
|
||
" return candidate_i, candidate_j # Otherwise, return the new position\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c9e620e6",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2.5 Transition probabilities and reward function"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "80bd2bca-7717-4b5f-bffa-76fe86a51d35",
|
||
"metadata": {},
|
||
"source": [
|
||
"Recall that we set the discount factor $\\gamma \\in(0,1)$, that is, the future rewards are multiplied by $\\gamma$, so immediate rewards matter a little bit more than future ones. \n",
|
||
"\n",
|
||
"\n",
|
||
"Moreover, we consider a probability error $p_{\\text{error}}$, which means, with probability $p_{\\text{error}}$, the robot **does not** execute the intended action but one of the 3 other directions (chosen uniformly). With probability $1-p_{\\text{error}}$, the robot executes the action that we asked."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 84,
|
||
"id": "610253e7-f3f7-4a30-be3e-2ec5a1e2ed04",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"gamma = 0.95\n",
|
||
"p_error = 0.1 # probability of the error to a random other direction\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0d1ceff8-86e0-4c45-83d3-af9fae974608",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we initialize the state–transition probability : the probability of reaching next state $s'$ after taking action $a$ in state $s$. \n",
|
||
"$$\n",
|
||
" p(s' \\mid s, a)\n",
|
||
" = \\mathbb{P} \\big[S_t=s'\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]\n",
|
||
"$$\n",
|
||
"\n",
|
||
"We store these transition probabilities in the 3D array `P` (`P[a][s, s_next]`), which has shape `(n_actions, n_states, n_states)`:\n",
|
||
"\n",
|
||
"`P[a, s, s_next] = P(S_{t+1} = s_next | S_t = s, A_t = a)`.\n",
|
||
"\n",
|
||
"We also initialize the reward vector `R`, which has length `n_states`, where `R[s]` is the reward received when the agent is in state `s`.\n",
|
||
"\n",
|
||
"In this maze game, we assume that the reward depends only on the current state, which is natural: in navigation tasks, being in a particular location is what matters, not the direction you used to reach it."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 85,
|
||
"id": "7a51f242-fe4e-4e74-8a1f-a8df32b194b8",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Initialize transition matrices and reward vector\n",
|
||
"P = np.zeros((len(ACTIONS), n_states, n_states))\n",
|
||
"R = np.zeros(n_states)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c08f4af5-a2a7-4baa-b5da-c7ce636d8a4a",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we assign the reward to each state. \n",
|
||
"\n",
|
||
"For each state index s:\n",
|
||
"\n",
|
||
"1. If s is a goal, then the reward = +1.0\n",
|
||
"2. If s is a trap, then the reward = −1.0\n",
|
||
"3. Otherwise for the normal cell, the reward = −0.01 every time you leave this cell.\n",
|
||
"\n",
|
||
"Recall that rewards are received at the moment the agent executes an action. Here when the agent moves out of the cell, we set reward −0.01. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 86,
|
||
"id": "49d54d1f-dc29-45b6-ad31-ad0e848f920d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Set rewards for each state\n",
|
||
"step_penalty = -0.01\n",
|
||
"goal_reward = 1.0\n",
|
||
"trap_reward = -1.0\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "dd571ec8-c36a-4e20-bec6-9e6458dc622b",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 5.** Why do we set the step penalty to -0.01 in this MDP?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1e8ea171",
|
||
"metadata": {},
|
||
"source": [
|
||
"We set a small negative step penalty (`-0.01`) for two main reasons:\n",
|
||
"\n",
|
||
"- Incentivize Efficiency: It forces the agent to find the shortest path to the goal. By losing a small amount of reward at every step, the agent learns that the faster it reaches the goal, the higher its total cumulative return will be.\n",
|
||
"\n",
|
||
"- Prevent Loitering: It discourages infinite loops or wandering. Without this penalty (i.e., if step reward = 0), the agent might be indifferent between reaching the goal now or in 1000 steps, potentially leading to a policy that never terminates."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "07bfb065-b1af-4df1-885e-780fe250f2fb",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 6.** We now define the reward vector. Recall that we have already initialized\n",
|
||
"`R = np.zeros(n_states)`.\n",
|
||
"If a state belongs to `goal_states`, we assign the `goal_reward`.\n",
|
||
"If it belongs to `trap_states`, we assign the `trap_reward`.\n",
|
||
"Otherwise, we assign the `step_penalty`. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 87,
|
||
"id": "b9b7495a-c233-425c-99c0-5bddaf6c3225",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"for s in range(n_states):\n",
|
||
" if s in goal_states:\n",
|
||
" R[s] = goal_reward\n",
|
||
" elif s in trap_states:\n",
|
||
" R[s] = trap_reward\n",
|
||
" else:\n",
|
||
" R[s] = step_penalty\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "b90fb80c-9452-48a2-889f-286703c2ae93",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we define terminal states and a helper function. Here terminal_states is a set containing all absorbing states, which means, reaching them ends the episode conceptually. \n",
|
||
"\n",
|
||
"Moreover, `is_terminal(s)` is a small helper to check if a state is terminal."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 88,
|
||
"id": "eca4c571-39c7-468b-af86-0bab9489415e",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"terminal_states = set(goal_states + trap_states)\n",
|
||
"\n",
|
||
"\n",
|
||
"def is_terminal(s: int) -> bool:\n",
|
||
" \"\"\"Check if a state is terminal (goal or trap).\"\"\"\n",
|
||
" return s in terminal_states\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3a9a1d54-8339-402b-84e9-105961ed78d7",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we need to fill the transition matrices `P[a][s, s_next]`. \n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d9cfd15c-12cc-48bb-bd88-07f3ae3db31c",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 7.** **Complete the `# TO DO` part in the program below** to fill the transition matrices `P[a][s, s_next]`. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 89,
|
||
"id": "2d03276b-e206-4d1f-9024-f6948ca61523",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"for s in range(n_states): # We loop over all states s.\n",
|
||
" i, j = state_to_pos[\n",
|
||
" s\n",
|
||
" ] # We recover the states to their coordinates (i, j) in the maze.\n",
|
||
"\n",
|
||
" # First, in a goal or trap state,\n",
|
||
" # No matter which action you “choose”, you stay in the same state with probability 1.\n",
|
||
" # This makes the terminal states as the absorbing states.\n",
|
||
" if is_terminal(s):\n",
|
||
" # Terminal states: stay forever\n",
|
||
" for a in ACTIONS:\n",
|
||
" P[a, s, s] = goal_reward\n",
|
||
" continue\n",
|
||
"\n",
|
||
" # If the state is non-terminal, we define the stochastic movement.\n",
|
||
" # For a given state s and intended action a,\n",
|
||
" # With probability 1 - p_error, the robot will move in direction a;\n",
|
||
" # With probability p_error, the robot will move in one of the other 3 directions, each with probability p_error / 3.\n",
|
||
" for a in ACTIONS:\n",
|
||
" # main action (intended action)\n",
|
||
" main_i, main_j = move_deterministic(i, j, a)\n",
|
||
" s_main = pos_to_state[\n",
|
||
" (main_i, main_j)\n",
|
||
" ] # s_main is the state index of that next cell.\n",
|
||
" P[a, s, s_main] += (\n",
|
||
" 1 - p_error\n",
|
||
" ) # We add probability 1 - p_error to P[a, s, s_main].\n",
|
||
"\n",
|
||
" # error actions\n",
|
||
" other_actions = [\n",
|
||
" a2 for a2 in ACTIONS if a2 != a\n",
|
||
" ] # other_actions = the 3 actions different from a.\n",
|
||
" for a2 in other_actions: # for each of the error action,\n",
|
||
" error_i, error_j = move_deterministic(i, j, a2)\n",
|
||
" s_error = pos_to_state[(error_i, error_j)] # get its state index s_error\n",
|
||
" P[a, s, s_error] += p_error / len(\n",
|
||
" other_actions\n",
|
||
" ) # add p_error / 3 to P[a, s, s_error]\n",
|
||
"# So for each (s,a), probabilities over all s_next sum to 1.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7841b264-af00-4322-b728-adcffac0ef89",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we check if the transition matrices `P[a][s, s_next]` are computed correctly.\n",
|
||
"For each action `a`, we sum the transition probabilities over all possible next states `s_next` and verify that these sums are equal to 1.\n",
|
||
"\n",
|
||
"This is because the matrix `P[a, s, s_next]` stores the transition probability\n",
|
||
"\n",
|
||
"$\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$. \n",
|
||
"\n",
|
||
"Therefore, for each action $a$, and for each state $s$, the sum over $s_{\\text{next}}$ of $\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$ should be 1. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 90,
|
||
"id": "341fe630-8f87-4773-84ad-92d3516e53e2",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Action ↑: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n",
|
||
"Action →: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n",
|
||
"Action ↓: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n",
|
||
"Action ←: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for a in ACTIONS:\n",
|
||
" # For each action a:\n",
|
||
" # P[a] is a matrix of shape (n_states, n_states).\n",
|
||
" # P[a].sum(axis=1) sums over next states s_next, giving for each state s:\n",
|
||
" # We print these row sums.\n",
|
||
" # If everything is correct, they should be very close to 1.\n",
|
||
"\n",
|
||
" probs = P[a].sum(axis=1)\n",
|
||
" print(f\"Action {action_names[a]}:\", probs)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "46d23991",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 3. Policy evaluation\n",
|
||
"\n",
|
||
"### 3.1 Bellman expectation equation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "305b047c-e83b-4f42-b64e-e2050d5deeff",
|
||
"metadata": {},
|
||
"source": [
|
||
"Recall that the value function under a policy $\\pi$ is defined as:\n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=\\mathbb{E}\\Big[\\:G_t \\:\\Big|\\: S_t=s\\:\\Big]\n",
|
||
"$$\n",
|
||
"where the return $G_t$ is\n",
|
||
"$$\n",
|
||
"G_t=R_t +\\gamma R_{t+1}+\\gamma^2 R_{t+2}+... . \n",
|
||
"$$\n",
|
||
"This means *The value of a state is the expected discounted sum of all future rewards\n",
|
||
"when following policy $\\pi$.*\n",
|
||
"\n",
|
||
"We know that $G_t=R_t+\\gamma G_{t+1}$, and plugging this equation into the definition of $V^{\\pi}(s)$, we get \n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n",
|
||
"$$\n",
|
||
"This step shows simply ``The total future reward = immediate reward + discounted reward from next state.''"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "88ea8d56-3b62-4690-9ff7-469e43726fbc",
|
||
"metadata": {},
|
||
"source": [
|
||
"For the expected immediate reward part $\\mathbb{E}[R_t| S_t=s]$, as we are in a maze problem, the reward depends only on the current state, not the time step, i.e., $\\mathbb{E}[R_t| S_t=s]=R(s)$. Hence we get \n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=R(s)+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n",
|
||
"$$\n",
|
||
"\n",
|
||
"Moreover, in this maze problem, we consider a deterministic policy $A_t=\\pi(s)$ (the action depends only on the state). Therefore, \n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s, A_t=\\pi(s)\\:\\Big]. \n",
|
||
"$$\n",
|
||
"\n",
|
||
"Now **given the state $S_t=s$ and $A_t=a$**, the next state is random (because of the error probability) and we know the transition probability \n",
|
||
"$$\n",
|
||
"\\mathbb{P}\\big(\\:S_{t+1}=s' \\:|\\:S_t=s, \\, A_t=a\\big)=P\\big(s'\\:\\big|\\:s, a\\big). \n",
|
||
"$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c25e255d-8f58-4eaf-9485-cee6ab3bea6c",
|
||
"metadata": {},
|
||
"source": [
|
||
"Therefore,\n",
|
||
"$$\n",
|
||
"\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_t=s,A_t=a\\,\\big] =\\sum_{s'}\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_{t+1}=s'\\,\\big]\\times \\mathbb{P}\\big[S_{t+1}=s'\\,\\big|\\,S_t=s, A_t=a\\, \\big]\n",
|
||
"$$\n",
|
||
"$$\n",
|
||
"\\hspace{-1.2cm}=\\sum_{s'}V^{\\pi}(s')P\\big(s'\\:\\big|\\:s, a\\big),\n",
|
||
"$$\n",
|
||
"where here we use the Markov property. (**Question: Can you show the detailed computations here?**)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9a2b6cff-e848-44a2-b504-973067b367b3",
|
||
"metadata": {},
|
||
"source": [
|
||
"In conclusion, we have (the Bellman expectation equation)\n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n",
|
||
"$$"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "15049fdb-f3af-4f78-b556-817284260ed0",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3.2 Define a function which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n",
|
||
"\n",
|
||
"\n",
|
||
"**Exercise $8^*$.** Now we define `policy_evaluation(...)`, which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n",
|
||
"\n",
|
||
"The input of this function `policy_evaluation(...)` are:\n",
|
||
"1. policy: array of size `n_states`, each entry is an action 0,1,2,3, which correspond to UP, RIGHT, DOWN, LEFT.\n",
|
||
"2. `P`: the transition probabilities `P[a, s, s']`.\n",
|
||
"3. `R`: the reward vector `R[s]`.\n",
|
||
"4. gamma: the discount factor $\\gamma\\in(0,1)$.\n",
|
||
"5. theta: convergence threshold.\n",
|
||
"6. max_iter: which is used to avoid infinite loops.\n",
|
||
"\n",
|
||
"How can we apply the Bellman expectation equation\n",
|
||
"$$\n",
|
||
"V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n",
|
||
"$$\n",
|
||
"here ?\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5c48f489-3508-4981-8b35-5bedc2e5838c",
|
||
"metadata": {},
|
||
"source": [
|
||
"We start with an initial guess of $V^{\\pi}$(e.g., all values = 0) and repeatedly apply the Bellman equation to update each state:\n",
|
||
"$$\n",
|
||
"V_{k+1}^\\pi(s) \\leftarrow R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}_k(s').\n",
|
||
"$$\n",
|
||
"until values converge."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 91,
|
||
"id": "2fffe0b7",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def policy_evaluation( # noqa: PLR0913\n",
|
||
" policy: np.ndarray,\n",
|
||
" P: np.ndarray,\n",
|
||
" R: np.ndarray,\n",
|
||
" gamma: float,\n",
|
||
" theta: float = 1e-6,\n",
|
||
" max_iter: int = 10_000,\n",
|
||
") -> np.ndarray:\n",
|
||
" \"\"\"Evaluate a deterministic policy for the given MDP.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" policy: array of shape (n_states,), with values in {0,1,2,3}\n",
|
||
" P: array of shape (n_actions, n_states, n_states)\n",
|
||
" R: array of shape (n_states,)\n",
|
||
" gamma: discount factor\n",
|
||
" theta: convergence threshold\n",
|
||
" max_iter: maximum number of iterations\n",
|
||
"\n",
|
||
" \"\"\"\n",
|
||
" n_states = len(R) # get the number of states\n",
|
||
" V = np.zeros(n_states) # initialize the value function\n",
|
||
"\n",
|
||
" for _it in range(max_iter): # Main iterative loop\n",
|
||
" V_new = np.zeros_like(\n",
|
||
" V\n",
|
||
" ) # Create a new value vector and we will compute an updated value for each state.\n",
|
||
"\n",
|
||
" # Now we update each state using the Bellman expectation equation\n",
|
||
" for s in range(n_states):\n",
|
||
" a = policy[s] # Extract the action chosen by the policy in state\n",
|
||
" V_new[s] = R[s] + gamma * np.sum(P[a, s, :] * V)\n",
|
||
"\n",
|
||
" delta = np.max(\n",
|
||
" np.abs(V_new - V)\n",
|
||
" ) # This measures how much the value function changed in this iteration:\n",
|
||
" # If delta is small, the values start to converge; otherwise, we need to keep iterating.\n",
|
||
" V = V_new # Update V, i.e. Set the new values for the next iteration.\n",
|
||
"\n",
|
||
" if delta < theta: # Check convergence: When changes are tiny, we stop.\n",
|
||
" break\n",
|
||
"\n",
|
||
" return V # Return the final value function, this is our estimate for V^{pi}(s), s in the state set.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "09ef3439",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 3.3 Evaluating a random policy"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "eecbca15-f89f-47bf-a13d-7d7c051699b8",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we use the policy evaluation function `policy_evaluation` to evaluate a random policy. \n",
|
||
"\n",
|
||
"We first generate a `random_policy`, which is an array like [2, 0, 1, 3, 0, 2, ...] and has the size `n_states`. (Recall that the policy is a mapping from states to actions)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 92,
|
||
"id": "b4a44e38",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[0 3 2 1 1 3 0 2 0 0 2 3 2 3 2 3 2 0 3 1 2 1]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Random policy: for each state, pick a random action\n",
|
||
"random_policy = rng.integers(low=0, high=len(ACTIONS), size=n_states)\n",
|
||
"\n",
|
||
"print(random_policy)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3fe07992-ce82-4124-aebc-a6384d417f64",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we call the function `policy_evaluation(...)` to compute $V^{\\pi_{\\text{random}}}(s)$."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 93,
|
||
"id": "c5f559b2-452a-477c-a1fa-258b40805670",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Value function under random policy:\n",
|
||
"[ -0.2 -0.2 -0.201 -0.204 -0.205 -0.202 -0.214 -0.429 -0.212\n",
|
||
" -0.207 -0.276 -0.459 -0.352 -0.366 -5.827 -4.605 20. -0.366\n",
|
||
" -0.999 -20. -6.4 -3.163]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"V_random = policy_evaluation(policy=random_policy, P=P, R=R, gamma=gamma)\n",
|
||
"print(\"Value function under random policy:\")\n",
|
||
"print(V_random)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f46c70ba-2932-49af-b568-b5477260bc94",
|
||
"metadata": {},
|
||
"source": [
|
||
"Here in this value vector of the policy, \n",
|
||
"- If it is a negative values, then the agent tends to move around aimlessly, fall in traps, or take too long.\n",
|
||
"- It it is a higher values, then the agent is closer to the goal or more likely to reach it"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "1efcb076-467c-42d8-94e8-87453f688bbd",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now we define a function `plot_values`, which displays the value function $V(s)$ and displays it on the maze grid. It helps students visually understand:\n",
|
||
"- which states are good (high value, near the goal),\n",
|
||
"- which states are bad (low value, near traps),\n",
|
||
"- how a policy affects the long-term expected reward."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 94,
|
||
"id": "4c428327",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"def plot_values(V: np.ndarray, title=\"Value function\") -> None:\n",
|
||
" \"\"\"Plot the value function V on the maze as a heatmap.\"\"\"\n",
|
||
" grid_values = np.full(\n",
|
||
" (n_rows, n_cols), np.nan\n",
|
||
" ) # Initializes a grid the same size as the maze. Every cell starts as NaN.\n",
|
||
" for (\n",
|
||
" s,\n",
|
||
" (i, j),\n",
|
||
" ) in (\n",
|
||
" state_to_pos.items()\n",
|
||
" ): # recall that state_to_pos maps each state index to its maze coordinates (i,j).\n",
|
||
" grid_values[i, j] = V[\n",
|
||
" s\n",
|
||
" ] # For each reachable cell, we write the value V[s] in the grid.\n",
|
||
" # Walls # never get values, and they stay as NaN.\n",
|
||
"\n",
|
||
" fig, ax = plt.subplots()\n",
|
||
" im = ax.imshow(grid_values, cmap=\"magma\")\n",
|
||
" plt.colorbar(im, ax=ax)\n",
|
||
"\n",
|
||
" # For each state:\n",
|
||
" # Place the text label at (column j, row i).\n",
|
||
" # Display value to two decimals.\n",
|
||
" # Use white text so it’s visible on the heatmap.\n",
|
||
" # Center the text inside each cell.\n",
|
||
"\n",
|
||
" for s, (i, j) in state_to_pos.items():\n",
|
||
" ax.text(\n",
|
||
" j, i, f\"{V[s]:.2f}\", ha=\"center\", va=\"center\", color=\"white\", fontsize=9\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Remove axis ticks and set title\n",
|
||
" ax.set_xticks([])\n",
|
||
" ax.set_yticks([])\n",
|
||
" ax.set_title(title)\n",
|
||
" plt.show()\n",
|
||
"\n",
|
||
"\n",
|
||
"plot_values(V_random, title=\"Value function: random policy\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8275a1eb-b58e-4e05-ae5d-5635ff9a1556",
|
||
"metadata": {},
|
||
"source": [
|
||
"The next function `plot_policy` visualizes a policy on the maze.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 95,
|
||
"id": "c1ab67f0-bd5e-4ffe-b655-aec030401b78",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def plot_policy(policy: np.ndarray, title=\"Policy\") -> None:\n",
|
||
" \"\"\"Plot the given policy on the maze.\"\"\"\n",
|
||
" _fig, ax = plt.subplots()\n",
|
||
" # draw walls as dark cells\n",
|
||
" wall_grid = np.zeros((n_rows, n_cols))\n",
|
||
" for i in range(n_rows):\n",
|
||
" for j in range(n_cols):\n",
|
||
" if maze_str[i][j] == \"#\":\n",
|
||
" wall_grid[i, j] = 1\n",
|
||
" ax.imshow(wall_grid, cmap=\"Greys\", alpha=0.5)\n",
|
||
"\n",
|
||
" for s, (i, j) in state_to_pos.items():\n",
|
||
" cell = maze_str[i][j]\n",
|
||
" if cell == \"#\":\n",
|
||
" continue\n",
|
||
"\n",
|
||
" if s in goal_states:\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" \"G\",\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=14,\n",
|
||
" fontweight=\"bold\",\n",
|
||
" color=\"blue\",\n",
|
||
" )\n",
|
||
" elif s in trap_states:\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" \"X\",\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=14,\n",
|
||
" fontweight=\"bold\",\n",
|
||
" color=\"red\",\n",
|
||
" )\n",
|
||
" elif s == start_state:\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" \"S\",\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=14,\n",
|
||
" fontweight=\"bold\",\n",
|
||
" color=\"green\",\n",
|
||
" )\n",
|
||
" else:\n",
|
||
" a = policy[s]\n",
|
||
" ax.text(\n",
|
||
" j,\n",
|
||
" i,\n",
|
||
" action_names[a],\n",
|
||
" ha=\"center\",\n",
|
||
" va=\"center\",\n",
|
||
" fontsize=14,\n",
|
||
" color=\"black\",\n",
|
||
" )\n",
|
||
"\n",
|
||
" ax.set_xticks(np.arange(-0.5, n_cols, 1))\n",
|
||
" ax.set_yticks(np.arange(-0.5, n_rows, 1))\n",
|
||
" ax.set_xticklabels([])\n",
|
||
" ax.set_yticklabels([])\n",
|
||
" ax.grid(True)\n",
|
||
" ax.set_title(title)\n",
|
||
" plt.show()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "48037254-dccc-4f9c-a4d7-349adba5c74f",
|
||
"metadata": {},
|
||
"source": [
|
||
"Now let’s visualize the `random_policy`. Does it seem like a good policy?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 96,
|
||
"id": "d452681c-c89c-41cc-95dc-df75993b0391",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"plot_policy(policy=random_policy, title=\"Policy\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cbad5bf1-0150-4c3f-8cce-c82e0f1d1695",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 9.** Define your own policy and evaluate it using the functions `policy_evaluation(...)` and `plot_values(...)`. **Can you identify an optimal policy visually?** Plot your own policy using `plot_policy`. \n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 97,
|
||
"id": "929707e6-3022-4d86-96cc-12f251f890a9",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"my_policy = [\n",
|
||
" A_RIGHT,\n",
|
||
" A_RIGHT,\n",
|
||
" A_RIGHT,\n",
|
||
" A_DOWN,\n",
|
||
" A_DOWN, # First row\n",
|
||
" A_UP,\n",
|
||
" A_DOWN,\n",
|
||
" A_DOWN,\n",
|
||
" A_LEFT, # Second row\n",
|
||
" A_UP,\n",
|
||
" A_RIGHT,\n",
|
||
" A_DOWN, # Third row\n",
|
||
" A_UP,\n",
|
||
" A_LEFT,\n",
|
||
" A_RIGHT,\n",
|
||
" A_RIGHT,\n",
|
||
" A_RIGHT, # Fourth row\n",
|
||
" A_UP,\n",
|
||
" A_LEFT,\n",
|
||
" A_DOWN,\n",
|
||
" A_RIGHT,\n",
|
||
" A_UP, # Fifth row\n",
|
||
"]\n",
|
||
"\n",
|
||
"V_my_policy = policy_evaluation(policy=my_policy, P=P, R=R, gamma=gamma)\n",
|
||
"\n",
|
||
"plot_values(V=V_my_policy, title=\"Value function: my policy\")\n",
|
||
"plot_policy(policy=my_policy, title=\"My policy\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e61f5ee8-f9cd-4fbc-96c0-0a8d661bd1e5",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Exercise 10.** (optional) How can we find an optimal policy?\n",
|
||
"(We will discuss this question next week, but you can already start thinking about it!)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "00ae548b",
|
||
"metadata": {},
|
||
"source": [
|
||
"To find an optimal policy $π^*$ (a policy that yields the highest possible expected return from every state), we generally use one of two main dynamic programming algorithms:\n",
|
||
"\n",
|
||
"1. **Policy Iteration**: This method alternates between two steps until convergence:\n",
|
||
"\n",
|
||
"- *Policy Evaluation*: Calculate the value function Vπ(s) for the current specific policy (as we did in Exercise 8).\n",
|
||
"\n",
|
||
"- *Policy Improvement*: Update the policy to be greedy with respect to the current values. For every state s, we choose the action a that maximizes the expected next value:\n",
|
||
" $$π_{new}(s) = argmax_{a} \\sum_{s\\prime} P({s \\prime}∣s,a)[R(s)+ \\gamma V_{\\pi}({s\\prime})]$$\n",
|
||
"\n",
|
||
"1. **Value Iteration**: Instead of evaluating a specific policy until convergence every time, we iteratively update the value function directly using the *Bellman Optimality Equation*:\n",
|
||
" $$V_{k+1}(s) = max_a (R(s)+ \\gamma \\sum_{s\\prime} P(s\\prime∣s,a)V_k(s\\prime))$$\n",
|
||
"\n",
|
||
" Once the values converge to the optimal values $V^{*}$, we simply extract the optimal policy by acting greedily towards those values."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "studies",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.13.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|