{ "cells": [ { "cell_type": "markdown", "id": "44b75d44", "metadata": {}, "source": [ "# Lab 2 - Maze Game as a Markov Decision Process Part 1\n", "\n", "## **1. Objectives**\n", "\n", "In this lab, we will:\n", "\n", "- Model a simple **maze game** as a **Markov Decision Process (MDP)** by defining:\n", " - **States**\n", " - **Actions**\n", " - **Transition probabilities**\n", " - **Rewards**\n", "\n", "- Implement **policy evaluation** to compute the value function of a given policy.\n", "\n", "This week, we **do not** improve the policy and search for an optimal one yet. \n", "We will continue working on the Maze Game **next week**, where we will use these components to compute an **optimal policy**.\n", "\n", "We consider a **discounted MDP** with discount factor $\\gamma \\in (0,1)$.\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "100d1e0d", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "np.set_printoptions(\n", " precision=3,\n", " suppress=True,\n", ")\n", "# (not mandatory) This line is for limiting floats to 3 decimal places,\n", "# avoiding scientific notation (like 1.23e-04) for small numbers.\n", "\n", "# For reproducibility\n", "rng = np.random.default_rng(seed=42) # This line creates a random number generator." ] }, { "cell_type": "markdown", "id": "1018deab", "metadata": {}, "source": [ "## 2. Maze definition and MDP formulation\n", "\n", "We consider a small 2D maze on a grid. The agent is a **robot** that moves on the grid.\n", "\n", "- `S` : start state\n", "- `G` : goal state, with positive reward\n", "- `#` : wall (not accessible)\n", "- `.` : empty cell\n", "- `X` : \"trap\" (negative reward)\n", "\n", "At each step, the robot can choose among 4 actions:\n", "\n", "$$\n", "\\mathcal{A} = \\{\\text{Up} \\uparrow, \\quad \\text{Right} \\rightarrow, \\quad \\text{Down} \\downarrow, \\quad \\text{Left}\\leftarrow\\}.\n", "$$\n", "\n", "The movement is deterministic, but here we set a small probability of “error” to make the example more realistic.\n", "- With probability $1 - p_{\\text{error}}$, it moves in the chosen direction.\n", "- With probability $p_{\\text{error}}$, it moves in a random *other* direction.\n", "- If the movement would hit a wall or go outside the grid, the agent stays in place.\n", "\n", "We will represent the MDP with:\n", "\n", "- A list of **states** $\\mathcal{S} = \\{0, \\dots, n_{S - 1}\\}$, **each corresponding to a grid cell.**\n", "- For each action $a$, a transition matrix $P[a]$ of size $(n_S, n_S)$, where\n", " $$\n", " P[a][s, s'] = \\mathbb{P}(S_{t+1} = s' \\mid S_t = s, A_t = a).\n", " $$\n", "- A reward vector $R$ of length $n_S$, where $R[s]$ is the immediate reward obtained when **leaving** state $s$.\n", "\n", "We will use a discount factor $\\gamma = 0.95$.\n" ] }, { "cell_type": "markdown", "id": "ca4fa301-c14f-44ec-b04f-b01ca42d979a", "metadata": {}, "source": [ "### 2.1 Define the maze \n", "\n", "Let us now define the maze as follows." ] }, { "cell_type": "code", "execution_count": 74, "id": "f91cda05", "metadata": {}, "outputs": [], "source": [ "maze_str = [\n", " \"#######\",\n", " \"S...#.#\",\n", " \"#.#...#\",\n", " \"#.#..##\",\n", " \"#..#..G\",\n", " \"#..X..#\",\n", " \"#######\",\n", "]" ] }, { "cell_type": "markdown", "id": "99820cf4-292d-49ba-b662-f9f05f901f62", "metadata": {}, "source": [ "**Exercise 1.** Compute the dimensions of the maze (complete the “TO DO” parts):\n", "- How many rows does the maze have?\n", "- How many columns does the maze have?" ] }, { "cell_type": "code", "execution_count": 75, "id": "564cb757-eefe-4be6-9b6f-bb77ace42a97", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7\n", "7\n" ] } ], "source": [ "n_rows = len(maze_str)\n", "print(n_rows)\n", "n_cols = len(maze_str[0])\n", "print(n_cols)" ] }, { "cell_type": "code", "execution_count": 76, "id": "26c821d3-2362-4b60-8c77-3d09296d130d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Maze:\n", "#######\n", "S...#.#\n", "#.#...#\n", "#.#..##\n", "#..#..G\n", "#..X..#\n", "#######\n" ] } ], "source": [ "print(\"Maze:\")\n", "for row in maze_str:\n", " print(row)" ] }, { "cell_type": "markdown", "id": "adc49d58-2730-41d8-96fb-ca7c9cb4fcdf", "metadata": {}, "source": [ "### 2.2 Map each walkable cell (not a wall '#') to a state index\n", "\n", "Now we convert the maze grid into state indices for the MDP.\n", "\n", "\n", "The cells where the robot is allowed to stand are \n", "\n", "- . : empty space\n", "\n", "- S : start\n", "\n", "- G : goal\n", "\n", "- X : trap\n", "\n", "Everything else (i.e., #) is a wall and cannot be a state in the MDP.\n" ] }, { "cell_type": "code", "execution_count": 77, "id": "7116044b-c134-43de-9f30-01ab62325300", "metadata": {}, "outputs": [], "source": [ "FREE = {\n", " \".\",\n", " \"S\",\n", " \"G\",\n", " \"X\",\n", "} # The vector Free represents cells that the agent is allowed to move into." ] }, { "cell_type": "markdown", "id": "1c9ad05e-9c6c-4e00-918c-44b858f45298", "metadata": {}, "source": [ "**Dictionaries to convert between grid and state index**\n", "\n", "We now want to identify all **valid states** of the maze (all non-wall cells). \n", "To do this, we need two mappings:\n", "\n", "1. `state_to_pos[s] = (i, j)`: Given a state index $s$, return its grid coordinates (row, column).\n", "2. `pos_to_state[(i, j)] = s`: Given coordinates (i, j), return the corresponding state index $s$.\n", "\n", "These two dictionaries allow easy conversion between **MDP state indices** and the **physical maze positions**. " ] }, { "cell_type": "code", "execution_count": null, "id": "a1258de4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of states (non-wall cells): 22\n", "Start state: 0 at (1, 0)\n", "Goal states: [16] at (4, 6)\n", "Trap states: [19] at (5, 3)\n" ] } ], "source": [ "state_to_pos = {} # s -> (i,j)\n", "pos_to_state = {} # (i,j) -> s\n", "\n", "start_state = None # will store the state index of start state\n", "goal_states = [] # will store the state index of goal state\n", "trap_states = [] # will store the state index of trap state\n", "\n", "s = 0\n", "for i in range(n_rows): # i = row index\n", " for j in range(n_cols): # j = column index\n", " cell = maze_str[i][j] # cell = the character at that position (S, ., #, etc.)\n", "\n", " if cell in FREE:\n", " # FREE contains: free cells \".\", start cell \"S\", goal cell \"G\" and trap cell \"X\"\n", " # Walls # are ignored, they are not MDP states.\n", " state_to_pos[s] = (i, j)\n", " pos_to_state[(i, j)] = s\n", "\n", " if cell == \"S\":\n", " start_state = s\n", " elif cell == \"G\":\n", " goal_states.append(s)\n", " elif cell == \"X\":\n", " trap_states.append(s)\n", "\n", " s += 1\n", "\n", "n_states = s\n", "\n", "print(\"Number of states (non-wall cells):\", n_states)\n", "print(\"Start state:\", start_state, \"at\", state_to_pos[start_state])\n", "print(\"Goal states:\", goal_states, \"at\", state_to_pos[goal_states[0]])\n", "print(\"Trap states:\", trap_states, \"at\", state_to_pos[trap_states[0]])" ] }, { "cell_type": "markdown", "id": "721b968c-a355-46eb-aae4-5950441ba604", "metadata": {}, "source": [ "*Hint.* If you don’t know what a dictionary is in Python, try the following code to help you understand." ] }, { "cell_type": "code", "execution_count": 79, "id": "68744dd6-7278-4c20-8b82-34212685352f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "value2\n" ] } ], "source": [ "my_dict = {\"key1\": \"value1\", \"key2\": \"value2\"}\n", "print(my_dict[\"key2\"])" ] }, { "cell_type": "markdown", "id": "0c76f4e1-b0ba-49c5-b9d5-cfb523024ba9", "metadata": {}, "source": [ "**Exercise 2.** Read the program above and answer the following questions:\n", "1. What is the purpose of state_to_pos and pos_to_state?\n", "2. Why do we only assign states to cells in FREE?\n", "3. What would happen if the maze had multiple goal cells?\n", "4. What is the total number of states (n_states) in this maze? Does this match the number of non-wall cells you can count visually?" ] }, { "cell_type": "markdown", "id": "4c26a18f-2d03-401c-8eae-f9a17ac55f6d", "metadata": {}, "source": [ "1. What is the purpose of `state_to_pos` and `pos_to_state`? These dictionaries establish a bijective mapping between the mathematical representation of the state and its spatial representation:\n", "\n", " `state_to_pos`: Maps the scalar state index `s` (an integer used for matrix/vector operations in RL algorithms like Q-learning) to the grid coordinates (i,j).\n", "\n", " `pos_to_state`: Maps the grid coordinates (`i,j`) (used to calculate movement and dynamics within the 2D grid) back to the unique state index s.\n", "\n", "2. Why do we only assign states to cells in FREE? In a Markov Decision Process (MDP), walls (#) are obstructions, not valid states.\n", "\n", " The agent can never \"be\" in a wall, so assigning a state index to a wall would needlessly increase the dimensionality of the state space (∣S∣).\n", "\n", " Excluding walls ensures the transition matrices and value vectors remain compact and contain only reachable positions.\n", "\n", "3. What would happen if the maze had multiple goal cells?\n", "\n", " In the code: The logic is robust. Since goal_states is initialized as a list (`[]`), the code would simply append the state index `s` of every `G` cell found during the iteration. The list would contain multiple integers representing all terminal states.\n", "\n", " Caveat: While the logic holds, the final print statement in the provided script (`state_to_pos[goal_states[0]]`) would only display the coordinates of the first goal found, ignoring the others in the console output.\n", "\n", "4. What is the total number of states (`n_states`) in this maze? Does this match the number of non-wall cells you can count visually?\n", "\n", " `n_states` represents the total count of walkable cells (Start, Goal, Trap, and empty space).\n", "\n", " Yes, this value matches exactly the number of non-wall cells visible in the maze, as the counter s is incremented precisely when a cell is found in the FREE set." ] }, { "cell_type": "markdown", "id": "6d0fa298-7b7c-44fc-bbed-15ea002037c2", "metadata": {}, "source": [ "-----\n", "\n", "The following function `plot_maze_with_states` creates a figure showing:\n", "- the maze walls and free cells\n", "- the state index for each non-wall cell\n", "- special labels and colors for S (start state), G (goal state), and X (trap state). " ] }, { "cell_type": "code", "execution_count": null, "id": "fc61ceef-217c-47f4-8eba-0353369210db", "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAGbCAYAAAAr/4yjAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjEsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvc2/+5QAAAAlwSFlzAAAPYQAAD2EBqD+naQAAK99JREFUeJzt3Qd4VFX6x/E3CS1CEiSAofcS+spKEZCmgiBZKQqKiIuCrIDArmBBKSu661IUQYoNpFhoYhCpSlEEFekIiHThTwslEaRm/s97xjlMMAlRk9wp38/zXObOzZA5d+7k/O4pdybE5XK5BAAAEQl1ugAAAN9BKAAALEIBAGARCgAAi1AAAFiEAgDAIhQAABahAACwCAUAgEUowHEhISEydOjQDD+2d+/eEoyaNGlilqz8nfv27TOv8ZQpUzL1eeA/CAUfoX+E+seoy5dffvmbn+unkZQoUcL8/O6775ZA9tVXX5mQOH36tF8+9/jx46lU4bdyOF0ApJQnTx557733pGHDhim2r1y5Un766SfJnTu3BJpffvlFcuTIkaJiHjZsmDz88MOSP3/+bC1LZjy3hkLBggXN78hMS5YskaxWqlQpczxy5syZ5c8F30RLwce0atVKZs2aJZcvX06xXYOidu3aEhMTI4EYhN6hgNTlypXLLFlJW6J6PMLCwrL0eeC7CAUfc//990tCQoIsXbrUbrt48aLMnj1bHnjggVT/z8iRI+XWW2+V6OhoCQ8PN+Ghj/emZ62e7qlrF+/+/AsXLsiQIUOkfPnyplWiXVYDBw4029Pz2muvmYrEu9tl1KhR5vf/85//tNuuXLkiERER8tRTT9lt3mXQ2wEDBpj1MmXK2DJqX7e3efPmSbVq1UwZq1atKosWLZKMGDt2rHn8DTfcIDfeeKP89a9/NYGbkeeePHmyNGvWTAoXLmyet0qVKjJhwoQUv7906dKybds207Lz/H/vPnt9ffr162deV/0d+jq//PLLkpyc/Lv7/1esWGF+/8yZM+XFF1+U4sWLmwq9efPm8uOPP/7m/7/xxhtSrlw58x6pU6eOfPHFF795TFpjCjt27JD77rtPChUqZP5/pUqVZNCgQSkec+jQIenWrZvcdNNN9ri88847v+sYwHmcnvkYrVTq168v77//vtx1111m28KFC+XMmTPSqVMnU/lea8yYMRIXFyedO3c2AfLBBx/IvffeK5988om0bt3aPOaxxx6T22+/PcX/04p0xowZppJTWjHp79ExjR49ekhsbKxs2bJFXnnlFfnhhx9MRZyWRo0amf+v/9cz5qGVTmhoaIrKZ8OGDfLzzz/LbbfdlurvadeunXku3X99Xu2GUVoZeehzzJ07Vx5//HETMPqatG/fXg4cOGCCMS1vvvmmPPHEE9KhQwfp27evnD9/XjZv3ixff/21CdzrPbcGgFZm+hppy2b+/PmmDLrfvXr1Mo959dVXpU+fPpIvXz5baWolqc6dOyeNGzc2lacej5IlS5ruqmeeeUb+7//+z/zfP+K///2veZ2ffPJJ8z753//+Z94Lul8eb7/9tnlOPXnQUNqzZ4/ZjwIFCpiASo++Rnp8tUtJ3xf6Ht29e7fZfw0jdfToUalXr56dCKCvmb5vH3nkEUlMTDTPmZFjAB+g36cA502ePFm/18L17bffusaNG+eKiIhwnTt3zvzs3nvvdTVt2tSslypVytW6desU/9fzOI+LFy+6qlWr5mrWrFmaz7dr1y5XVFSU64477nBdvnzZbJs2bZorNDTU9cUXX6R47MSJE03ZVq9enebvu3LliisyMtI1cOBAcz85OdkVHR1tyh4WFuZKSkoy20ePHm2e49SpU/b/6u8eMmSIvT9ixAizbe/evb95Ht2eK1cu148//mi3bdq0yWwfO3asKz1/+9vfXFWrVk33Mek997Wvs2rRooWrbNmyKbbpczRu3Pg3j33hhRdcefPmdf3www8ptj/99NPmNTpw4EC6ZdPf6f17ly9fbsoaGxvrunDhgt0+ZswYs33Lli32/VC4cGFXrVq1UjzujTfeMI/z/p2637pN348et912m3k/7t+/P0V59Bh7PPLII64iRYq4Tpw4keIxnTp1Mu8zz2uXkWMAZ9F95IO0ma6DfXqmn5SUZG7TO4vS5rzHqVOnzNmintmtX78+1cefPXtW2rZta5ruelbs6T/WsQxtHVSuXFlOnDhhF+0yUcuXL0+zDHqmqmehq1atMve3b99uusGefvppM3NqzZo1Zru2GrTb588MIGuLR7tBPGrUqCGRkZHm7Dc9+pw6WP/tt9/+oef1fp31NdbXRs/89Xn1/vXo66vHRV9379dX90e71Tyv3e/197//PcVYgz6H8rwe69atk2PHjknPnj1TPE67FKOiotL93cePHzfl0m4hbdl401aB0uM7Z84cadOmjVn33rcWLVqY18bzXvyzxwBZj+4jH6RNb60otJ9Vuxy0wtDmdlo0NIYPHy4bN25M0ffv+aO9Vvfu3U3zX7suvLtbdu3aZSpz764ab1qxpEcrI+2X10DTyr9IkSJy8803S82aNc39O+64w3T9aOj9GddWTkorWg3E9Og4xrJly0x/uvbl33nnnSZsGzRokKHnXb16tRlv0YDT4+JNK77rVbD6+mpXyR99fTP6euhroTyvx/79+81thQoVUjxOu4PKli2b7u/2BIsGeXrBoWMlOmahS3r79mePAbIeoeCj9A9FK+8jR46YsYW0zqy1stW+Ye2j16mQWhHrH7sOiqY2eKfjD9o6mD59utSqVSvFz7RvvHr16jJ69OhUn+t6fc86jfbSpUum0tRyec5Y9Vbv62ClViCe7X9UWjNjrvfNstoK2rlzpwlRHU/Rs1t9zQYPHmymoaZHQ1QHcLUVpa+PvhZ61v3pp5+a8YeMDBTrYzQYdeA+NRUrVpTsfD0yi2ffH3zwQenatWuqj9HW3J89BsgehIKP0u4dHRhcu3atfPjhh2k+Tv+odMbJ4sWLU1zDoKFwLa2YdTBSB/10IPJa2iWzadMmU/ml1cpIj579aUWpz6OLZyaPBpYOMH722Wf2fnr+yHNnVN68eaVjx45m0UF5HVzWwVId7NXXMa3n1kFVbYXFx8enODNPrUstrd+hr68Osl874J8d1x54WiqerkClAb53717TkkuLpyWxdevWNB+jLR8d8NcWbUb27XrHAM5iTMFH6ewVne2i3THaV5veWaJWQvoH6T2t8NqZQjq7Rbtt9Gx+xIgRqf4u/bnOjNEK/FraJaRjEenRP+hbbrnFtER0JpB3S0H/v84S0opRWzPXqzRUZl/RrGMc3jTAdFqpnlFrBZnec3vOxr3PvrXLKLXw1d+RWtn19dVWlAb4tfTx116bkll0yqdW3BMnTjSVsIdOO73ea6z/T0Ncp5bqMfXmeS30tdHZX3qCklp4aOvw9xwDOIuWgg9LqynuTaecandGy5YtTZeT9t2+/vrrpr9W+689dBqg/nFq14VOWb22aa9Lly5dzJx3HZDUM2Dt59Ww0W4f3a6VmVYw6dEA0CmS2r+uXVFKp7zqvHbtNsjIVb56nYXSKZ06DVe7wzQYPRX2H6X913rxn+6XThPV8ZNx48aZ11DPdNN7bv2/WoHpurbg9Ixfw1P3TQP32vJroOs4jx4HfYyeoWvLSVsaOmVXXwd9nAatTvvV60o0zD3TYDOT7oOWRcut5dAzdG0haKBdb0xBaZjryYSOD+mUVL2GQ8u6YMECM46l9Jjre6Zu3bqm21Mr+pMnT5oBZh1D0PWMHgM4zOHZT0hlSmp6UpuS+vbbb7sqVKjgyp07t6ty5crmd+kUT+/Dq9MO9X5qi/d0UJ2++PLLL5tpg/r7brzxRlft2rVdw4YNc505c+a6+7FgwQLzO++6664U2x999FGzXct6rWvL4Jm+WaxYMTN91XuKqK736tUr1dela9eu6ZZt0qRJZnqlTpXVfStXrpxrwIABv9mvtJ47Pj7eVaNGDVeePHlcpUuXNq/TO++885sprEeOHDHHSKdxXjvlU6fmPvPMM67y5cubqbUFCxZ03Xrrra6RI0ea1/6PTEmdNWtWiselNq1UjR8/3lWmTBmz73/9619dq1at+s3vTOv/bt261dW2bVtX/vz5zf5XqlTJ9fzzz6d4zNGjR82xKVGihCtnzpyumJgYV/Pmzc3U1997DOCcEP3H6WACAPgGxhQAABahAACwCAUAgEUoAAAsQgEA8PuuU9DL2A8fPmzmEWfl1aYAgKyhE031AzaLFi1qPsDyT4WCBsL1PvcGAOD7Dh48aL6Q6U+FgudKQ/1cE742EQD8j36Mil5dfr0rxzNUw3u6jDQQ+EJvAPBf1xsCYKAZAGARCgAAi1AAAFiEAgDAIhQAABahAACwCAUAgEUoAAAsQgEAYBEKAACLUAAAWIQCAMAiFAAAFqEAALAIBQCARSgAACxCAQBgEQoAAItQAABYhAIAwCIUAAAWoQAAsAgFAIBFKAAALEIBAGARCgAAi1AAAFiEAgDAIhQAABahAACwCAUAgEUoAAAsQgEAYBEKAACLUAAAWIQCAMAiFAAAFqEAALAIBQCARSgAACxCAQBgEQoAAItQAABYhAIAwCIUAAAWoQAAsAgFAIBFKAAALEIBAGARCgAAK4f8Dh9++KFERkb+nv8C4BpxcXESaOLj4yXQxAXgccoIWgoAAItQAABYhAIAwCIUAAAWoQAAsAgFAIBFKAAALEIBAGARCgAAi1AAAFiEAgDAIhQAAH/sA/GywsUrF2XkVyNl+ubpsv/MfgkLCZPCeQtL9Zuqy9DGQ6VmTE2niwgAQcPxlsKAJQNk0OeDZPuJ7VIsopiUzl9ajp09JvN2zJNdJ3c5XTwACCqOtxQ+3PahuR1822AZ1nSYWXe5XPLVwa9MiwEAEEShkOxKNrdL9iyRW4rdIrcUvUVuyneTNCjZwOmiAUDQcbz76PFbHje3a39aK23ebyMxo2Kk8rjK8sLKF+T85fNOFw8AgorjoTC0yVCZe99caVOxjUTmdn+r286EnTJ4xWDp+UlPp4sHAEHF8VBQbWPbSvz98XLqqVPyzaPfSPXC1c12HWwGAARRKDz3+XOy8chGd2FCQs24QsXoiuZ+VJ4oh0sHAMHF8YHmt9a/JS9+8aIUvKGglIwqaaaj/pT4k/nZA9UecLp4ABBUHG8pDG82XP5W6W8SkStCdpzYYUKhUnQlGdJ4iLzQ7AXxV6tWrZJWrVpJoUKFJCQkxCwTJ04UfzVq1Chp0qSJFClSRHLnzi2lSpWSrl27yp49e8Sfvfrqq1KzZk3Jnz+/2a/ixYvLvffeK5s3b3a6aEjFfffdZ/+eOnXq5HRxApLjLYVHb37ULIFm/fr1snTpUilbtqycOHFC/N3YsWPlwIEDUqlSJQkPD5e9e/fK1KlTZcmSJbJz506JjHRPEvA3K1eulOPHj5vjdP78ebMvs2fPls8//9zsb968eZ0uIn41efJkmTVrltPFCHiOtxQCVZcuXSQxMVEWL14sgaB79+6yb98+2b59u2kd9OvXz2w/cuSIfPbZZ+Kv3n//fTl8+LAJ8e+//16effZZs/3kyZOyY8cOp4uHX+3evVueeOIJqV+/vmnNIesQClkkOjranFEHikGDBknJkiXt/UaNGtl17XbxV3ny5JGPPvpI6tWrJ1WqVJGXXnrJbNduv4oV3RMe4KzLly9L586dJTQ0VGbMmCFhYWFOFymgOd59BP9z5coVeeONN8y6drs0b95c/NnRo0fl66+/tvfLlCkj8+fPl4iICEfLBbdhw4aZ4zN9+nRzbJC1aCngdzl79qy0bdvWdIvFxMSYytOfWwqqZ8+ekpycLPv375eOHTua8RK9TUpKcrpoQW/dunXyn//8Rx588EHTWkDWIxSQYTp+0LhxYxME2rWyevVq0+USCHQ2i3aPecYUtm3bZsYb4KytW7ealqkO/ufLl88sOgFAzZkzx9w/c+aM08UMKIQCMkQrSe13/+6778x4wpo1a0zXkT9LSEiQadOmycWLF+22Tz/9NEWrCL5BZ4bp8dBFP0XZM9bgfR8BFgofbP1Abp50s4S/GC4FXi4gHWZ2kN0nd4u/mjt3rpQvX97M7fcYPHiw2eaPzeB27dqZ7hWl3Sp6DYaGhC5vvfWW+CPdj4ceeshco1C9enXTUnjmmWfMz3Q8QfcZznr44YdNpe+96DUySrv49L4ePwTYQPPb69+WR+e7r1Uok7+MJPySIHO2z5EvDnwhm3pukph8MeJvdDqqTqPzpvPhdfHHKXUXLlyw6xs3uj+WxKNly5bij7Qy0QugvvnmG3OsLl26JCVKlDBdZNqN5Kl8gGAS4spA20sruKioKNN3l9kXKenXcRYbXUxOnDsh7WPby+z7ZsvhpMPm47OTLiZJnzp95LW7XsvU5wScFBcXJ4EmPj5eAk1cgB0nPelZtGjRdetxx7uPvj30rQkEpaGgikYUlXrF65n1RT8ucrR8ABBMHA+Fg4kH7br312/qt6+pA2fcMw0AAEEQCmlhRgEABGEolIgsYdf1E1KvXdeP0wYABEko6JfqRIdHm3WdcaR0oFm/s1m1LO+fM1sAwB85Hgq5wnLJS81fsqFQdkxZiX091sw80i/eebrh004XEQCChuOhoHrU7iHT206XWjG1TCshREKkXWw7+arbV2YmEgAgiC5eU51rdDYLACDIWwoAAN9AKAAALEIBAGARCgAAi1AAAFiEAgDAIhQAABahAACwCAUAgEUoAAAsQgEAYBEKAIA/9oF4HTt2lJw5c0qg4MvGfV8gHqNA3CcEDloKAACLUAAAWIQCAMAiFAAAFqEAALAIBQCARSgAACxCAQBgEQoAAItQAABYhAIAwCIUAAAWoQAAsAgFAIBFKAAALEIBAGARCgAAi1AAAFiEAgDAIhQAABahkAWGDh0qISEhqS6XL192uni4xvHjx6VPnz5SqlQpyZUrlxQsWFCaN28ue/bsEX+zb9++NN97uuh709+cPXtWBg4cKBUqVJAbbrhBoqKipEaNGjJixAhxuVxOFy/g5HC6AIFMK5dy5cql2KZ/mPAdJ06ckLp168revXtNIFSsWNFUNGvWrJHDhw9L2bJlxZ/kzp3b7I+306dPy86dO816kSJFxN/06tVL3n33XbNetWpVOXPmjGzZssUERZ48eUygI/MQClmodevWMmXKFKeLgXQ899xzJhC0slm6dKmtNC9evOiXZ6Fa/rVr16bY1rt3bxMKN954o3Tu3Fn8zZdffmluW7ZsKQsXLpRffvlFChQoIOfPn5f9+/c7XbyAQ/dRFpozZ46Eh4ebP9S7775bNmzY4HSR4EUr/ZkzZ5r1EiVKyB133CF58+aVmjVrmmOnZ93+LiEhQSZPnmzW//GPf0i+fPnE3zRq1MjcLlq0SKpVq2ZacxoIuv1f//qX08ULOIRCFgkLC5OYmBgpXbq0HDlyRBYsWCD169cnGHxsLOHUqVO2wtFuFj2b3rx5szzwwAMye/Zs8Xfjx4+Xc+fOmYDz126WiRMnykMPPWTWt23bJj/99JPp6tNxBT1eyFyEQhbQCuXYsWOya9cu2b59u6lw1IULF+T11193unj4lfegf2xsrBlY1kXX1bhx48Sfeb/fHnzwQXOS4o9eeeUVmTZtmjRo0MD8XWkwREREmH17+umnnS5ewCEUsoA2b7XP06NFixYSHR1t1g8cOOBgyeCtUKFC5oxTaZeRruui656ZPP5s6tSpcvToUTO5wV+7WbSV8/zzz5uuvvbt25tjVqVKFRMQatmyZU4XMeAQClng5ZdfTlH56wCm9u0q7U6Cb8iZM6fcdtttZl27jC5dumQWXVc6BdJfaSU6atQoO+HB0/rxx1DwtOi+++47c6vjCdpaUDoGhMxFKGSBCRMmmMpf573rWY22FDxv4H79+jldPHgZPny4aR18//33UqZMGbPouo4JPfvss+Kv5s+fb6ehDhgwQPx5WrcnuGfMmGGCWv+2du/ebbZ17drV4RIGHkIhC2hlohc/6Vmn9lFrOOhUQD3T0ZCA79A5/Z9//rk0adLEDDrrWejtt98uq1evlqZNm4q/GjlypLmtU6eOrVT91bx588w1Cdotq9eO6HRhPW7Tp0+Xxx9/3OniBZwQVwYmYycmJpqrCHWesDa5A0V8fLzTRch0cXFxEkgC8RjBP8QF2N+SnqTqpBe9+C8yMjLNx9FSAABYhAIAwCIUAAAWoQAAsAgFAIBFKAAALEIBAGARCgAAi1AAAFiEAgDAIhQAABahAACwCAUAgEUoAAAsQgEAYBEKAACLUAAAWIQCAMAiFAAAFqEAALByXF0FgD8m0L7kPpjRUgAAWIQCAMAiFAAAFqEAALAIBQCARSgAACxCAQBgEQoAAItQAABYhAIAwCIUAAAWoQAAsAgFAIBFKAAALEIBAGARCgAAi1AAAFiEAgDAIhQAABahAACwCAUAgEUoZJGff/5ZhgwZIpUrV5bw8HApWrSo/OMf/5BTp045XbSgtWrVKmnVqpUUKlRIQkJCzDJx4sQUj7l06ZIMGzZMypYtK7ly5ZLixYtL//79zfH0x/2ZNGmSNGzYUPLmzWsfs2PHDsfKDN9HKGSRNm3ayL///W/58ccfpWLFiqZS0T/YO++8Uy5fvux08YLS+vXrZenSpVKgQIE0H9OtWzcZOnSo7N+/3wTDsWPH5NVXX5W7775bkpOTxd/2Z+HChbJhwwYTHEBGEApZ4Pvvv5cVK1aY9TFjxsimTZvku+++M/fXrVsnM2fOdLiEwalLly6SmJgoixcvTrOSnT59uj1uekY9Z84cc3/lypUyb9488af9UePHjzeP0aADMoJQyALeZ5ShoaEpbtWyZcscKVewi46ONl156Z1Ve7Rv397ctm7dWvLkyWPWFy1aJP60P0q7LcPCwrKtTPB/hEIWiI2NlWrVqpn1Pn36SK1ateTmm2+2Pz906JCDpUNaDh48aNcLFy5sw7xgwYJm/cCBA46VDcguhEIW0DMzPevs3LmzqVD27NkjjRo1knLlypmf58yZ0+ki4ndwuVxOFwHINjmy76mCi85a8fRPq/Pnz0tMTIxZr1SpkoMlQ1pKlChh13WAuUiRIqYrMCEhwWwrWbKkg6UDsgcthSyig5ZJSUlm/cqVKzJgwAA5c+aMud+xY0eHS4fUtGzZ0q57BpgXLFhgAv3anwOBilDIIu+8847pl65evbppIYwbN85s79evn9SpU8fp4gWluXPnSvny5aVJkyZ22+DBg8027eqrXbu23H///WZ73759zdiQZ8BZu//uuece8af9UU899ZS5r7ceLVq0MNtee+01R8oN30b3URbRin/58uVmPEH7pLXC0YvXHnnkEaeLFrR0aubu3btTbDt+/LhZtLtPvfvuu1KhQgWZOnWqeazO7+/QoYMMHz48xQwyf9mfo0eP/uYxngHzkydPZmNp4S9CXBkYRdM3X1RUlGk+B9IgaXx8vASauLg4CSSBeIwCUaC97wKRXq2v06q1GzsyMjLNx/nWqQ8AwFGEAgDAIhQAABahAACwCAUAgEUoAAAsQgEAYBEKAACLUAAAWIQCAMAiFAAAFqEAALAIBQCARSgAACxCAQBgEQoAAItQAABYhAIAwCIUAAAWoQAAsHJcXUUgCLQvug/EL4QPtGMUqPsUaBITEyUqKuq6j6OlAACwCAUAgEUoAAAsQgEAYBEKAACLUAAAWIQCAMAiFAAAFqEAALAIBQCARSgAACxCAQBgEQoAAItQAIBsdv68yCuviNx6q0j+/CK5c4uULCly++0io0c7WzY+OhsAslFCgkjz5iKbNrnv33CDSMWKIklJIitXinz2mcg//+lc+WgpAEA26t37aiD07esOiS1bRPbtEzlxQmTyZGfLR0sBALLJ6dMis2a512vWdHcVhXqdmut34Dz8sDiKlgIAZJMffhC5csW93qjR1UC45x6RkJCry5QpzpWRUAAAB4R61b6VKrlbDr6AUACAbFKpkkhYmHv9q6+ubn/5ZZEPPhCfQCgAQDaJihK57z73+rp1IkOGXO1O8hWEQiZYtWqVtGrVSgoVKiQhISFmmThxov15UlKS9OvXT2rXri0FCxaU8PBwqVixojz//PPmZ/64T6pbt25SoUIFyZcvn+TNm1fKlSsnTzzxhJw8edKxcgeTjBwjD32f6fG53uP8YZ+aNGlif+a9NGzYUPzB2LEiNWq41//9b5ECBUT+8hfdL/EJhEImWL9+vSxdulQK6NFNRUJCgowZM0a2bdsmxYsXN5Xorl27ZPjw4dKxY0fxx31SH3/8sVy5ckUqV65swm7Pnj0yduxYeeCBB7K1rMEqI8fIo3fv3ub4BNI+lS1bVurWrWuXqlWrij+IjhZZu9bdZVS7tkhyssiOHSLh4SItWohoBurAs1MIhUzQpUsXSUxMlMWLF6f68zx58siIESPk+PHjsnHjRjl48KDUq1fP/GzhwoVy6tQp8bd9UocOHTIVzbp162T//v32TG316tXZWNLglZFjpGbOnClTp06V+zz9FgGwT0pb2mvXrrXLpEmTxF+Eh4sMHOjuQtLOgl9+Edm7V2TRIpHHHnNf5ewUQiETREdHmy6htMTExMiTTz4pERERNiRuueUWsx4aGio5cuTwu33y7If+YepZWunSpeXLL7802/2lGe/vMnKM9ATkscceM12X2jINhH3y6N+/v+TOndu0GHr06CFHjx7N8vIFA0LBAceOHZM5c+aY9U6dOtmw8EfaDfbNN9+YloK6/fbbzZkpnJecnGzOvC9duiTvvfee5MyZUwKFBkexYsXM2MPevXvlzTfflPr168vZs2edLprfIxSy2e7du82Z9OHDh6VBgwY+O+CXUR988IFcvHhRNmzYINWqVZNly5ZJr169nC4WRMw41sqVK82tTmwIFK+88orpct26datpCT3zzDNmu4bDRx995HTx/B6hkI3WrFljxhL07LpNmzayZMkSv24leOgZaK1ataR79+7m/rRp0+QHvXQTjtr06wfs9O3b10xu8B6I1dlwt+pHdPqhv/zlL6bbSOmsI++JDQcOHHCwZIGBUMgms2fPlmbNmsmJEyekT58+Mm/ePLlBPx7RT3377beyYsUKe19bC9pK8KAZ7zv0WOhy7tw5u+3ChQsp7vtT1+vo0aNTTOX+8MMP7bqObeHPIRQywdy5c6V8+fJm/rTH4MGDzbbOnTubriKd+XH+/HnJlSuX6YPXszRtNeii0/D8bZ90em3Tpk3N1EFtJRQpUkTmz59vHqf3a/rKNfsB7HrHaMqUKeJyueyi3SseEyZMMDPh/G2fNMj+9a9/mfddbGyslCxZ0g6g6/127dqJr1m1SqRVK5FCha5+tlFqvcabN4t06OB+XK5cIsWKXb3QLTv53rQXP6RT6HSswJtOP9VFr0vQs2j9o1S6/vXXX//m//vbPun4QcuWLU0Xxffffy9hYWHmj7J169by7LPPmllVcPYYBeI+6cDyoEGDTNerPu6XX34x18ncc889MnDgQDMjztesXy+ydKleV+H+aOzU6MS9O+90T02NjBTRnr6ff9ZrgbK7tCIhLk9tdZ0DFRUVZSqBQJrBEB8f73QRcB1xcXESaHjfBZeEBPcX6eiM2TJl3NsmTBDp2dO9rjVwlSruC9g6dxZ58033dQxKe8kya9jRU4+fOXNGIjV50sDpHABk8RXM4elceqHdRhoInoDQD83Tz0hq1sz9UdvZjVAAAAft3Hl1/b333K0KtXy5+/OQ9BvZshOhAAAOunz56vojj7hbDToHQD9iW8cVsvsLdwgFAHBQsWJX13/99Bsz9qCzkBQtBQAIInXquGccKf2APKWfGnP8uHu9QoXsLQ+hAABZaO5ckfLlU35fwuDB7m0620gHoYcOdW9/6y293sL91Zz65TsxMSI9emRveQkFAMhCiYn6mWfus38PbQXotkOH3Pf793cHQrVq7o/Q1mmoXbq4Ww6ebqTswsVrAJCFHn7YvVyPDjLr4jRaCgAAi1AAAFiEAgDAIhQAABahAACwCAUAgEUoAAAsQgEAYBEKAACLUAAAWIQCAMAiFAAAFh+IB5/Gl9zDKXFxcRJILl26lKHH0VIAAFiEAgDAIhQAABahAACwCAUAgEUoAAAsQgEAYBEKAACLUAAAWIQCAMAiFAAAFqEAALAIBQCARSggOC1fLhIaKhISIvLf/17dfuWKSL167u2lSokkJjpZSiDbEQoITk2bivTt614fMkRk82b3ugbE11+7Q+Hdd0UiIx0tJpDdCAUEr//8RyQ2VuTiRZEuXUS++UZk2DD3z/r1E2nSxOkSAtmOUEDwypNHZNo0kRw53C2Fxo31m0hEqlQReeklp0sHOIJQQHCrXVvkuefc6+fPi4SFuYNCAwMIQoQCsGtXyoHmffucLA3gKEIBwW3OHJEZM9zrOttIPfaYyNGjjhYLcAqhgOClFX/Pnu71Vq1E1qwRiY4WOXFCpHt3p0sHOIJQyASrVq2SVq1aSaFChSQkJMQsEydOtD+fMmWK3Z7asmLFCvG3fVK7d++WBx98UEqUKCG5c+eWggULSuPGjeXjjz8Wv/Doo+4AKFBA5K23RIoUEZkwwf2z+fNF3n5bfNmoUaOkSZMmUqRIEfP6lypVSrp27Sp79uyxj7l06ZIMGzZMypYtK7ly5ZLixYtL//795eeffxZ/3adJkyZJw4YNJW/evPa9uWPHDkfLHUgIhUywfv16Wbp0qRTQyiUVWrHWrVs3xaJveo+YmBjxt31yuVxyxx13yIwZM+T48eNStWpVuXLligmTtm3byqZNm8SnaQh88ol7ffx4dyCoe+8V6dzZvd6/v0+PL4wdO9a83vnz55dixYrJgQMHZOrUqdKgQQNJ/PWiu27dusnQoUNl//79JhiOHTsmr776qtx9992SnJws/rhPCxculA0bNpi/K2Q+QiETdOnSxbxhFy9enOrPW7duLWvXrk2x6Fm10oq1cuXK4m/7dOjQIdm7d69Z1zNRDZG5c+fawDh48KD4fCvB5XIvHTum/Nn06e7tWgmVLi2+qnv37rJv3z7Zvn27OZPup9dWiMiRI0fks88+M8dkuu6LiIwZM8acTc/RMRQRWblypcybN0/8bZ/U+PHjzXtTww6Zj1DIBNHR0RIeHp7hxy9atEi2bNli1gcMGCD+uE/a0ilfvrxZHzJkiNx8883Srl07yZEjhzk7veuuu7KxtMFp0KBBUrJkSXu/UaNGdl27XvSM2qN9+/b2BCXPr9Nt9X3ob/ukihYtKmE6dRhZglBwwIgRI8xtzZo1TUvBH+kf5fLly6V27dpy4cIF05w/ffq03HjjjSYg+KPNXtp198Ybb5h17SZq3rx5itZa4cKFzW1oaKhtpWrXjL/tE7IeoZDNtPL8/PPPzfqTTz4p/kr7o3v27Cnfffed9O3b1wxczpo1y4wv9O7d2ye7JgLV2bNnzTiOdvXp+NT8+fPtWXVqtHsv0PYJmYdQyGYjR440tzpjp1OnTuKvtH93wYIFZl1nh+hMkA4dOkjkrx8gt2zZModLGBy0r11nfGmlWbFiRVm9erVU0Y/p+PU95qEDzJ4wT0hIMOve3TT+sk/IeoRCNtLm+syZM826nl1r/7u/OnPmjF1ft26duf3hhx8kKSnJrGtIIGtt27ZN6tWrZ1pr2ve+Zs0a083i0bJlS7vuGWDWID+vH+dxzc/9ZZ+Q9QiFTKCzbnTQVedXewwePNhs6+yZ3ihipgJevnxZoqKipEePHuLP+9S0aVMzfqC0G6l69epmLEG7JnLmzCn333+/+JxVq9wXqelURv1obF2uufbCTEF9+GH31c06IFupksj//qen2OJrdGBfp5oqDWO9rkQrVF3eeustM97jOQ56EhIbG2sHnLXCveeee8Tf9kk99dRT5n2otx4tWrQw21577TXHyh4o/PdU1Yfo9Di9kMub9q3rohcLec6sPW9qDYSIiAjx533S2UnarH/xxRfliy++kF27dpmQ0Gb/c889J7Vq1RKfs369yNKlOmrpvmjtWsePi9Sp477Nl09Epwpv3aq1kMjhw5rq4kt0gN9j48aNKX7maQW8++67UqFCBTPXX4+nzu3Xbr7hw4ebQWdfk5F9Onr06G/em55B85MnT2ZLOQNZiCsDo05aQejZrR4UPQsMFPHx8U4XAdlJ+9JvuMH98RZlyri36RXMno+60IvYevVyr2sYVK0q8uabmuLuT0/V6zK8+ukR2OLi4iSQ6NXtOg1ZT1A9Y3+p8b1TBSCr6OcapXc9iXcXkecs2nOrn56qX+EJBDhCAfDQ8QbtNlJ164poF5inFaEOHXKsaEB2IRQADx1rWLLE/f3N2kLQcQQddNYBaRVAXadAWggFwFv9+iJ6ceHp0zq5Xz9Rzv05SEpnIgEBjlAAvH35pXv8QJ06pZedu9f1oyH4mAUEAUIBwUM/xVU/xM/r2gsZPNi9zXM9iY4haADUqCGi04m/+so980ivZ9CZS0CAIxQQPPSjsHV++68XRxl6TYJu8wwi33mniE7X27lTRK841/vanfTrRV9AoOPiNQQPHTTWJT2jR7sXIEjRUgAAWIQCAMAiFAAAFqEAALAIBQCARSgAACxCAQBgEQoAAItQAABYhAIAwCIUAAAWoQAAsIL6A/EC7Yu5AeDPoqUAALAIBQCARSgAACxCAQBgEQoAAItQAABYhAIAwCIUAAAWoQAAsAgFAIBFKAAALEIBAGARCgAAi1AAAFiEAgDAIhQAABahAACwCAUAgEUoAAAsQgEAYBEKAACLUAAAWIQCAMAiFAAAFqEAALAIBQCARSgAACxCAQBgEQoAAItQAABYhAIAwCIUAAAWoQAAsAgFAIBFKAAALEIBAGARCgAAi1AAAFiEAgDAIhQAABahAACwCAUAgEUoAAAsQgEAYBEKAACLUAAAWIQCAMAiFAAAFqEAALAIBQCARSgAAKwckgEul8vcXr58OSMPBwD4GE/97anP/1QoJCUlmdtly5ZlRtkAAA7R+jwqKirNn4e4rhcbIpKcnCyHDx+WiIgICQkJyewyAgCymFb1GghFixaV0NDQPxcKAIDgwEAzAMAiFAAAFqEAALAIBQCARSgAACxCAQBgEQoAAPH4f8EOdr53tgDzAAAAAElFTkSuQmCC", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_maze_with_states() -> None:\n", " \"\"\"Plot the maze with state indices.\"\"\"\n", " grid = np.ones(\n", " (n_rows, n_cols),\n", " ) # Start with a matrix of ones. Here 1 means “free cell”\n", " for i in range(n_rows):\n", " for j in range(n_cols):\n", " if maze_str[i][j] == \"#\":\n", " grid[i, j] = 0 # We replace walls (#) with 0\n", "\n", " _fig, ax = plt.subplots()\n", " ax.imshow(grid, cmap=\"gray\", alpha=0.7)\n", "\n", " # Plot state indices\n", " for (\n", " s,\n", " (i, j),\n", " ) in state_to_pos.items():\n", " cell = maze_str[i][j]\n", "\n", " if cell == \"S\":\n", " label = f\"S\\n{s}\"\n", " color = \"green\"\n", " elif cell == \"G\":\n", " label = f\"G\\n{s}\"\n", " color = \"blue\"\n", " elif cell == \"X\":\n", " label = f\"X\\n{s}\"\n", " color = \"red\"\n", " else:\n", " label = str(s)\n", " color = \"black\"\n", "\n", " ax.text(\n", " j,\n", " i,\n", " label, # Attention : matplotlib, text(x, y, ...) expects (column, row)\n", " ha=\"center\",\n", " va=\"center\",\n", " fontsize=10,\n", " fontweight=\"bold\",\n", " color=color,\n", " )\n", "\n", " ax.set_xticks([]) # remove numeric axes, we don't need.\n", " ax.set_yticks([])\n", " ax.set_title(\"Maze with state indices\")\n", "\n", " plt.show()\n", "\n", "\n", "plot_maze_with_states()" ] }, { "cell_type": "markdown", "id": "db078d86", "metadata": {}, "source": [ "### 2.4 Actions and deterministic movement" ] }, { "cell_type": "markdown", "id": "96e7f1f2-9d73-410b-853d-e39f40dfb5da", "metadata": {}, "source": [ "We first define integer codes for each action. \n", "\n", "**Exercise 3.** How many possible actions can the agent take in the maze?" ] }, { "cell_type": "markdown", "id": "22259ab4-527e-4d7c-bb30-98fb240da6d5", "metadata": {}, "source": [ "We have four possible actions in the maze. \n", "\n", "In this following cell, each action is mapped to an integer (0,1,2,3). This makes it easy to store and use actions inside arrays and matrices\n", "\n", "Here we use Unicode arrow character:\n", "\n", "- \"\\u2191\" : ↑ (up arrow)\n", "\n", "- \"\\u2192\" : → (right arrow)\n", "\n", "- \"\\u2193\" : ↓ (down arrow)\n", "\n", "- \"\\u2190\" : ← (left arrow)" ] }, { "cell_type": "code", "execution_count": 81, "id": "f7f0b8e4-1f48-4d03-9e5f-a47e59c3e827", "metadata": {}, "outputs": [], "source": [ "A_UP, A_RIGHT, A_DOWN, A_LEFT = 0, 1, 2, 3\n", "ACTIONS = [A_UP, A_RIGHT, A_DOWN, A_LEFT]\n", "action_names = {A_UP: \"\\u2191\", A_RIGHT: \"\\u2192\", A_DOWN: \"\\u2193\", A_LEFT: \"\\u2190\"}" ] }, { "cell_type": "code", "execution_count": 82, "id": "3773781c-a0cd-48db-967b-d4b432d17046", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "↑\n" ] } ], "source": [ "print(action_names[0])" ] }, { "cell_type": "markdown", "id": "4b957f5a-ee39-4437-abc1-4809105ad83c", "metadata": {}, "source": [ "**Exercise 4.** Now we define a **deterministic movement function** `move_deterministic(i, j, a)`. \n", "\n", "This function simulates the robot trying to move from (i, j) in direction a.\n", "\n", "But if the movement hits a wall or boundary, the agent stays in place." ] }, { "cell_type": "code", "execution_count": null, "id": "4b06da5e-bc63-48e5-a336-37bce952443d", "metadata": {}, "outputs": [], "source": [ "def move_deterministic(i: int, j: int, a: int) -> tuple[int, int]:\n", " \"\"\"Deterministic movement on the grid. If the movement hits a wall or boundary, the agent stays in place.\n", "\n", " Args:\n", " i (int): current row index\n", " j (int): current column index\n", " a (int): action to take (A_UP, A_DOWN, A_LEFT, A_RIGHT)\n", "\n", " Returns:\n", " (tuple[int, int]): new (row, column) position after taking action a\n", "\n", " \"\"\"\n", " candidate_i, candidate_j = (\n", " i,\n", " j,\n", " ) # It means “Unless the action succeeds, the robot stays in place.”\n", "\n", " # Now each action changes the coordinates of the robot:\n", " if a == A_UP:\n", " candidate_i, candidate_j = (\n", " i - 1,\n", " j,\n", " ) # if the action is UP, then row becomes row -1\n", " elif a == A_DOWN:\n", " candidate_i, candidate_j = (\n", " i + 1,\n", " j,\n", " ) # if the action is DOWN, then row becomes row +1\n", " elif a == A_LEFT:\n", " candidate_i, candidate_j = (\n", " i,\n", " j - 1,\n", " ) # if the action is LEFT, then column becomes column -1\n", " elif a == A_RIGHT:\n", " candidate_i, candidate_j = (\n", " i,\n", " j + 1,\n", " ) # if the action is RIGHT, then column becomes column +1\n", "\n", " # Check boundaries\n", " if not (0 <= candidate_i < n_rows and 0 <= candidate_j < n_cols):\n", " # If the robot tries to move outside the maze\n", " # It will not move and it stays at (i, j).\n", " return i, j\n", "\n", " # Check wall\n", " if maze_str[candidate_i][candidate_j] == \"#\":\n", " # If the next cell is a wall, the robot stays in place.\n", " return i, j\n", "\n", " return candidate_i, candidate_j # Otherwise, return the new position" ] }, { "cell_type": "markdown", "id": "c9e620e6", "metadata": {}, "source": [ "### 2.5 Transition probabilities and reward function" ] }, { "cell_type": "markdown", "id": "80bd2bca-7717-4b5f-bffa-76fe86a51d35", "metadata": {}, "source": [ "Recall that we set the discount factor $\\gamma \\in(0,1)$, that is, the future rewards are multiplied by $\\gamma$, so immediate rewards matter a little bit more than future ones. \n", "\n", "\n", "Moreover, we consider a probability error $p_{\\text{error}}$, which means, with probability $p_{\\text{error}}$, the robot **does not** execute the intended action but one of the 3 other directions (chosen uniformly). With probability $1-p_{\\text{error}}$, the robot executes the action that we asked." ] }, { "cell_type": "code", "execution_count": 84, "id": "610253e7-f3f7-4a30-be3e-2ec5a1e2ed04", "metadata": {}, "outputs": [], "source": [ "gamma = 0.95\n", "p_error = 0.1 # probability of the error to a random other direction" ] }, { "cell_type": "markdown", "id": "0d1ceff8-86e0-4c45-83d3-af9fae974608", "metadata": {}, "source": [ "Now we initialize the state–transition probability : the probability of reaching next state $s'$ after taking action $a$ in state $s$. \n", "$$\n", " p(s' \\mid s, a)\n", " = \\mathbb{P} \\big[S_t=s'\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]\n", "$$\n", "\n", "We store these transition probabilities in the 3D array `P` (`P[a][s, s_next]`), which has shape `(n_actions, n_states, n_states)`:\n", "\n", "`P[a, s, s_next] = P(S_{t+1} = s_next | S_t = s, A_t = a)`.\n", "\n", "We also initialize the reward vector `R`, which has length `n_states`, where `R[s]` is the reward received when the agent is in state `s`.\n", "\n", "In this maze game, we assume that the reward depends only on the current state, which is natural: in navigation tasks, being in a particular location is what matters, not the direction you used to reach it." ] }, { "cell_type": "code", "execution_count": 85, "id": "7a51f242-fe4e-4e74-8a1f-a8df32b194b8", "metadata": {}, "outputs": [], "source": [ "# Initialize transition matrices and reward vector\n", "P = np.zeros((len(ACTIONS), n_states, n_states))\n", "R = np.zeros(n_states)" ] }, { "cell_type": "markdown", "id": "c08f4af5-a2a7-4baa-b5da-c7ce636d8a4a", "metadata": {}, "source": [ "Now we assign the reward to each state. \n", "\n", "For each state index s:\n", "\n", "1. If s is a goal, then the reward = +1.0\n", "2. If s is a trap, then the reward = −1.0\n", "3. Otherwise for the normal cell, the reward = −0.01 every time you leave this cell.\n", "\n", "Recall that rewards are received at the moment the agent executes an action. Here when the agent moves out of the cell, we set reward −0.01. " ] }, { "cell_type": "code", "execution_count": 86, "id": "49d54d1f-dc29-45b6-ad31-ad0e848f920d", "metadata": {}, "outputs": [], "source": [ "# Set rewards for each state\n", "step_penalty = -0.01\n", "goal_reward = 1.0\n", "trap_reward = -1.0" ] }, { "cell_type": "markdown", "id": "dd571ec8-c36a-4e20-bec6-9e6458dc622b", "metadata": {}, "source": [ "**Exercise 5.** Why do we set the step penalty to -0.01 in this MDP?" ] }, { "cell_type": "markdown", "id": "1e8ea171", "metadata": {}, "source": [ "We set a small negative step penalty (`-0.01`) for two main reasons:\n", "\n", "- Incentivize Efficiency: It forces the agent to find the shortest path to the goal. By losing a small amount of reward at every step, the agent learns that the faster it reaches the goal, the higher its total cumulative return will be.\n", "\n", "- Prevent Loitering: It discourages infinite loops or wandering. Without this penalty (i.e., if step reward = 0), the agent might be indifferent between reaching the goal now or in 1000 steps, potentially leading to a policy that never terminates." ] }, { "cell_type": "markdown", "id": "07bfb065-b1af-4df1-885e-780fe250f2fb", "metadata": {}, "source": [ "**Exercise 6.** We now define the reward vector. Recall that we have already initialized\n", "`R = np.zeros(n_states)`.\n", "If a state belongs to `goal_states`, we assign the `goal_reward`.\n", "If it belongs to `trap_states`, we assign the `trap_reward`.\n", "Otherwise, we assign the `step_penalty`. " ] }, { "cell_type": "code", "execution_count": 87, "id": "b9b7495a-c233-425c-99c0-5bddaf6c3225", "metadata": {}, "outputs": [], "source": [ "for s in range(n_states):\n", " if s in goal_states:\n", " R[s] = goal_reward\n", " elif s in trap_states:\n", " R[s] = trap_reward\n", " else:\n", " R[s] = step_penalty" ] }, { "cell_type": "markdown", "id": "b90fb80c-9452-48a2-889f-286703c2ae93", "metadata": {}, "source": [ "Now we define terminal states and a helper function. Here terminal_states is a set containing all absorbing states, which means, reaching them ends the episode conceptually. \n", "\n", "Moreover, `is_terminal(s)` is a small helper to check if a state is terminal." ] }, { "cell_type": "code", "execution_count": 88, "id": "eca4c571-39c7-468b-af86-0bab9489415e", "metadata": {}, "outputs": [], "source": [ "terminal_states = set(goal_states + trap_states)\n", "\n", "\n", "def is_terminal(s: int) -> bool:\n", " \"\"\"Check if a state is terminal (goal or trap).\"\"\"\n", " return s in terminal_states" ] }, { "cell_type": "markdown", "id": "3a9a1d54-8339-402b-84e9-105961ed78d7", "metadata": {}, "source": [ "Now we need to fill the transition matrices `P[a][s, s_next]`. \n" ] }, { "cell_type": "markdown", "id": "d9cfd15c-12cc-48bb-bd88-07f3ae3db31c", "metadata": {}, "source": [ "**Exercise 7.** **Complete the `# TO DO` part in the program below** to fill the transition matrices `P[a][s, s_next]`. " ] }, { "cell_type": "code", "execution_count": 89, "id": "2d03276b-e206-4d1f-9024-f6948ca61523", "metadata": {}, "outputs": [], "source": [ "for s in range(n_states): # We loop over all states s.\n", " i, j = state_to_pos[\n", " s\n", " ] # We recover the states to their coordinates (i, j) in the maze.\n", "\n", " # First, in a goal or trap state,\n", " # No matter which action you “choose”, you stay in the same state with probability 1.\n", " # This makes the terminal states as the absorbing states.\n", " if is_terminal(s):\n", " # Terminal states: stay forever\n", " for a in ACTIONS:\n", " P[a, s, s] = goal_reward\n", " continue\n", "\n", " # If the state is non-terminal, we define the stochastic movement.\n", " # For a given state s and intended action a,\n", " # With probability 1 - p_error, the robot will move in direction a;\n", " # With probability p_error, the robot will move in one of the other 3 directions, each with probability p_error / 3.\n", " for a in ACTIONS:\n", " # main action (intended action)\n", " main_i, main_j = move_deterministic(i, j, a)\n", " s_main = pos_to_state[\n", " (main_i, main_j)\n", " ] # s_main is the state index of that next cell.\n", " P[a, s, s_main] += (\n", " 1 - p_error\n", " ) # We add probability 1 - p_error to P[a, s, s_main].\n", "\n", " # error actions\n", " other_actions = [\n", " a2 for a2 in ACTIONS if a2 != a\n", " ] # other_actions = the 3 actions different from a.\n", " for a2 in other_actions: # for each of the error action,\n", " error_i, error_j = move_deterministic(i, j, a2)\n", " s_error = pos_to_state[(error_i, error_j)] # get its state index s_error\n", " P[a, s, s_error] += p_error / len(\n", " other_actions,\n", " ) # add p_error / 3 to P[a, s, s_error]\n", "# So for each (s,a), probabilities over all s_next sum to 1." ] }, { "cell_type": "markdown", "id": "7841b264-af00-4322-b728-adcffac0ef89", "metadata": {}, "source": [ "Now we check if the transition matrices `P[a][s, s_next]` are computed correctly.\n", "For each action `a`, we sum the transition probabilities over all possible next states `s_next` and verify that these sums are equal to 1.\n", "\n", "This is because the matrix `P[a, s, s_next]` stores the transition probability\n", "\n", "$\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$. \n", "\n", "Therefore, for each action $a$, and for each state $s$, the sum over $s_{\\text{next}}$ of $\\mathbb{P} \\big[S_t=s_{\\text{next}}\\,|\\, S_{t-1}=s, \\,A_{t-1}=a\\big]$ should be 1. " ] }, { "cell_type": "code", "execution_count": 90, "id": "341fe630-8f87-4773-84ad-92d3516e53e2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Action ↑: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", "Action →: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", "Action ↓: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n", "Action ←: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n" ] } ], "source": [ "for a in ACTIONS:\n", " # For each action a:\n", " # P[a] is a matrix of shape (n_states, n_states).\n", " # P[a].sum(axis=1) sums over next states s_next, giving for each state s:\n", " # We print these row sums.\n", " # If everything is correct, they should be very close to 1.\n", "\n", " probs = P[a].sum(axis=1)\n", " print(f\"Action {action_names[a]}:\", probs)" ] }, { "cell_type": "markdown", "id": "46d23991", "metadata": {}, "source": [ "## 3. Policy evaluation\n", "\n", "### 3.1 Bellman expectation equation" ] }, { "cell_type": "markdown", "id": "305b047c-e83b-4f42-b64e-e2050d5deeff", "metadata": {}, "source": [ "Recall that the value function under a policy $\\pi$ is defined as:\n", "$$\n", "V^{\\pi}(s)=\\mathbb{E}\\Big[\\:G_t \\:\\Big|\\: S_t=s\\:\\Big]\n", "$$\n", "where the return $G_t$ is\n", "$$\n", "G_t=R_t +\\gamma R_{t+1}+\\gamma^2 R_{t+2}+... . \n", "$$\n", "This means *The value of a state is the expected discounted sum of all future rewards\n", "when following policy $\\pi$.*\n", "\n", "We know that $G_t=R_t+\\gamma G_{t+1}$, and plugging this equation into the definition of $V^{\\pi}(s)$, we get \n", "$$\n", "V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n", "$$\n", "This step shows simply ``The total future reward = immediate reward + discounted reward from next state.''" ] }, { "cell_type": "markdown", "id": "88ea8d56-3b62-4690-9ff7-469e43726fbc", "metadata": {}, "source": [ "For the expected immediate reward part $\\mathbb{E}[R_t| S_t=s]$, as we are in a maze problem, the reward depends only on the current state, not the time step, i.e., $\\mathbb{E}[R_t| S_t=s]=R(s)$. Hence we get \n", "$$\n", "V^{\\pi}(s)=R(s)+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s\\:\\Big]. \n", "$$\n", "\n", "Moreover, in this maze problem, we consider a deterministic policy $A_t=\\pi(s)$ (the action depends only on the state). Therefore, \n", "$$\n", "V^{\\pi}(s)=\\mathbb{E}\\Big[\\:R_t \\:\\Big|\\: S_t=s\\:\\Big]+\\gamma\\mathbb{E}\\Big[\\:G_{t+1} \\:\\Big|\\: S_t=s, A_t=\\pi(s)\\:\\Big]. \n", "$$\n", "\n", "Now **given the state $S_t=s$ and $A_t=a$**, the next state is random (because of the error probability) and we know the transition probability \n", "$$\n", "\\mathbb{P}\\big(\\:S_{t+1}=s' \\:|\\:S_t=s, \\, A_t=a\\big)=P\\big(s'\\:\\big|\\:s, a\\big). \n", "$$" ] }, { "cell_type": "markdown", "id": "c25e255d-8f58-4eaf-9485-cee6ab3bea6c", "metadata": {}, "source": [ "Therefore,\n", "$$\n", "\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_t=s,A_t=a\\,\\big] =\\sum_{s'}\\mathbb{E}\\big[\\,G_{t+1}\\,|\\,S_{t+1}=s'\\,\\big]\\times \\mathbb{P}\\big[S_{t+1}=s'\\,\\big|\\,S_t=s, A_t=a\\, \\big]\n", "$$\n", "$$\n", "\\hspace{-1.2cm}=\\sum_{s'}V^{\\pi}(s')P\\big(s'\\:\\big|\\:s, a\\big),\n", "$$\n", "where here we use the Markov property. (**Question: Can you show the detailed computations here?**)" ] }, { "cell_type": "markdown", "id": "9a2b6cff-e848-44a2-b504-973067b367b3", "metadata": {}, "source": [ "In conclusion, we have (the Bellman expectation equation)\n", "$$\n", "V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n", "$$" ] }, { "cell_type": "markdown", "id": "15049fdb-f3af-4f78-b556-817284260ed0", "metadata": {}, "source": [ "### 3.2 Define a function which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n", "\n", "\n", "**Exercise $8^*$.** Now we define `policy_evaluation(...)`, which computes the value function $V^{\\pi}(s)$ for a given deterministic policy. \n", "\n", "The input of this function `policy_evaluation(...)` are:\n", "1. policy: array of size `n_states`, each entry is an action 0,1,2,3, which correspond to UP, RIGHT, DOWN, LEFT.\n", "2. `P`: the transition probabilities `P[a, s, s']`.\n", "3. `R`: the reward vector `R[s]`.\n", "4. gamma: the discount factor $\\gamma\\in(0,1)$.\n", "5. theta: convergence threshold.\n", "6. max_iter: which is used to avoid infinite loops.\n", "\n", "How can we apply the Bellman expectation equation\n", "$$\n", "V^{\\pi}(s)=R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}(s').\n", "$$\n", "here ?\n", "\n" ] }, { "cell_type": "markdown", "id": "5c48f489-3508-4981-8b35-5bedc2e5838c", "metadata": {}, "source": [ "We start with an initial guess of $V^{\\pi}$(e.g., all values = 0) and repeatedly apply the Bellman equation to update each state:\n", "$$\n", "V_{k+1}^\\pi(s) \\leftarrow R(s)+\\gamma \\sum_{s'}P\\big(\\,s'\\,\\big|\\,s, \\pi(s)\\,\\big)V^{\\pi}_k(s').\n", "$$\n", "until values converge." ] }, { "cell_type": "code", "execution_count": null, "id": "2fffe0b7", "metadata": {}, "outputs": [], "source": [ "def policy_evaluation( # noqa: PLR0913\n", " policy: np.ndarray,\n", " P: np.ndarray,\n", " R: np.ndarray,\n", " gamma: float,\n", " theta: float = 1e-6,\n", " max_iter: int = 10_000,\n", ") -> np.ndarray:\n", " \"\"\"Evaluate a deterministic policy for the given MDP.\n", "\n", " Args:\n", " policy: array of shape (n_states,), with values in {0,1,2,3}\n", " P: array of shape (n_actions, n_states, n_states)\n", " R: array of shape (n_states,)\n", " gamma: discount factor\n", " theta: convergence threshold\n", " max_iter: maximum number of iterations\n", "\n", " \"\"\"\n", " n_states = len(R) # get the number of states\n", " V = np.zeros(n_states) # initialize the value function\n", "\n", " for _it in range(max_iter): # Main iterative loop\n", " V_new = np.zeros_like(\n", " V,\n", " ) # Create a new value vector and we will compute an updated value for each state.\n", "\n", " # Now we update each state using the Bellman expectation equation\n", " for s in range(n_states):\n", " a = policy[s] # Extract the action chosen by the policy in state\n", " V_new[s] = R[s] + gamma * np.sum(P[a, s, :] * V)\n", "\n", " delta = np.max(\n", " np.abs(V_new - V),\n", " ) # This measures how much the value function changed in this iteration:\n", " # If delta is small, the values start to converge; otherwise, we need to keep iterating.\n", " V = V_new # Update V, i.e. Set the new values for the next iteration.\n", "\n", " if delta < theta: # Check convergence: When changes are tiny, we stop.\n", " break\n", "\n", " return V # Return the final value function, this is our estimate for V^{pi}(s), s in the state set." ] }, { "cell_type": "markdown", "id": "09ef3439", "metadata": {}, "source": [ "### 3.3 Evaluating a random policy" ] }, { "cell_type": "markdown", "id": "eecbca15-f89f-47bf-a13d-7d7c051699b8", "metadata": {}, "source": [ "Now we use the policy evaluation function `policy_evaluation` to evaluate a random policy. \n", "\n", "We first generate a `random_policy`, which is an array like [2, 0, 1, 3, 0, 2, ...] and has the size `n_states`. (Recall that the policy is a mapping from states to actions)." ] }, { "cell_type": "code", "execution_count": 92, "id": "b4a44e38", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 3 2 1 1 3 0 2 0 0 2 3 2 3 2 3 2 0 3 1 2 1]\n" ] } ], "source": [ "# Random policy: for each state, pick a random action\n", "random_policy = rng.integers(low=0, high=len(ACTIONS), size=n_states)\n", "\n", "print(random_policy)" ] }, { "cell_type": "markdown", "id": "3fe07992-ce82-4124-aebc-a6384d417f64", "metadata": {}, "source": [ "Now we call the function `policy_evaluation(...)` to compute $V^{\\pi_{\\text{random}}}(s)$." ] }, { "cell_type": "code", "execution_count": 93, "id": "c5f559b2-452a-477c-a1fa-258b40805670", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Value function under random policy:\n", "[ -0.2 -0.2 -0.201 -0.204 -0.205 -0.202 -0.214 -0.429 -0.212\n", " -0.207 -0.276 -0.459 -0.352 -0.366 -5.827 -4.605 20. -0.366\n", " -0.999 -20. -6.4 -3.163]\n" ] } ], "source": [ "V_random = policy_evaluation(policy=random_policy, P=P, R=R, gamma=gamma)\n", "print(\"Value function under random policy:\")\n", "print(V_random)" ] }, { "cell_type": "markdown", "id": "f46c70ba-2932-49af-b568-b5477260bc94", "metadata": {}, "source": [ "Here in this value vector of the policy, \n", "- If it is a negative values, then the agent tends to move around aimlessly, fall in traps, or take too long.\n", "- It it is a higher values, then the agent is closer to the goal or more likely to reach it" ] }, { "cell_type": "markdown", "id": "1efcb076-467c-42d8-94e8-87453f688bbd", "metadata": {}, "source": [ "Now we define a function `plot_values`, which displays the value function $V(s)$ and displays it on the maze grid. It helps students visually understand:\n", "- which states are good (high value, near the goal),\n", "- which states are bad (low value, near traps),\n", "- how a policy affects the long-term expected reward." ] }, { "cell_type": "code", "execution_count": null, "id": "4c428327", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def plot_values(V: np.ndarray, title: str = \"Value function\") -> None:\n", " \"\"\"Plot the value function V on the maze as a heatmap.\"\"\"\n", " grid_values = np.full(\n", " (n_rows, n_cols),\n", " np.nan,\n", " ) # Initializes a grid the same size as the maze. Every cell starts as NaN.\n", " for (\n", " s,\n", " (i, j),\n", " ) in (\n", " state_to_pos.items()\n", " ): # recall that state_to_pos maps each state index to its maze coordinates (i,j).\n", " grid_values[i, j] = V[\n", " s\n", " ] # For each reachable cell, we write the value V[s] in the grid.\n", " # Walls # never get values, and they stay as NaN.\n", "\n", " _fig, ax = plt.subplots()\n", " im = ax.imshow(grid_values, cmap=\"magma\")\n", " plt.colorbar(im, ax=ax)\n", "\n", " # For each state:\n", " # Place the text label at (column j, row i).\n", " # Display value to two decimals.\n", " # Use white text so it's visible on the heatmap.\n", " # Center the text inside each cell.\n", "\n", " for s, (i, j) in state_to_pos.items():\n", " ax.text(\n", " j,\n", " i,\n", " f\"{V[s]:.2f}\",\n", " ha=\"center\",\n", " va=\"center\",\n", " color=\"white\",\n", " fontsize=9,\n", " )\n", "\n", " # Remove axis ticks and set title\n", " ax.set_xticks([])\n", " ax.set_yticks([])\n", " ax.set_title(title)\n", " plt.show()\n", "\n", "\n", "plot_values(V_random, title=\"Value function: random policy\")" ] }, { "cell_type": "markdown", "id": "8275a1eb-b58e-4e05-ae5d-5635ff9a1556", "metadata": {}, "source": [ "The next function `plot_policy` visualizes a policy on the maze.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c1ab67f0-bd5e-4ffe-b655-aec030401b78", "metadata": {}, "outputs": [], "source": [ "def plot_policy(policy: np.ndarray, title: str =\"Policy\") -> None:\n", " \"\"\"Plot the given policy on the maze.\"\"\"\n", " _fig, ax = plt.subplots()\n", " # draw walls as dark cells\n", " wall_grid = np.zeros((n_rows, n_cols))\n", " for i in range(n_rows):\n", " for j in range(n_cols):\n", " if maze_str[i][j] == \"#\":\n", " wall_grid[i, j] = 1\n", " ax.imshow(wall_grid, cmap=\"Greys\", alpha=0.5)\n", "\n", " for s, (i, j) in state_to_pos.items():\n", " cell = maze_str[i][j]\n", " if cell == \"#\":\n", " continue\n", "\n", " if s in goal_states:\n", " ax.text(\n", " j,\n", " i,\n", " \"G\",\n", " ha=\"center\",\n", " va=\"center\",\n", " fontsize=14,\n", " fontweight=\"bold\",\n", " color=\"blue\",\n", " )\n", " elif s in trap_states:\n", " ax.text(\n", " j,\n", " i,\n", " \"X\",\n", " ha=\"center\",\n", " va=\"center\",\n", " fontsize=14,\n", " fontweight=\"bold\",\n", " color=\"red\",\n", " )\n", " elif s == start_state:\n", " ax.text(\n", " j,\n", " i,\n", " \"S\",\n", " ha=\"center\",\n", " va=\"center\",\n", " fontsize=14,\n", " fontweight=\"bold\",\n", " color=\"green\",\n", " )\n", " else:\n", " a = policy[s]\n", " ax.text(\n", " j,\n", " i,\n", " action_names[a],\n", " ha=\"center\",\n", " va=\"center\",\n", " fontsize=14,\n", " color=\"black\",\n", " )\n", "\n", " ax.set_xticks(np.arange(-0.5, n_cols, 1))\n", " ax.set_yticks(np.arange(-0.5, n_rows, 1))\n", " ax.set_xticklabels([])\n", " ax.set_yticklabels([])\n", " ax.grid(visible=True)\n", " ax.set_title(title)\n", " plt.show()" ] }, { "cell_type": "markdown", "id": "48037254-dccc-4f9c-a4d7-349adba5c74f", "metadata": {}, "source": [ "Now let’s visualize the `random_policy`. Does it seem like a good policy?" ] }, { "cell_type": "code", "execution_count": 96, "id": "d452681c-c89c-41cc-95dc-df75993b0391", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_policy(policy=random_policy, title=\"Policy\")" ] }, { "cell_type": "markdown", "id": "cbad5bf1-0150-4c3f-8cce-c82e0f1d1695", "metadata": {}, "source": [ "**Exercise 9.** Define your own policy and evaluate it using the functions `policy_evaluation(...)` and `plot_values(...)`. **Can you identify an optimal policy visually?** Plot your own policy using `plot_policy`. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "929707e6-3022-4d86-96cc-12f251f890a9", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "my_policy = np.ndarray(\n", " [\n", " A_RIGHT,\n", " A_RIGHT,\n", " A_RIGHT,\n", " A_DOWN,\n", " A_DOWN, # First row\n", " A_UP,\n", " A_DOWN,\n", " A_DOWN,\n", " A_LEFT, # Second row\n", " A_UP,\n", " A_RIGHT,\n", " A_DOWN, # Third row\n", " A_UP,\n", " A_LEFT,\n", " A_RIGHT,\n", " A_RIGHT,\n", " A_RIGHT, # Fourth row\n", " A_UP,\n", " A_LEFT,\n", " A_DOWN,\n", " A_RIGHT,\n", " A_UP, # Fifth row\n", " ],\n", ")\n", "\n", "V_my_policy = policy_evaluation(policy=my_policy, P=P, R=R, gamma=gamma)\n", "\n", "plot_values(V=V_my_policy, title=\"Value function: my policy\")\n", "plot_policy(policy=my_policy, title=\"My policy\")\n" ] }, { "cell_type": "markdown", "id": "e61f5ee8-f9cd-4fbc-96c0-0a8d661bd1e5", "metadata": {}, "source": [ "**Exercise 10.** (optional) How can we find an optimal policy?\n", "(We will discuss this question next week, but you can already start thinking about it!)" ] }, { "cell_type": "markdown", "id": "00ae548b", "metadata": {}, "source": [ "To find an optimal policy $π^*$ (a policy that yields the highest possible expected return from every state), we generally use one of two main dynamic programming algorithms:\n", "\n", "1. **Policy Iteration**: This method alternates between two steps until convergence:\n", "\n", "- *Policy Evaluation*: Calculate the value function Vπ(s) for the current specific policy (as we did in Exercise 8).\n", "\n", "- *Policy Improvement*: Update the policy to be greedy with respect to the current values. For every state s, we choose the action a that maximizes the expected next value:\n", " $$π_{new}​(s) = argmax​_{a} \\sum_{s\\prime} ​P({s \\prime}∣s,a)[R(s)+ \\gamma V_{\\pi}({s\\prime})]$$\n", "\n", "1. **Value Iteration**: Instead of evaluating a specific policy until convergence every time, we iteratively update the value function directly using the *Bellman Optimality Equation*:\n", " $$V_{k+1}​(s) = max_a ​(R(s)+ \\gamma \\sum_{s\\prime} ​P(s\\prime∣s,a)V_k​(s\\prime))$$\n", "\n", " Once the values converge to the optimal values $V^{*}$, we simply extract the optimal policy by acting greedily towards those values." ] } ], "metadata": { "kernelspec": { "display_name": "studies", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.12" } }, "nbformat": 4, "nbformat_minor": 5 }