Refactor and enhance code in Reinforcement Learning notebook; add new R script for EM algorithm in Unsupervised Learning; update README to include new section for Unsupervised Learning.

2026-03-18 16:51:36 +01:00 · 2025-11-26 13:20:18 +01:00
parent 5d968fa5e5
commit 08cf8fbeda
8 changed files with 1480 additions and 212 deletions
--- a/Récurrents/TP4
+++ b/Récurrents/TP4
@@ -0,0 +1,313 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Séance 4 - Bonus : Réseau récurrent avec Embedding\n",
+    "\n",
+    "Dans cette séance nous avons entraîné un modèle à copier le style de poésie de Beaudelaire, spécifiquement l'oeuvre *Les fleurs du mal*. On souhaite voir ici comment utiliser la couche [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) et ce que l'on peut faire avec.\n",
+    "\n",
+    "Commençons par importer les données."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import keras\n",
+    "import numpy as np\n",
+    "import seaborn as sns\n",
+    "\n",
+    "sns.set(style=\"whitegrid\")\n",
+    "\n",
+    "\n",
+    "start = False\n",
+    "book = open(\"Beaudelaire.txt\", encoding=\"utf8\")  # noqa: SIM115\n",
+    "lines = book.readlines()\n",
+    "verses = []\n",
+    "\n",
+    "for line in lines:\n",
+    "    line_striped = line.strip().lower()\n",
+    "    if \"AU LECTEUR\".lower() in line_striped and not start:\n",
+    "        start = True\n",
+    "    if (\n",
+    "        \"End of the Project Gutenberg EBook of Les Fleurs du Mal, by Charles Baudelaire\".lower()\n",
+    "        in line_striped\n",
+    "    ):\n",
+    "        break\n",
+    "    if not start or len(line_striped) == 0:\n",
+    "        continue\n",
+    "    verses.append(line_striped)\n",
+    "\n",
+    "book.close()\n",
+    "text = \" \".join(verses)\n",
+    "characters = sorted(set(text))\n",
+    "n_characters = len(characters)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Dans le TP principal nous avons one-hot encodé le texte. La couche [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) prend en entrée une séquence d'entier. Ainsi, nous devons changer la manière de construire $X$ et $y$.\n",
+    "\n",
+    "**Consigne** : En s'inspirant du travail précédent, construire la matrice d'informations $X$ et le vecteur réponse $y$. Puis on scindera le dataset en un dataset d'entraînement et de validation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "X_train shape: (108720, 40)\n",
+      "y_train shape: (108720,)\n",
+      "X_val shape: (27181, 40)\n",
+      "y_val shape: (27181,)\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create character to index and index to character mappings\n",
+    "char_to_idx = {char: idx for idx, char in enumerate(characters)}\n",
+    "idx_to_char = dict(enumerate(characters))\n",
+    "\n",
+    "# Parameters\n",
+    "sequence_length = 40\n",
+    "\n",
+    "# Create sequences\n",
+    "X = []\n",
+    "y = []\n",
+    "\n",
+    "for i in range(len(text) - sequence_length):\n",
+    "    # Input sequence: convert characters to indices\n",
+    "    sequence = text[i : i + sequence_length]\n",
+    "    X.append([char_to_idx[char] for char in sequence])\n",
+    "\n",
+    "    # Target: next character as index\n",
+    "    target = text[i + sequence_length]\n",
+    "    y.append(char_to_idx[target])\n",
+    "\n",
+    "X = np.array(X)\n",
+    "y = np.array(y)\n",
+    "\n",
+    "# Split into training and validation sets\n",
+    "split_ratio = 0.8\n",
+    "split_index = int(len(X) * split_ratio)\n",
+    "\n",
+    "X_train, X_val = X[:split_index], X[split_index:]\n",
+    "y_train, y_val = y[:split_index], y[split_index:]\n",
+    "\n",
+    "print(f\"X_train shape: {X_train.shape}\")\n",
+    "print(f\"y_train shape: {y_train.shape}\")\n",
+    "print(f\"X_val shape: {X_val.shape}\")\n",
+    "print(f\"y_val shape: {y_val.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "La couche [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) a comme paramètre :\n",
+    "* *input_dim* : la taille du vocabulaire que l'on considère, ici *n_characters*\n",
+    "* *output_dim* : la dimension de l'embedding, autrement dit chaque caractère sera représenté comme un vecteur de *output_dim* dimension\n",
+    "\n",
+    "On souhaite mesurer l'impact du paramètre *output_dim*. \n",
+    "\n",
+    "**Consigne** : Définir une fonction `get_model` qui prend en paramètre:\n",
+    "* *dimension* : un entier qui correspond à la dimension de sortie de l'embedding\n",
+    "* *vocabulary_size* : la taille du vocabulaire\n",
+    "\n",
+    "La fonction renvoie un réseau de neurones récurrents avec une couche d'embedding paramétré en accord avec les paramètres de la fonction. On essayera de faire un modèle de taille raisonnable.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_model(dimension: int, vocabulary_size: int) -> keras.Model:\n",
+    "    \"\"\"Create and return a SimpleRNN Keras model.\n",
+    "\n",
+    "    Args:\n",
+    "        dimension (int): The embedding dimension.\n",
+    "        vocabulary_size (int): The size of the vocabulary.\n",
+    "\n",
+    "    Returns:\n",
+    "        keras.Model: The constructed Keras model.\n",
+    "\n",
+    "    \"\"\"\n",
+    "    model = keras.Sequential()\n",
+    "    model.add(\n",
+    "        keras.layers.Embedding(\n",
+    "            input_dim=vocabulary_size,\n",
+    "            output_dim=dimension,\n",
+    "        )\n",
+    "    )\n",
+    "    model.add(keras.layers.SimpleRNN(128, return_sequences=False))\n",
+    "    model.add(keras.layers.Dense(vocabulary_size, activation=\"softmax\"))\n",
+    "    return model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Consigne** : Écrire une boucle d'entraînement qui va stocker dans une liste le maximum atteint lors de l'entraînement jusqu'à 10 époques. Chaque élément de la liste correspondra à un dictionnaire avec pour clé:\n",
+    "* *dimension*: la dimension de l'embedding\n",
+    "* *val_loss*: la valeur de loss minimale atteinte sur le dataset de validation au cours de l'entraînement"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">Model: \"sequential_1\"</span>\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[1mModel: \"sequential_1\"\u001b[0m\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n",
+       "┃<span style=\"font-weight: bold\"> Layer (type)                    </span>┃<span style=\"font-weight: bold\"> Output Shape           </span>┃<span style=\"font-weight: bold\">       Param # </span>┃\n",
+       "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n",
+       "│ embedding_1 (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">Embedding</span>)         │ ?                      │   <span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (unbuilt) │\n",
+       "├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
+       "│ simple_rnn_1 (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">SimpleRNN</span>)        │ ?                      │   <span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (unbuilt) │\n",
+       "├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
+       "│ dense_1 (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">Dense</span>)                 │ ?                      │   <span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (unbuilt) │\n",
+       "└─────────────────────────────────┴────────────────────────┴───────────────┘\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n",
+       "┃\u001b[1m \u001b[0m\u001b[1mLayer (type)                   \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mOutput Shape          \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1m      Param #\u001b[0m\u001b[1m \u001b[0m┃\n",
+       "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n",
+       "│ embedding_1 (\u001b[38;5;33mEmbedding\u001b[0m)         │ ?                      │   \u001b[38;5;34m0\u001b[0m (unbuilt) │\n",
+       "├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
+       "│ simple_rnn_1 (\u001b[38;5;33mSimpleRNN\u001b[0m)        │ ?                      │   \u001b[38;5;34m0\u001b[0m (unbuilt) │\n",
+       "├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
+       "│ dense_1 (\u001b[38;5;33mDense\u001b[0m)                 │ ?                      │   \u001b[38;5;34m0\u001b[0m (unbuilt) │\n",
+       "└─────────────────────────────────┴────────────────────────┴───────────────┘\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Total params: </span><span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (0.00 B)\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[1m Total params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Trainable params: </span><span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (0.00 B)\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[1m Trainable params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Non-trainable params: </span><span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (0.00 B)\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\u001b[1m Non-trainable params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "dimension = 64\n",
+    "vocabulary_size = n_characters\n",
+    "\n",
+    "model = get_model(dimension, vocabulary_size)\n",
+    "model.summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Consigne** : Modifier la structure de results pour correspondre à une liste de tuple où on a la moyenne et l'écart-type pour chaque entraînement pour une dimension précise."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Consigne** : Visualiser puis commenter les résultats."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "studies",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/Récurrents/TP4
+++ b/Récurrents/TP4