Refactor and enhance code in Reinforcement Learning notebook; add new R script for EM algorithm in Unsupervised Learning; update README to include new section for Unsupervised Learning.

This commit is contained in:
2025-11-26 13:20:18 +01:00
parent 5d968fa5e5
commit 08cf8fbeda
8 changed files with 1480 additions and 212 deletions

View File

@@ -0,0 +1,313 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Séance 4 - Bonus : Réseau récurrent avec Embedding\n",
"\n",
"Dans cette séance nous avons entraîné un modèle à copier le style de poésie de Beaudelaire, spécifiquement l'oeuvre *Les fleurs du mal*. On souhaite voir ici comment utiliser la couche [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) et ce que l'on peut faire avec.\n",
"\n",
"Commençons par importer les données."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import keras\n",
"import numpy as np\n",
"import seaborn as sns\n",
"\n",
"sns.set(style=\"whitegrid\")\n",
"\n",
"\n",
"start = False\n",
"book = open(\"Beaudelaire.txt\", encoding=\"utf8\") # noqa: SIM115\n",
"lines = book.readlines()\n",
"verses = []\n",
"\n",
"for line in lines:\n",
" line_striped = line.strip().lower()\n",
" if \"AU LECTEUR\".lower() in line_striped and not start:\n",
" start = True\n",
" if (\n",
" \"End of the Project Gutenberg EBook of Les Fleurs du Mal, by Charles Baudelaire\".lower()\n",
" in line_striped\n",
" ):\n",
" break\n",
" if not start or len(line_striped) == 0:\n",
" continue\n",
" verses.append(line_striped)\n",
"\n",
"book.close()\n",
"text = \" \".join(verses)\n",
"characters = sorted(set(text))\n",
"n_characters = len(characters)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dans le TP principal nous avons one-hot encodé le texte. La couche [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) prend en entrée une séquence d'entier. Ainsi, nous devons changer la manière de construire $X$ et $y$.\n",
"\n",
"**Consigne** : En s'inspirant du travail précédent, construire la matrice d'informations $X$ et le vecteur réponse $y$. Puis on scindera le dataset en un dataset d'entraînement et de validation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"X_train shape: (108720, 40)\n",
"y_train shape: (108720,)\n",
"X_val shape: (27181, 40)\n",
"y_val shape: (27181,)\n"
]
}
],
"source": [
"# Create character to index and index to character mappings\n",
"char_to_idx = {char: idx for idx, char in enumerate(characters)}\n",
"idx_to_char = dict(enumerate(characters))\n",
"\n",
"# Parameters\n",
"sequence_length = 40\n",
"\n",
"# Create sequences\n",
"X = []\n",
"y = []\n",
"\n",
"for i in range(len(text) - sequence_length):\n",
" # Input sequence: convert characters to indices\n",
" sequence = text[i : i + sequence_length]\n",
" X.append([char_to_idx[char] for char in sequence])\n",
"\n",
" # Target: next character as index\n",
" target = text[i + sequence_length]\n",
" y.append(char_to_idx[target])\n",
"\n",
"X = np.array(X)\n",
"y = np.array(y)\n",
"\n",
"# Split into training and validation sets\n",
"split_ratio = 0.8\n",
"split_index = int(len(X) * split_ratio)\n",
"\n",
"X_train, X_val = X[:split_index], X[split_index:]\n",
"y_train, y_val = y[:split_index], y[split_index:]\n",
"\n",
"print(f\"X_train shape: {X_train.shape}\")\n",
"print(f\"y_train shape: {y_train.shape}\")\n",
"print(f\"X_val shape: {X_val.shape}\")\n",
"print(f\"y_val shape: {y_val.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"La couche [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) a comme paramètre :\n",
"* *input_dim* : la taille du vocabulaire que l'on considère, ici *n_characters*\n",
"* *output_dim* : la dimension de l'embedding, autrement dit chaque caractère sera représenté comme un vecteur de *output_dim* dimension\n",
"\n",
"On souhaite mesurer l'impact du paramètre *output_dim*. \n",
"\n",
"**Consigne** : Définir une fonction `get_model` qui prend en paramètre:\n",
"* *dimension* : un entier qui correspond à la dimension de sortie de l'embedding\n",
"* *vocabulary_size* : la taille du vocabulaire\n",
"\n",
"La fonction renvoie un réseau de neurones récurrents avec une couche d'embedding paramétré en accord avec les paramètres de la fonction. On essayera de faire un modèle de taille raisonnable.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def get_model(dimension: int, vocabulary_size: int) -> keras.Model:\n",
" \"\"\"Create and return a SimpleRNN Keras model.\n",
"\n",
" Args:\n",
" dimension (int): The embedding dimension.\n",
" vocabulary_size (int): The size of the vocabulary.\n",
"\n",
" Returns:\n",
" keras.Model: The constructed Keras model.\n",
"\n",
" \"\"\"\n",
" model = keras.Sequential()\n",
" model.add(\n",
" keras.layers.Embedding(\n",
" input_dim=vocabulary_size,\n",
" output_dim=dimension,\n",
" )\n",
" )\n",
" model.add(keras.layers.SimpleRNN(128, return_sequences=False))\n",
" model.add(keras.layers.Dense(vocabulary_size, activation=\"softmax\"))\n",
" return model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Consigne** : Écrire une boucle d'entraînement qui va stocker dans une liste le maximum atteint lors de l'entraînement jusqu'à 10 époques. Chaque élément de la liste correspondra à un dictionnaire avec pour clé:\n",
"* *dimension*: la dimension de l'embedding\n",
"* *val_loss*: la valeur de loss minimale atteinte sur le dataset de validation au cours de l'entraînement"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\">Model: \"sequential_1\"</span>\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1mModel: \"sequential_1\"\u001b[0m\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n",
"┃<span style=\"font-weight: bold\"> Layer (type) </span>┃<span style=\"font-weight: bold\"> Output Shape </span>┃<span style=\"font-weight: bold\"> Param # </span>┃\n",
"┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n",
"│ embedding_1 (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">Embedding</span>) │ ? │ <span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (unbuilt) │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ simple_rnn_1 (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">SimpleRNN</span>) │ ? │ <span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (unbuilt) │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ dense_1 (<span style=\"color: #0087ff; text-decoration-color: #0087ff\">Dense</span>) │ ? │ <span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (unbuilt) │\n",
"└─────────────────────────────────┴────────────────────────┴───────────────┘\n",
"</pre>\n"
],
"text/plain": [
"┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n",
"┃\u001b[1m \u001b[0m\u001b[1mLayer (type) \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mOutput Shape \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1m Param #\u001b[0m\u001b[1m \u001b[0m┃\n",
"┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n",
"│ embedding_1 (\u001b[38;5;33mEmbedding\u001b[0m) │ ? │ \u001b[38;5;34m0\u001b[0m (unbuilt) │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ simple_rnn_1 (\u001b[38;5;33mSimpleRNN\u001b[0m) │ ? │ \u001b[38;5;34m0\u001b[0m (unbuilt) │\n",
"├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
"│ dense_1 (\u001b[38;5;33mDense\u001b[0m) │ ? │ \u001b[38;5;34m0\u001b[0m (unbuilt) │\n",
"└─────────────────────────────────┴────────────────────────┴───────────────┘\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Total params: </span><span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (0.00 B)\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m Total params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Trainable params: </span><span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (0.00 B)\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m Trainable params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"><span style=\"font-weight: bold\"> Non-trainable params: </span><span style=\"color: #00af00; text-decoration-color: #00af00\">0</span> (0.00 B)\n",
"</pre>\n"
],
"text/plain": [
"\u001b[1m Non-trainable params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dimension = 64\n",
"vocabulary_size = n_characters\n",
"\n",
"model = get_model(dimension, vocabulary_size)\n",
"model.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Consigne** : Modifier la structure de results pour correspondre à une liste de tuple où on a la moyenne et l'écart-type pour chaque entraînement pour une dimension précise."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Consigne** : Visualiser puis commenter les résultats."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "studies",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because one or more lines are too long