mirror of
https://github.com/ArthurDanjou/ArtStudies.git
synced 2026-02-13 08:07:38 +01:00
302 lines
10 KiB
Plaintext
302 lines
10 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Séance 4 - Bonus : Réseau récurrent avec Embedding\n",
|
|
"\n",
|
|
"Dans cette séance nous avons entraîné un modèle à copier le style de poésie de Beaudelaire, spécifiquement l'oeuvre *Les fleurs du mal*. On souhaite voir ici comment utiliser la couche [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) et ce que l'on peut faire avec.\n",
|
|
"\n",
|
|
"Commençons par importer les données."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import keras\n",
|
|
"import numpy as np\n",
|
|
"import seaborn as sns\n",
|
|
"\n",
|
|
"sns.set(style=\"whitegrid\")\n",
|
|
"\n",
|
|
"\n",
|
|
"start = False\n",
|
|
"book = open(\"Beaudelaire.txt\", encoding=\"utf8\") # noqa: SIM115\n",
|
|
"lines = book.readlines()\n",
|
|
"verses = []\n",
|
|
"\n",
|
|
"for line in lines:\n",
|
|
" line_striped = line.strip().lower()\n",
|
|
" if \"AU LECTEUR\".lower() in line_striped and not start:\n",
|
|
" start = True\n",
|
|
" if (\n",
|
|
" \"End of the Project Gutenberg EBook of Les Fleurs du Mal, by Charles Baudelaire\".lower()\n",
|
|
" in line_striped\n",
|
|
" ):\n",
|
|
" break\n",
|
|
" if not start or len(line_striped) == 0:\n",
|
|
" continue\n",
|
|
" verses.append(line_striped)\n",
|
|
"\n",
|
|
"book.close()\n",
|
|
"text = \" \".join(verses)\n",
|
|
"characters = sorted(set(text))\n",
|
|
"n_characters = len(characters)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Dans le TP principal nous avons one-hot encodé le texte. La couche [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) prend en entrée une séquence d'entier. Ainsi, nous devons changer la manière de construire $X$ et $y$.\n",
|
|
"\n",
|
|
"**Consigne** : En s'inspirant du travail précédent, construire la matrice d'informations $X$ et le vecteur réponse $y$. Puis on scindera le dataset en un dataset d'entraînement et de validation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"X_train shape: (108720, 40)\n",
|
|
"y_train shape: (108720,)\n",
|
|
"X_val shape: (27181, 40)\n",
|
|
"y_val shape: (27181,)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Create character to index and index to character mappings\n",
|
|
"char_to_idx = {char: idx for idx, char in enumerate(characters)}\n",
|
|
"idx_to_char = dict(enumerate(characters))\n",
|
|
"\n",
|
|
"# Parameters\n",
|
|
"sequence_length = 40\n",
|
|
"\n",
|
|
"# Create sequences\n",
|
|
"X = []\n",
|
|
"y = []\n",
|
|
"\n",
|
|
"for i in range(len(text) - sequence_length):\n",
|
|
" # Input sequence: convert characters to indices\n",
|
|
" sequence = text[i : i + sequence_length]\n",
|
|
" X.append([char_to_idx[char] for char in sequence])\n",
|
|
"\n",
|
|
" # Target: next character as index\n",
|
|
" target = text[i + sequence_length]\n",
|
|
" y.append(char_to_idx[target])\n",
|
|
"\n",
|
|
"X = np.array(X)\n",
|
|
"y = np.array(y)\n",
|
|
"\n",
|
|
"# Split into training and validation sets\n",
|
|
"split_ratio = 0.8\n",
|
|
"split_index = int(len(X) * split_ratio)\n",
|
|
"\n",
|
|
"X_train, X_val = X[:split_index], X[split_index:]\n",
|
|
"y_train, y_val = y[:split_index], y[split_index:]\n",
|
|
"\n",
|
|
"print(f\"X_train shape: {X_train.shape}\")\n",
|
|
"print(f\"y_train shape: {y_train.shape}\")\n",
|
|
"print(f\"X_val shape: {X_val.shape}\")\n",
|
|
"print(f\"y_val shape: {y_val.shape}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"La couche [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) a comme paramètre :\n",
|
|
"* *input_dim* : la taille du vocabulaire que l'on considère, ici *n_characters*\n",
|
|
"* *output_dim* : la dimension de l'embedding, autrement dit chaque caractère sera représenté comme un vecteur de *output_dim* dimension\n",
|
|
"\n",
|
|
"On souhaite mesurer l'impact du paramètre *output_dim*. \n",
|
|
"\n",
|
|
"**Consigne** : Définir une fonction `get_model` qui prend en paramètre:\n",
|
|
"* *dimension* : un entier qui correspond à la dimension de sortie de l'embedding\n",
|
|
"* *vocabulary_size* : la taille du vocabulaire\n",
|
|
"\n",
|
|
"La fonction renvoie un réseau de neurones récurrents avec une couche d'embedding paramétré en accord avec les paramètres de la fonction. On essayera de faire un modèle de taille raisonnable.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def get_model(dimension: int, vocabulary_size: int) -> keras.Model:\n",
|
|
" \"\"\"Create and return a SimpleRNN Keras model.\n",
|
|
"\n",
|
|
" Args:\n",
|
|
" dimension (int): The embedding dimension.\n",
|
|
" vocabulary_size (int): The size of the vocabulary.\n",
|
|
"\n",
|
|
" Returns:\n",
|
|
" keras.Model: The constructed Keras model.\n",
|
|
"\n",
|
|
" \"\"\"\n",
|
|
" model = keras.Sequential()\n",
|
|
" model.add(\n",
|
|
" keras.layers.Embedding(\n",
|
|
" input_dim=vocabulary_size,\n",
|
|
" output_dim=dimension,\n",
|
|
" )\n",
|
|
" )\n",
|
|
" model.add(keras.layers.SimpleRNN(128, return_sequences=False))\n",
|
|
" model.add(keras.layers.Dense(vocabulary_size, activation=\"softmax\"))\n",
|
|
" return model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Consigne** : Écrire une boucle d'entraînement qui va stocker dans une liste le maximum atteint lors de l'entraînement jusqu'à 10 époques. Chaque élément de la liste correspondra à un dictionnaire avec pour clé:\n",
|
|
"* *dimension*: la dimension de l'embedding\n",
|
|
"* *val_loss*: la valeur de loss minimale atteinte sur le dataset de validation au cours de l'entraînement"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# List of embedding dimensions to test\n",
|
|
"dimensions = [8, 16, 32, 64, 128]\n",
|
|
"n_epochs = 10\n",
|
|
"results = []\n",
|
|
"\n",
|
|
"for dimension in dimensions:\n",
|
|
" print(f\"Training with embedding dimension: {dimension}\")\n",
|
|
" model = get_model(dimension, n_characters)\n",
|
|
" model.compile(\n",
|
|
" loss=\"sparse_categorical_crossentropy\",\n",
|
|
" optimizer=keras.optimizers.Adam(),\n",
|
|
" metrics=[\"accuracy\"],\n",
|
|
" )\n",
|
|
"\n",
|
|
" history = model.fit(\n",
|
|
" X_train,\n",
|
|
" y_train,\n",
|
|
" batch_size=64,\n",
|
|
" epochs=n_epochs,\n",
|
|
" validation_data=(X_val, y_val),\n",
|
|
" verbose=1,\n",
|
|
" )\n",
|
|
"\n",
|
|
" min_val_loss = min(history.history[\"val_loss\"])\n",
|
|
" results.append({\"dimension\": dimension, \"val_loss\": min_val_loss})\n",
|
|
" print(f\"Min val_loss for dimension {dimension}: {min_val_loss:.4f}\\n\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Consigne** : Modifier la structure de results pour correspondre à une liste de tuple où on a la moyenne et l'écart-type pour chaque entraînement pour une dimension précise."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"# Convert results to DataFrame for easier manipulation\n",
|
|
"df_results = pd.DataFrame(results)\n",
|
|
"\n",
|
|
"# Compute mean and std for each dimension (if multiple runs were done)\n",
|
|
"# Since we have one run per dimension, we'll display them as tuples (val_loss, 0)\n",
|
|
"results_stats = [\n",
|
|
" (row[\"dimension\"], row[\"val_loss\"], 0.0) for _, row in df_results.iterrows()\n",
|
|
"]\n",
|
|
"\n",
|
|
"print(\"Results: (dimension, mean_val_loss, std_val_loss)\")\n",
|
|
"for dimension, mean_loss, std_loss in results_stats:\n",
|
|
" print(f\"Dimension {dimension}: mean={mean_loss:.4f}, std={std_loss:.4f}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Consigne** : Visualiser puis commenter les résultats."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"# Extract data for plotting\n",
|
|
"dims = [r[0] for r in results_stats]\n",
|
|
"val_losses = [r[1] for r in results_stats]\n",
|
|
"\n",
|
|
"# Create the plot\n",
|
|
"plt.figure(figsize=(10, 6))\n",
|
|
"plt.plot(dims, val_losses, marker=\"o\", linewidth=2, markersize=8)\n",
|
|
"plt.xlabel(\"Embedding Dimension\", fontsize=12)\n",
|
|
"plt.ylabel(\"Minimum Validation Loss\", fontsize=12)\n",
|
|
"plt.title(\"Impact of Embedding Dimension on Model Performance\", fontsize=14)\n",
|
|
"plt.xticks(dims)\n",
|
|
"plt.grid(True, alpha=0.3)\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Commentaires sur les résultats\n",
|
|
"\n",
|
|
"Les résultats montrent l'impact de la dimension de l'embedding sur les performances du modèle :\n",
|
|
"\n",
|
|
"1. **Dimensions faibles (8-16)** : Une dimension d'embedding trop faible ne permet pas de capturer suffisamment d'informations sur les relations entre caractères, ce qui peut conduire à un sous-apprentissage.\n",
|
|
"\n",
|
|
"2. **Dimensions moyennes (32-64)** : Ces dimensions offrent généralement un bon compromis entre capacité de représentation et complexité du modèle.\n",
|
|
"\n",
|
|
"3. **Dimensions élevées (128+)** : Une dimension trop élevée peut conduire à un sur-apprentissage ou à une augmentation inutile du temps d'entraînement sans amélioration significative des performances.\n",
|
|
"\n",
|
|
"La couche Embedding permet de représenter chaque caractère comme un vecteur dense de dimension fixe, ce qui est plus efficace qu'un encodage one-hot, surtout pour des vocabulaires plus grands."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "studies",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|