mirror of
https://github.com/ArthurDanjou/ArtStudies.git
synced 2026-01-14 15:54:13 +01:00
- Updated pyproject.toml to include langchain-text-splitters version >=1.1.0 in dependencies. - Modified uv.lock to add langchain-text-splitters in both dependencies and requires-dist sections.
1396 lines
79 KiB
Plaintext
1396 lines
79 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8514812a",
|
||
"metadata": {},
|
||
"source": [
|
||
"# TP2 - Retrieval Augmented Generation\n",
|
||
"\n",
|
||
"Dans ce TP nous allons construire un système RAG complet : base de connaissance, vectorisation et appel avec un modèle de langage.\n",
|
||
"\n",
|
||
"Certaines fonctions seront réutilisées dans les prochaines séances, nous encourageons donc la définition de fonction générale, optimisée et robuste. Il est à garder en tête que ce notebook n'a qu'une portée pédagogique et n'est pas forcément à jour puisque le domaine évolue rapidement.\n",
|
||
"\n",
|
||
"Dans ce TP nous cherchons à apporter des connaissances Machine Learning, bien que le modèle en ait largement, en utilisant des cours au format PDF à notre disposition. \n",
|
||
"\n",
|
||
"\n",
|
||
"## Constitution de la base de connaissance\n",
|
||
"\n",
|
||
"Pour construire un RAG, il faut commencer par une base de connaissance. Elle sera composée dans notre cas de document PDF. Nous allons commencer par extraire les informations texte contenue dans les documents.\n",
|
||
"\n",
|
||
"**Consigne** : À partir des fichiers disponible, construire une fonction `pdf_parser` qui prend en paramètre le nom du fichier et qui renvoie le texte associé. On utilisera la classe [`PyPDFLoader`](https://python.langchain.com/docs/how_to/document_loader_pdf/#simple-and-fast-text-extraction) et sa méthode `load` pour charger le document.\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "6a4a00a2",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain_community.document_loaders import PyPDFLoader\n",
|
||
"\n",
|
||
"def pdf_parser(file_path: str):\n",
|
||
" loader = PyPDFLoader(file_path=file_path)\n",
|
||
" return loader.load()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "77905595",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Consigne** : Utiliser la fonction `pdf_parser` pour charger le fichier 'ML.pdf' puis inspecter son contenu."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "8ec332e6",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"page_content='Chapitre 1\n",
|
||
"Introduction au Machine Learning\n",
|
||
"Les termes d’intelligence artificielle (IA) et Machine Learning (ML) sont fréquemment confondu et\n",
|
||
"leur hiérarchie n’est pas toujours clair. Unalgorithme est une séquence d’instructions logique ordonnée\n",
|
||
"pour répondre explicitement à un problème. Par exemple, une recette de cuisine est un algorithme, mais\n",
|
||
"tous les algorithmes ne sont pas des recettes de cuisine. Un algorithme d’intelligence d’artificielle est un\n",
|
||
"algorithme, mais il n’est pas explicitement construit pour répondre à un problème : il va s’adapter. S’il\n",
|
||
"s’appuie sur des données, alors on parle d’algorithme de Machine Learning1.\n",
|
||
"Le terme d’intelligence artificielle vient de la conférence de Dartmouth en 1957 où l’objectif était de\n",
|
||
"copier le fonctionnement des neurones. Mais les concepts d’intelligence artificielle était déjà proposé par\n",
|
||
"Alan Turing, et la méthode des moindres carrés de Legendre (la fameuse tendance linéaire dans Excel)\n",
|
||
"date de bien avant 1957. Depuis, le domaine s’est structuré autour d’une philosophie d’ouverture. Ainsi,\n",
|
||
"nous avons des datasets commun, des algorithmes identiques et des compétitions commune pour pouvoir\n",
|
||
"progresser ensemble.\n",
|
||
"Nous proposons dans ce chapitre d’introduire les différentes approches du Machine Learning et les\n",
|
||
"grands principes. Pour le rendre aussi général que possible, nous ne discuterons pas d’algorithmes en\n",
|
||
"particulier, mais supposerons que nous en avons un. La description de ces objets sera le coeur des prochains\n",
|
||
"chapitre.\n",
|
||
"1.1 Les différentes approches du Machine Learning\n",
|
||
"Quand on parle de Machine Learning, on parle d’un grand ensemble contenant plusieurs approches\n",
|
||
"différentes. Leur point commun est que la donnée est la source de l’apprentissage de paramètres optimaux\n",
|
||
"selon une procédure donnée. Pour saisir les différences entre ces approches, regardons ce dont chacune a\n",
|
||
"besoin pour être suivie.\n",
|
||
"• Apprentissage supervisé: je dispose d’une base de données qui contient une colonne que je\n",
|
||
"souhaite prédire\n",
|
||
"• Apprentissage non-supervisé: je dispose seulement d’une base de données composée d’indicateurs\n",
|
||
"Ces deux approches représentent l’écrasante majorité des utilisations en entreprise. Se développe\n",
|
||
"également une troisième approche : l’apprentissage par renforcement, qui nécessiterai un cours dédié2.\n",
|
||
"Au sein de ces deux grandes approches se trouvent des sous catégories :\n",
|
||
"• Apprentissage supervisé: je dispose d’une base de données qui contient une colonne que je\n",
|
||
"souhaite prédire qui est ...\n",
|
||
"– Régression: ... une valeur continue\n",
|
||
"1. Et si la classe d’algorithme est un réseau de neurone, alors on parle de Deep Learning. Ce n’est pas au programme du\n",
|
||
"cours.\n",
|
||
"2. Elle est au coeur de l’alignement des modèles de langage avec la préférence humaine par exemple.\n",
|
||
"6' metadata={'producer': 'pdfTeX-1.40.26', 'creator': 'TeX', 'creationdate': '2025-07-20T15:41:06+02:00', 'moddate': '2025-07-20T15:41:06+02:00', 'trapped': '/False', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) kpathsea version 6.4.0', 'source': 'ML.pdf', 'total_pages': 140, 'page': 5, 'page_label': '6'}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"ml_doc = pdf_parser(\"ML.pdf\")\n",
|
||
"print(ml_doc[5])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0473470e",
|
||
"metadata": {},
|
||
"source": [
|
||
"Nous avons du texte et des métadonnées. Nous commençerons par nous concentrer sur le texte. Pour qu'il puisse être digérer par le RAG, nous devons le découper en plusieurs *chunk*. La classe [`CharacterTextSplitter`](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html) permet de réaliser cette opération."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "bea1f928",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Il y a 1471 chunks.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain.text_splitter import CharacterTextSplitter\n",
|
||
"\n",
|
||
"text_splitter = CharacterTextSplitter(\n",
|
||
" separator=\"\\n\",\n",
|
||
" chunk_size=256,\n",
|
||
" chunk_overlap=0,\n",
|
||
" length_function=len,\n",
|
||
" is_separator_regex=False,\n",
|
||
")\n",
|
||
"\n",
|
||
"texts = text_splitter.split_documents(documents=ml_doc)\n",
|
||
"print(f\"Il y a {len(texts)} chunks.\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "96d05d6a",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Consigne** : Après avoir inspecté le contenu de la variable *texts*, afficher la distribution de la longueur des chunks."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "b30cc5de",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAqYAAAImCAYAAACBy0hHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8ekN5oAAAACXBIWXMAAA9hAAAPYQGoP6dpAABIuUlEQVR4nO3dB5wU9f3/8c/BUQ4Bg4QWS0AQkEgvASOIqIAKUTRNhQhKUVR+goCKoFSDikDAICIQREBExUI0ihCjooAUOx0B0VCkSO+3/8f762P2v3vcwd1x5Tt3r+fjsY+9m5mdnZnvze17vmU2IRKJRAwAAADIZQVyewMAAAAAIZgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFMhDfPi+DB+2AQAQTgRTIId06NDBqlWrFn1Ur17d6tatazfeeKNNnTrVjh8/Hrd8ixYt7MEHH0z3+ufPn28PPPDAaZfTOrXuzL5PWvbu3Wt9+/a1pUuXxu2zHrll9uzZ7lh///33Z7SerNiPxYsXu23RM8JzrLQdY8eOzXf7DeSWxFx7ZyAfqlGjhj366KPu5xMnTtiePXvsww8/tL/97W8u0I0ePdoKFPj5evHpp5+24sWLp3vdU6ZMSddy3bt3t7/+9a+W1VauXGlvvPGG3XTTTdFpwb4CAJAeBFMgBylo1qlTJ26aaiwvvPBCGzZsmP3rX/+y3//+99EQmx0uuOACyylVqlTJsfcCAIQfTfmAB9q3b2/lypWzmTNnptnEHoTWWrVqWePGja137962bds2N0/NzJ9++ql7BE2BQbOg1nnFFVdYvXr17OOPPz6pKV+OHTtmQ4cOtYYNG1qDBg1cl4Bdu3adsik7ttlRj6AWVs/Bsilfd+TIEfvHP/5hrVu3tpo1a1rLli1twoQJlpycHPdeDz/8sJvevHlzt9xf/vIX+/LLL095DLWOcePGudfUrl3b1QyrRjqlNWvWWLdu3dzx0OPuu++2zZs3W0bo2AwaNMgd10suucQaNWrk1pPRLgNfffWV3XHHHfbb3/7Wbcudd95pa9euPekYL1y40G6//Xa3X7/73e/sySefdDXugf3799sjjzxiTZo0cd1Devbs6WrQ9dpTddlIravD6Y5PWt0jUq5fy6jWX11V9Dern9Oiv9FWrVq55XQu/O9//ztpGU3r1auXO9Y6DrfddputWLEibplTnSNp2b59u/t7D46d3v+zzz6LW0bHV3+Tem8t06NHD9uxY0eGjq26A1x99dX23//+19q2bev+brTPr7/+eprbdvToUVfu+vtQi4R8/fXXbt/r16/vtqVjx472+eefn3IfgTAhmAIeUPO9PhgVvlL2NZVly5a5/psKcs8995w99NBDtmjRIrv//vujTeaqYdXjpZdest/85jfR1yoQ6INXwUUfZKn597//bd98840NHz7cLasPzy5dusSFn1PR+2n9oufUmvA1KErBa+LEifbHP/7Rxo8f7wKqui+kXP7dd991fWb79+9vI0eOdCHg3nvvPeX2KKwp9P7hD39w+/yLX/zCnnrqqbhlNmzY4ELuzp077fHHH3e11ApdN998s5uWHtoPBTeFfAWfSZMm2T333OPCY0a6Lqj89L7y2GOPuQuDLVu2uO1bv3593LJ6HwURHbM2bdq4Y/jyyy9H5yuEqwx1jEaNGmUHDhw4ad/TIyuOTyxtr0LYmDFjXAhLzbRp09xxu/zyy92FhULngAEDTroQ0Hbpb1TztG+6ELn11lujx+p050hqdJy0b7oA6NOnj/u7KVKkiAuDGzdujC6nPuC6ePv73//u1vef//zHBg8enOHj8eOPP7rX6eJNF17nnXeeO99Slrfo/4AuMBREJ0+ebBdffLELyJ07d7ZSpUq5oKuyPnTokLu42bdvX4a3B/ARTfmAJ375y1+6D7+ffvrJ/RxLH7pFixa1rl27WuHChd00BS/VuCkoqck86I+asqvALbfc4gLgqeiDTgGrWLFi0d9VU6b+r6oVPB29d9Bsr+fUmvC1rk8++cQFzeuuu85NU+2f9ksf+Pqwvuiii6IfytqeYJ8UIPQBrloj1TSlNvDqhRdesE6dOrmQKE2bNnW1YR999FF0OQWPpKQkV5sYrFsXBFdddZULe+kZPKZ1ah1aVrXLohqt7777zl0UpJfC1a9//WsXUAoWLOimXXbZZa5WTUFOxySgIK/yCLZ33rx57uJBYU2BWMFKQUWhTJo1a+YCbGqB51Sy4vjE0vFRmaRFf7sKo9dee63169cvegwUwGJbD55//nl3Xrz44ot27rnnRvdRr9Nx0vE63TmSkJBw0vu/9tpr9sMPP7hnBT9RLfENN9xgS5YssYoVK7ppqrV/4oknosfjiy++sA8++MAySiFSYV/rEK1f55fWVbly5ehyCt2qgVW5/vOf/4xeaK5bt852797tzhVtp6gbkP7udI6UKFEiw9sE+IYaU8ATwW2WUvsAVRO7PtQUNhRoNFBKH+AKYaktHyv4wD0V1VYFoTRomkxMTHQfzllF3Qy0zpQhOehTq/mB2KAt6uYgOgapUVOmQn3KEH3NNdfE/a4aNDXHKsAo/Oqh91GAUmhOD22LatBUg6lmWtWcKhQvX77cNb2mx8GDB11g0vYFoVRKlizp9iH2WEjKmu7y5cu7dQT7VKhQIRceY2vgFdoyKiuOT0b+9r799ltXE3u6clP41rp07IPt0j4qnAbblZlzRGFWtZax26lgrhp7XQwEVNax9BpdDGVG7IWjylGCsgyMGDHC5syZ4wKoQnFAF27nnHOOa3lQy8R7773nLmJV2xusCwg7akwBT6gvnAKBanlSUjBRzZpqslSDop/1gaQPqNPdxig2cKalTJkycb/rQ1+1ppn98E2N+ntqnbFBLPa9Y5siFQ5Sbo/E9kVNuW7R+lNbd0C1bm+//bZ7pKQP/PR68803Xc2vmt5VXgo2Krv00r7qQiRlzbhoWspm2ZTr1vEILmRUg6ZtCI5RoHTp0pZRWXV80vu3l5Fy27RpU1wXlVgKpJk5R7Te9BynlPsRe/wzKvZvOyizlOtSlwoFbdUU//nPf45emJ111lk2ffp0e+aZZ1zXDdWU6m/j+uuvd91egppiIMwIpoAHVAOkZjs1z6UMbgE1TeuhD2HVbKnWTv0S1SdPgz3OhD6gY6kvpwJP7Id2yv6dKWt5Tufss89269R6YvdRTeOphZOMCF6r2jc1baa1X2rqvPTSS1NtXlZtbnqoJk5N2go76tsXhAY19aoGLj20HarFix1AE9sPMbWLk7To/XVcFdpjw2lqfUJPV4bpOT5B7WPKiwQ1JZ9JucVKrdxUk6s+pKkJAllGzxGtN7UBa6r91t9rbPP66Zzp+RFryJAhbn9Vc6xBduruENDfdzD4TX3SdYs2dXHQ3TbU/xQIO5ryAQ+o5kOBJBgMk5IGouj+oKpZUY2Lmj6D/n7BCOaUNWYZoebo2EFXasrU7+o7KWrO3bp1a9xrUoawtAJ1QB+0Wuc777xzUu1jas2lGaHaMtUcpVz3+++/f9I2qJ+eajjVRKqH+qyqlk3NoumhEdsKZRpoFIRShYSgSTmtWt2UNXB6X9V6xQYa1ZSq72hGjkVwXDUgJ6C/E/VDjZWeMkzP8Qm6WMSuS31ZU4bJ9FAfywoVKqSr3FSLWKlSpeh26aFQ9sorr7i/vfScIympi4IGd8XeCUF3jlDZar3plZ5jmxGq6VWtse5CoEGA+jsRHSfdbUD/K7TP+rsfOHCg6wKS1j4CYUONKZCDNKgjuLWLAoxquhYsWOCCqfpaBoNXUtKHkZonNSBCy6k/pQajqGZN80QfTgpN6o+X0Xug6oNOH8aqBdRoZDVTa2BSMEhDH/IKPvoiAPU/Va1hytvcBAMvFKxU26Rvtoql/oAKumpyVLcFzVdfSo2gbteu3Rnd81RNnBqZrhH+CiU6JhpQkjLgaBkNGNKoel0EaAS2jr1CnAbQpEdQ86bR1QpCao5W8+qqVauiNWXp+WIEje5WjasG62iAmspUzc/qpxoMdEoPNfmqrHQ7I9XA/upXv3KhavXq1XF9K1WGzz77rHuoBlHlqVrFjB4flaEuAnQHh//7v/9zNaWal5Fa3oC2T3cc0LHQ34X6H+v8UA1gLN0SSSFUzxoxr5pWdTeYNWuWG32f3nMkJd3KSv2D77rrLncLKK03GIGvMkmv9BzbzFBZ6DzTgCnVZKtFRf839Pehvxv93Su06oImrf8dQNhQYwrkIN13UX3G9NAHn5omFWhU6xGM+k1rcJIGRKhmR4M5VJOiAKYP0SAQ6NY5GgSj2zxpBHxGaFvUbK8PPI1y1i1+NEI7CDYKYFqv7hOpD0QF4JRBTgMzNPBEIU1hIyWtSx/c+rBVDZzWoxog7Ytul3SmFKY0slvrVNBQMEs5ilxhWNunbdGxVxhRKNdtptL7wa5gpoEnOgY6JgpoCoPBfTrTW1Om0K8gdfjwYXcMdBsk1cAqbFWtWjVD+67bBumCQYN+FBbVtK1gGds3UsdHA3p0twMdH+23Ak9Gj48ugHQHANX0Bn8vek7tbgnpob8Zbb8CqbZLFxMpb8UU3ONXI/J1rqjfqJqxtf0Kq+k9R1LSBYRuV6Uwqebz++67zwU/veb8889P9z6k59hmhlpBdCx0Aasa4bJly7qwrYtAXYjofXULLZVHWuEbCJuESGZ7cAMAcp1ud6RQd+WVV8YNklKoVDO1boUEAGFBUz4AhJhq1dR8rWCqLxdQ30Pdu3Xu3Lmu6wUAhAk1pgAQcurPqOZ2fQGBBkJpNLlG1quZHADChGAKAAAALzD4CQAAAF4gmAIAAMALBFMAAAB4IfSj8nUvQXWT1f0bAQAA4B99cYXukaxvLMvTNaYKpdk1fkvr1bewMD4sfCi78KLswouyCy/KLrwiISm79Oa10NeYBjWl+t7krKavFtTtV/RVibHfoAL/UXbhRdmFF2UXXpRdeB0MSdl99dVX6Vou9DWmAAAAyBsIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAkGHJyRHLy/L6/vkqMbc3AAAAhE+BAgk2Yvoy+37bPstrzitXwnrfWj+3NyNfIpgCAIBMUShd/8Oe3N4M5CE05QMAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAMjnEhISLCkpyT0DuSkxV98dAIA8KDk5YgUKhCfkKZTWqFEjtzcDIJgCAJDVFEpHTF9m32/bZ3lRvepl7a/XEmSR9QimAABkA4XS9T/ssbzovLLFc3sTkEfRxxQAAADhDqYbNmywunXr2uzZs6PTVq5cae3bt7c6depYixYtbOrUqXGvSU5OtjFjxljTpk3dMl26dLHNmzef2R4AAAAg/wbTY8eOWe/eve3gwYPRabt377ZOnTrZBRdcYK+++qrdfffdNmLECPdzYNy4cTZjxgwbMmSIzZw50wXVzp0729GjR7NmbwAAAJC/gunYsWOtePH4/iWzZs2yQoUK2eDBg61y5cp20003WceOHW3ChAluvsLn5MmTrUePHta8eXOrXr26jRo1yrZu3Wpz587Nmr0BAABA/gmmS5YssZdeesmGDx8eN33p0qXWqFEjS0z8/+OpGjdubBs3brQdO3bYqlWr7MCBA9akSZPo/JIlS7rbU2idAAAAyN8yNCp/79691rdvX+vfv79VqFAhbp5qPqtWrRo3rWzZsu55y5Ytbr6kfJ2WCeZlViQSietWkFUOHToU94zwoOzCi7ILL8ou/mb1CD/9LStj+OxQSM47Hcf0fIFDhoLpwIED3YCntm3bnjTv8OHDVrhw4bhpRYoUcc9HjhyJHrDUltmz58xup6E+rxp4lV1U64twouzCi7ILr/xedtysPu/QQG/fA1+YzruUGfCMgunrr7/umuvnzJmT6vyiRYueNIhJgVSKFSvm5ouWCX4OljnTK0v1ba1SpYplNf0xqqArVqzI1W/IUHbhRdmFF2X3M77WM++oVKlSKGpMN4bgvFu3bl26lkt3MNXo+p07d7qBS7EeffRRe/vtt618+fK2ffv2uHnB7+XKlbPjx49Hp2nkfuwy1apVszP9J6Dwm11U0Nm5fmQfyi68KLvwouyQV/gc9MJ23qX3gi3dwVS3flJzfayWLVu6Ufa///3v7Y033nC3gDpx4oQVLFjQzV+0aJG72ihdurSVKFHCjeRfvHhxNJiqz+qKFSvcvU8BAACQv6U7mKrWMzUKnZqn20NNnDjRHn74YXdv0i+//NKmTJligwYNivYrUABVwD3nnHPs3HPPtSeffNLVtCrgAgAAIHOD7fJKF5IMDX46FQVUBdNhw4ZZu3btrEyZMm4Ev34OqHZVTfoa1a/a14YNG9qkSZNcH1EAAAAf/KJEEUtOjliBAgl5erBdsof7eEbBdPXq1XG/16pVy93jNC1q4u/Tp497AAAA+Kh4UiEX2EZMX2bfb9tnedF55UpY71vrW56tMQUAAMhLFErX/3Bmt7REDnwlKQAAAJDVCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAABDOYLpz507r06ePNW7c2OrWrWtdu3a19evXR+f379/fqlWrFvdo0aJFdH5ycrKNGTPGmjZtanXq1LEuXbrY5s2bs26PAAAAkD+C6d13322bNm2yCRMm2CuvvGJFixa1jh072qFDh9z81atX25133mkLFiyIPrRcYNy4cTZjxgwbMmSIzZw50wXVzp0729GjR7N2zwAAAJB3g+mePXvs3HPPtaFDh1qtWrWscuXK1r17d9u+fbutXbvWIpGIrVu3zi655BIrU6ZM9HHOOee41yt8Tp482Xr06GHNmze36tWr26hRo2zr1q02d+7c7NpHAAAA5LVgevbZZ9tTTz1lVatWdb/v2rXLpkyZYuXLl7cqVarYd999ZwcPHrQLL7ww1devWrXKDhw4YE2aNIlOK1mypNWoUcOWLFlypvsCAACAEEvM7AsHDBhgs2bNssKFC9szzzxjxYoVszVr1rh5L7zwgn344YdWoEABa9asmfXs2dNKlCjhakalQoUKcesqW7ZsdF5mqKZWgTirBd0TgmeEB2UXXpRdeFF2P0tISLCkpKTc3gwgXXS+KkdlN72Hzo1sC6a33Xab/fnPf7bp06e7fqfqN6pgqjCqoDl+/HhXg/rEE0+4Zv7nn38++s9KYTZWkSJFXDeBzDp27JitXLnSssvGjRuzbd3IXpRdeFF24ZXfy06hVC2BQBhs2LAhxy4mU+a/LA2marqXYcOG2RdffGHTpk1zP99yyy1WqlQpN09N/upj+qc//cm++uorN1Aq6Gsa/CxHjhw5o6vLQoUKRbcnK6mg9A+2YsWKXP2GDGUXXpRdeFF2P0tPrRDgi0qVKuVIjanGIKVHhoKp+pQuXLjQWrVqZYmJP79UNaQKhRoApZ+DUBq46KKL3LOa6oMmfC17wQUXRJfR77qt1Jn8E1BXguyif7DZuX5kH8ouvCi78KLsgPBIyqGLyPResGVo8NOOHTusV69eLpzGNqOvWLHCjdDv27evu3VULNWUisKrRuEXL17cFi9eHJ2/d+9e9/qGDRtmZFMAAACQx2QomKppXoOZdLsojaJXn9IHH3zQhUsFUtWkKrQ+/fTTrn/pBx98YP369bM2bdq44Kq+Be3bt7cRI0bY/Pnz3Sh9DYzSqP6WLVtm314CAADAexnuYzpy5Eh3yygFyn379lmDBg3cAKhf/epX7jF69Gh38/3nnnvOjcRv27at3XfffdHX6x6mx48fd98QdfjwYVdTOmnSJNdPFAAAAPlXhoOpwubAgQPdIzXXXHONe6SlYMGC7itN9QAAAAAy/ZWkAAAAQHYgmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAAhnMN25c6f16dPHGjdubHXr1rWuXbva+vXro/NXrlxp7du3tzp16liLFi1s6tSpca9PTk62MWPGWNOmTd0yXbp0sc2bN2fN3gAAACD/BNO7777bNm3aZBMmTLBXXnnFihYtah07drRDhw7Z7t27rVOnTnbBBRfYq6++6pYdMWKE+zkwbtw4mzFjhg0ZMsRmzpzpgmrnzp3t6NGjWb1vAAAACJHEjCy8Z88eO/fcc61bt25WtWpVN6179+52/fXX29q1a23hwoVWqFAhGzx4sCUmJlrlypWjIfamm25y4XPy5MnWu3dva968uXv9qFGjXO3p3LlzrU2bNtmzlwAAAMhbNaZnn322PfXUU9FQumvXLpsyZYqVL1/eqlSpYkuXLrVGjRq5UBpQk//GjRttx44dtmrVKjtw4IA1adIkOr9kyZJWo0YNW7JkSVbuFwAAAPJyjWmsAQMG2KxZs6xw4cL2zDPPWLFixWzr1q3R0BooW7ase96yZYubLxUqVDhpmWBeZkQiETt48KBlNXVPiH1GeFB24UXZhRdl97OEhARLSkrK7c0A0kXnq3JUdtN76NzItmB622232Z///GebPn2660uqfqOHDx92QTVWkSJF3PORI0ei/6xSW0bdBDLr2LFjbtBVdlGNL8KJsgsvyi688nvZKZSqJRAIgw0bNuTYxWTK/JelwVRN9zJs2DD74osvbNq0aW4gVMpBTAqkohpVzRctE/wcLHMmV5fq1xpsT1ZSQekfbMWKFbn6DRnKLrwou/Ci7H6WnlohwBeVKlXKkRrTdevWpWu5DAVT9SnVAKdWrVpF+5EWKFDAhcLt27e7vqZ6jhX8Xq5cOTt+/Hh0mkbuxy5TrVo1O5N/Agq+2UX/YLNz/cg+lF14UXbhRdkB4ZGUQxeR6b1gy9DgJw1g6tWrlwunsc3oK1ascCPwGzZsaMuWLbMTJ05E5y9atMil8dKlS1v16tWtePHitnjx4uj8vXv3utfrtQAAAMi/MhRMNbCpWbNmNnToUDeKfs2aNfbggw+6cKl7meqWUPv377eHH37YVdnOnj3bjdrX7aWCvgW6+b7ubTp//nw3Sr9nz56uprVly5bZtY8AAAAIgQz3MR05cqS7ZZQC5b59+6xBgwZuANSvfvUrN3/ixImu32m7du2sTJky1rdvX/dzoEePHq5Jv3///m6wlGpKJ02a5PqJAgAAIP/KcDAtUaKEDRw40D1SU6tWLXvppZfSfH3BggXdV5rqAQAAAGT6K0kBAACA7EAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAAEM5g+tNPP9kjjzxizZo1s3r16tnNN99sS5cujc7v1KmTVatWLe7RoUOH6PwjR47YoEGDrEmTJla3bl27//77bdeuXVm3RwAAAAilxIy+oFevXvbjjz/ayJEjrXTp0vbCCy/YHXfcYa+99ppdeOGFtnr1ahs4cKBdddVV0dcUKlQo+rPmKciOHTvWChcubI8++qj16NHDpk2blnV7BQAAgLwdTDdt2mQff/yxzZgxw+rXr++mDRgwwD766CObM2eOtW/f3nbu3Gm1a9e2MmXKnPT6bdu22euvv27jx4+3Bg0auGkKuK1bt7bPPvvM1aACAAAgf8pQU36pUqVswoQJVrNmzei0hIQE99i7d6+rLdXPlSpVSvX1y5Ytc8+NGzeOTtOy5cqVsyVLlmR+LwAAAJC/akxLlixpl19+edy0d99919Wk9uvXz9asWWMlSpSwwYMHu5rVYsWKudrQ7t27u2Z71Zgq3BYpUiRuHWXLlrWtW7dmeicikYgdPHjQstqhQ4finhEelF14UXbhRdn9TBU0SUlJub0ZQLrofFWOym56D50bWd7HNNby5cvtoYcespYtW1rz5s1dONXgplq1arlBUCtXrrQnnnjC/ve//7ln7bwCakoKqnpdZh07dsy9V3bZuHFjtq0b2YuyCy/KLrzye9kplNaoUSO3NwNIlw0bNuTYxWRqGTDLgum8efOsd+/ebmT+iBEj3DTVlD7wwAN29tlnu9+rVq3qBj717NnT+vbta0WLFrWjR4+etC6F0jO5utR7VKlSxbKaCkr/YCtWrMjVb8hQduFF2YUXZfez9NQKAb6oVKlSjtSYrlu3Ll3LZSqYagT9sGHDXDP9448/Hk3AiYmJ0VAauOiii9yzmurLly/vbjelcBqbmrdv3+76mZ7JPwF1G8gu+gebnetH9qHswouyCy/KDgiPpBy6iEzvBVuG72OqEflDhgyxW2+91Y2ojw2Yul+pmvZjffXVV65GU1fQGsmfnJwcHQQVVCGr72nDhg0zuikAAADIQzJUY6oQ+dhjj9nVV19t3bp1sx07dkTnqZm+VatWbr76mF522WUulKpvqe5zWrx4cfe47rrrrH///m45pXTdx7RRo0ZWp06d7Ng/AAAA5MVgqhH4Gmj03nvvuUesdu3a2fDhw11VrW66r+Cpe5l27NjRunbtGl1Ota2ad88997jf9Q1SCqoAAADI3zIUTO+88073OBU18euRFvU7Gjp0qHsAAAAAme5jCgAAAGQHgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAIBwBtOffvrJHnnkEWvWrJnVq1fPbr75Zlu6dGl0/sKFC+3GG2+02rVrW+vWre2tt96Ke/2RI0ds0KBB1qRJE6tbt67df//9tmvXrqzZGwAAAOSfYNqrVy/77LPPbOTIkfbqq6/axRdfbHfccYd9++23tn79euvWrZs1bdrUZs+ebX/84x+tb9++LqwGBg4caAsWLLCxY8fa888/717Xo0ePrN4vAAAAhExiRhbetGmTffzxxzZjxgyrX7++mzZgwAD76KOPbM6cObZz506rVq2a9ezZ082rXLmyrVixwiZOnOhqSLdt22avv/66jR8/3ho0aOCWUcBVzarCrmpQAQAAkD9lqMa0VKlSNmHCBKtZs2Z0WkJCgnvs3bvXNekrgMZq3LixLVu2zCKRiHsOpgUqVapk5cqVsyVLlpz53gAAACB/1JiWLFnSLr/88rhp7777rqtJ7devn7322mtWvnz5uPlly5a1Q4cO2e7du12NqcJtkSJFTlpm69atmd4Jhd6DBw9aVtN2xz4jPCi78KLswouy+5kqa5KSknJ7M4B00fmqHJXd9B46N7I0mKa0fPlye+ihh6xly5bWvHlzO3z4sBUuXDhumeD3o0ePup1POV8UVDUoKrOOHTtmK1eutOyycePGbFs3shdlF16UXXjl97JTKK1Ro0ZubwaQLhs2bMixi8nUMmCWBdN58+ZZ79693cj8ESNGRAOmAmis4HedqEWLFj1pviiUnsnVZaFChaxKlSqW1VRQ+gdbsWJFrn5DhrILL8ouvCi7n6WnVgjwRaVKlXKkxnTdunXpWi5TwXTatGk2bNgwN2jp8ccfjybgChUq2Pbt2+OW1e/FihWzEiVKuGZ+3W5K4TQ2NWsZ9TM9k38Ceo/son+w2bl+ZB/KLrwou/Ci7IDwSMqhi8j0XrBl+HZRGpE/ZMgQu/XWW92I+tiAqZH2n376adzyixYtcrWqBQoUcCP5k5OTo4Oggipk9T1t2LBhRjcFAAAAeUiGgqlC5GOPPWZXX321u1/pjh077Mcff3SPffv2WYcOHezLL790Tfu6p+nkyZPtnXfesc6dO7vXq1b0uuuus/79+9vixYvdsrovaqNGjaxOnTrZtY8AAAAIgQw15WsEvgYavffee+4Rq127djZ8+HAbN26cPfnkk+7m+eedd577OfYWUqptVbi955573O/6BikFVQAAAORvGQqmd955p3ucioKmHmlRv6OhQ4e6BwAAAJDpPqYAAABAdiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBADkuOTmS25sAwEOJub0BAID8p0CBBBsxfZl9v22f5TX1qpe1v15bI7c3AwglgikAIFcolK7/YY/lNeeVLZ7bmwCEFk35AAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAAAIfzB99tlnrUOHDnHT+vfvb9WqVYt7tGjRIjo/OTnZxowZY02bNrU6depYly5dbPPmzWeyGQAAAMjPwXT69Ok2evTok6avXr3a7rzzTluwYEH08corr0Tnjxs3zmbMmGFDhgyxmTNnuqDauXNnO3r0aOb3AgAAAPkvmG7bts0FzxEjRljFihXj5kUiEVu3bp1dcsklVqZMmejjnHPOcfMVPidPnmw9evSw5s2bW/Xq1W3UqFG2detWmzt3btbtFQAAAPJ+MP3mm2+sUKFC9uabb1rt2rXj5n333Xd28OBBu/DCC1N97apVq+zAgQPWpEmT6LSSJUtajRo1bMmSJZnZfgAAAOQRiRl9gfqLxvYZjbVmzRr3/MILL9iHH35oBQoUsGbNmlnPnj2tRIkSrmZUKlSoEPe6smXLRudlhmpqFYiz2qFDh+KeER6UXXhRdnm/7BISEiwpKSmHtgrAqeh8VY7KbnoPnftZHkxPRcFUYVRBc/z48a4G9YknnrC1a9fa888/H/1nVbhw4bjXFSlSxPbs2ZPp9z127JitXLnSssvGjRuzbd3IXpRdeFF2ebfsFErVUgYg923YsCHHKgJS5r9sD6Z33XWX3XLLLVaqVCn3e9WqVV0f0z/96U/21VdfWdGiRaN9TYOf5ciRI2d09ayuBVWqVLGspoLSP1j1peXqPlwou/Ci7PJ+2aWn1gRAzqhUqVKO1JhqDFJ6ZGkwVW1pEEoDF110kXtWU33QhL99+3a74IILosvod91WKrP0T65YsWKZfv3p6B9sdq4f2YeyCy/KLrwoOyA8knKoAiC9F6RZeoP9vn37WseOHeOmqaZUVKOpUfjFixe3xYsXR+fv3bvXVqxYYQ0bNszKTQEAAEDIZGkwbdWqlS1cuNCefvpp17/0gw8+sH79+lmbNm2scuXKrm9B+/bt3a2m5s+f70bpa2BU+fLlrWXLllm5KQAAAAiZLG3Kv/LKK91N9ydMmGDPPfecG4nftm1bu++++6LL6B6mx48fd98QdfjwYVdTOmnSJNdPFAAAAPnXGQXT4cOHnzTtmmuucY+0FCxY0Pr06eMeAAAAQLY05QMAAACZRTAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAAAQ/mD67LPPWocOHeKmrVy50tq3b2916tSxFi1a2NSpU+PmJycn25gxY6xp06ZumS5dutjmzZvPZDMAAACQn4Pp9OnTbfTo0XHTdu/ebZ06dbILLrjAXn31Vbv77rttxIgR7ufAuHHjbMaMGTZkyBCbOXOmC6qdO3e2o0ePntmeAAAAINQSM/qCbdu22aOPPmqLFy+2ihUrxs2bNWuWFSpUyAYPHmyJiYlWuXJl27Rpk02YMMFuuukmFz4nT55svXv3tubNm7vXjBo1ytWezp0719q0aZN1ewYAAIC8XWP6zTffuPD55ptvWu3atePmLV261Bo1auRCaaBx48a2ceNG27Fjh61atcoOHDhgTZo0ic4vWbKk1ahRw5YsWXKm+wIAAID8VGOqfqN6pGbr1q1WtWrVuGlly5Z1z1u2bHHzpUKFCictE8zLjEgkYgcPHrSsdujQobhnhAdlF16UXd4vu4SEBEtKSsqhrQJwKjpflaOym95D536WB9NTOXz4sBUuXDhuWpEiRdzzkSNHov+sUltmz549mX7fY8eOuUFX2UU1vggnyi68KLu8W3YKpWopA5D7NmzYkGMVASnzX7YH06JFi540iEmBVIoVK+bmi5YJfg6WOZOrZ3UtqFKlimU1FZT+waovLVf34ULZhRdll/fLLj21JgByRqVKlXKkxnTdunXpWi5Lg2n58uVt+/btcdOC38uVK2fHjx+PTtPI/dhlqlWrlun31T85Bd/son+w2bl+ZB/KLrwou/Ci7IDwSMqhCoD0XpBm6Q32GzZsaMuWLbMTJ05Epy1atMil8dKlS1v16tWtePHibkR/YO/evbZixQr3WgAAAORfWRpMdUuo/fv328MPP+yqbGfPnm1Tpkyxbt26RfsW6Ob7urfp/Pnz3Sj9nj17uprWli1bZuWmAAAAIGSytClftaITJ060YcOGWbt27axMmTLWt29f93OgR48erkm/f//+brCUakonTZrk+okCAAAg/zqjYDp8+PCTptWqVcteeumlNF9TsGBB69Onj3sAAAAA2dKUDwAAAGQWwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAEDeDKbbtm2zatWqnfSYPXu2m79y5Upr37691alTx1q0aGFTp07N6k0AAABACCVm9QpXrVplRYoUsXnz5llCQkJ0eokSJWz37t3WqVMnF0gHDRpkn3/+uXs+66yz7KabbsrqTQEAAEB+DqZr1qyxihUrWtmyZU+a9/zzz1uhQoVs8ODBlpiYaJUrV7ZNmzbZhAkTCKYAAAD5XJY35a9evdoFztQsXbrUGjVq5EJpoHHjxrZx40bbsWNHVm8KAAAA8nuNaalSpezWW2+1DRs22K9//Wu76667rFmzZrZ161arWrVq3PJBzeqWLVvsl7/8ZabeMxKJ2MGDBy2rHTp0KO4Z4UHZhRdll/fLTt28kpKScmirAJyKzlflqOym94jt4pkjwfT48eP27bffWpUqVezBBx+04sWL21tvvWVdu3a1f/7zn3b48GErXLhw3GvUH1WOHDmS6fc9duyYG1SVXVSji3Ci7MKLssu7ZadQWqNGjRzbHgBpUyViTlUEpMyA2R5M1US/ePFiK1iwoBUtWtRNu+SSS2zt2rU2adIkN+3o0aNxrwkCabFixTL9vuq3qjCc1VRQ+gerPrNc3YcLZRdelF3eL7v01JoAyBmVKlXKkRrTdevW5U5TvkbYp3TRRRfZggULrHz58rZ9+/a4ecHv5cqVy/R76p/cmQTb09E/2OxcP7IPZRdelF14UXZAeCTlUAVAei9Is3Twk2pG69Wr52pNY3399deuRrNhw4a2bNkyO3HiRHTeokWLXFovXbp0Vm4KAAAAQiZLg6lG41944YXudlAagb9+/Xr729/+5u5XqgFQuiXU/v377eGHH3ZVurrp/pQpU6xbt25ZuRkAAAAIoSxtyi9QoICNHz/ennrqKbvvvvts7969roO7Bj4Fo/EnTpxow4YNs3bt2lmZMmWsb9++7mcAAADkb1nex1S3fFItaVpq1aplL730Ula/LQAAAEIuy2+wDwAAAGQGwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAAMALBFMAAAB4gWAKAAAALxBMAQAA4AWCKQAAALxAMAUAAIAXCKYAAADwAsEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgBCJzk5YnldfthHAEgp8aQpAOC5AgUSbMT0Zfb9tn2WF51XroT1vrV+bm8GAOQ4gimAUFIoXf/DntzeDABAFqIpH4BXEhISLCkpyT0DAPIXakyBPNo/Uc3dYaRQWqNGjdzeDABALiCYAnlQXu6DWa96WfvrtQRXH1HbDeBMEUyBPCqv9sE8r2xxy+t+UaJIKGu9qe0GcKYIpgDgmeJJhaj1BpAvEUwBwFPUegPIbxiVj3yJm5cDAOAfakyRL9FMCgCAfwimyLdoJgUAwC805QMAAMALBFMAAAB4gWCaCXl94Exe2D9u9A0AQPjQxzQT8vLAmYsrnWNdrq9pYceNvgEACJ9cCabJycn29NNP28svv2z79u2zhg0b2iOPPGLnn3++hUVeHjiTl4O3MGodAAA/5UowHTdunM2YMcOGDx9u5cuXtyeffNI6d+5sc+bMscKFC+fGJiGfBG9h1DoAAH7K8T6mR48etcmTJ1uPHj2sefPmVr16dRs1apRt3brV5s6dm9ObAwAAgPwaTFetWmUHDhywJk2aRKeVLFnS9QdcsmRJTm8OAAAAPJEQiURydAi2akXvvfde++KLL6xo0aLR6f/3f/9nhw8ftmeffTZD61u+fLlpFwoVKpTl26r1Hj9+3BITE+NGd+vnPfuP2vETyZbXFClU0IoXK5Rn9y8/7CP7F355fR/Zv/DL6/uY1/dPEgsWsLOLF3ZZJyccO3bM5ad69er51cf00KFD7jllX9IiRYrYnj0Z79MYBMbsuC2Q1plWn1cVZl6W1/cvP+wj+xd+eX0f2b/wy+v7mNf3T3Lqtop6n/S8V44H06CWVH1NY2tMjxw54m7xk1F169bN0u0DAABAPuljWqFCBfe8ffv2uOn6vVy5cjm9OQAAAMivwVSj8IsXL26LFy+OTtu7d6+tWLHC3c8UAAAA+VOON+Wrz2b79u1txIgRds4559i5557r7mOq+5m2bNkypzcHAAAA+fkG+7qHqUa79+/f343EV03ppEmTsmVkPQAAAMIhx28XBQAAAHjRxxQAAABIDcEUAAAAXiCYAgAAwAsEUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwTUNycrKNGTPGmjZtanXq1LEuXbrY5s2bc3uzkIpt27ZZtWrVTnrMnj3bzV+5cqX7tjGVY4sWLWzq1Km5vcn53rPPPmsdOnSIm3a6cuKc9Lfs9GUpKc8/lWGAsss9P/30kz3yyCPWrFkzq1evnt188822dOnS6PyFCxfajTfeaLVr17bWrVvbW2+9Fff6I0eO2KBBg6xJkyZWt25du//++23Xrl25sCf5z0+nKbtOnTqddN7FnpuhLTvdYB8nGzt2bOS3v/1t5P3334+sXLkycvvtt0datmwZOXLkSG5vGlL473//G6lZs2Zk27Ztke3bt0cfhw4diuzatcuV40MPPRRZt25d5JVXXnHL6hm5Y9q0aZHq1atH2rdvH52WnnLinPSz7OQPf/hDZOTIkXHn386dO6PzKbvc06lTp0ibNm0iS5YsiXz77beRQYMGRWrVqhVZv369O9d0nqns9PPEiRMjNWrUiHzyySfR1z/44IORq666yr3+iy++iNxwww2RW2+9NVf3Kb/odIqykyZNmkRmzJgRd97t3r079GVHME2F/lnWrVs3Mn369Oi0PXv2uD+IOXPm5Oq24WQTJkyItG3bNtV548ePj1x22WWRY8eORac99dRT7kMROWvr1q2Rbt26RerUqRNp3bp1XLg5XTlxTvpbdsnJyW763LlzU30tZZd7Nm7cGKlatWpk6dKlceWlsDJ69OjIgAED3EVFrF69erkLh6DcdSGii/+AApLWuXz58hzck/xn42nKbseOHW7+N998k+rrw1x2NOWnYtWqVXbgwAFX/R0oWbKk1ahRw5YsWZKr24aTrV692ipXrpzqPDV7NGrUyBITE6PTGjdubBs3brQdO3bk4Fbim2++sUKFCtmbb77pmg0zUk6ck/6W3XfffWcHDx60Cy+8MNXXUna5p1SpUjZhwgSrWbNmdFpCQoJ77N271513seUSnHfLli1TpZV7DqYFKlWqZOXKlaPscrnsVq9e7X5WeaQmzGVHME3F1q1b3XOFChXippctWzY6D/5Ys2aN6zdz66232qWXXur64Xz44YdunsqrfPnyJ5WjbNmyJVe2N79Sn8OxY8fa+eeff9K805UT56S/ZafzT1544QW33FVXXWWDBw+2ffv2uemUXe7RBcDll19uhQsXjk579913bdOmTa6/b1rn3aFDh2z37t2u/74CUpEiRU5ahrLL3bJbs2aNlShRwp1r6oOq/sGjR4+2o0ePumXDXHYE01TopJTYPwhRAaszMfxx/Phx+/bbb23Pnj127733uitMDa7o2rWr69R/+PDhVMtRKEt/nK6cOCf9pQ/IAgUKuA+88ePH24MPPmgLFiyw7t27u0FPlJ0/li9fbg899JC1bNnSmjdvnup5F/yugKOySzlfKLvcL7s1a9a4MqhVq5ZNnDjR7rrrLnv55ZfdQEQJc9n9/3YzRBUtWjR6YgY/iwozKSkpF7cMKanpd/HixVawYMFoWV1yySW2du1amzRpkpsWXEEGgpOyWLFiubLNONnpyolz0l/6QLzllltc7YxUrVrVypQpY3/605/sq6++ouw8MW/ePOvdu7cb3T1ixIhoSEl53gW/q2xSOy+Fssv9shs8eLA98MADdvbZZ0fPO3W36dmzp/Xt2zfUZUeNaSqCJqft27fHTdfv6p8Bv5x11llxH3hy0UUXuaYMNVOlVo5CWfrjdOXEOekv1ZYGoTT2/BM1GVJ2uW/atGmuRemKK65wtdpBa4TKJrVy0cWgmol1XuqWRSkDDmWX+2WXmJgYDaWpnXdhLjuCaSqqV69uxYsXdzVxAXU2XrFihTVs2DBXtw3xVDOqq8jYspKvv/7aqlSp4spLncBPnDgRnbdo0SLXCbx06dK5sMVIzenKiXPSX6qd6dixY9w01ZSKzkHKLnfNmDHDhgwZ4vrgjxw5Mq55t0GDBvbpp5/GLa/zTv9TdcFRv3591x0jGEgjGzZscBf9lF3ull2HDh1c037K8061phUrVgx12RFMU6HC142+VWU+f/58N6pU1eO6AlH/DvhDo/E1GljNGhphun79evvb3/5mn3/+uWtivOmmm2z//v328MMP27p169xN96dMmWLdunXL7U1HjNOVE+ekv1q1auX6cz/99NNuhP4HH3xg/fr1szZt2rjzk7LLPQoijz32mF199dXuXNIdLn788Uf30OA0hZsvv/zSlY3+d06ePNneeecd69y5s3u9atauu+46129RFxZatlevXu4OGurLj9wru1atWtkbb7xhL774ovuyirffftueeOIJu+OOO9yFYJjLLkH3jMrtjfCRam50haIPSHUQ1xWGvoHhvPPOy+1NQwo6YZ966in76KOPXE2MbkOj/jiqDRCdkMOGDXM1NOr7dvvtt7sPSuQeDZD54Ycf3EjuwOnKiXPS37L797//7QYeaiCimoDbtm1r9913X7TZkbLLHWr6HTVqVKrz2rVrZ8OHD3d3MHnyySfdrdlUHmo2vvbaa6PL6VZgCkgaES4aAa6wk7L7BnK+7KZPn+4eCqZBv24N/FVtd5jLjmAKAAAAL9CUDwAAAC8QTAEAAOAFgikAAAC8QDAFAACAFwimAAAA8ALBFAAAAF4gmAIAshR3IQSQWQRTAFlK3yajLzkIvpYypRYtWribtOcEvY/eL7fpxvLVqlWz77//3vK6l19+2R5//PGTpuvb2XQM9EUKAJAWgimALKdv+tH3OB89ejS3NwU57JlnnrGffvrppOn6xhldJPz973/Ple0CEA4EUwBZTl9LuXbtWvvHP/6R25sCT+hrEhVKH3744dzeFAAeI5gCyHIXX3yx3XDDDTZx4kT7+uuvT1u7qu971ver16pVy5o3b24jRoywI0eOxDXJ33HHHfbSSy/ZVVdd5Zb7y1/+Yhs2bLD333/fvbZ27dr2xz/+0VauXHnSe+h1Wq9ed9ttt9mKFSvimtnV9UBN0L/73e+sUaNGtm7dOjdv3rx5duONN1rNmjXdvKFDh7rvnz6V5ORkGzdunHs/bVP37t1tz549Jy23Zs0a69atm9WrV8897r77bved16fzwQcfuH2vU6eOXXbZZe475/fu3Rudv2TJEnes9H30l1xyiaulHDt2rNsuUXcCNan/85//tNatW7ttfPXVV6P7e8stt1jdunXdazVfZRNr+/bt9sADD1iTJk3ccu3bt7fPPvvMzdN7/fDDD/baa6/FdV343//+Z7169XLbq+/5TlkGp9qm9Byn559/3r1O5dS0aVMbOHCg7d+//7THEoCHIgCQhdq3b+8eP/30U+R3v/tdpE2bNpEjR45E519xxRWRBx54IPp7v379Ir/5zW8io0ePjixYsCAyYcKESO3atSO33357JDk52S2j5evWrevW9d5770X+9a9/RRo0aBC56qqrIldffXVkzpw5kXnz5rn3u/baa6Pr1usuvvjiyGWXXRZ57bXX3Guvv/76SL169SI//PCDW+bVV1+NVK1aNdK6devI+++/H5k9e7Z73zfffNNNv//++yMffPBBZMaMGZGGDRtGbrvttuh2pWb48OGRGjVqRMaOHRv58MMPIw899JDbP61r8+bNbplvv/3W7c9NN90UmTt3buTtt9+OtG3b1m3/jh070lz3f/7zn0i1atUi3bt3d9uqfWrSpIk7VrJy5Ur33r169Yp89NFH7v379Onj3lvHTLQN+l3v/8orr0TeeeedyJYtW9z6NH3o0KGRTz75xL1X586d3bTPP//cvXb//v2RFi1aRC6//HJ33FReeu86depENmzYEPnmm2/cPnTp0iXy2WefuXLfuXNnpGnTppGWLVu6Y6oy0N+HXrNu3bpTblN6jpPKXsd36tSpkcWLF0defPFFt+6+fftm6u8XQO4imALIlmAq8+fPd4Fj5MiRqQbTtWvXuvnPPvts3Dpef/11N/2///2v+13L6/cgyMgjjzzipilEBSZNmuSm7dmzJ+51X3zxRXSZ7du3R2rVquUCZGww1XsGFDybNWsWueOOO+K2S++lZRXiUqP3VUh68skn46ZrPbHBVMHx0ksvjezbty+6zO7duyP169ePbldq2rVrF7nhhhvigvFbb73lQt+PP/7ogqrC5IkTJ6Lz9bPWO2DAgLgQqAuCWM8991zcBUOwTbHl88ILL7hgvGLFiugyBw8edO8/a9asVC88VPY1a9aMfP/999FpCqxXXnll5N577z3lNqXnOGm/WrVqFbfPb7zxhguqAMInMbdrbAHkXWra/f3vf++a9Fu2bGm/+c1v4uZ/+umn7vm6666Lm67fNXhq8eLFdvnll7tpZ599tlWuXDm6zC9/+Uv3rGbfwC9+8Qv3rKbtkiVLup/PP/9814QfKFOmjGsGV5N3yu4HgW+//da2bt3qmpCPHz8ena7m8eLFi9vHH3/smupT+vzzz+3YsWN2xRVXxE2/5ppr7KOPPor+vmjRItdloGjRotH1a70NGjSwTz75JNVjefjwYdf8fe+991pCQkJ0+rXXXuseou4TeqgbhLo5bNq0yXVtUHcJbVda+yudO3d2zwcOHHCv/e6776J3VggGsS1btszOO++8uNcmJSXZu+++a2lZuHChW75cuXLRfVV/02bNmtmbb755ym1Kz3Fq3Lix66qhLhfq5qG/F3XtiD1GAMKDYAogW2k0tsKJgmbQbzAQ9L1UWIyVmJhopUqVsn379kWnKZCkplixYqd8/yDAxipdurRt2bIlzfUEo8oHDRrkHimpn2Vqgv3RtsdKuX9a/9tvv+0eKZ1zzjlprlutXNr2tCi8DhkyxN544w0X5BQi1Q9UxzPlvUVTHrddu3bZo48+6vqZKtT9+te/dgFQgtdqu0/1/qnRaxSQU16UBA4dOpTmNqXnOCmUq//sjBkzXN9e9ac999xzrXfv3tHADiA8CKYAspVqOjUYRYNWFBxSzpMff/zRhYmAavd27959UsDLjNQGHun90gqAEtS29u3b19XYpRRsd0rB9u7cudMuvPDC6PSUt0/SXQsuvfRS69Sp00nrUIhMjYK5AqMCZCzVjqpmUTXHTz31lKu9HD16tFt/EPQ0UOl0FORUUzxlyhQXZgsXLuxC46xZs+K2O7V7sS5fvvykGu3Y1+gY6limRu+TlvQepzZt2riHLmQWLFhgzz33nPXp08fq16/vamoBhAej8gFkOzWxKjhMmDAhLlgFoe+tt96KW16/q/lZweJMBc3SAdWUahT5b3/72zRfo1CpmkGFMI30Dh4KOQp/sSPKYynQqdn5nXfeiZuuOwfECkb+q+k6WLdGwSsUvvfee6mu+6yzznLLp1zXhx9+aF27dnW1uGpq137peAehVHdF0DEPRuWnRa9Vdwu9PgiLWrcEr1UNqkbE61ZgscFY3QteeeWVaDN9yn1VGVSqVCnuWKpWV68pWLBgmtuUnuN03333uYueIMiq24TuhKAa47RqtgH4ixpTADliwIABrmZvx44d0WlVqlRxtw8aM2aMq51TH071iXz66addQNKtf85UkSJF7K677rKePXu6sKt7aaovqm5ZlBaFJS2vWzHpZ/UZVb9V1fhu27YtzWZphUeFItVYqu+l+j/q9k4pw6SW0S2f1If15ptvdtuofpJqRtexSEuPHj3cvujWS+pLqmM5cuRIF0SrVq3q+tL++9//thdffNHVXq5atcrd8F41rbFN5qnRa+fMmeP2rXz58q4WVBcSsa9VP84XXnjBbYO2RTXEU6dOdTXcus1UUNus4K7+w1pnx44dXQjV8+233+5eo6Z51cSqe8eppOc46RirC4K+bUr9VlVO+vupWLGiVa9e/ZTrB+AfgimAHKEwqCb9e+65J276sGHDXH9G9T9VE2zZsmXtr3/9qwslKWvfMkP3KG3VqpV7bzX1qlm7X79+p2zKF90TVUFTA7cUhlQDqfto6h6rGlCVFoUoLat7a+qhWlTd91PvH1Bg0v1BR40a5Zq41YdTwVJfSHDllVemuW4F5PHjx7vgpVpC7YMG+qjGMrjfq0KigrEGLKmPqUKkah3/85//uGCeluHDh7v+qXqIgp3612qA0tKlS6PdCaZNm2ZPPPGEW041qRpIpnAaHBOFz8cee8zdS1X3JVUt68yZM11Ns46Bali1bpX7H/7wh1OWQXqOk4Kr9lnvoX6mqrFWGaspv1ChQqdcPwD/JGhofm5vBAAAAEAfUwAAAHiBYAoAAAAvEEwBAADgBYIpAAAAvEAwBQAAgBcIpgAAAPACwRQAAABeIJgCAADACwRTAAAAeIFgCgAAAC8QTAEAAOAFgikAAADMB/8PCTteVC5D8lUAAAAASUVORK5CYII=",
|
||
"text/plain": [
|
||
"<Figure size 800x600 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import seaborn as sns; sns.set_theme(style=\"whitegrid\")\n",
|
||
"\n",
|
||
"\n",
|
||
"length = np.array([len(doc.page_content) for doc in texts])\n",
|
||
"\n",
|
||
"plt.figure(figsize=(8, 6))\n",
|
||
"plt.hist(length)\n",
|
||
"plt.title(\"Distribution de la longueur des chunks\")\n",
|
||
"plt.xlabel(\"Nombre de caractères\")\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "43bf41cd",
|
||
"metadata": {},
|
||
"source": [
|
||
"Nous observons des chunks avec très peu de caractères. Inspecter les contenus des documents avec moins de 100 caractères et noter les améliorations possibles."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "8d300959",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"INTRODUCTION AU MACHINE LEARNING\n",
|
||
"2022-2026\n",
|
||
"Théo Lopès-Quintas\n",
|
||
"------------------------------\n",
|
||
"vue un peu plus complète du domaine, ainsi qu’un aperçu plus récent des développements en cours.\n",
|
||
"2\n",
|
||
"------------------------------\n",
|
||
"3. À condition que l’algorithme soit performant.\n",
|
||
"7\n",
|
||
"------------------------------\n",
|
||
"Pour essayer de comprendre ce passage, faisons un exercice :\n",
|
||
"4. Voir l’équation (2.3).\n",
|
||
"8\n",
|
||
"------------------------------\n",
|
||
"11\n",
|
||
"------------------------------\n",
|
||
"le résultat, on peut vérifier la cohérence de la formule avec un exercice.\n",
|
||
"15\n",
|
||
"------------------------------\n",
|
||
"valeur moyenne. La vision est donc bien complémentaire à celle de laRMSE.\n",
|
||
"17\n",
|
||
"------------------------------\n",
|
||
"• FP : Faux positif - une baisse identifiée comme une hausse\n",
|
||
"28\n",
|
||
"------------------------------\n",
|
||
"L’idée est de partitionner l’espace engendré parD, dont voici la procédure à chaque étape :\n",
|
||
"33\n",
|
||
"------------------------------\n",
|
||
"définir ce que l’on appelle intuitivementla meilleure coupure.\n",
|
||
"34\n",
|
||
"------------------------------\n",
|
||
"Devant cet exemple jouet, on peut imaginer une situation plus proche de la réalité :\n",
|
||
"37\n",
|
||
"------------------------------\n",
|
||
"Pour saisir l’intérêt de la proposition, résolvons l’exercice suivant.\n",
|
||
"38\n",
|
||
"------------------------------\n",
|
||
"40\n",
|
||
"------------------------------\n",
|
||
"des champions.\n",
|
||
"41\n",
|
||
"------------------------------\n",
|
||
"42\n",
|
||
"------------------------------\n",
|
||
"fm(x) = fm−1(x) − γ\n",
|
||
"nX\n",
|
||
"i=1\n",
|
||
"∂C\n",
|
||
"∂fm−1\n",
|
||
"\u0000\n",
|
||
"x(i)\u0001\n",
|
||
"\u0010\n",
|
||
"yi, fm−1\n",
|
||
"\u0010\n",
|
||
"x(i)\n",
|
||
"\u0011\u0011\n",
|
||
"= fm−1(x) + γ′hm(x)\n",
|
||
"45\n",
|
||
"------------------------------\n",
|
||
"peut visualiser ce résultat avec la figure (5.3).\n",
|
||
"47\n",
|
||
"------------------------------\n",
|
||
"i (xi − µk)\n",
|
||
"2. Conclure sur la convergence deJ.\n",
|
||
"53\n",
|
||
"------------------------------\n",
|
||
"pour amener le clustering vers sa meilleure version.\n",
|
||
"62\n",
|
||
"------------------------------\n",
|
||
"3. Que nous ne démontrerons pas\n",
|
||
"68\n",
|
||
"------------------------------\n",
|
||
"6. Puisqu’on peut normaliser la distance par rapport au voisin le plus éloigné.\n",
|
||
"71\n",
|
||
"------------------------------\n",
|
||
"2. Largement inspiré du schéma de Park ChangUk.\n",
|
||
"77\n",
|
||
"------------------------------\n",
|
||
"8. Avec des valeurs non nulle dans la majorité des coordonnées.\n",
|
||
"84\n",
|
||
"------------------------------\n",
|
||
"10. Pour plus de détails, voir la section (G.1)\n",
|
||
"88\n",
|
||
"------------------------------\n",
|
||
"11. Dépendant donc de la méthode de tokenization et de la taille du vocabulaire.\n",
|
||
"89\n",
|
||
"------------------------------\n",
|
||
"Appendices\n",
|
||
"93\n",
|
||
"------------------------------\n",
|
||
"donner. Il nous faudrait une caractérisation plus simple d’utilisation :\n",
|
||
"95\n",
|
||
"------------------------------\n",
|
||
"existe deux minimaux globaux et on aboutit à une absurdité en exploitant la stricte convexité.\n",
|
||
"98\n",
|
||
"------------------------------\n",
|
||
"∥xi∥. Alors lak-ième erreur de classification du perceptron aura lieu avant :\n",
|
||
"k ⩽\n",
|
||
"\u0012R\n",
|
||
"γ\n",
|
||
"\u00132\n",
|
||
"∥w∗∥2\n",
|
||
"103\n",
|
||
"------------------------------\n",
|
||
"P({y = k}) × P\n",
|
||
"\n",
|
||
"\n",
|
||
"d\\\n",
|
||
"j=1\n",
|
||
"xj | {y = k}\n",
|
||
"\n",
|
||
"\n",
|
||
"P\n",
|
||
"\n",
|
||
"\n",
|
||
"d\\\n",
|
||
"j=1\n",
|
||
"xj\n",
|
||
"\n",
|
||
"\n",
|
||
"(C.1)\n",
|
||
"109\n",
|
||
"------------------------------\n",
|
||
"exploratoire et d’augmentation des données pour répondre à un problème de Machine Learning.\n",
|
||
"113\n",
|
||
"------------------------------\n",
|
||
"aléatoirement entre−1 et 1. Puis on normalise le vecteurx.\n",
|
||
"114\n",
|
||
"------------------------------\n",
|
||
"époque il y avait également Yann Le Cun, à la tête de la recherche chez Meta.\n",
|
||
"116\n",
|
||
"------------------------------\n",
|
||
"118\n",
|
||
"------------------------------\n",
|
||
"2. Kernel en allemand.\n",
|
||
"125\n",
|
||
"------------------------------\n",
|
||
"s’améliore! Deux phénomènes contre-intuitifs se réalisent :\n",
|
||
"132\n",
|
||
"------------------------------\n",
|
||
"computing. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.\n",
|
||
"139\n",
|
||
"------------------------------\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for doc in texts:\n",
|
||
" if len(doc.page_content) < 100:\n",
|
||
" print(doc.page_content)\n",
|
||
" print(\"-\" * 30)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f69b2033",
|
||
"metadata": {},
|
||
"source": [
|
||
"Nous avons à présent un ensemble de chunk, il nous reste à construire l'embedding pour stocker toute ces informations. Nous faisons les choix suivants :\n",
|
||
"* Nous utiliserons l'embedding [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) pour sa taille et son entraînement spécifique à notre tâche\n",
|
||
"* Nous utiliserons le *vector store* [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss/) puisque nous l'avons couvert en cours.\n",
|
||
"* Nous récupérerons les trois chunks les plus proches, pour commencer"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "40021b12",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain_huggingface import HuggingFaceEmbeddings\n",
|
||
"from langchain_community.vectorstores import FAISS\n",
|
||
"import os\n",
|
||
"\n",
|
||
"os.environ['USE_TF'] = 'false'\n",
|
||
"os.environ['USE_TORCH'] = 'true'\n",
|
||
"os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'\n",
|
||
"\n",
|
||
"\n",
|
||
"embedding_model = HuggingFaceEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
|
||
"vectordb = FAISS.from_documents(texts, embedding_model)\n",
|
||
"n_doc_to_retrieve = 3\n",
|
||
"retriever = vectordb.as_retriever(search_kwargs={\"k\": n_doc_to_retrieve})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ed148169",
|
||
"metadata": {},
|
||
"source": [
|
||
"Notre base de connaissance est réalisée ! Passons maintenant à l'augmentation du modèle de langage.\n",
|
||
"\n",
|
||
"## Génération\n",
|
||
"\n",
|
||
"Pour cette étape, il nous reste à définir le modèle de langage et comment nous allons nous adresser à lui.\n",
|
||
"\n",
|
||
"**Consigne** : Définir la variable *model* à partir de la classe [OllamaLLM](https://python.langchain.com/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html#ollamallm) et du modèle de votre choix."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "4abfbda6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain_ollama import OllamaLLM\n",
|
||
"\n",
|
||
"model = OllamaLLM(model=\"gemma3:4b\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d42c7f56",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Consigne** : À l'aide de la classe [PromptTemplate](https://python.langchain.com/api_reference/core/prompts/langchain_core.prompts.prompt.PromptTemplate.html#langchain_core.prompts.prompt.PromptTemplate) et en s'inspirant éventuellement de [cet exemple](https://smith.langchain.com/hub/rlm/rag-prompt), définir un template de prompt qui aura deux *input_variable* : 'context' et 'question'."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"id": "2c3c7729",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain_core.prompts import PromptTemplate\n",
|
||
"\n",
|
||
"prompt_template = PromptTemplate(\n",
|
||
" template=\"\"\"\n",
|
||
" You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. \n",
|
||
" If you don't know the answer, just say that you don't know. Answer in the language of the question asked.\n",
|
||
"\n",
|
||
" Question: {question}\n",
|
||
" Context:\\n{context}\n",
|
||
" Answer:\n",
|
||
" \"\"\",\n",
|
||
" input_variables=[\"context\", \"question\"]\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0da52ea4",
|
||
"metadata": {},
|
||
"source": [
|
||
"Pour construire la chaîne de RAG, LangChain utilise le [LangChain Expression Language (LCEL)](https://python.langchain.com/v0.2/docs/concepts/#langchain-expression-language-lcel), voici dans notre cas comment cela se traduit :"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"id": "c51afe07",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from langchain_core.runnables import RunnablePassthrough\n",
|
||
"from langchain_core.output_parsers import StrOutputParser\n",
|
||
"\n",
|
||
"def format_docs(docs):\n",
|
||
" return \"\\n\\n\".join(doc.page_content for doc in docs)\n",
|
||
"\n",
|
||
"\n",
|
||
"rag_chain = (\n",
|
||
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
|
||
" | prompt_template\n",
|
||
" | model\n",
|
||
" | StrOutputParser()\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7db86940",
|
||
"metadata": {},
|
||
"source": [
|
||
"Une fois la chaîne définie, nous pouvons lui poser des questions :"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"id": "02444b65",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Answer: Nous ne pouvons qu’avoir un aperçu du futur, mais cela suffit pour comprendre qu’il y a beaucoup à faire.\n",
|
||
"— Alan Turing (1950)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"query = \"Quelle est la citation d'Alan Turing ?\"\n",
|
||
"result = rag_chain.invoke(query)\n",
|
||
"print(\"Answer:\", result)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3ffe0531",
|
||
"metadata": {},
|
||
"source": [
|
||
"LangChain ne permet pas nativement d'afficher quels chunks ont été utilisé pour produire la réponse, ni le score de similarité. Pour le faire, nous allons utiliser directement FAISS.\n",
|
||
"\n",
|
||
"**Consigne** : À l'aide de la méthode [`similarity_search_with_score`](https://python.langchain.com/v0.2/docs/integrations/vectorstores/llm_rails/#similarity-search-with-score) de `FAISS`, afficher les trois documents utilisé dans le RAG."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"id": "95d81fe2",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Similarity Score: 0.5376\n",
|
||
"Document Content: s’entraîneront, propageant ainsi les biais des premiers. Évidemment les usages malveillants malgré un\n",
|
||
"travail sur la sécurité et la toxicité toujours plus important.\n",
|
||
"Finalement, la fameuse citation d’Alan Turing est plus que jamais d’actualité.\n",
|
||
"--------------------------------------------------\n",
|
||
"Similarity Score: 0.6169\n",
|
||
"Document Content: Cadre et approche du cours\n",
|
||
"Alan Turing publieComputing Machinery and Intelligenceen 1950 [Tur50], qui deviendra un article\n",
|
||
"fondamental pour l’intelligence artificielle. Une citation devenue célèbre a motivé l’écriture de ce cours :\n",
|
||
"--------------------------------------------------\n",
|
||
"Similarity Score: 0.6388\n",
|
||
"Document Content: Nous ne pouvons qu’avoir un aperçu du futur, mais cela suffit pour comprendre qu’il y a\n",
|
||
"beaucoup à faire.\n",
|
||
"— Alan Turing (1950)\n",
|
||
"C’est par cette vision des années 1950 que nous nous proposons de remonter le temps et de découvrir\n",
|
||
"--------------------------------------------------\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"results_with_scores = vectordb.similarity_search_with_score(query, k=n_doc_to_retrieve)\n",
|
||
"\n",
|
||
"for doc, score in results_with_scores:\n",
|
||
" print(f\"Similarity Score: {score:.4f}\")\n",
|
||
" print(f\"Document Content: {doc.page_content}\")\n",
|
||
" print(\"-\" * 50)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6aeeadf8",
|
||
"metadata": {},
|
||
"source": [
|
||
"Nous avons finalement bien défini notre premier RAG !\n",
|
||
"\n",
|
||
"## Amélioration de notre RAG\n",
|
||
"\n",
|
||
"Mais nous pouvons faire mieux, notamment afficher la source dans la génération pour que l'utilisateur puisse vérifier et mesurer les performances de notre RAG. Une fois que nous aurons réalisé ces deux améliorations, alors nous pourrons modifier plusieurs points techniques spécifique et mesurer l'apport en performance.\n",
|
||
"\n",
|
||
"### Exploiter les méta-données\n",
|
||
"\n",
|
||
"Nous avons utilisé la classe `PyPDFLoader` qui charge chaque page dans un document. Nous avons largement utilisé le contenu *page_content* mais l'attribut *metadata* contient deux informations qui nous intéressent : *source* et *page*. \n",
|
||
"\n",
|
||
"**Consigne** : Modifier la fonction `format_doc` pour qu'elle prenne en paramètre une liste de document LangChain puis qu'elle affiche la source et la page en plus de seulement le contenu texte."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"id": "cae9a90c",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def format_docs(docs):\n",
|
||
" formatted = []\n",
|
||
" for doc in docs:\n",
|
||
" source = doc.metadata.get(\"source\", \"unknown\")\n",
|
||
" page = doc.metadata.get(\"page\", \"unknown\")\n",
|
||
" content = doc.page_content.strip()\n",
|
||
" formatted.append(f\"[Source: {source}, Page: {page+1}]\\n{content}\")\n",
|
||
" return \"\\n\\n\".join(formatted)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0363d832",
|
||
"metadata": {},
|
||
"source": [
|
||
"Maintenant que nous passons des informations sur les métadonnées, il faut s'assurer que le modèle de langage les utilises.\n",
|
||
"\n",
|
||
"**Consigne** : Modifier le prompt template défini plus tôt pour intégrer cette règle."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"id": "a57e10a6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"prompt_template = PromptTemplate(\n",
|
||
" template=\"\"\"\n",
|
||
" You are an assistant for question-answering tasks. \n",
|
||
" Use the following retrieved pieces of context (with source and page information) to answer the question. \n",
|
||
" If you don't know the answer, just say that you don't know. Answer in the same language as the question.\n",
|
||
" When possible, cite the source and page in your answer. \n",
|
||
"\n",
|
||
" Question: {question}\n",
|
||
" Context:\\n{context}\n",
|
||
" Answer:\n",
|
||
" \"\"\",\n",
|
||
" input_variables=[\"context\", \"question\"]\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "260f39f4",
|
||
"metadata": {},
|
||
"source": [
|
||
"Testons à présent avec la même question sur une nouvelle chaîne RAG prenant en compte nos améliorations.\n",
|
||
"\n",
|
||
"**Consigne** : Définir un nouveau RAG prenant en compte les informations des méta-données, puis poser la même question."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"id": "b3824802",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Answer: Selon ML.pdf, page 92, la citation d'Alan Turing est : « Nous ne pouvons qu’avoir un aperçu du futur, mais cela suffit pour comprendre qu’il y a beaucoup à faire. »\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"rag_chain = (\n",
|
||
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
|
||
" | prompt_template\n",
|
||
" | model\n",
|
||
" | StrOutputParser()\n",
|
||
")\n",
|
||
"\n",
|
||
"query = \"Quelle est la citation d'Alan Turing ?\"\n",
|
||
"result = rag_chain.invoke(query)\n",
|
||
"print(\"Answer:\", result)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "973dfa8d",
|
||
"metadata": {},
|
||
"source": [
|
||
"C'est ce que nous souhaitions obtenir ! Mais nous pourrions avoir un format un peu plus structuré et moins libre. Pour cela, nous allons modifier notre système pour qu'il renvoie des JSON !\n",
|
||
"Commençons par modifier le template de prompt pour lui donner les instructions :"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"id": "d4892e8d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"prompt_template = PromptTemplate(\n",
|
||
" template=\"\"\"\n",
|
||
" You are an assistant for question-answering tasks, use the retrieved context to answer the question. Each piece of context includes metadata (source + page).\n",
|
||
" If you don’t know the answer, respond with: {{\"answer\": \"I don't know\", \"sources\": []}}\n",
|
||
" Otherwise, return your answer in JSON with this exact structure:\n",
|
||
" {{\n",
|
||
" \"answer\": \"your answer here\",\n",
|
||
" \"sources\": [\"source:page\", \"source:page\"]\n",
|
||
" }}\n",
|
||
" Rules:\n",
|
||
" - Answer in the same language as the question.\n",
|
||
" - Always include the sources (source:page).\n",
|
||
" - Never add extra fields.\n",
|
||
"\n",
|
||
" Question: {question}\n",
|
||
" Context:\\n{context}\n",
|
||
" Answer:\n",
|
||
" \"\"\",\n",
|
||
" input_variables=[\"context\", \"question\"]\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "01e34935",
|
||
"metadata": {},
|
||
"source": [
|
||
"Puisque nous demandons ici de répondre par exemple : '['ML.pdf:91\"], nous allons lui faciliter la tâche en modifiant la fonction `format_docs`.\n",
|
||
"\n",
|
||
"**Consigne** : Modifier la fonction `format_docs` pour prendre en compte le formattage 'source:page'."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"id": "547f6ea2",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def format_docs(docs):\n",
|
||
" formatted = []\n",
|
||
" for doc in docs:\n",
|
||
" source = doc.metadata.get(\"source\", \"unknown\")\n",
|
||
" page = doc.metadata.get(\"page\", \"unknown\")\n",
|
||
" content = doc.page_content.strip()\n",
|
||
" formatted.append(f\"[{source}:{page+1}]\\n{content}\")\n",
|
||
" return \"\\n\\n\".join(formatted)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0238f9f6",
|
||
"metadata": {},
|
||
"source": [
|
||
"Si nous souhaitons obtenir un JSON, ou un dictionnaire, en sortie du modèle, nous devons modifier la chaîne RAG définie précédemment.\n",
|
||
"\n",
|
||
"**Consigne** : Remplacer la fonction [`JsonOutputParser`](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.json.JsonOutputParser.html) à la place de [`StrOutputParser`](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.string.StrOutputParser.html#langchain_core.output_parsers.string.StrOutputParser) puis tester la nouvelle chaîne RAG avec la même question."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"id": "c0f90db7",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Answer: {'answer': 'Nous ne pouvons qu’avoir un aperçu du futur, mais cela suffit pour comprendre qu’il y a beaucoup à faire.', 'sources': ['ML.pdf:2']}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"from langchain_core.output_parsers import JsonOutputParser\n",
|
||
"\n",
|
||
"\n",
|
||
"rag_chain = (\n",
|
||
" {\"context\": retriever | format_docs, \"question\": RunnablePassthrough()}\n",
|
||
" | prompt_template\n",
|
||
" | model\n",
|
||
" | JsonOutputParser()\n",
|
||
")\n",
|
||
"\n",
|
||
"query = \"Quelle est la citation d'Alan Turing ?\"\n",
|
||
"result = rag_chain.invoke(query)\n",
|
||
"print(\"Answer:\", result)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3db037d1",
|
||
"metadata": {},
|
||
"source": [
|
||
"C'est mieux ! Il nous reste à présent à mesurer la performance de notre système.\n",
|
||
"\n",
|
||
"\n",
|
||
"### Mesurer les performances\n",
|
||
"\n",
|
||
"Nous avons défini manuellement plusieurs questions dont les réponses sont contenus dans le cours dans le fichier JSON *eval_dataset*."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"id": "d4398984",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"{'question': \"Qu'est-ce qu'un algorithme ?\", 'answer': 'Un algorithme est une séquence d’instructions logique ordonnée pour répondre explicitement à un problème.', 'sources': 'ML.pdf:6'}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import json\n",
|
||
"with open(\"eval_dataset.json\", \"r\", encoding=\"utf-8\") as file:\n",
|
||
" eval_dataset = json.load(file)\n",
|
||
"\n",
|
||
"print(eval_dataset[0])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "37b8eb75",
|
||
"metadata": {},
|
||
"source": [
|
||
"Il sera probablement difficile de mesurer la performance de manière frontale. Ainsi, nous optons pour une méthodologie *LLM as a Judge*.\n",
|
||
"\n",
|
||
"**Consigne** : Définir une fonction `evaluate_rag` qui prend en paramètre une chaîne RAG et un dataset pour évaluation. La fonction renverra une liste de dictionnaire avec pour clés :\n",
|
||
"* *question* : la question posée\n",
|
||
"* *expected_answer* : la réponse attendue\n",
|
||
"* *predicted_answer* : la réponse obtenue\n",
|
||
"* *expected_sources* : la ou les sources attendues\n",
|
||
"* *predicted_sources* : la ou les sources obtenues"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"id": "4a3a70a4",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"def evaluate_rag(rag_chain, dataset):\n",
|
||
" results = []\n",
|
||
" for example in dataset:\n",
|
||
" prediction = rag_chain.invoke(example[\"question\"])\n",
|
||
"\n",
|
||
" results.append({\n",
|
||
" \"question\": example[\"question\"],\n",
|
||
" \"expected_answer\": example[\"answer\"],\n",
|
||
" \"predicted_answer\": prediction[\"answer\"],\n",
|
||
" \"expected_sources\": example[\"sources\"],\n",
|
||
" \"predicted_sources\": prediction[\"sources\"]\n",
|
||
" })\n",
|
||
" return results"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "da59e623",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Consigne** : Tester la fonction précédente avec les trois premières questions puis afficher le résultat sous la forme d'un dataframe pandas."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"id": "a33db551",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[{'question': \"Qu'est-ce qu'un algorithme ?\", 'expected_answer': 'Un algorithme est une séquence d’instructions logique ordonnée pour répondre explicitement à un problème.', 'predicted_answer': \"Un algorithme est un objet dont nous supposerons l'existence, et dont la description sera le cœur des prochains chapitres.\", 'expected_sources': 'ML.pdf:6', 'predicted_sources': ['ML.pdf:6', 'ML.pdf:134']}, {'question': \"Qu'est-ce qu'un hackathon ?\", 'expected_answer': 'Un hackathon en Machine Learning est une compétition entre data-scientists (ou étudiants) dont le but est de trouver la meilleure manière de répondre à une tâche donnée.', 'predicted_answer': \"I don't know\", 'expected_sources': 'ML.pdf:10', 'predicted_sources': []}, {'question': \"Quel est l'inconvénient de la méthode Leave-One-Out Cross-Validation ?\", 'expected_answer': 'L’un des inconvénients majeur est que cela peut devenir très long et très coûteux en opération de calcul puisqu’il faut entraîner n fois l’algorithme sur presque l’ensemble du dataset', 'predicted_answer': \"L'inconvénient de la méthode Leave-One-Out Cross-Validation est que pour chaque point de données, le modèle est entraîné sur tous les autres points de données et testé sur le point de données restant. Cela peut entraîner des estimations de l'erreur beaucoup plus élevées que celles obtenues par la validation croisée standard.\", 'expected_sources': 'ML.pdf:10', 'predicted_sources': ['ML.pdf:10', 'ML.pdf:10', 'ML.pdf:128']}]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"results = evaluate_rag(rag_chain, dataset=eval_dataset[:3])\n",
|
||
"print(results)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "14393690",
|
||
"metadata": {},
|
||
"source": [
|
||
"Nous sommes capable d'obtenir un ensemble de réponse de la part d'un modèle avec un RAG, il nous reste à mettre en place le juge.\n",
|
||
"\n",
|
||
"**Consigne** : Définir un prompt pour décrire le rôle du juge."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"id": "a9eacd88",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"judge_prompt = PromptTemplate(\n",
|
||
" template = \"\"\"\n",
|
||
" You are an evaluator. Your task is to compare a student's answer with the reference answer. \n",
|
||
" The student answer may still be valid even if it is phrased differently.\n",
|
||
"\n",
|
||
" Question: {question}\n",
|
||
" Reference Answer: {expected_answer}\n",
|
||
" Expected Sources: {expected_sources}\n",
|
||
"\n",
|
||
" Student Answer: {predicted_answer}\n",
|
||
" Student Sources: {predicted_sources}\n",
|
||
"\n",
|
||
" Evaluation Instructions:\n",
|
||
" - If the student's answer correctly matches the meaning of the reference answer, mark it as CORRECT. \n",
|
||
" - If it is wrong or missing important details, mark it as INCORRECT.\n",
|
||
" - For sources, check if the student listed at least the expected sources. Extra sources are allowed.\n",
|
||
" - Return your judgment strictly as JSON:\n",
|
||
" {{\n",
|
||
" \"answer_correct\": true/false,\n",
|
||
" \"sources_correct\": true/false,\n",
|
||
" }}\n",
|
||
" \"\"\",\n",
|
||
" input_variables=[\n",
|
||
" \"question\",\n",
|
||
" \"expected_answer\",\n",
|
||
" \"predicted_answer\",\n",
|
||
" \"expected_sources\",\n",
|
||
" \"predicted_sources\",\n",
|
||
" ]\n",
|
||
")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bc714900",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Consigne** : Définir une chaîne pour le juge, de la même manière que le RAG : prompt --> model --> JSONParser"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"id": "b3c30cc3",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"judge_model = OllamaLLM(model=\"gemma3:4b\")\n",
|
||
"json_parser = JsonOutputParser()\n",
|
||
"\n",
|
||
"judge_chain = judge_prompt | judge_model | JsonOutputParser()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6069627d",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Consigne** : Modifier la fonction `evaluate_rag` pour qu'elle note directement la performance du modèle et renvoie sous forme d'un dataframe pandas les résultats. On implémentera également des mesures temporelles pour le RAG et le juge, ainsi que des blocs *try...except...* pour ne pas bloquer l'exécution de toutes les requêtes si une renvoie une erreur.\n",
|
||
"Pour pouvoir suivre l'avancement de l'évaluation, on utilisera la barre de progression tqdm."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"id": "0556cbed",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from tqdm import tqdm\n",
|
||
"import time\n",
|
||
"import pandas as pd\n",
|
||
"\n",
|
||
"\n",
|
||
"def evaluate_rag(rag_chain, dataset, judge_chain):\n",
|
||
" \"\"\"\n",
|
||
" Evaluate a RAG chain against a dataset using a judge LLM.\n",
|
||
"\n",
|
||
" Args:\n",
|
||
" rag_chain: LangChain RAG chain.\n",
|
||
" dataset: List of dicts with 'question', 'answer', 'sources'.\n",
|
||
" judge_chain: LangChain judge chain that outputs JSON with 'answer_correct', 'sources_correct', 'explanation'.\n",
|
||
"\n",
|
||
" Returns:\n",
|
||
" pandas.DataFrame with predictions, judgment, and timings.\n",
|
||
" \"\"\"\n",
|
||
" results = []\n",
|
||
"\n",
|
||
" iterator = tqdm(dataset, desc=\"Evaluating RAG\", unit=\"query\")\n",
|
||
"\n",
|
||
" for example in iterator:\n",
|
||
" rag_start = time.time()\n",
|
||
" try:\n",
|
||
" prediction = rag_chain.invoke(example[\"question\"])\n",
|
||
" except Exception as e:\n",
|
||
" prediction = {\"answer\": \"\", \"sources\": []}\n",
|
||
" print(f\"[RAG ERROR] Question: {example['question']} | {e}\")\n",
|
||
" rag_end = time.time()\n",
|
||
"\n",
|
||
" judge_input = {\n",
|
||
" \"question\": example[\"question\"],\n",
|
||
" \"expected_answer\": example[\"answer\"],\n",
|
||
" \"predicted_answer\": prediction.get(\"answer\", \"\"),\n",
|
||
" \"expected_sources\": example[\"sources\"],\n",
|
||
" \"predicted_sources\": prediction.get(\"sources\", []),\n",
|
||
" }\n",
|
||
"\n",
|
||
" judge_start = time.time()\n",
|
||
" try:\n",
|
||
" judgment = judge_chain.invoke(judge_input)\n",
|
||
" except Exception as e:\n",
|
||
" judgment = {\"answer_correct\": False, \"sources_correct\": False, \"explanation\": f\"Judge error: {e}\"}\n",
|
||
" print(f\"[JUDGE ERROR] Question: {example['question']} | {e}\")\n",
|
||
" judge_end = time.time()\n",
|
||
"\n",
|
||
" results.append({\n",
|
||
" **judge_input,\n",
|
||
" **judgment,\n",
|
||
" \"rag_time\": rag_end - rag_start,\n",
|
||
" \"judge_time\": judge_end - judge_start,\n",
|
||
" \"total_time\": judge_end - rag_start\n",
|
||
" })\n",
|
||
" \n",
|
||
" return pd.DataFrame(results)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "73d842ea",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Consigne** : Utiliser cette fonction sur les trois premières question du dataset d'évaluation."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"id": "afad101d",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stderr",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Evaluating RAG: 100%|██████████| 10/10 [00:46<00:00, 4.64s/query]\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>question</th>\n",
|
||
" <th>expected_answer</th>\n",
|
||
" <th>predicted_answer</th>\n",
|
||
" <th>expected_sources</th>\n",
|
||
" <th>predicted_sources</th>\n",
|
||
" <th>answer_correct</th>\n",
|
||
" <th>sources_correct</th>\n",
|
||
" <th>rag_time</th>\n",
|
||
" <th>judge_time</th>\n",
|
||
" <th>total_time</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>Qu'est-ce qu'un algorithme ?</td>\n",
|
||
" <td>Un algorithme est une séquence d’instructions ...</td>\n",
|
||
" <td>Nous ne discuterons pas d’algorithmes en parti...</td>\n",
|
||
" <td>ML.pdf:6</td>\n",
|
||
" <td>[ML.pdf:6]</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>2.782175</td>\n",
|
||
" <td>1.656888</td>\n",
|
||
" <td>4.439065</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>Qu'est-ce qu'un hackathon ?</td>\n",
|
||
" <td>Un hackathon en Machine Learning est une compé...</td>\n",
|
||
" <td>I don't know</td>\n",
|
||
" <td>ML.pdf:10</td>\n",
|
||
" <td>[]</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>1.868308</td>\n",
|
||
" <td>1.657052</td>\n",
|
||
" <td>3.525366</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>Quel est l'inconvénient de la méthode Leave-On...</td>\n",
|
||
" <td>L’un des inconvénients majeur est que cela peu...</td>\n",
|
||
" <td>L'inconvénient de la méthode Leave-One-Out Cro...</td>\n",
|
||
" <td>ML.pdf:10</td>\n",
|
||
" <td>[ML.pdf:10, ML.pdf:128]</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>4.339367</td>\n",
|
||
" <td>1.844820</td>\n",
|
||
" <td>6.184189</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>Qu'est-ce que la régression polynomiale ?</td>\n",
|
||
" <td>Une régression polynomiale est une régression ...</td>\n",
|
||
" <td>Une régression polynomiale est une régression ...</td>\n",
|
||
" <td>ML.pdf:21</td>\n",
|
||
" <td>[ML.pdf:21]</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>3.342725</td>\n",
|
||
" <td>1.751531</td>\n",
|
||
" <td>5.094258</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>What is exercise 3.5 about ?</td>\n",
|
||
" <td>Mail classification</td>\n",
|
||
" <td>I don't know</td>\n",
|
||
" <td>ML.pdf:30</td>\n",
|
||
" <td>[]</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>2.151726</td>\n",
|
||
" <td>1.553353</td>\n",
|
||
" <td>3.705080</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5</th>\n",
|
||
" <td>Quel est l'autre nom du bagging ?</td>\n",
|
||
" <td>La solution donne son nom à la section : nous ...</td>\n",
|
||
" <td>Le bagging est également connu sous le nom d’a...</td>\n",
|
||
" <td>ML.pdf:39</td>\n",
|
||
" <td>[ML.pdf:40, ML.pdf:68]</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>2.952315</td>\n",
|
||
" <td>1.646025</td>\n",
|
||
" <td>4.598341</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6</th>\n",
|
||
" <td>Qu'est-ce qu'une souche en Machine Learning ?</td>\n",
|
||
" <td>Les weak learners d’AdaBoost sont appelés des ...</td>\n",
|
||
" <td>En Machine Learning, une souche (ou lineage) f...</td>\n",
|
||
" <td>ML.pdf:42</td>\n",
|
||
" <td>[ML.pdf:113]</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>4.658800</td>\n",
|
||
" <td>1.877533</td>\n",
|
||
" <td>6.536340</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7</th>\n",
|
||
" <td>Quelle sont les trois propriétés mathématiques...</td>\n",
|
||
" <td>Indiscernabilité, symétrie et sous-additivité</td>\n",
|
||
" <td>I don't know</td>\n",
|
||
" <td>ML.pdf:51</td>\n",
|
||
" <td>[]</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>2.128439</td>\n",
|
||
" <td>1.583463</td>\n",
|
||
" <td>3.711939</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8</th>\n",
|
||
" <td>Pourquoi KMeans a été introduit ?</td>\n",
|
||
" <td>Kmeans++ : un meilleur départ\\nSuivre cette mé...</td>\n",
|
||
" <td>I don't know</td>\n",
|
||
" <td>ML.pdf:54</td>\n",
|
||
" <td>[]</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>False</td>\n",
|
||
" <td>1.878088</td>\n",
|
||
" <td>1.763518</td>\n",
|
||
" <td>3.641612</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>9</th>\n",
|
||
" <td>Dans quel article a été introduit le lemme de ...</td>\n",
|
||
" <td>Cette similitude est expliquée par le titre de...</td>\n",
|
||
" <td>Le lemme de Johnson-Lindenstrauss a été introd...</td>\n",
|
||
" <td>ML.pdf:63</td>\n",
|
||
" <td>[ML.pdf:64]</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>True</td>\n",
|
||
" <td>3.118761</td>\n",
|
||
" <td>1.801741</td>\n",
|
||
" <td>4.920507</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" question \\\n",
|
||
"0 Qu'est-ce qu'un algorithme ? \n",
|
||
"1 Qu'est-ce qu'un hackathon ? \n",
|
||
"2 Quel est l'inconvénient de la méthode Leave-On... \n",
|
||
"3 Qu'est-ce que la régression polynomiale ? \n",
|
||
"4 What is exercise 3.5 about ? \n",
|
||
"5 Quel est l'autre nom du bagging ? \n",
|
||
"6 Qu'est-ce qu'une souche en Machine Learning ? \n",
|
||
"7 Quelle sont les trois propriétés mathématiques... \n",
|
||
"8 Pourquoi KMeans a été introduit ? \n",
|
||
"9 Dans quel article a été introduit le lemme de ... \n",
|
||
"\n",
|
||
" expected_answer \\\n",
|
||
"0 Un algorithme est une séquence d’instructions ... \n",
|
||
"1 Un hackathon en Machine Learning est une compé... \n",
|
||
"2 L’un des inconvénients majeur est que cela peu... \n",
|
||
"3 Une régression polynomiale est une régression ... \n",
|
||
"4 Mail classification \n",
|
||
"5 La solution donne son nom à la section : nous ... \n",
|
||
"6 Les weak learners d’AdaBoost sont appelés des ... \n",
|
||
"7 Indiscernabilité, symétrie et sous-additivité \n",
|
||
"8 Kmeans++ : un meilleur départ\\nSuivre cette mé... \n",
|
||
"9 Cette similitude est expliquée par le titre de... \n",
|
||
"\n",
|
||
" predicted_answer expected_sources \\\n",
|
||
"0 Nous ne discuterons pas d’algorithmes en parti... ML.pdf:6 \n",
|
||
"1 I don't know ML.pdf:10 \n",
|
||
"2 L'inconvénient de la méthode Leave-One-Out Cro... ML.pdf:10 \n",
|
||
"3 Une régression polynomiale est une régression ... ML.pdf:21 \n",
|
||
"4 I don't know ML.pdf:30 \n",
|
||
"5 Le bagging est également connu sous le nom d’a... ML.pdf:39 \n",
|
||
"6 En Machine Learning, une souche (ou lineage) f... ML.pdf:42 \n",
|
||
"7 I don't know ML.pdf:51 \n",
|
||
"8 I don't know ML.pdf:54 \n",
|
||
"9 Le lemme de Johnson-Lindenstrauss a été introd... ML.pdf:63 \n",
|
||
"\n",
|
||
" predicted_sources answer_correct sources_correct rag_time \\\n",
|
||
"0 [ML.pdf:6] False True 2.782175 \n",
|
||
"1 [] False False 1.868308 \n",
|
||
"2 [ML.pdf:10, ML.pdf:128] True True 4.339367 \n",
|
||
"3 [ML.pdf:21] True True 3.342725 \n",
|
||
"4 [] False False 2.151726 \n",
|
||
"5 [ML.pdf:40, ML.pdf:68] True True 2.952315 \n",
|
||
"6 [ML.pdf:113] False True 4.658800 \n",
|
||
"7 [] False False 2.128439 \n",
|
||
"8 [] False False 1.878088 \n",
|
||
"9 [ML.pdf:64] True True 3.118761 \n",
|
||
"\n",
|
||
" judge_time total_time \n",
|
||
"0 1.656888 4.439065 \n",
|
||
"1 1.657052 3.525366 \n",
|
||
"2 1.844820 6.184189 \n",
|
||
"3 1.751531 5.094258 \n",
|
||
"4 1.553353 3.705080 \n",
|
||
"5 1.646025 4.598341 \n",
|
||
"6 1.877533 6.536340 \n",
|
||
"7 1.583463 3.711939 \n",
|
||
"8 1.763518 3.641612 \n",
|
||
"9 1.801741 4.920507 "
|
||
]
|
||
},
|
||
"execution_count": 24,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"results = evaluate_rag(rag_chain, dataset=eval_dataset[:10], judge_chain=judge_chain)\n",
|
||
"results"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "91231c6d",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Consigne** : A partir des résultats précédents, donner des statistiques de performance du modèle."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"id": "59d821db",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Accuracy: 40.00%\n",
|
||
"Accuracy source: 60.00%\n",
|
||
"RAG time (avg): 2.92s\n",
|
||
"Judge time (avg): 1.71s\n",
|
||
"Total time (avg): 4.64s\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"accuracy = results[\"answer_correct\"].astype(int).mean()\n",
|
||
"source_accuracy = results[\"sources_correct\"].astype(int).mean()\n",
|
||
"avg_rag_time = results[\"rag_time\"].mean()\n",
|
||
"avg_judge_time = results[\"judge_time\"].mean()\n",
|
||
"avg_total_time = results[\"total_time\"].mean()\n",
|
||
"\n",
|
||
"print(f\"Accuracy: {100 * accuracy:.2f}%\")\n",
|
||
"print(f\"Accuracy source: {100 * source_accuracy:.2f}%\")\n",
|
||
"print(f\"RAG time (avg): {avg_rag_time:.2f}s\")\n",
|
||
"print(f\"Judge time (avg): {avg_judge_time:.2f}s\")\n",
|
||
"print(f\"Total time (avg): {avg_total_time:.2f}s\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "289c97f8",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Pour aller plus loin\n",
|
||
"\n",
|
||
"Nous avons plusieurs axes d'améliorations, de manière non exhaustive :\n",
|
||
"* Une meilleure récupération du texte dans le PDF : par exemple utiliser [Docling](https://python.langchain.com/docs/integrations/document_loaders/docling/) ?\n",
|
||
"* Une meilleure manière de découper en *chunk* le texte : par exemple utiliser [RecursiveCharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#recursivecharactertextsplitter), ou changer la taille des chunks...\n",
|
||
"* Un meilleur modèle d'embedding : voir le [leaderboard](https://huggingface.co/spaces/mteb/leaderboard) des embeddings\n",
|
||
"* Un meilleur retrieval : meilleure méthode pour chercher, par exemple [MMR](https://python.langchain.com/v0.2/docs/how_to/example_selectors_mmr/)\n",
|
||
"* De meilleurs prompt\n",
|
||
"* Une meilleure mesure de performance : plus de questions par exemple\n",
|
||
"\n",
|
||
"Nous encourageons l'étudiant à tester la ou les améliorations qu'ils souhaitent faire et surtout que les apports soit mesurés séparemment. On encourage également d'utiliser ses propres documents et son propre benchmark.\n",
|
||
"Pour accélérer encore un peu l'évaluation, on propose une version asynchrone de la fonction d'évaluation :"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"id": "7ae5fd5d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import asyncio\n",
|
||
"from tqdm.asyncio import tqdm_asyncio\n",
|
||
"\n",
|
||
"async def evaluate_rag_async(rag_chain, dataset, judge_chain, max_concurrency=5):\n",
|
||
" \"\"\"\n",
|
||
" Async evaluation of a RAG chain against a dataset using a judge LLM.\n",
|
||
" \"\"\"\n",
|
||
" results = []\n",
|
||
" semaphore = asyncio.Semaphore(max_concurrency)\n",
|
||
"\n",
|
||
" async def process_example(example):\n",
|
||
" async with semaphore:\n",
|
||
" rag_start = time.time()\n",
|
||
" try:\n",
|
||
" prediction = await rag_chain.ainvoke(example[\"question\"])\n",
|
||
" except Exception as e:\n",
|
||
" prediction = {\"answer\": \"\", \"sources\": []}\n",
|
||
" print(f\"[RAG ERROR] Question: {example['question']} | {e}\")\n",
|
||
" rag_end = time.time()\n",
|
||
"\n",
|
||
" judge_input = {\n",
|
||
" \"question\": example[\"question\"],\n",
|
||
" \"expected_answer\": example[\"answer\"],\n",
|
||
" \"predicted_answer\": prediction.get(\"answer\", \"\"),\n",
|
||
" \"expected_sources\": example[\"sources\"],\n",
|
||
" \"predicted_sources\": prediction.get(\"sources\", []),\n",
|
||
" }\n",
|
||
"\n",
|
||
" judge_start = time.time()\n",
|
||
" try:\n",
|
||
" judgment = await judge_chain.ainvoke(judge_input)\n",
|
||
" except Exception as e:\n",
|
||
" judgment = {\"answer_correct\": False, \"sources_correct\": False, \"explanation\": f\"Judge error: {e}\"}\n",
|
||
" print(f\"[JUDGE ERROR] Question: {example['question']} | {e}\")\n",
|
||
" judge_end = time.time()\n",
|
||
"\n",
|
||
" results.append({\n",
|
||
" **judge_input,\n",
|
||
" **judgment,\n",
|
||
" \"rag_time\": rag_end - rag_start,\n",
|
||
" \"judge_time\": judge_end - judge_start,\n",
|
||
" \"total_time\": judge_end - rag_start\n",
|
||
" })\n",
|
||
"\n",
|
||
" tasks = [process_example(example) for example in dataset]\n",
|
||
" for f in tqdm_asyncio.as_completed(tasks, desc=\"Evaluating RAG\", total=len(dataset)):\n",
|
||
" await f\n",
|
||
"\n",
|
||
" return pd.DataFrame(results)\n"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "studies",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.13.9"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|