ArtStudies/M2/Machine Learning/TP_3/2025_TP_3_M2_ISF.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8750d15b",
   "metadata": {},
   "source": [
    "# Cours 3 : Machine Learning - Algorithmes supervisés (1/2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7c08ae5",
   "metadata": {},
   "source": [
    "## Préambule"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec7ecb4b",
   "metadata": {},
   "source": [
    "Les objectifs de cette séance (3h) sont :\n",
    "* Préparation des bases de modélisation (sampling)\n",
    "* Mettre en application un modèle supervisé simple.\n",
    "* Construire un modèle de Machine Learning (cross-validation et hyperparamétrage) pour résoudre un problème de régression\n",
    "* Analyser les performances du modèle"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e99c600",
   "metadata": {},
   "source": [
    "## Préparation du workspace"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1b01045",
   "metadata": {},
   "source": [
    "### Import de librairies "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97d58527",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Données\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "#Graphiques\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set()\n",
    "import plotly.express as px\n",
    "import plotly.graph_objects as gp\n",
    "import sklearn.preprocessing as preproc\n",
    "\n",
    "#Statistiques\n",
    "from scipy.stats import chi2_contingency\n",
    "from sklearn import metrics\n",
    "\n",
    "# Machine Learning\n",
    "from sklearn.cluster import KMeans\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.model_selection import KFold, train_test_split\n",
    "from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06153286",
   "metadata": {},
   "source": [
    "### Définition des fonctions "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c67db932",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "985e4e97",
   "metadata": {},
   "source": [
    "### Constantes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "c9597b48",
   "metadata": {},
   "outputs": [],
   "source": [
    "input_path = \"./1_inputs\"\n",
    "output_path = \"./2_outputs\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2b035d2",
   "metadata": {},
   "source": [
    "### Import des données"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "8051b5f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "path =input_path + '/base_retraitee.csv'\n",
    "data_retraitee = pd.read_csv(path,sep=\",\",decimal=\".\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2578ba1",
   "metadata": {},
   "source": [
    "## Algorithme supervisé : CART "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaa0b27d",
   "metadata": {},
   "source": [
    "Dans cette partie l'objectif est de construire un modèle simple (algorithme CART) afin de voir les différentes étapes nécessaire au lancement d'un modèle\n",
    "Nous modéliserons directement le coût des sinistres. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0458a05",
   "metadata": {},
   "source": [
    "### Construction du modèle"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3715c37",
   "metadata": {},
   "source": [
    "La première étape est de calculer les côut moyen de chaque sinistre (target ou variable réponse). Cette variable sera la variable à prédire en fonction des variables explicatives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "c427a4b8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.microsoft.datawrangler.viewer.v0+json": {
       "columns": [
        {
         "name": "index",
         "rawType": "int64",
         "type": "integer"
        },
        {
         "name": "ANNEE_CTR",
         "rawType": "int64",
         "type": "integer"
        },
        {
         "name": "CONTRAT_ANCIENNETE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "FREQUENCE_PAIEMENT_COTISATION",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "GROUPE_KM",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "ZONE_RISQUE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "AGE_ASSURE_PRINCIPAL",
         "rawType": "int64",
         "type": "integer"
        },
        {
         "name": "GENRE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "DEUXIEME_CONDUCTEUR",
         "rawType": "bool",
         "type": "boolean"
        },
        {
         "name": "ANCIENNETE_PERMIS",
         "rawType": "int64",
         "type": "integer"
        },
        {
         "name": "ANNEE_CONSTRUCTION",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "ENERGIE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "EQUIPEMENT_SECURITE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "VALEUR_DU_BIEN",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "CM",
         "rawType": "float64",
         "type": "float"
        }
       ],
       "ref": "e76df045-0c83-40e9-a027-c48f278ec1d6",
       "rows": [
        [
         "10",
         "2019",
         "(0,1]",
         "MENSUEL",
         "[0;20000[",
         "C",
         "40",
         "M",
         "False",
         "37",
         "2017.0",
         "ESSENCE",
         "VRAI",
         "[15000;20000[",
         "1072.98"
        ],
        [
         "34",
         "2020",
         "(-1,0]",
         "MENSUEL",
         "[20000;40000[",
         "C",
         "27",
         "M",
         "True",
         "13",
         "2018.0",
         "AUTRE",
         "FAUX",
         "[35000;99999[",
         "3750.0"
        ],
        [
         "36",
         "2019",
         "(-1,0]",
         "MENSUEL",
         "[20000;40000[",
         "L",
         "19",
         "M",
         "False",
         "2",
         "2017.0",
         "ESSENCE",
         "VRAI",
         "[0;10000[",
         "1838.49"
        ],
        [
         "78",
         "2019",
         "(-1,0]",
         "MENSUEL",
         "[20000;40000[",
         "B",
         "40",
         "M",
         "False",
         "45",
         "2018.0",
         "DIESEL",
         "FAUX",
         "[15000;20000[",
         "4892.74"
        ],
        [
         "89",
         "2018",
         "(1,2]",
         "MENSUEL",
         "[20000;40000[",
         "C",
         "20",
         "M",
         "False",
         "11",
         "2014.0",
         "ESSENCE",
         "FAUX",
         "[25000;35000[",
         "166.73"
        ]
       ],
       "shape": {
        "columns": 14,
        "rows": 5
       }
      },
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ANNEE_CTR</th>\n",
       "      <th>CONTRAT_ANCIENNETE</th>\n",
       "      <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
       "      <th>GROUPE_KM</th>\n",
       "      <th>ZONE_RISQUE</th>\n",
       "      <th>AGE_ASSURE_PRINCIPAL</th>\n",
       "      <th>GENRE</th>\n",
       "      <th>DEUXIEME_CONDUCTEUR</th>\n",
       "      <th>ANCIENNETE_PERMIS</th>\n",
       "      <th>ANNEE_CONSTRUCTION</th>\n",
       "      <th>ENERGIE</th>\n",
       "      <th>EQUIPEMENT_SECURITE</th>\n",
       "      <th>VALEUR_DU_BIEN</th>\n",
       "      <th>CM</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2019</td>\n",
       "      <td>(0,1]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[0;20000[</td>\n",
       "      <td>C</td>\n",
       "      <td>40</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>37</td>\n",
       "      <td>2017.0</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>VRAI</td>\n",
       "      <td>[15000;20000[</td>\n",
       "      <td>1072.98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>2020</td>\n",
       "      <td>(-1,0]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>C</td>\n",
       "      <td>27</td>\n",
       "      <td>M</td>\n",
       "      <td>True</td>\n",
       "      <td>13</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>AUTRE</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[35000;99999[</td>\n",
       "      <td>3750.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>2019</td>\n",
       "      <td>(-1,0]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>L</td>\n",
       "      <td>19</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>2</td>\n",
       "      <td>2017.0</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>VRAI</td>\n",
       "      <td>[0;10000[</td>\n",
       "      <td>1838.49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78</th>\n",
       "      <td>2019</td>\n",
       "      <td>(-1,0]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>B</td>\n",
       "      <td>40</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>45</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>DIESEL</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[15000;20000[</td>\n",
       "      <td>4892.74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>89</th>\n",
       "      <td>2018</td>\n",
       "      <td>(1,2]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>C</td>\n",
       "      <td>20</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>11</td>\n",
       "      <td>2014.0</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[25000;35000[</td>\n",
       "      <td>166.73</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION      GROUPE_KM  \\\n",
       "10       2019              (0,1]                       MENSUEL      [0;20000[   \n",
       "34       2020             (-1,0]                       MENSUEL  [20000;40000[   \n",
       "36       2019             (-1,0]                       MENSUEL  [20000;40000[   \n",
       "78       2019             (-1,0]                       MENSUEL  [20000;40000[   \n",
       "89       2018              (1,2]                       MENSUEL  [20000;40000[   \n",
       "\n",
       "   ZONE_RISQUE  AGE_ASSURE_PRINCIPAL GENRE  DEUXIEME_CONDUCTEUR  \\\n",
       "10           C                    40     M                False   \n",
       "34           C                    27     M                 True   \n",
       "36           L                    19     M                False   \n",
       "78           B                    40     M                False   \n",
       "89           C                    20     M                False   \n",
       "\n",
       "    ANCIENNETE_PERMIS  ANNEE_CONSTRUCTION  ENERGIE EQUIPEMENT_SECURITE  \\\n",
       "10                 37              2017.0  ESSENCE                VRAI   \n",
       "34                 13              2018.0    AUTRE                FAUX   \n",
       "36                  2              2017.0  ESSENCE                VRAI   \n",
       "78                 45              2018.0   DIESEL                FAUX   \n",
       "89                 11              2014.0  ESSENCE                FAUX   \n",
       "\n",
       "   VALEUR_DU_BIEN       CM  \n",
       "10  [15000;20000[  1072.98  \n",
       "34  [35000;99999[  3750.00  \n",
       "36      [0;10000[  1838.49  \n",
       "78  [15000;20000[  4892.74  \n",
       "89  [25000;35000[   166.73  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_model = data_retraitee.copy()\n",
    "\n",
    "# Filtre pour ne garder que les lignes qui ont un sinistre (NB > 0)\n",
    "data_model = data_model[data_model['NB'] > 0]\n",
    "\n",
    "# Calcul du cout moyen \"théorique\" des sinistres\n",
    "data_model[\"CM\"] = (data_model[\"CHARGE\"] / data_model[\"NB\"])\n",
    "data_model = data_model.drop(['CHARGE', 'NB', \"EXPO\"], axis=1)\n",
    "data_model.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3e85088",
   "metadata": {},
   "source": [
    "**Exercice :** construisez les statistiques descriptives de la base utilisée."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "c8fd3ee1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.microsoft.datawrangler.viewer.v0+json": {
       "columns": [
        {
         "name": "index",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "ANNEE_CTR",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "CONTRAT_ANCIENNETE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "FREQUENCE_PAIEMENT_COTISATION",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "GROUPE_KM",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "ZONE_RISQUE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "AGE_ASSURE_PRINCIPAL",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "GENRE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "DEUXIEME_CONDUCTEUR",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "ANCIENNETE_PERMIS",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "ANNEE_CONSTRUCTION",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "ENERGIE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "EQUIPEMENT_SECURITE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "VALEUR_DU_BIEN",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "CM",
         "rawType": "float64",
         "type": "float"
        }
       ],
       "ref": "b2f9efdd-d035-4c51-9797-2e202b404c15",
       "rows": [
        [
         "count",
         "824.0",
         "824",
         "824",
         "824",
         "824",
         "824.0",
         "824",
         "824",
         "824.0",
         "824.0",
         "824",
         "824",
         "824",
         "824.0"
        ],
        [
         "unique",
         null,
         "5",
         "3",
         "4",
         "14",
         null,
         "2",
         "2",
         null,
         null,
         "3",
         "2",
         "6",
         null
        ],
        [
         "top",
         null,
         "(0,1]",
         "MENSUEL",
         "[0;20000[",
         "C",
         null,
         "M",
         "False",
         null,
         null,
         "ESSENCE",
         "FAUX",
         "[10000;15000[",
         null
        ],
        [
         "freq",
         null,
         "297",
         "398",
         "391",
         "269",
         null,
         "483",
         "663",
         null,
         null,
         "413",
         "517",
         "213",
         null
        ],
        [
         "mean",
         "2018.384708737864",
         null,
         null,
         null,
         null,
         "44.383495145631066",
         null,
         null,
         "35.68810679611651",
         "2015.2123786407767",
         null,
         null,
         null,
         "4246.01697815534"
        ],
        [
         "std",
         "1.515832735580178",
         null,
         null,
         null,
         null,
         "13.808216667998865",
         null,
         null,
         "19.370620845496358",
         "3.1637823115731556",
         null,
         null,
         null,
         "6869.61691660173"
        ],
        [
         "min",
         "2016.0",
         null,
         null,
         null,
         null,
         "19.0",
         null,
         null,
         "1.0",
         "1998.0",
         null,
         null,
         null,
         "7.5"
        ],
        [
         "25%",
         "2017.0",
         null,
         null,
         null,
         null,
         "34.0",
         null,
         null,
         "18.0",
         "2014.0",
         null,
         null,
         null,
         "1159.96125"
        ],
        [
         "50%",
         "2018.0",
         null,
         null,
         null,
         null,
         "43.0",
         null,
         null,
         "35.0",
         "2016.0",
         null,
         null,
         null,
         "2541.6499999999996"
        ],
        [
         "75%",
         "2020.0",
         null,
         null,
         null,
         null,
         "53.0",
         null,
         null,
         "53.0",
         "2017.0",
         null,
         null,
         null,
         "4193.797500000001"
        ],
        [
         "max",
         "2021.0",
         null,
         null,
         null,
         null,
         "94.0",
         null,
         null,
         "70.0",
         "2021.0",
         null,
         null,
         null,
         "83421.85"
        ]
       ],
       "shape": {
        "columns": 14,
        "rows": 11
       }
      },
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ANNEE_CTR</th>\n",
       "      <th>CONTRAT_ANCIENNETE</th>\n",
       "      <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
       "      <th>GROUPE_KM</th>\n",
       "      <th>ZONE_RISQUE</th>\n",
       "      <th>AGE_ASSURE_PRINCIPAL</th>\n",
       "      <th>GENRE</th>\n",
       "      <th>DEUXIEME_CONDUCTEUR</th>\n",
       "      <th>ANCIENNETE_PERMIS</th>\n",
       "      <th>ANNEE_CONSTRUCTION</th>\n",
       "      <th>ENERGIE</th>\n",
       "      <th>EQUIPEMENT_SECURITE</th>\n",
       "      <th>VALEUR_DU_BIEN</th>\n",
       "      <th>CM</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>824.000000</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824.000000</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824.000000</td>\n",
       "      <td>824.000000</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>NaN</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>14</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>NaN</td>\n",
       "      <td>(0,1]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[0;20000[</td>\n",
       "      <td>C</td>\n",
       "      <td>NaN</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[10000;15000[</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>NaN</td>\n",
       "      <td>297</td>\n",
       "      <td>398</td>\n",
       "      <td>391</td>\n",
       "      <td>269</td>\n",
       "      <td>NaN</td>\n",
       "      <td>483</td>\n",
       "      <td>663</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>413</td>\n",
       "      <td>517</td>\n",
       "      <td>213</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>2018.384709</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>44.383495</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.688107</td>\n",
       "      <td>2015.212379</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4246.016978</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.515833</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>13.808217</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>19.370621</td>\n",
       "      <td>3.163782</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6869.616917</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>2016.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>19.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1998.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2017.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>2014.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1159.961250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2018.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>43.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.000000</td>\n",
       "      <td>2016.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2541.650000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2020.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>53.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>53.000000</td>\n",
       "      <td>2017.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4193.797500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>2021.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>94.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>70.000000</td>\n",
       "      <td>2021.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>83421.850000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION  \\\n",
       "count    824.000000                824                           824   \n",
       "unique          NaN                  5                             3   \n",
       "top             NaN              (0,1]                       MENSUEL   \n",
       "freq            NaN                297                           398   \n",
       "mean    2018.384709                NaN                           NaN   \n",
       "std        1.515833                NaN                           NaN   \n",
       "min     2016.000000                NaN                           NaN   \n",
       "25%     2017.000000                NaN                           NaN   \n",
       "50%     2018.000000                NaN                           NaN   \n",
       "75%     2020.000000                NaN                           NaN   \n",
       "max     2021.000000                NaN                           NaN   \n",
       "\n",
       "        GROUPE_KM ZONE_RISQUE  AGE_ASSURE_PRINCIPAL GENRE DEUXIEME_CONDUCTEUR  \\\n",
       "count         824         824            824.000000   824                 824   \n",
       "unique          4          14                   NaN     2                   2   \n",
       "top     [0;20000[           C                   NaN     M               False   \n",
       "freq          391         269                   NaN   483                 663   \n",
       "mean          NaN         NaN             44.383495   NaN                 NaN   \n",
       "std           NaN         NaN             13.808217   NaN                 NaN   \n",
       "min           NaN         NaN             19.000000   NaN                 NaN   \n",
       "25%           NaN         NaN             34.000000   NaN                 NaN   \n",
       "50%           NaN         NaN             43.000000   NaN                 NaN   \n",
       "75%           NaN         NaN             53.000000   NaN                 NaN   \n",
       "max           NaN         NaN             94.000000   NaN                 NaN   \n",
       "\n",
       "        ANCIENNETE_PERMIS  ANNEE_CONSTRUCTION  ENERGIE EQUIPEMENT_SECURITE  \\\n",
       "count          824.000000          824.000000      824                 824   \n",
       "unique                NaN                 NaN        3                   2   \n",
       "top                   NaN                 NaN  ESSENCE                FAUX   \n",
       "freq                  NaN                 NaN      413                 517   \n",
       "mean            35.688107         2015.212379      NaN                 NaN   \n",
       "std             19.370621            3.163782      NaN                 NaN   \n",
       "min              1.000000         1998.000000      NaN                 NaN   \n",
       "25%             18.000000         2014.000000      NaN                 NaN   \n",
       "50%             35.000000         2016.000000      NaN                 NaN   \n",
       "75%             53.000000         2017.000000      NaN                 NaN   \n",
       "max             70.000000         2021.000000      NaN                 NaN   \n",
       "\n",
       "       VALEUR_DU_BIEN            CM  \n",
       "count             824    824.000000  \n",
       "unique              6           NaN  \n",
       "top     [10000;15000[           NaN  \n",
       "freq              213           NaN  \n",
       "mean              NaN   4246.016978  \n",
       "std               NaN   6869.616917  \n",
       "min               NaN      7.500000  \n",
       "25%               NaN   1159.961250  \n",
       "50%               NaN   2541.650000  \n",
       "75%               NaN   4193.797500  \n",
       "max               NaN  83421.850000  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_model.describe(include='all')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92d6156a",
   "metadata": {},
   "source": [
    "#### Etude des corrélations parmi les variables explicatives"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7327570",
   "metadata": {},
   "source": [
    "**Question :** Selon vous, pourquoi faut-il s'intéresser à la corrélation des variables ? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "475e141b",
   "metadata": {},
   "source": [
    "*Réponse*: Pour avoir un modèle qui fit mieux + déterminer un potentiel effet de causalité entre features et target + sélectionner certaines variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "1b156435",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_set = data_model.drop(\"CM\", axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "0ef0fcc0",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Séparation en variables qualitatives ou catégorielles\n",
    "variables_na = []\n",
    "variables_numeriques = []\n",
    "variables_01 = []\n",
    "variables_categorielles = []\n",
    "for colu in data_set.columns:\n",
    "    if True in data_set[colu].isna().unique() :\n",
    "        variables_na.append(data_set[colu])\n",
    "    else :\n",
    "        if str(data_set[colu].dtypes) in [\"int32\",\"int64\",\"float64\"]:\n",
    "            if len(data_set[colu].unique())==2 :\n",
    "                variables_categorielles.append(data_set[colu])\n",
    "            else :\n",
    "                variables_numeriques.append(data_set[colu])\n",
    "        else :\n",
    "            if len(data_set[colu].unique())==2 :\n",
    "                variables_categorielles.append(data_set[colu])\n",
    "            else :\n",
    "                variables_categorielles.append(data_set[colu])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e82fcade",
   "metadata": {},
   "source": [
    "##### Corrélation des variables catégorielles :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e130aae5",
   "metadata": {},
   "outputs": [],
   "source": [
    "vars_categorielles = pd.DataFrame(variables_categorielles).transpose()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c39e2ad0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "config": {
        "plotlyServerURL": "https://plot.ly"
       },
       "data": [
        {
         "coloraxis": "coloraxis",
         "hovertemplate": "x: %{x}<br>y: %{y}<br>color: %{z}<extra></extra>",
         "name": "0",
         "texttemplate": "%{z:.2f}",
         "type": "heatmap",
         "x": [
          "CONTRAT_ANCIENNETE",
          "FREQUENCE_PAIEMENT_COTISATION",
          "GROUPE_KM",
          "ZONE_RISQUE",
          "GENRE",
          "DEUXIEME_CONDUCTEUR",
          "ENERGIE",
          "EQUIPEMENT_SECURITE",
          "VALEUR_DU_BIEN"
         ],
         "xaxis": "x",
         "y": [
          "CONTRAT_ANCIENNETE",
          "FREQUENCE_PAIEMENT_COTISATION",
          "GROUPE_KM",
          "ZONE_RISQUE",
          "GENRE",
          "DEUXIEME_CONDUCTEUR",
          "ENERGIE",
          "EQUIPEMENT_SECURITE",
          "VALEUR_DU_BIEN"
         ],
         "yaxis": "y",
         "z": {
          "bdata": "AAAAAAAA8D8AAAAAAAAAACoCGzzITrA/jS6+t390sj/aAKYMJa2eP5RMqUS3uZs/ytNpsBVXkz8AAAAAAAAAAJsekiMPM4I/AAAAAAAAAAAAAAAAAADwPwAAAAAAAAAAAAAAAAAAAABgNwyfFOK3Px3tLvtk1qI/VTS7w965nj/DbHQwNU6sP6xOyIjBVMQ/KwIbPMhOsD8AAAAAAAAAAAAAAAAAAPA/JGwWgOwjwz/Y12crRVC2P1AU8aUpk3Y/tZ25v8HgyT9++YWBDBq6PxMKBP1KAMk/ki6+t390sj8AAAAAAAAAACNsFoDsI8M/AAAAAAAA8D8AAAAAAAAAAOzpAHMW1bU/OToUIB5twT+gpoD1ZjrEP/5ATjN+vpg/0gCmDCWtnj9gNwyfFOK3P9jXZytFULY/AAAAAAAAAAAAAAAAAADwPwAAAAAAAAAA2p0N4q1bwz/UsLoqS0u5PxFqf8IHB9E/lEypRLe5mz8d7S77ZNaiP1AU8aUpk3Y/7OkAcxbVtT8AAAAAAAAAAAAAAAAAAPA/AAAAAAAAAAAAAAAAAAAAAOYlMsJ0brs/ytNpsBVXkz9RNLvD3rmeP7edub/B4Mk/OjoUIB5twT/anQ3irVvDPwAAAAAAAAAAAAAAAAAA8D8nEbUEUmnAP+SA2g/TvNE/AAAAAAAAAADDbHQwNU6sP335hYEMGro/oKaA9WY6xD/UsLoqS0u5PwAAAAAAAAAAJxG1BFJpwD8AAAAAAADwP+fmCf6XRco/mx6SIw8zgj+rTsiIwVTEPxIKBP1KAMk//kBOM36+mD8Ran/CBwfRP+YlMsJ0brs/5YDaD9O80T/n5gn+l0XKPwAAAAAAAPA/",
          "dtype": "f8",
          "shape": "9, 9"
         }
        }
       ],
       "layout": {
        "coloraxis": {
         "colorscale": [
          [
           0,
           "rgb(5,48,97)"
          ],
          [
           0.1,
           "rgb(33,102,172)"
          ],
          [
           0.2,
           "rgb(67,147,195)"
          ],
          [
           0.3,
           "rgb(146,197,222)"
          ],
          [
           0.4,
           "rgb(209,229,240)"
          ],
          [
           0.5,
           "rgb(247,247,247)"
          ],
          [
           0.6,
           "rgb(253,219,199)"
          ],
          [
           0.7,
           "rgb(244,165,130)"
          ],
          [
           0.8,
           "rgb(214,96,77)"
          ],
          [
           0.9,
           "rgb(178,24,43)"
          ],
          [
           1,
           "rgb(103,0,31)"
          ]
         ]
        },
        "template": {
         "data": {
          "bar": [
           {
            "error_x": {
             "color": "#2a3f5f"
            },
            "error_y": {
             "color": "#2a3f5f"
            },
            "marker": {
             "line": {
              "color": "#E5ECF6",
              "width": 0.5
             },
             "pattern": {
              "fillmode": "overlay",
              "size": 10,
              "solidity": 0.2
             }
            },
            "type": "bar"
           }
          ],
          "barpolar": [
           {
            "marker": {
             "line": {
              "color": "#E5ECF6",
              "width": 0.5
             },
             "pattern": {
              "fillmode": "overlay",
              "size": 10,
              "solidity": 0.2
             }
            },
            "type": "barpolar"
           }
          ],
          "carpet": [
           {
            "aaxis": {
             "endlinecolor": "#2a3f5f",
             "gridcolor": "white",
             "linecolor": "white",
             "minorgridcolor": "white",
             "startlinecolor": "#2a3f5f"
            },
            "baxis": {
             "endlinecolor": "#2a3f5f",
             "gridcolor": "white",
             "linecolor": "white",
             "minorgridcolor": "white",
             "startlinecolor": "#2a3f5f"
            },
            "type": "carpet"
           }
          ],
          "choropleth": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "type": "choropleth"
           }
          ],
          "contour": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "contour"
           }
          ],
          "contourcarpet": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "type": "contourcarpet"
           }
          ],
          "heatmap": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "heatmap"
           }
          ],
          "histogram": [
           {
            "marker": {
             "pattern": {
              "fillmode": "overlay",
              "size": 10,
              "solidity": 0.2
             }
            },
            "type": "histogram"
           }
          ],
          "histogram2d": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "histogram2d"
           }
          ],
          "histogram2dcontour": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "histogram2dcontour"
           }
          ],
          "mesh3d": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "type": "mesh3d"
           }
          ],
          "parcoords": [
           {
            "line": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "parcoords"
           }
          ],
          "pie": [
           {
            "automargin": true,
            "type": "pie"
           }
          ],
          "scatter": [
           {
            "fillpattern": {
             "fillmode": "overlay",
             "size": 10,
             "solidity": 0.2
            },
            "type": "scatter"
           }
          ],
          "scatter3d": [
           {
            "line": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scatter3d"
           }
          ],
          "scattercarpet": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattercarpet"
           }
          ],
          "scattergeo": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattergeo"
           }
          ],
          "scattergl": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattergl"
           }
          ],
          "scattermap": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattermap"
           }
          ],
          "scattermapbox": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattermapbox"
           }
          ],
          "scatterpolar": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scatterpolar"
           }
          ],
          "scatterpolargl": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scatterpolargl"
           }
          ],
          "scatterternary": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scatterternary"
           }
          ],
          "surface": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "surface"
           }
          ],
          "table": [
           {
            "cells": {
             "fill": {
              "color": "#EBF0F8"
             },
             "line": {
              "color": "white"
             }
            },
            "header": {
             "fill": {
              "color": "#C8D4E3"
             },
             "line": {
              "color": "white"
             }
            },
            "type": "table"
           }
          ]
         },
         "layout": {
          "annotationdefaults": {
           "arrowcolor": "#2a3f5f",
           "arrowhead": 0,
           "arrowwidth": 1
          },
          "autotypenumbers": "strict",
          "coloraxis": {
           "colorbar": {
            "outlinewidth": 0,
            "ticks": ""
           }
          },
          "colorscale": {
           "diverging": [
            [
             0,
             "#8e0152"
            ],
            [
             0.1,
             "#c51b7d"
            ],
            [
             0.2,
             "#de77ae"
            ],
            [
             0.3,
             "#f1b6da"
            ],
            [
             0.4,
             "#fde0ef"
            ],
            [
             0.5,
             "#f7f7f7"
            ],
            [
             0.6,
             "#e6f5d0"
            ],
            [
             0.7,
             "#b8e186"
            ],
            [
             0.8,
             "#7fbc41"
            ],
            [
             0.9,
             "#4d9221"
            ],
            [
             1,
             "#276419"
            ]
           ],
           "sequential": [
            [
             0,
             "#0d0887"
            ],
            [
             0.1111111111111111,
             "#46039f"
            ],
            [
             0.2222222222222222,
             "#7201a8"
            ],
            [
             0.3333333333333333,
             "#9c179e"
            ],
            [
             0.4444444444444444,
             "#bd3786"
            ],
            [
             0.5555555555555556,
             "#d8576b"
            ],
            [
             0.6666666666666666,
             "#ed7953"
            ],
            [
             0.7777777777777778,
             "#fb9f3a"
            ],
            [
             0.8888888888888888,
             "#fdca26"
            ],
            [
             1,
             "#f0f921"
            ]
           ],
           "sequentialminus": [
            [
             0,
             "#0d0887"
            ],
            [
             0.1111111111111111,
             "#46039f"
            ],
            [
             0.2222222222222222,
             "#7201a8"
            ],
            [
             0.3333333333333333,
             "#9c179e"
            ],
            [
             0.4444444444444444,
             "#bd3786"
            ],
            [
             0.5555555555555556,
             "#d8576b"
            ],
            [
             0.6666666666666666,
             "#ed7953"
            ],
            [
             0.7777777777777778,
             "#fb9f3a"
            ],
            [
             0.8888888888888888,
             "#fdca26"
            ],
            [
             1,
             "#f0f921"
            ]
           ]
          },
          "colorway": [
           "#636efa",
           "#EF553B",
           "#00cc96",
           "#ab63fa",
           "#FFA15A",
           "#19d3f3",
           "#FF6692",
           "#B6E880",
           "#FF97FF",
           "#FECB52"
          ],
          "font": {
           "color": "#2a3f5f"
          },
          "geo": {
           "bgcolor": "white",
           "lakecolor": "white",
           "landcolor": "#E5ECF6",
           "showlakes": true,
           "showland": true,
           "subunitcolor": "white"
          },
          "hoverlabel": {
           "align": "left"
          },
          "hovermode": "closest",
          "mapbox": {
           "style": "light"
          },
          "paper_bgcolor": "white",
          "plot_bgcolor": "#E5ECF6",
          "polar": {
           "angularaxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           },
           "bgcolor": "#E5ECF6",
           "radialaxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           }
          },
          "scene": {
           "xaxis": {
            "backgroundcolor": "#E5ECF6",
            "gridcolor": "white",
            "gridwidth": 2,
            "linecolor": "white",
            "showbackground": true,
            "ticks": "",
            "zerolinecolor": "white"
           },
           "yaxis": {
            "backgroundcolor": "#E5ECF6",
            "gridcolor": "white",
            "gridwidth": 2,
            "linecolor": "white",
            "showbackground": true,
            "ticks": "",
            "zerolinecolor": "white"
           },
           "zaxis": {
            "backgroundcolor": "#E5ECF6",
            "gridcolor": "white",
            "gridwidth": 2,
            "linecolor": "white",
            "showbackground": true,
            "ticks": "",
            "zerolinecolor": "white"
           }
          },
          "shapedefaults": {
           "line": {
            "color": "#2a3f5f"
           }
          },
          "ternary": {
           "aaxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           },
           "baxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           },
           "bgcolor": "#E5ECF6",
           "caxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           }
          },
          "title": {
           "x": 0.05
          },
          "xaxis": {
           "automargin": true,
           "gridcolor": "white",
           "linecolor": "white",
           "ticks": "",
           "title": {
            "standoff": 15
           },
           "zerolinecolor": "white",
           "zerolinewidth": 2
          },
          "yaxis": {
           "automargin": true,
           "gridcolor": "white",
           "linecolor": "white",
           "ticks": "",
           "title": {
            "standoff": 15
           },
           "zerolinecolor": "white",
           "zerolinewidth": 2
          }
         }
        },
        "title": {
         "text": "Matrice de corrélation des variables catégorielles (V de Cramér)"
        },
        "xaxis": {
         "anchor": "y",
         "domain": [
          0,
          1
         ]
        },
        "yaxis": {
         "anchor": "x",
         "autorange": "reversed",
         "domain": [
          0,
          1
         ]
        }
       }
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Matrice de corrélation pour les variables catégorielles (V de Cramér)\n",
    "def cramers_v(confusion_matrix):\n",
    "    \"\"\"Calcule le V de Cramér à partir d'une matrice de contingence\"\"\"\n",
    "    chi2 = chi2_contingency(confusion_matrix)[0]\n",
    "    n = confusion_matrix.sum().sum()\n",
    "    phi2 = chi2 / n\n",
    "    r, k = confusion_matrix.shape\n",
    "    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))\n",
    "    rcorr = r - ((r-1)**2)/(n-1)\n",
    "    kcorr = k - ((k-1)**2)/(n-1)\n",
    "    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))\n",
    "\n",
    "# Créer la matrice de corrélation\n",
    "categorical_cols = vars_categorielles.columns\n",
    "n_vars = len(categorical_cols)\n",
    "cramers_matrix = np.zeros((n_vars, n_vars))\n",
    "\n",
    "for i, col1 in enumerate(categorical_cols):\n",
    "    for j, col2 in enumerate(categorical_cols):\n",
    "        if i == j:\n",
    "            cramers_matrix[i, j] = 1.0\n",
    "        else:\n",
    "            confusion_matrix = pd.crosstab(vars_categorielles[col1], vars_categorielles[col2])\n",
    "            cramers_matrix[i, j] = cramers_v(confusion_matrix)\n",
    "\n",
    "# Créer le DataFrame de corrélation\n",
    "correlation_cat = pd.DataFrame(cramers_matrix,\n",
    "                               index=categorical_cols,\n",
    "                               columns=categorical_cols)\n",
    "\n",
    "# Visualiser avec Plotly\n",
    "fig = px.imshow(correlation_cat,\n",
    "                text_auto='.2f', # type: ignore\n",
    "                aspect=\"auto\",\n",
    "                color_continuous_scale='RdBu_r',\n",
    "                title='Matrice de corrélation des variables catégorielles (V de Cramér)')\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f615121",
   "metadata": {},
   "source": [
    "##### Corrélation des variables numériques :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a16215ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "vars_numeriques = pd.DataFrame(variables_numeriques).transpose()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "532ca6c4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "config": {
        "plotlyServerURL": "https://plot.ly"
       },
       "data": [
        {
         "coloraxis": "coloraxis",
         "hovertemplate": "x: %{x}<br>y: %{y}<br>color: %{z}<extra></extra>",
         "name": "0",
         "texttemplate": "%{z}",
         "type": "heatmap",
         "x": [
          "ANNEE_CTR",
          "AGE_ASSURE_PRINCIPAL",
          "ANCIENNETE_PERMIS",
          "ANNEE_CONSTRUCTION"
         ],
         "xaxis": "x",
         "y": [
          "ANNEE_CTR",
          "AGE_ASSURE_PRINCIPAL",
          "ANCIENNETE_PERMIS",
          "ANNEE_CONSTRUCTION"
         ],
         "yaxis": "y",
         "z": {
          "bdata": "AAAAAAAA8D+ybZcEUUCbP/CBLCtO46Q/qr2Q49LN2D+ybZcEUUCbPwAAAAAAAPA/slV7SAtP4T84L73yETWgv/CBLCtO46Q/slV7SAtP4T8AAAAAAADwP0I6y25dD6E/qr2Q49LN2D84L73yETWgv0I6y25dD6E/AAAAAAAA8D8=",
          "dtype": "f8",
          "shape": "4, 4"
         }
        }
       ],
       "layout": {
        "coloraxis": {
         "colorscale": [
          [
           0,
           "rgb(5,48,97)"
          ],
          [
           0.1,
           "rgb(33,102,172)"
          ],
          [
           0.2,
           "rgb(67,147,195)"
          ],
          [
           0.3,
           "rgb(146,197,222)"
          ],
          [
           0.4,
           "rgb(209,229,240)"
          ],
          [
           0.5,
           "rgb(247,247,247)"
          ],
          [
           0.6,
           "rgb(253,219,199)"
          ],
          [
           0.7,
           "rgb(244,165,130)"
          ],
          [
           0.8,
           "rgb(214,96,77)"
          ],
          [
           0.9,
           "rgb(178,24,43)"
          ],
          [
           1,
           "rgb(103,0,31)"
          ]
         ]
        },
        "template": {
         "data": {
          "bar": [
           {
            "error_x": {
             "color": "#2a3f5f"
            },
            "error_y": {
             "color": "#2a3f5f"
            },
            "marker": {
             "line": {
              "color": "#E5ECF6",
              "width": 0.5
             },
             "pattern": {
              "fillmode": "overlay",
              "size": 10,
              "solidity": 0.2
             }
            },
            "type": "bar"
           }
          ],
          "barpolar": [
           {
            "marker": {
             "line": {
              "color": "#E5ECF6",
              "width": 0.5
             },
             "pattern": {
              "fillmode": "overlay",
              "size": 10,
              "solidity": 0.2
             }
            },
            "type": "barpolar"
           }
          ],
          "carpet": [
           {
            "aaxis": {
             "endlinecolor": "#2a3f5f",
             "gridcolor": "white",
             "linecolor": "white",
             "minorgridcolor": "white",
             "startlinecolor": "#2a3f5f"
            },
            "baxis": {
             "endlinecolor": "#2a3f5f",
             "gridcolor": "white",
             "linecolor": "white",
             "minorgridcolor": "white",
             "startlinecolor": "#2a3f5f"
            },
            "type": "carpet"
           }
          ],
          "choropleth": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "type": "choropleth"
           }
          ],
          "contour": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "contour"
           }
          ],
          "contourcarpet": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "type": "contourcarpet"
           }
          ],
          "heatmap": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "heatmap"
           }
          ],
          "histogram": [
           {
            "marker": {
             "pattern": {
              "fillmode": "overlay",
              "size": 10,
              "solidity": 0.2
             }
            },
            "type": "histogram"
           }
          ],
          "histogram2d": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "histogram2d"
           }
          ],
          "histogram2dcontour": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "histogram2dcontour"
           }
          ],
          "mesh3d": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "type": "mesh3d"
           }
          ],
          "parcoords": [
           {
            "line": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "parcoords"
           }
          ],
          "pie": [
           {
            "automargin": true,
            "type": "pie"
           }
          ],
          "scatter": [
           {
            "fillpattern": {
             "fillmode": "overlay",
             "size": 10,
             "solidity": 0.2
            },
            "type": "scatter"
           }
          ],
          "scatter3d": [
           {
            "line": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scatter3d"
           }
          ],
          "scattercarpet": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattercarpet"
           }
          ],
          "scattergeo": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattergeo"
           }
          ],
          "scattergl": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattergl"
           }
          ],
          "scattermap": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattermap"
           }
          ],
          "scattermapbox": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scattermapbox"
           }
          ],
          "scatterpolar": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scatterpolar"
           }
          ],
          "scatterpolargl": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scatterpolargl"
           }
          ],
          "scatterternary": [
           {
            "marker": {
             "colorbar": {
              "outlinewidth": 0,
              "ticks": ""
             }
            },
            "type": "scatterternary"
           }
          ],
          "surface": [
           {
            "colorbar": {
             "outlinewidth": 0,
             "ticks": ""
            },
            "colorscale": [
             [
              0,
              "#0d0887"
             ],
             [
              0.1111111111111111,
              "#46039f"
             ],
             [
              0.2222222222222222,
              "#7201a8"
             ],
             [
              0.3333333333333333,
              "#9c179e"
             ],
             [
              0.4444444444444444,
              "#bd3786"
             ],
             [
              0.5555555555555556,
              "#d8576b"
             ],
             [
              0.6666666666666666,
              "#ed7953"
             ],
             [
              0.7777777777777778,
              "#fb9f3a"
             ],
             [
              0.8888888888888888,
              "#fdca26"
             ],
             [
              1,
              "#f0f921"
             ]
            ],
            "type": "surface"
           }
          ],
          "table": [
           {
            "cells": {
             "fill": {
              "color": "#EBF0F8"
             },
             "line": {
              "color": "white"
             }
            },
            "header": {
             "fill": {
              "color": "#C8D4E3"
             },
             "line": {
              "color": "white"
             }
            },
            "type": "table"
           }
          ]
         },
         "layout": {
          "annotationdefaults": {
           "arrowcolor": "#2a3f5f",
           "arrowhead": 0,
           "arrowwidth": 1
          },
          "autotypenumbers": "strict",
          "coloraxis": {
           "colorbar": {
            "outlinewidth": 0,
            "ticks": ""
           }
          },
          "colorscale": {
           "diverging": [
            [
             0,
             "#8e0152"
            ],
            [
             0.1,
             "#c51b7d"
            ],
            [
             0.2,
             "#de77ae"
            ],
            [
             0.3,
             "#f1b6da"
            ],
            [
             0.4,
             "#fde0ef"
            ],
            [
             0.5,
             "#f7f7f7"
            ],
            [
             0.6,
             "#e6f5d0"
            ],
            [
             0.7,
             "#b8e186"
            ],
            [
             0.8,
             "#7fbc41"
            ],
            [
             0.9,
             "#4d9221"
            ],
            [
             1,
             "#276419"
            ]
           ],
           "sequential": [
            [
             0,
             "#0d0887"
            ],
            [
             0.1111111111111111,
             "#46039f"
            ],
            [
             0.2222222222222222,
             "#7201a8"
            ],
            [
             0.3333333333333333,
             "#9c179e"
            ],
            [
             0.4444444444444444,
             "#bd3786"
            ],
            [
             0.5555555555555556,
             "#d8576b"
            ],
            [
             0.6666666666666666,
             "#ed7953"
            ],
            [
             0.7777777777777778,
             "#fb9f3a"
            ],
            [
             0.8888888888888888,
             "#fdca26"
            ],
            [
             1,
             "#f0f921"
            ]
           ],
           "sequentialminus": [
            [
             0,
             "#0d0887"
            ],
            [
             0.1111111111111111,
             "#46039f"
            ],
            [
             0.2222222222222222,
             "#7201a8"
            ],
            [
             0.3333333333333333,
             "#9c179e"
            ],
            [
             0.4444444444444444,
             "#bd3786"
            ],
            [
             0.5555555555555556,
             "#d8576b"
            ],
            [
             0.6666666666666666,
             "#ed7953"
            ],
            [
             0.7777777777777778,
             "#fb9f3a"
            ],
            [
             0.8888888888888888,
             "#fdca26"
            ],
            [
             1,
             "#f0f921"
            ]
           ]
          },
          "colorway": [
           "#636efa",
           "#EF553B",
           "#00cc96",
           "#ab63fa",
           "#FFA15A",
           "#19d3f3",
           "#FF6692",
           "#B6E880",
           "#FF97FF",
           "#FECB52"
          ],
          "font": {
           "color": "#2a3f5f"
          },
          "geo": {
           "bgcolor": "white",
           "lakecolor": "white",
           "landcolor": "#E5ECF6",
           "showlakes": true,
           "showland": true,
           "subunitcolor": "white"
          },
          "hoverlabel": {
           "align": "left"
          },
          "hovermode": "closest",
          "mapbox": {
           "style": "light"
          },
          "paper_bgcolor": "white",
          "plot_bgcolor": "#E5ECF6",
          "polar": {
           "angularaxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           },
           "bgcolor": "#E5ECF6",
           "radialaxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           }
          },
          "scene": {
           "xaxis": {
            "backgroundcolor": "#E5ECF6",
            "gridcolor": "white",
            "gridwidth": 2,
            "linecolor": "white",
            "showbackground": true,
            "ticks": "",
            "zerolinecolor": "white"
           },
           "yaxis": {
            "backgroundcolor": "#E5ECF6",
            "gridcolor": "white",
            "gridwidth": 2,
            "linecolor": "white",
            "showbackground": true,
            "ticks": "",
            "zerolinecolor": "white"
           },
           "zaxis": {
            "backgroundcolor": "#E5ECF6",
            "gridcolor": "white",
            "gridwidth": 2,
            "linecolor": "white",
            "showbackground": true,
            "ticks": "",
            "zerolinecolor": "white"
           }
          },
          "shapedefaults": {
           "line": {
            "color": "#2a3f5f"
           }
          },
          "ternary": {
           "aaxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           },
           "baxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           },
           "bgcolor": "#E5ECF6",
           "caxis": {
            "gridcolor": "white",
            "linecolor": "white",
            "ticks": ""
           }
          },
          "title": {
           "x": 0.05
          },
          "xaxis": {
           "automargin": true,
           "gridcolor": "white",
           "linecolor": "white",
           "ticks": "",
           "title": {
            "standoff": 15
           },
           "zerolinecolor": "white",
           "zerolinewidth": 2
          },
          "yaxis": {
           "automargin": true,
           "gridcolor": "white",
           "linecolor": "white",
           "ticks": "",
           "title": {
            "standoff": 15
           },
           "zerolinecolor": "white",
           "zerolinewidth": 2
          }
         }
        },
        "title": {
         "text": "Matrice de corrélation des variables numériques"
        },
        "xaxis": {
         "anchor": "y",
         "domain": [
          0,
          1
         ]
        },
        "yaxis": {
         "anchor": "x",
         "autorange": "reversed",
         "domain": [
          0,
          1
         ]
        }
       }
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "vars_numeriques.corr()\n",
    "fig = px.imshow(vars_numeriques.corr(),\n",
    "                text_auto=True,\n",
    "                aspect=\"auto\",\n",
    "                color_continuous_scale='RdBu_r',\n",
    "                title='Matrice de corrélation des variables numériques')\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98c7dba6",
   "metadata": {},
   "source": [
    "**Question :** quels sont vos commentaires ?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "212209ec",
   "metadata": {},
   "source": [
    "#### Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65aca700",
   "metadata": {},
   "source": [
    "Deux étapes sont nécessaires avant de lancer l'apprentissage d'un modèle, c'est ce qu'on connait comme le *Preprocessing* :\n",
    "\n",
    "* Les modèles proposés par la librairie \"sklearn\" ne gèrent que des variables numériques. Il est donc nécessaire de transformer les variables catégorielles en variables numériques : ce processus s'appelle le *One Hot Encoding*.\n",
    "* Normaliser les données numériques"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95f5cc9f",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant de réaliser le One Hot Encoding des variables catégorielles. Vous pourrez utiliser la fonction \"preproc.OneHotEncoder\" de la librairie sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "b8530717",
   "metadata": {},
   "outputs": [],
   "source": [
    "encoder = preproc.OneHotEncoder(sparse_output=False, drop='first')\n",
    "encoder.fit(vars_categorielles)\n",
    "vars_categorielles_enc = encoder.transform(vars_categorielles)\n",
    "vars_categorielles_enc = pd.DataFrame(vars_categorielles_enc, columns=encoder.get_feature_names_out()) # type: ignore"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b70abc5c",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant normaliser les variables numériques présentes dans la base. Vous pourrez utiliser la fonction \"preproc.StandardScaler\" de la librairie sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "4ff3847d",
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler = preproc.StandardScaler()\n",
    "scaler.fit(vars_numeriques)\n",
    "vars_numeriques_scaled = scaler.transform(vars_numeriques)\n",
    "vars_numeriques_scaled = pd.DataFrame(vars_numeriques_scaled, columns=vars_numeriques.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62d49546",
   "metadata": {},
   "source": [
    "#### Sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64d229f4",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant construire la base d'apprentissage (80% des données) et la base de test (20%)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "6a1c7907",
   "metadata": {},
   "outputs": [],
   "source": [
    "train, test = train_test_split(data_model, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84dc7a07",
   "metadata": {},
   "source": [
    "#### Fitting"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97c7b783",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant construire le modèle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd26339b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "8d624704",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant d'évaluer les performances du modèle (MAE, MSE et RMSE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4ca2cf9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "fb2fe98c",
   "metadata": {},
   "source": [
    "**Question :** que pensez-vous des performances de ce modèle ?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ecba832",
   "metadata": {},
   "source": [
    "## Algorithme supervisé : Random Forest  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efcb8987",
   "metadata": {},
   "source": [
    "A ce stade, nous avons vu les différentes étapes pour lancer un algorithme de Machine Learning. Néanmoins, ces étapes ne sont pas suffisantes pour construire un modèle performant.  \n",
    "En effet, afin de construire un modèle performant le Data Scientist doit agir sur l'apprentissage du modèle. Dans ce qui suit nous :\n",
    "* Changerons d'algorithme pour utiliser un algorithme plus performant (Random Forest)\n",
    "* Raliserons un *grid search* sur les paramètres du modèle\n",
    "* Appliquerons l'apprentissage par validation croisée\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6723a2f",
   "metadata": {},
   "source": [
    "### Modèle avec Validation Croisée"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3716b09f",
   "metadata": {},
   "source": [
    "#### Sampling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab1e1367",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "3f5d735e",
   "metadata": {},
   "source": [
    "#### Fitting avec Cross-Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc819f8f",
   "metadata": {},
   "source": [
    "**Exercice :** construisez un modèle RF (RandomForestRegressor) en implémentant la technique de validation croisée. Pensez à enregistrer au sein d'une variable/liste les performances (MAE, MSE & RMSE) du modèle au sein de chaque fold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b515460e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Initialisation\n",
    "# Nombre de sous-échantillons pour la cross-validation\n",
    "num_splits = 5\n",
    "\n",
    "# Random Forest regressor\n",
    "rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)\n",
    "\n",
    "# Initialisation du KFold cross-validation splitter\n",
    "kf = KFold(n_splits=num_splits)\n",
    "\n",
    "# Listes pour enregistrer les performances du modèle\n",
    "MAE_scores = []\n",
    "MSE_scores = []\n",
    "RMSE_scores = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eebb394f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Entrainement avec cross-validation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b067126c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Métriques sur tous les folds\n",
    "\n",
    "#MAE\n",
    "for fold, mae in enumerate(MAE_scores, start=1):\n",
    "    print(f\"Fold {fold} MAE:\", mae)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6597152c",
   "metadata": {},
   "outputs": [],
   "source": [
    "#MSE\n",
    "for fold, mse in enumerate(MSE_scores, start=1):\n",
    "    print(f\"Fold {fold} MSE:\", mse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63ff1c9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "#RMSE\n",
    "for fold, rmse in enumerate(RMSE_scores, start=1):\n",
    "    print(f\"Fold {fold} RMSE:\", rmse)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec1961c2",
   "metadata": {},
   "source": [
    "**Question :** Commentez les résultats."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a8163ef",
   "metadata": {},
   "source": [
    "### Ajout d'un Grid Search pour les hyper paramètres"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a6adbfe",
   "metadata": {},
   "source": [
    "#### Sampling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9342ad6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "dce52b11",
   "metadata": {},
   "source": [
    "#### Fitting avec Cross-Validation et *Grid Search*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e3a9dd0",
   "metadata": {},
   "source": [
    "**Exercice :** Intégrez la technique de Grid Search pour rechercher les paramètres optimaux du modèle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d58dbc2",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Initialisation\n",
    "# Nombre de sous-échantillons pour la cross-validation\n",
    "num_splits = 5\n",
    "\n",
    "# Initialisation du KFold cross-validation splitter\n",
    "kf = KFold(n_splits=num_splits)\n",
    "\n",
    "# Listes pour enregistrer les performances du modèle\n",
    "MAE_scores = []\n",
    "MSE_scores = []\n",
    "RMSE_scores = []\n",
    "\n",
    "# Hyperparamètres à tester\n",
    "n_estimators_values = [] #Complétez ici par les paramètres à tester\n",
    "max_depth_values = [] #Complétez ici par les paramètres à tester\n",
    "min_samples_split_values = [] #Complétez ici par les paramètres à tester\n",
    "\n",
    "# Liste pour sauveagrder les meilleurs résultats\n",
    "best_score = np.inf\n",
    "best_params = {}\n",
    "\n",
    "MAE_best_score = []\n",
    "MSE_best_score = []\n",
    "RMSE_best_score = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47da5172",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Complétez ici avec votre code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4936c46",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Meilleurs résultats\n",
    "print(\"Meilleurs paramètres:\", best_params)\n",
    "print(\"Meilleure RMSE :\", best_score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3215c463",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Métriques sur tous les folds\n",
    "\n",
    "#RMSE\n",
    "for fold, rmse in enumerate(RMSE_best_score, start=1):\n",
    "    print(f\"Fold {fold} RMSE:\", rmse)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb9a5c9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "#MAE\n",
    "for fold, mse in enumerate(MSE_best_score, start=1):\n",
    "    print(f\"Fold {fold} MSE:\", mse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f0768ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "#MSE\n",
    "for fold, mae in enumerate(MAE_best_score, start=1):\n",
    "    print(f\"Fold {fold} MAE:\", mae)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "802a625f",
   "metadata": {},
   "source": [
    "**Question :** Commentez les résultats"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "studies",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}