ArtStudies/M2/Machine Learning/TP_3/2025_TP_3_M2_ISF.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8750d15b",
   "metadata": {},
   "source": [
    "# Cours 3 : Machine Learning - Algorithmes supervisés (1/2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7c08ae5",
   "metadata": {},
   "source": [
    "## Préambule"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec7ecb4b",
   "metadata": {},
   "source": [
    "Les objectifs de cette séance (3h) sont :\n",
    "* Préparation des bases de modélisation (sampling)\n",
    "* Mettre en application un modèle supervisé simple.\n",
    "* Construire un modèle de Machine Learning (cross-validation et hyperparamétrage) pour résoudre un problème de régression\n",
    "* Analyser les performances du modèle"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e99c600",
   "metadata": {},
   "source": [
    "## Préparation du workspace"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1b01045",
   "metadata": {},
   "source": [
    "### Import de librairies "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97d58527",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Données\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "#Graphiques\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set()\n",
    "import plotly.express as px\n",
    "import plotly.graph_objects as gp\n",
    "import sklearn.preprocessing as preproc\n",
    "\n",
    "#Statistiques\n",
    "from scipy.stats import chi2_contingency\n",
    "from sklearn import metrics\n",
    "\n",
    "# Machine Learning\n",
    "from sklearn.cluster import KMeans\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.model_selection import KFold, train_test_split\n",
    "from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06153286",
   "metadata": {},
   "source": [
    "### Définition des fonctions "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c67db932",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "985e4e97",
   "metadata": {},
   "source": [
    "### Constantes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "c9597b48",
   "metadata": {},
   "outputs": [],
   "source": [
    "input_path = \"./1_inputs\"\n",
    "output_path = \"./2_outputs\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2b035d2",
   "metadata": {},
   "source": [
    "### Import des données"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "8051b5f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "path =input_path + '/base_retraitee.csv'\n",
    "data_retraitee = pd.read_csv(path,sep=\",\",decimal=\".\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2578ba1",
   "metadata": {},
   "source": [
    "## Algorithme supervisé : CART "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaa0b27d",
   "metadata": {},
   "source": [
    "Dans cette partie l'objectif est de construire un modèle simple (algorithme CART) afin de voir les différentes étapes nécessaire au lancement d'un modèle\n",
    "Nous modéliserons directement le coût des sinistres. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0458a05",
   "metadata": {},
   "source": [
    "### Construction du modèle"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3715c37",
   "metadata": {},
   "source": [
    "La première étape est de calculer les côut moyen de chaque sinistre (target ou variable réponse). Cette variable sera la variable à prédire en fonction des variables explicatives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "c427a4b8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.microsoft.datawrangler.viewer.v0+json": {
       "columns": [
        {
         "name": "index",
         "rawType": "int64",
         "type": "integer"
        },
        {
         "name": "ANNEE_CTR",
         "rawType": "int64",
         "type": "integer"
        },
        {
         "name": "CONTRAT_ANCIENNETE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "FREQUENCE_PAIEMENT_COTISATION",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "GROUPE_KM",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "ZONE_RISQUE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "AGE_ASSURE_PRINCIPAL",
         "rawType": "int64",
         "type": "integer"
        },
        {
         "name": "GENRE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "DEUXIEME_CONDUCTEUR",
         "rawType": "bool",
         "type": "boolean"
        },
        {
         "name": "ANCIENNETE_PERMIS",
         "rawType": "int64",
         "type": "integer"
        },
        {
         "name": "ANNEE_CONSTRUCTION",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "ENERGIE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "EQUIPEMENT_SECURITE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "VALEUR_DU_BIEN",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "CM",
         "rawType": "float64",
         "type": "float"
        }
       ],
       "ref": "e76df045-0c83-40e9-a027-c48f278ec1d6",
       "rows": [
        [
         "10",
         "2019",
         "(0,1]",
         "MENSUEL",
         "[0;20000[",
         "C",
         "40",
         "M",
         "False",
         "37",
         "2017.0",
         "ESSENCE",
         "VRAI",
         "[15000;20000[",
         "1072.98"
        ],
        [
         "34",
         "2020",
         "(-1,0]",
         "MENSUEL",
         "[20000;40000[",
         "C",
         "27",
         "M",
         "True",
         "13",
         "2018.0",
         "AUTRE",
         "FAUX",
         "[35000;99999[",
         "3750.0"
        ],
        [
         "36",
         "2019",
         "(-1,0]",
         "MENSUEL",
         "[20000;40000[",
         "L",
         "19",
         "M",
         "False",
         "2",
         "2017.0",
         "ESSENCE",
         "VRAI",
         "[0;10000[",
         "1838.49"
        ],
        [
         "78",
         "2019",
         "(-1,0]",
         "MENSUEL",
         "[20000;40000[",
         "B",
         "40",
         "M",
         "False",
         "45",
         "2018.0",
         "DIESEL",
         "FAUX",
         "[15000;20000[",
         "4892.74"
        ],
        [
         "89",
         "2018",
         "(1,2]",
         "MENSUEL",
         "[20000;40000[",
         "C",
         "20",
         "M",
         "False",
         "11",
         "2014.0",
         "ESSENCE",
         "FAUX",
         "[25000;35000[",
         "166.73"
        ]
       ],
       "shape": {
        "columns": 14,
        "rows": 5
       }
      },
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ANNEE_CTR</th>\n",
       "      <th>CONTRAT_ANCIENNETE</th>\n",
       "      <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
       "      <th>GROUPE_KM</th>\n",
       "      <th>ZONE_RISQUE</th>\n",
       "      <th>AGE_ASSURE_PRINCIPAL</th>\n",
       "      <th>GENRE</th>\n",
       "      <th>DEUXIEME_CONDUCTEUR</th>\n",
       "      <th>ANCIENNETE_PERMIS</th>\n",
       "      <th>ANNEE_CONSTRUCTION</th>\n",
       "      <th>ENERGIE</th>\n",
       "      <th>EQUIPEMENT_SECURITE</th>\n",
       "      <th>VALEUR_DU_BIEN</th>\n",
       "      <th>CM</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2019</td>\n",
       "      <td>(0,1]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[0;20000[</td>\n",
       "      <td>C</td>\n",
       "      <td>40</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>37</td>\n",
       "      <td>2017.0</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>VRAI</td>\n",
       "      <td>[15000;20000[</td>\n",
       "      <td>1072.98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>2020</td>\n",
       "      <td>(-1,0]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>C</td>\n",
       "      <td>27</td>\n",
       "      <td>M</td>\n",
       "      <td>True</td>\n",
       "      <td>13</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>AUTRE</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[35000;99999[</td>\n",
       "      <td>3750.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>2019</td>\n",
       "      <td>(-1,0]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>L</td>\n",
       "      <td>19</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>2</td>\n",
       "      <td>2017.0</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>VRAI</td>\n",
       "      <td>[0;10000[</td>\n",
       "      <td>1838.49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78</th>\n",
       "      <td>2019</td>\n",
       "      <td>(-1,0]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>B</td>\n",
       "      <td>40</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>45</td>\n",
       "      <td>2018.0</td>\n",
       "      <td>DIESEL</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[15000;20000[</td>\n",
       "      <td>4892.74</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>89</th>\n",
       "      <td>2018</td>\n",
       "      <td>(1,2]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>C</td>\n",
       "      <td>20</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>11</td>\n",
       "      <td>2014.0</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[25000;35000[</td>\n",
       "      <td>166.73</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION      GROUPE_KM  \\\n",
       "10       2019              (0,1]                       MENSUEL      [0;20000[   \n",
       "34       2020             (-1,0]                       MENSUEL  [20000;40000[   \n",
       "36       2019             (-1,0]                       MENSUEL  [20000;40000[   \n",
       "78       2019             (-1,0]                       MENSUEL  [20000;40000[   \n",
       "89       2018              (1,2]                       MENSUEL  [20000;40000[   \n",
       "\n",
       "   ZONE_RISQUE  AGE_ASSURE_PRINCIPAL GENRE  DEUXIEME_CONDUCTEUR  \\\n",
       "10           C                    40     M                False   \n",
       "34           C                    27     M                 True   \n",
       "36           L                    19     M                False   \n",
       "78           B                    40     M                False   \n",
       "89           C                    20     M                False   \n",
       "\n",
       "    ANCIENNETE_PERMIS  ANNEE_CONSTRUCTION  ENERGIE EQUIPEMENT_SECURITE  \\\n",
       "10                 37              2017.0  ESSENCE                VRAI   \n",
       "34                 13              2018.0    AUTRE                FAUX   \n",
       "36                  2              2017.0  ESSENCE                VRAI   \n",
       "78                 45              2018.0   DIESEL                FAUX   \n",
       "89                 11              2014.0  ESSENCE                FAUX   \n",
       "\n",
       "   VALEUR_DU_BIEN       CM  \n",
       "10  [15000;20000[  1072.98  \n",
       "34  [35000;99999[  3750.00  \n",
       "36      [0;10000[  1838.49  \n",
       "78  [15000;20000[  4892.74  \n",
       "89  [25000;35000[   166.73  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_model = data_retraitee.copy()\n",
    "\n",
    "# Filtre pour ne garder que les lignes qui ont un sinistre (NB > 0)\n",
    "data_model = data_model[data_model['NB'] > 0]\n",
    "\n",
    "# Calcul du cout moyen \"théorique\" des sinistres\n",
    "data_model[\"CM\"] = (data_model[\"CHARGE\"] / data_model[\"NB\"])\n",
    "data_model = data_model.drop(['CHARGE', 'NB', \"EXPO\"], axis=1)\n",
    "data_model.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3e85088",
   "metadata": {},
   "source": [
    "**Exercice :** construisez les statistiques descriptives de la base utilisée."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "c8fd3ee1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.microsoft.datawrangler.viewer.v0+json": {
       "columns": [
        {
         "name": "index",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "ANNEE_CTR",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "CONTRAT_ANCIENNETE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "FREQUENCE_PAIEMENT_COTISATION",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "GROUPE_KM",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "ZONE_RISQUE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "AGE_ASSURE_PRINCIPAL",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "GENRE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "DEUXIEME_CONDUCTEUR",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "ANCIENNETE_PERMIS",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "ANNEE_CONSTRUCTION",
         "rawType": "float64",
         "type": "float"
        },
        {
         "name": "ENERGIE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "EQUIPEMENT_SECURITE",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "VALEUR_DU_BIEN",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "CM",
         "rawType": "float64",
         "type": "float"
        }
       ],
       "ref": "b2f9efdd-d035-4c51-9797-2e202b404c15",
       "rows": [
        [
         "count",
         "824.0",
         "824",
         "824",
         "824",
         "824",
         "824.0",
         "824",
         "824",
         "824.0",
         "824.0",
         "824",
         "824",
         "824",
         "824.0"
        ],
        [
         "unique",
         null,
         "5",
         "3",
         "4",
         "14",
         null,
         "2",
         "2",
         null,
         null,
         "3",
         "2",
         "6",
         null
        ],
        [
         "top",
         null,
         "(0,1]",
         "MENSUEL",
         "[0;20000[",
         "C",
         null,
         "M",
         "False",
         null,
         null,
         "ESSENCE",
         "FAUX",
         "[10000;15000[",
         null
        ],
        [
         "freq",
         null,
         "297",
         "398",
         "391",
         "269",
         null,
         "483",
         "663",
         null,
         null,
         "413",
         "517",
         "213",
         null
        ],
        [
         "mean",
         "2018.384708737864",
         null,
         null,
         null,
         null,
         "44.383495145631066",
         null,
         null,
         "35.68810679611651",
         "2015.2123786407767",
         null,
         null,
         null,
         "4246.01697815534"
        ],
        [
         "std",
         "1.515832735580178",
         null,
         null,
         null,
         null,
         "13.808216667998865",
         null,
         null,
         "19.370620845496358",
         "3.1637823115731556",
         null,
         null,
         null,
         "6869.61691660173"
        ],
        [
         "min",
         "2016.0",
         null,
         null,
         null,
         null,
         "19.0",
         null,
         null,
         "1.0",
         "1998.0",
         null,
         null,
         null,
         "7.5"
        ],
        [
         "25%",
         "2017.0",
         null,
         null,
         null,
         null,
         "34.0",
         null,
         null,
         "18.0",
         "2014.0",
         null,
         null,
         null,
         "1159.96125"
        ],
        [
         "50%",
         "2018.0",
         null,
         null,
         null,
         null,
         "43.0",
         null,
         null,
         "35.0",
         "2016.0",
         null,
         null,
         null,
         "2541.6499999999996"
        ],
        [
         "75%",
         "2020.0",
         null,
         null,
         null,
         null,
         "53.0",
         null,
         null,
         "53.0",
         "2017.0",
         null,
         null,
         null,
         "4193.797500000001"
        ],
        [
         "max",
         "2021.0",
         null,
         null,
         null,
         null,
         "94.0",
         null,
         null,
         "70.0",
         "2021.0",
         null,
         null,
         null,
         "83421.85"
        ]
       ],
       "shape": {
        "columns": 14,
        "rows": 11
       }
      },
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ANNEE_CTR</th>\n",
       "      <th>CONTRAT_ANCIENNETE</th>\n",
       "      <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
       "      <th>GROUPE_KM</th>\n",
       "      <th>ZONE_RISQUE</th>\n",
       "      <th>AGE_ASSURE_PRINCIPAL</th>\n",
       "      <th>GENRE</th>\n",
       "      <th>DEUXIEME_CONDUCTEUR</th>\n",
       "      <th>ANCIENNETE_PERMIS</th>\n",
       "      <th>ANNEE_CONSTRUCTION</th>\n",
       "      <th>ENERGIE</th>\n",
       "      <th>EQUIPEMENT_SECURITE</th>\n",
       "      <th>VALEUR_DU_BIEN</th>\n",
       "      <th>CM</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>824.000000</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824.000000</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824.000000</td>\n",
       "      <td>824.000000</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824</td>\n",
       "      <td>824.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>NaN</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "      <td>14</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>NaN</td>\n",
       "      <td>(0,1]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[0;20000[</td>\n",
       "      <td>C</td>\n",
       "      <td>NaN</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[10000;15000[</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>NaN</td>\n",
       "      <td>297</td>\n",
       "      <td>398</td>\n",
       "      <td>391</td>\n",
       "      <td>269</td>\n",
       "      <td>NaN</td>\n",
       "      <td>483</td>\n",
       "      <td>663</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>413</td>\n",
       "      <td>517</td>\n",
       "      <td>213</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>2018.384709</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>44.383495</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.688107</td>\n",
       "      <td>2015.212379</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4246.016978</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.515833</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>13.808217</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>19.370621</td>\n",
       "      <td>3.163782</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6869.616917</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>2016.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>19.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1998.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2017.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>34.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>2014.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1159.961250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2018.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>43.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35.000000</td>\n",
       "      <td>2016.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2541.650000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>2020.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>53.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>53.000000</td>\n",
       "      <td>2017.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4193.797500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>2021.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>94.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>70.000000</td>\n",
       "      <td>2021.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>83421.850000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION  \\\n",
       "count    824.000000                824                           824   \n",
       "unique          NaN                  5                             3   \n",
       "top             NaN              (0,1]                       MENSUEL   \n",
       "freq            NaN                297                           398   \n",
       "mean    2018.384709                NaN                           NaN   \n",
       "std        1.515833                NaN                           NaN   \n",
       "min     2016.000000                NaN                           NaN   \n",
       "25%     2017.000000                NaN                           NaN   \n",
       "50%     2018.000000                NaN                           NaN   \n",
       "75%     2020.000000                NaN                           NaN   \n",
       "max     2021.000000                NaN                           NaN   \n",
       "\n",
       "        GROUPE_KM ZONE_RISQUE  AGE_ASSURE_PRINCIPAL GENRE DEUXIEME_CONDUCTEUR  \\\n",
       "count         824         824            824.000000   824                 824   \n",
       "unique          4          14                   NaN     2                   2   \n",
       "top     [0;20000[           C                   NaN     M               False   \n",
       "freq          391         269                   NaN   483                 663   \n",
       "mean          NaN         NaN             44.383495   NaN                 NaN   \n",
       "std           NaN         NaN             13.808217   NaN                 NaN   \n",
       "min           NaN         NaN             19.000000   NaN                 NaN   \n",
       "25%           NaN         NaN             34.000000   NaN                 NaN   \n",
       "50%           NaN         NaN             43.000000   NaN                 NaN   \n",
       "75%           NaN         NaN             53.000000   NaN                 NaN   \n",
       "max           NaN         NaN             94.000000   NaN                 NaN   \n",
       "\n",
       "        ANCIENNETE_PERMIS  ANNEE_CONSTRUCTION  ENERGIE EQUIPEMENT_SECURITE  \\\n",
       "count          824.000000          824.000000      824                 824   \n",
       "unique                NaN                 NaN        3                   2   \n",
       "top                   NaN                 NaN  ESSENCE                FAUX   \n",
       "freq                  NaN                 NaN      413                 517   \n",
       "mean            35.688107         2015.212379      NaN                 NaN   \n",
       "std             19.370621            3.163782      NaN                 NaN   \n",
       "min              1.000000         1998.000000      NaN                 NaN   \n",
       "25%             18.000000         2014.000000      NaN                 NaN   \n",
       "50%             35.000000         2016.000000      NaN                 NaN   \n",
       "75%             53.000000         2017.000000      NaN                 NaN   \n",
       "max             70.000000         2021.000000      NaN                 NaN   \n",
       "\n",
       "       VALEUR_DU_BIEN            CM  \n",
       "count             824    824.000000  \n",
       "unique              6           NaN  \n",
       "top     [10000;15000[           NaN  \n",
       "freq              213           NaN  \n",
       "mean              NaN   4246.016978  \n",
       "std               NaN   6869.616917  \n",
       "min               NaN      7.500000  \n",
       "25%               NaN   1159.961250  \n",
       "50%               NaN   2541.650000  \n",
       "75%               NaN   4193.797500  \n",
       "max               NaN  83421.850000  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_model.describe(include='all')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92d6156a",
   "metadata": {},
   "source": [
    "#### Etude des corrélations parmi les variables explicatives"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7327570",
   "metadata": {},
   "source": [
    "**Question :** Selon vous, pourquoi faut-il s'intéresser à la corrélation des variables ? "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "475e141b",
   "metadata": {},
   "source": [
    "*Réponse*: Pour avoir un modèle qui fit mieux + déterminer un potentiel effet de causalité entre features et target + sélectionner certaines variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "1b156435",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_set = data_model.drop(\"CM\", axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "0ef0fcc0",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Séparation en variables qualitatives ou catégorielles\n",
    "variables_na = []\n",
    "variables_numeriques = []\n",
    "variables_01 = []\n",
    "variables_categorielles = []\n",
    "for colu in data_set.columns:\n",
    "    if True in data_set[colu].isna().unique() :\n",
    "        variables_na.append(data_set[colu])\n",
    "    else :\n",
    "        if str(data_set[colu].dtypes) in [\"int32\",\"int64\",\"float64\"]:\n",
    "            if len(data_set[colu].unique())==2 :\n",
    "                variables_categorielles.append(data_set[colu])\n",
    "            else :\n",
    "                variables_numeriques.append(data_set[colu])\n",
    "        else :\n",
    "            if len(data_set[colu].unique())==2 :\n",
    "                variables_categorielles.append(data_set[colu])\n",
    "            else :\n",
    "                variables_categorielles.append(data_set[colu])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e82fcade",
   "metadata": {},
   "source": [
    "##### Corrélation des variables catégorielles :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e130aae5",
   "metadata": {},
   "outputs": [],
   "source": [
    "vars_categorielles = pd.DataFrame(variables_categorielles).transpose()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "c39e2ad0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.microsoft.datawrangler.viewer.v0+json": {
       "columns": [
        {
         "name": "index",
         "rawType": "int64",
         "type": "integer"
        },
        {
         "name": "CONTRAT_ANCIENNETE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "FREQUENCE_PAIEMENT_COTISATION",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "GROUPE_KM",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "ZONE_RISQUE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "GENRE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "DEUXIEME_CONDUCTEUR",
         "rawType": "object",
         "type": "unknown"
        },
        {
         "name": "ENERGIE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "EQUIPEMENT_SECURITE",
         "rawType": "object",
         "type": "string"
        },
        {
         "name": "VALEUR_DU_BIEN",
         "rawType": "object",
         "type": "string"
        }
       ],
       "ref": "089d2df2-1504-4d62-9804-f974629bdaaa",
       "rows": [
        [
         "10",
         "(0,1]",
         "MENSUEL",
         "[0;20000[",
         "C",
         "M",
         "False",
         "ESSENCE",
         "VRAI",
         "[15000;20000["
        ],
        [
         "34",
         "(-1,0]",
         "MENSUEL",
         "[20000;40000[",
         "C",
         "M",
         "True",
         "AUTRE",
         "FAUX",
         "[35000;99999["
        ],
        [
         "36",
         "(-1,0]",
         "MENSUEL",
         "[20000;40000[",
         "L",
         "M",
         "False",
         "ESSENCE",
         "VRAI",
         "[0;10000["
        ],
        [
         "78",
         "(-1,0]",
         "MENSUEL",
         "[20000;40000[",
         "B",
         "M",
         "False",
         "DIESEL",
         "FAUX",
         "[15000;20000["
        ],
        [
         "89",
         "(1,2]",
         "MENSUEL",
         "[20000;40000[",
         "C",
         "M",
         "False",
         "ESSENCE",
         "FAUX",
         "[25000;35000["
        ]
       ],
       "shape": {
        "columns": 9,
        "rows": 5
       }
      },
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CONTRAT_ANCIENNETE</th>\n",
       "      <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
       "      <th>GROUPE_KM</th>\n",
       "      <th>ZONE_RISQUE</th>\n",
       "      <th>GENRE</th>\n",
       "      <th>DEUXIEME_CONDUCTEUR</th>\n",
       "      <th>ENERGIE</th>\n",
       "      <th>EQUIPEMENT_SECURITE</th>\n",
       "      <th>VALEUR_DU_BIEN</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>(0,1]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[0;20000[</td>\n",
       "      <td>C</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>VRAI</td>\n",
       "      <td>[15000;20000[</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>(-1,0]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>C</td>\n",
       "      <td>M</td>\n",
       "      <td>True</td>\n",
       "      <td>AUTRE</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[35000;99999[</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>(-1,0]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>L</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>VRAI</td>\n",
       "      <td>[0;10000[</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78</th>\n",
       "      <td>(-1,0]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>B</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>DIESEL</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[15000;20000[</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>89</th>\n",
       "      <td>(1,2]</td>\n",
       "      <td>MENSUEL</td>\n",
       "      <td>[20000;40000[</td>\n",
       "      <td>C</td>\n",
       "      <td>M</td>\n",
       "      <td>False</td>\n",
       "      <td>ESSENCE</td>\n",
       "      <td>FAUX</td>\n",
       "      <td>[25000;35000[</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION      GROUPE_KM  \\\n",
       "10              (0,1]                       MENSUEL      [0;20000[   \n",
       "34             (-1,0]                       MENSUEL  [20000;40000[   \n",
       "36             (-1,0]                       MENSUEL  [20000;40000[   \n",
       "78             (-1,0]                       MENSUEL  [20000;40000[   \n",
       "89              (1,2]                       MENSUEL  [20000;40000[   \n",
       "\n",
       "   ZONE_RISQUE GENRE DEUXIEME_CONDUCTEUR  ENERGIE EQUIPEMENT_SECURITE  \\\n",
       "10           C     M               False  ESSENCE                VRAI   \n",
       "34           C     M                True    AUTRE                FAUX   \n",
       "36           L     M               False  ESSENCE                VRAI   \n",
       "78           B     M               False   DIESEL                FAUX   \n",
       "89           C     M               False  ESSENCE                FAUX   \n",
       "\n",
       "   VALEUR_DU_BIEN  \n",
       "10  [15000;20000[  \n",
       "34  [35000;99999[  \n",
       "36      [0;10000[  \n",
       "78  [15000;20000[  \n",
       "89  [25000;35000[  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vars_categorielles.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f615121",
   "metadata": {},
   "source": [
    "##### Corrélation des variables numériques :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a16215ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "vars_numeriques = pd.DataFrame(variables_numeriques).transpose()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "532ca6c4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Matrice de corrélation des variables numériques:\n",
      "                      ANNEE_CTR  AGE_ASSURE_PRINCIPAL  ANCIENNETE_PERMIS  \\\n",
      "ANNEE_CTR              1.000000              0.026613           0.040797   \n",
      "AGE_ASSURE_PRINCIPAL   0.026613              1.000000           0.540899   \n",
      "ANCIENNETE_PERMIS      0.040797              0.540899           1.000000   \n",
      "ANNEE_CONSTRUCTION     0.387562             -0.031655           0.033320   \n",
      "\n",
      "                      ANNEE_CONSTRUCTION  \n",
      "ANNEE_CTR                       0.387562  \n",
      "AGE_ASSURE_PRINCIPAL           -0.031655  \n",
      "ANCIENNETE_PERMIS               0.033320  \n",
      "ANNEE_CONSTRUCTION              1.000000  \n"
     ]
    },
    {
     "ename": "ValueError",
     "evalue": "\n    Invalid value of type 'builtins.str' received for the 'colorscale' property of imshow\n        Received value: 'coolwarm'\n\n    The 'colorscale' property is a colorscale and may be\n    specified as:\n      - A list of colors that will be spaced evenly to create the colorscale.\n        Many predefined colorscale lists are included in the sequential, diverging,\n        and cyclical modules in the plotly.colors package.\n      - A list of 2-element lists where the first element is the\n        normalized color level value (starting at 0 and ending at 1),\n        and the second item is a valid color string.\n        (e.g. [[0, 'green'], [0.5, 'red'], [1.0, 'rgb(0, 0, 255)']])\n      - One of the following named colorscales:\n            ['aggrnyl', 'agsunset', 'algae', 'amp', 'armyrose', 'balance',\n             'blackbody', 'bluered', 'blues', 'blugrn', 'bluyl', 'brbg',\n             'brwnyl', 'bugn', 'bupu', 'burg', 'burgyl', 'cividis', 'curl',\n             'darkmint', 'deep', 'delta', 'dense', 'earth', 'edge', 'electric',\n             'emrld', 'fall', 'geyser', 'gnbu', 'gray', 'greens', 'greys',\n             'haline', 'hot', 'hsv', 'ice', 'icefire', 'inferno', 'jet',\n             'magenta', 'magma', 'matter', 'mint', 'mrybm', 'mygbm', 'oranges',\n             'orrd', 'oryel', 'oxy', 'peach', 'phase', 'picnic', 'pinkyl',\n             'piyg', 'plasma', 'plotly3', 'portland', 'prgn', 'pubu', 'pubugn',\n             'puor', 'purd', 'purp', 'purples', 'purpor', 'rainbow', 'rdbu',\n             'rdgy', 'rdpu', 'rdylbu', 'rdylgn', 'redor', 'reds', 'solar',\n             'spectral', 'speed', 'sunset', 'sunsetdark', 'teal', 'tealgrn',\n             'tealrose', 'tempo', 'temps', 'thermal', 'tropic', 'turbid',\n             'turbo', 'twilight', 'viridis', 'ylgn', 'ylgnbu', 'ylorbr',\n             'ylorrd'].\n        Appending '_r' to a named colorscale reverses it.\n",
     "output_type": "error",
     "traceback": [
      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
      "\u001b[31mValueError\u001b[39m                                Traceback (most recent call last)",
      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[32]\u001b[39m\u001b[32m, line 6\u001b[39m\n\u001b[32m      3\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mMatrice de corrélation des variables numériques:\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m      4\u001b[39m \u001b[38;5;28mprint\u001b[39m(correlation_matrix)\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m fig = \u001b[43mpx\u001b[49m\u001b[43m.\u001b[49m\u001b[43mimshow\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m      7\u001b[39m \u001b[43m    \u001b[49m\u001b[43mcorrelation_matrix\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtext_auto\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolor_continuous_scale\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mcoolwarm\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maspect\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mauto\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\n\u001b[32m      8\u001b[39m \u001b[43m)\u001b[49m\n\u001b[32m      9\u001b[39m fig.update_layout(title=\u001b[33m\"\u001b[39m\u001b[33mMatrice de corrélation des variables numériques\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m     10\u001b[39m fig.show()\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/Workspace/studies/.venv/lib/python3.13/site-packages/plotly/express/_imshow.py:423\u001b[39m, in \u001b[36mimshow\u001b[39m\u001b[34m(img, zmin, zmax, origin, labels, x, y, animation_frame, facet_col, facet_col_wrap, facet_col_spacing, facet_row_spacing, color_continuous_scale, color_continuous_midpoint, range_color, title, template, width, height, aspect, contrast_rescaling, binary_string, binary_backend, binary_compression_level, binary_format, text_auto)\u001b[39m\n\u001b[32m    420\u001b[39m     layout[\u001b[33m\"\u001b[39m\u001b[33myaxis\u001b[39m\u001b[33m\"\u001b[39m][\u001b[33m\"\u001b[39m\u001b[33mconstrain\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[33m\"\u001b[39m\u001b[33mdomain\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m    421\u001b[39m colorscale_validator = ColorscaleValidator(\u001b[33m\"\u001b[39m\u001b[33mcolorscale\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mimshow\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m    422\u001b[39m layout[\u001b[33m\"\u001b[39m\u001b[33mcoloraxis1\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[38;5;28mdict\u001b[39m(\n\u001b[32m--> \u001b[39m\u001b[32m423\u001b[39m     colorscale=\u001b[43mcolorscale_validator\u001b[49m\u001b[43m.\u001b[49m\u001b[43mvalidate_coerce\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m    424\u001b[39m \u001b[43m        \u001b[49m\u001b[43margs\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mcolor_continuous_scale\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\n\u001b[32m    425\u001b[39m \u001b[43m    \u001b[49m\u001b[43m)\u001b[49m,\n\u001b[32m    426\u001b[39m     cmid=color_continuous_midpoint,\n\u001b[32m    427\u001b[39m     cmin=zmin,\n\u001b[32m    428\u001b[39m     cmax=zmax,\n\u001b[32m    429\u001b[39m )\n\u001b[32m    430\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m labels[\u001b[33m\"\u001b[39m\u001b[33mcolor\u001b[39m\u001b[33m\"\u001b[39m]:\n\u001b[32m    431\u001b[39m     layout[\u001b[33m\"\u001b[39m\u001b[33mcoloraxis1\u001b[39m\u001b[33m\"\u001b[39m][\u001b[33m\"\u001b[39m\u001b[33mcolorbar\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[38;5;28mdict\u001b[39m(title_text=labels[\u001b[33m\"\u001b[39m\u001b[33mcolor\u001b[39m\u001b[33m\"\u001b[39m])\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/Workspace/studies/.venv/lib/python3.13/site-packages/_plotly_utils/basevalidators.py:1636\u001b[39m, in \u001b[36mColorscaleValidator.validate_coerce\u001b[39m\u001b[34m(self, v)\u001b[39m\n\u001b[32m   1631\u001b[39m             v = [\n\u001b[32m   1632\u001b[39m                 [e[\u001b[32m0\u001b[39m], ColorValidator.perform_validate_coerce(e[\u001b[32m1\u001b[39m])] \u001b[38;5;28;01mfor\u001b[39;00m e \u001b[38;5;129;01min\u001b[39;00m v\n\u001b[32m   1633\u001b[39m             ]\n\u001b[32m   1635\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m v_valid:\n\u001b[32m-> \u001b[39m\u001b[32m1636\u001b[39m     \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mraise_invalid_val\u001b[49m\u001b[43m(\u001b[49m\u001b[43mv\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1638\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m v\n",
      "\u001b[36mFile \u001b[39m\u001b[32m~/Workspace/studies/.venv/lib/python3.13/site-packages/_plotly_utils/basevalidators.py:298\u001b[39m, in \u001b[36mBaseValidator.raise_invalid_val\u001b[39m\u001b[34m(self, v, inds)\u001b[39m\n\u001b[32m    295\u001b[39m             \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m inds:\n\u001b[32m    296\u001b[39m                 name += \u001b[33m\"\u001b[39m\u001b[33m[\u001b[39m\u001b[33m\"\u001b[39m + \u001b[38;5;28mstr\u001b[39m(i) + \u001b[33m\"\u001b[39m\u001b[33m]\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m--> \u001b[39m\u001b[32m298\u001b[39m         \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m    299\u001b[39m \u001b[38;5;250m            \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m    300\u001b[39m \u001b[33;03m    Invalid value of type {typ} received for the '{name}' property of {pname}\u001b[39;00m\n\u001b[32m    301\u001b[39m \u001b[33;03m        Received value: {v}\u001b[39;00m\n\u001b[32m    302\u001b[39m \n\u001b[32m    303\u001b[39m \u001b[33;03m{valid_clr_desc}\"\"\"\u001b[39;00m.format(\n\u001b[32m    304\u001b[39m                 name=name,\n\u001b[32m    305\u001b[39m                 pname=\u001b[38;5;28mself\u001b[39m.parent_name,\n\u001b[32m    306\u001b[39m                 typ=type_str(v),\n\u001b[32m    307\u001b[39m                 v=\u001b[38;5;28mrepr\u001b[39m(v),\n\u001b[32m    308\u001b[39m                 valid_clr_desc=\u001b[38;5;28mself\u001b[39m.description(),\n\u001b[32m    309\u001b[39m             )\n\u001b[32m    310\u001b[39m         )\n",
      "\u001b[31mValueError\u001b[39m: \n    Invalid value of type 'builtins.str' received for the 'colorscale' property of imshow\n        Received value: 'coolwarm'\n\n    The 'colorscale' property is a colorscale and may be\n    specified as:\n      - A list of colors that will be spaced evenly to create the colorscale.\n        Many predefined colorscale lists are included in the sequential, diverging,\n        and cyclical modules in the plotly.colors package.\n      - A list of 2-element lists where the first element is the\n        normalized color level value (starting at 0 and ending at 1),\n        and the second item is a valid color string.\n        (e.g. [[0, 'green'], [0.5, 'red'], [1.0, 'rgb(0, 0, 255)']])\n      - One of the following named colorscales:\n            ['aggrnyl', 'agsunset', 'algae', 'amp', 'armyrose', 'balance',\n             'blackbody', 'bluered', 'blues', 'blugrn', 'bluyl', 'brbg',\n             'brwnyl', 'bugn', 'bupu', 'burg', 'burgyl', 'cividis', 'curl',\n             'darkmint', 'deep', 'delta', 'dense', 'earth', 'edge', 'electric',\n             'emrld', 'fall', 'geyser', 'gnbu', 'gray', 'greens', 'greys',\n             'haline', 'hot', 'hsv', 'ice', 'icefire', 'inferno', 'jet',\n             'magenta', 'magma', 'matter', 'mint', 'mrybm', 'mygbm', 'oranges',\n             'orrd', 'oryel', 'oxy', 'peach', 'phase', 'picnic', 'pinkyl',\n             'piyg', 'plasma', 'plotly3', 'portland', 'prgn', 'pubu', 'pubugn',\n             'puor', 'purd', 'purp', 'purples', 'purpor', 'rainbow', 'rdbu',\n             'rdgy', 'rdpu', 'rdylbu', 'rdylgn', 'redor', 'reds', 'solar',\n             'spectral', 'speed', 'sunset', 'sunsetdark', 'teal', 'tealgrn',\n             'tealrose', 'tempo', 'temps', 'thermal', 'tropic', 'turbid',\n             'turbo', 'twilight', 'viridis', 'ylgn', 'ylgnbu', 'ylorbr',\n             'ylorrd'].\n        Appending '_r' to a named colorscale reverses it.\n"
     ]
    }
   ],
   "source": [
    "# Calcul des corrélations entre variables numériques\n",
    "correlation_matrix = vars_numeriques.corr()\n",
    "print(\"Matrice de corrélation des variables numériques:\")\n",
    "print(correlation_matrix)\n",
    "\n",
    "fig = px.imshow(\n",
    "    correlation_matrix, text_auto=True, color_continuous_scale=\"coolwarm\", aspect=\"auto\"\n",
    ")\n",
    "fig.update_layout(title=\"Matrice de corrélation des variables numériques\")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98c7dba6",
   "metadata": {},
   "source": [
    "**Question :** quels sont vos commentaires ?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "212209ec",
   "metadata": {},
   "source": [
    "#### Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65aca700",
   "metadata": {},
   "source": [
    "Deux étapes sont nécessaires avant de lancer l'apprentissage d'un modèle, c'est ce qu'on connait comme le *Preprocessing* :\n",
    "\n",
    "* Les modèles proposés par la librairie \"sklearn\" ne gèrent que des variables numériques. Il est donc nécessaire de transformer les variables catégorielles en variables numériques : ce processus s'appelle le *One Hot Encoding*.\n",
    "* Normaliser les données numériques"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95f5cc9f",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant de réaliser le One Hot Encoding des variables catégorielles. Vous pourrez utiliser la fonction \"preproc.OneHotEncoder\" de la librairie sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8530717",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "b70abc5c",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant normaliser les variables numériques présentes dans la base. Vous pourrez utiliser la fonction \"preproc.StandardScaler\" de la librairie sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ff3847d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "62d49546",
   "metadata": {},
   "source": [
    "#### Sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64d229f4",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant construire la base d'apprentissage (80% des données) et la base de test (20%)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a1c7907",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "84dc7a07",
   "metadata": {},
   "source": [
    "#### Fitting"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97c7b783",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant construire le modèle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd26339b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "8d624704",
   "metadata": {},
   "source": [
    "**Exercice :** proposez un bout de code permettant d'évaluer les performances du modèle (MAE, MSE et RMSE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4ca2cf9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "fb2fe98c",
   "metadata": {},
   "source": [
    "**Question :** que pensez-vous des performances de ce modèle ?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ecba832",
   "metadata": {},
   "source": [
    "## Algorithme supervisé : Random Forest  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efcb8987",
   "metadata": {},
   "source": [
    "A ce stade, nous avons vu les différentes étapes pour lancer un algorithme de Machine Learning. Néanmoins, ces étapes ne sont pas suffisantes pour construire un modèle performant.  \n",
    "En effet, afin de construire un modèle performant le Data Scientist doit agir sur l'apprentissage du modèle. Dans ce qui suit nous :\n",
    "* Changerons d'algorithme pour utiliser un algorithme plus performant (Random Forest)\n",
    "* Raliserons un *grid search* sur les paramètres du modèle\n",
    "* Appliquerons l'apprentissage par validation croisée\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6723a2f",
   "metadata": {},
   "source": [
    "### Modèle avec Validation Croisée"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3716b09f",
   "metadata": {},
   "source": [
    "#### Sampling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab1e1367",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "3f5d735e",
   "metadata": {},
   "source": [
    "#### Fitting avec Cross-Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc819f8f",
   "metadata": {},
   "source": [
    "**Exercice :** construisez un modèle RF (RandomForestRegressor) en implémentant la technique de validation croisée. Pensez à enregistrer au sein d'une variable/liste les performances (MAE, MSE & RMSE) du modèle au sein de chaque fold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b515460e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Initialisation\n",
    "# Nombre de sous-échantillons pour la cross-validation\n",
    "num_splits = 5\n",
    "\n",
    "# Random Forest regressor\n",
    "rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)\n",
    "\n",
    "# Initialisation du KFold cross-validation splitter\n",
    "kf = KFold(n_splits=num_splits)\n",
    "\n",
    "# Listes pour enregistrer les performances du modèle\n",
    "MAE_scores = []\n",
    "MSE_scores = []\n",
    "RMSE_scores = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eebb394f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Entrainement avec cross-validation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b067126c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Métriques sur tous les folds\n",
    "\n",
    "#MAE\n",
    "for fold, mae in enumerate(MAE_scores, start=1):\n",
    "    print(f\"Fold {fold} MAE:\", mae)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6597152c",
   "metadata": {},
   "outputs": [],
   "source": [
    "#MSE\n",
    "for fold, mse in enumerate(MSE_scores, start=1):\n",
    "    print(f\"Fold {fold} MSE:\", mse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63ff1c9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "#RMSE\n",
    "for fold, rmse in enumerate(RMSE_scores, start=1):\n",
    "    print(f\"Fold {fold} RMSE:\", rmse)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec1961c2",
   "metadata": {},
   "source": [
    "**Question :** Commentez les résultats."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a8163ef",
   "metadata": {},
   "source": [
    "### Ajout d'un Grid Search pour les hyper paramètres"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a6adbfe",
   "metadata": {},
   "source": [
    "#### Sampling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9342ad6",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "dce52b11",
   "metadata": {},
   "source": [
    "#### Fitting avec Cross-Validation et *Grid Search*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e3a9dd0",
   "metadata": {},
   "source": [
    "**Exercice :** Intégrez la technique de Grid Search pour rechercher les paramètres optimaux du modèle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d58dbc2",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Initialisation\n",
    "# Nombre de sous-échantillons pour la cross-validation\n",
    "num_splits = 5\n",
    "\n",
    "# Initialisation du KFold cross-validation splitter\n",
    "kf = KFold(n_splits=num_splits)\n",
    "\n",
    "# Listes pour enregistrer les performances du modèle\n",
    "MAE_scores = []\n",
    "MSE_scores = []\n",
    "RMSE_scores = []\n",
    "\n",
    "# Hyperparamètres à tester\n",
    "n_estimators_values = [] #Complétez ici par les paramètres à tester\n",
    "max_depth_values = [] #Complétez ici par les paramètres à tester\n",
    "min_samples_split_values = [] #Complétez ici par les paramètres à tester\n",
    "\n",
    "# Liste pour sauveagrder les meilleurs résultats\n",
    "best_score = np.inf\n",
    "best_params = {}\n",
    "\n",
    "MAE_best_score = []\n",
    "MSE_best_score = []\n",
    "RMSE_best_score = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47da5172",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Complétez ici avec votre code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4936c46",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Meilleurs résultats\n",
    "print(\"Meilleurs paramètres:\", best_params)\n",
    "print(\"Meilleure RMSE :\", best_score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3215c463",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Métriques sur tous les folds\n",
    "\n",
    "#RMSE\n",
    "for fold, rmse in enumerate(RMSE_best_score, start=1):\n",
    "    print(f\"Fold {fold} RMSE:\", rmse)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb9a5c9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "#MAE\n",
    "for fold, mse in enumerate(MSE_best_score, start=1):\n",
    "    print(f\"Fold {fold} MSE:\", mse)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f0768ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "#MSE\n",
    "for fold, mae in enumerate(MAE_best_score, start=1):\n",
    "    print(f\"Fold {fold} MAE:\", mae)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "802a625f",
   "metadata": {},
   "source": [
    "**Question :** Commentez les résultats"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "studies",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}