Files
ArtStudies/M2/Machine Learning/TP_3/2025_TP_3_M2_ISF.ipynb

3478 lines
84 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "8750d15b",
"metadata": {},
"source": [
"# Cours 3 : Machine Learning - Algorithmes supervisés (1/2)"
]
},
{
"cell_type": "markdown",
"id": "f7c08ae5",
"metadata": {},
"source": [
"## Préambule"
]
},
{
"cell_type": "markdown",
"id": "ec7ecb4b",
"metadata": {},
"source": [
"Les objectifs de cette séance (3h) sont :\n",
"* Préparation des bases de modélisation (sampling)\n",
"* Mettre en application un modèle supervisé simple.\n",
"* Construire un modèle de Machine Learning (cross-validation et hyperparamétrage) pour résoudre un problème de régression\n",
"* Analyser les performances du modèle"
]
},
{
"cell_type": "markdown",
"id": "4e99c600",
"metadata": {},
"source": [
"## Préparation du workspace"
]
},
{
"cell_type": "markdown",
"id": "c1b01045",
"metadata": {},
"source": [
"### Import de librairies "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97d58527",
"metadata": {},
"outputs": [],
"source": [
"# Données\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"#Graphiques\n",
"import seaborn as sns\n",
"\n",
"sns.set()\n",
"import plotly.express as px\n",
"import plotly.graph_objects as gp\n",
"import sklearn.preprocessing as preproc\n",
"\n",
"#Statistiques\n",
"from scipy.stats import chi2_contingency\n",
"from sklearn import metrics\n",
"\n",
"# Machine Learning\n",
"from sklearn.cluster import KMeans\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.model_selection import KFold, train_test_split\n",
"from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor"
]
},
{
"cell_type": "markdown",
"id": "06153286",
"metadata": {},
"source": [
"### Définition des fonctions "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c67db932",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "985e4e97",
"metadata": {},
"source": [
"### Constantes"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "c9597b48",
"metadata": {},
"outputs": [],
"source": [
"input_path = \"./1_inputs\"\n",
"output_path = \"./2_outputs\""
]
},
{
"cell_type": "markdown",
"id": "b2b035d2",
"metadata": {},
"source": [
"### Import des données"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "8051b5f4",
"metadata": {},
"outputs": [],
"source": [
"path =input_path + '/base_retraitee.csv'\n",
"data_retraitee = pd.read_csv(path,sep=\",\",decimal=\".\")"
]
},
{
"cell_type": "markdown",
"id": "a2578ba1",
"metadata": {},
"source": [
"## Algorithme supervisé : CART "
]
},
{
"cell_type": "markdown",
"id": "aaa0b27d",
"metadata": {},
"source": [
"Dans cette partie l'objectif est de construire un modèle simple (algorithme CART) afin de voir les différentes étapes nécessaire au lancement d'un modèle\n",
"Nous modéliserons directement le coût des sinistres. "
]
},
{
"cell_type": "markdown",
"id": "a0458a05",
"metadata": {},
"source": [
"### Construction du modèle"
]
},
{
"cell_type": "markdown",
"id": "b3715c37",
"metadata": {},
"source": [
"La première étape est de calculer les côut moyen de chaque sinistre (target ou variable réponse). Cette variable sera la variable à prédire en fonction des variables explicatives."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "c427a4b8",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
"columns": [
{
"name": "index",
"rawType": "int64",
"type": "integer"
},
{
"name": "ANNEE_CTR",
"rawType": "int64",
"type": "integer"
},
{
"name": "CONTRAT_ANCIENNETE",
"rawType": "object",
"type": "string"
},
{
"name": "FREQUENCE_PAIEMENT_COTISATION",
"rawType": "object",
"type": "string"
},
{
"name": "GROUPE_KM",
"rawType": "object",
"type": "string"
},
{
"name": "ZONE_RISQUE",
"rawType": "object",
"type": "string"
},
{
"name": "AGE_ASSURE_PRINCIPAL",
"rawType": "int64",
"type": "integer"
},
{
"name": "GENRE",
"rawType": "object",
"type": "string"
},
{
"name": "DEUXIEME_CONDUCTEUR",
"rawType": "bool",
"type": "boolean"
},
{
"name": "ANCIENNETE_PERMIS",
"rawType": "int64",
"type": "integer"
},
{
"name": "ANNEE_CONSTRUCTION",
"rawType": "float64",
"type": "float"
},
{
"name": "ENERGIE",
"rawType": "object",
"type": "string"
},
{
"name": "EQUIPEMENT_SECURITE",
"rawType": "object",
"type": "string"
},
{
"name": "VALEUR_DU_BIEN",
"rawType": "object",
"type": "string"
},
{
"name": "CM",
"rawType": "float64",
"type": "float"
}
],
"ref": "e76df045-0c83-40e9-a027-c48f278ec1d6",
"rows": [
[
"10",
"2019",
"(0,1]",
"MENSUEL",
"[0;20000[",
"C",
"40",
"M",
"False",
"37",
"2017.0",
"ESSENCE",
"VRAI",
"[15000;20000[",
"1072.98"
],
[
"34",
"2020",
"(-1,0]",
"MENSUEL",
"[20000;40000[",
"C",
"27",
"M",
"True",
"13",
"2018.0",
"AUTRE",
"FAUX",
"[35000;99999[",
"3750.0"
],
[
"36",
"2019",
"(-1,0]",
"MENSUEL",
"[20000;40000[",
"L",
"19",
"M",
"False",
"2",
"2017.0",
"ESSENCE",
"VRAI",
"[0;10000[",
"1838.49"
],
[
"78",
"2019",
"(-1,0]",
"MENSUEL",
"[20000;40000[",
"B",
"40",
"M",
"False",
"45",
"2018.0",
"DIESEL",
"FAUX",
"[15000;20000[",
"4892.74"
],
[
"89",
"2018",
"(1,2]",
"MENSUEL",
"[20000;40000[",
"C",
"20",
"M",
"False",
"11",
"2014.0",
"ESSENCE",
"FAUX",
"[25000;35000[",
"166.73"
]
],
"shape": {
"columns": 14,
"rows": 5
}
},
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ANNEE_CTR</th>\n",
" <th>CONTRAT_ANCIENNETE</th>\n",
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
" <th>GROUPE_KM</th>\n",
" <th>ZONE_RISQUE</th>\n",
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
" <th>GENRE</th>\n",
" <th>DEUXIEME_CONDUCTEUR</th>\n",
" <th>ANCIENNETE_PERMIS</th>\n",
" <th>ANNEE_CONSTRUCTION</th>\n",
" <th>ENERGIE</th>\n",
" <th>EQUIPEMENT_SECURITE</th>\n",
" <th>VALEUR_DU_BIEN</th>\n",
" <th>CM</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>2019</td>\n",
" <td>(0,1]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[0;20000[</td>\n",
" <td>C</td>\n",
" <td>40</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>37</td>\n",
" <td>2017.0</td>\n",
" <td>ESSENCE</td>\n",
" <td>VRAI</td>\n",
" <td>[15000;20000[</td>\n",
" <td>1072.98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>2020</td>\n",
" <td>(-1,0]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[20000;40000[</td>\n",
" <td>C</td>\n",
" <td>27</td>\n",
" <td>M</td>\n",
" <td>True</td>\n",
" <td>13</td>\n",
" <td>2018.0</td>\n",
" <td>AUTRE</td>\n",
" <td>FAUX</td>\n",
" <td>[35000;99999[</td>\n",
" <td>3750.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>2019</td>\n",
" <td>(-1,0]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[20000;40000[</td>\n",
" <td>L</td>\n",
" <td>19</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>2</td>\n",
" <td>2017.0</td>\n",
" <td>ESSENCE</td>\n",
" <td>VRAI</td>\n",
" <td>[0;10000[</td>\n",
" <td>1838.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>2019</td>\n",
" <td>(-1,0]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[20000;40000[</td>\n",
" <td>B</td>\n",
" <td>40</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>45</td>\n",
" <td>2018.0</td>\n",
" <td>DIESEL</td>\n",
" <td>FAUX</td>\n",
" <td>[15000;20000[</td>\n",
" <td>4892.74</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>2018</td>\n",
" <td>(1,2]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[20000;40000[</td>\n",
" <td>C</td>\n",
" <td>20</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>11</td>\n",
" <td>2014.0</td>\n",
" <td>ESSENCE</td>\n",
" <td>FAUX</td>\n",
" <td>[25000;35000[</td>\n",
" <td>166.73</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION GROUPE_KM \\\n",
"10 2019 (0,1] MENSUEL [0;20000[ \n",
"34 2020 (-1,0] MENSUEL [20000;40000[ \n",
"36 2019 (-1,0] MENSUEL [20000;40000[ \n",
"78 2019 (-1,0] MENSUEL [20000;40000[ \n",
"89 2018 (1,2] MENSUEL [20000;40000[ \n",
"\n",
" ZONE_RISQUE AGE_ASSURE_PRINCIPAL GENRE DEUXIEME_CONDUCTEUR \\\n",
"10 C 40 M False \n",
"34 C 27 M True \n",
"36 L 19 M False \n",
"78 B 40 M False \n",
"89 C 20 M False \n",
"\n",
" ANCIENNETE_PERMIS ANNEE_CONSTRUCTION ENERGIE EQUIPEMENT_SECURITE \\\n",
"10 37 2017.0 ESSENCE VRAI \n",
"34 13 2018.0 AUTRE FAUX \n",
"36 2 2017.0 ESSENCE VRAI \n",
"78 45 2018.0 DIESEL FAUX \n",
"89 11 2014.0 ESSENCE FAUX \n",
"\n",
" VALEUR_DU_BIEN CM \n",
"10 [15000;20000[ 1072.98 \n",
"34 [35000;99999[ 3750.00 \n",
"36 [0;10000[ 1838.49 \n",
"78 [15000;20000[ 4892.74 \n",
"89 [25000;35000[ 166.73 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_model = data_retraitee.copy()\n",
"\n",
"# Filtre pour ne garder que les lignes qui ont un sinistre (NB > 0)\n",
"data_model = data_model[data_model['NB'] > 0]\n",
"\n",
"# Calcul du cout moyen \"théorique\" des sinistres\n",
"data_model[\"CM\"] = (data_model[\"CHARGE\"] / data_model[\"NB\"])\n",
"data_model = data_model.drop(['CHARGE', 'NB', \"EXPO\"], axis=1)\n",
"data_model.head()"
]
},
{
"cell_type": "markdown",
"id": "e3e85088",
"metadata": {},
"source": [
"**Exercice :** construisez les statistiques descriptives de la base utilisée."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "c8fd3ee1",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
"columns": [
{
"name": "index",
"rawType": "object",
"type": "string"
},
{
"name": "ANNEE_CTR",
"rawType": "float64",
"type": "float"
},
{
"name": "CONTRAT_ANCIENNETE",
"rawType": "object",
"type": "unknown"
},
{
"name": "FREQUENCE_PAIEMENT_COTISATION",
"rawType": "object",
"type": "unknown"
},
{
"name": "GROUPE_KM",
"rawType": "object",
"type": "unknown"
},
{
"name": "ZONE_RISQUE",
"rawType": "object",
"type": "unknown"
},
{
"name": "AGE_ASSURE_PRINCIPAL",
"rawType": "float64",
"type": "float"
},
{
"name": "GENRE",
"rawType": "object",
"type": "unknown"
},
{
"name": "DEUXIEME_CONDUCTEUR",
"rawType": "object",
"type": "unknown"
},
{
"name": "ANCIENNETE_PERMIS",
"rawType": "float64",
"type": "float"
},
{
"name": "ANNEE_CONSTRUCTION",
"rawType": "float64",
"type": "float"
},
{
"name": "ENERGIE",
"rawType": "object",
"type": "unknown"
},
{
"name": "EQUIPEMENT_SECURITE",
"rawType": "object",
"type": "unknown"
},
{
"name": "VALEUR_DU_BIEN",
"rawType": "object",
"type": "unknown"
},
{
"name": "CM",
"rawType": "float64",
"type": "float"
}
],
"ref": "b2f9efdd-d035-4c51-9797-2e202b404c15",
"rows": [
[
"count",
"824.0",
"824",
"824",
"824",
"824",
"824.0",
"824",
"824",
"824.0",
"824.0",
"824",
"824",
"824",
"824.0"
],
[
"unique",
null,
"5",
"3",
"4",
"14",
null,
"2",
"2",
null,
null,
"3",
"2",
"6",
null
],
[
"top",
null,
"(0,1]",
"MENSUEL",
"[0;20000[",
"C",
null,
"M",
"False",
null,
null,
"ESSENCE",
"FAUX",
"[10000;15000[",
null
],
[
"freq",
null,
"297",
"398",
"391",
"269",
null,
"483",
"663",
null,
null,
"413",
"517",
"213",
null
],
[
"mean",
"2018.384708737864",
null,
null,
null,
null,
"44.383495145631066",
null,
null,
"35.68810679611651",
"2015.2123786407767",
null,
null,
null,
"4246.01697815534"
],
[
"std",
"1.515832735580178",
null,
null,
null,
null,
"13.808216667998865",
null,
null,
"19.370620845496358",
"3.1637823115731556",
null,
null,
null,
"6869.61691660173"
],
[
"min",
"2016.0",
null,
null,
null,
null,
"19.0",
null,
null,
"1.0",
"1998.0",
null,
null,
null,
"7.5"
],
[
"25%",
"2017.0",
null,
null,
null,
null,
"34.0",
null,
null,
"18.0",
"2014.0",
null,
null,
null,
"1159.96125"
],
[
"50%",
"2018.0",
null,
null,
null,
null,
"43.0",
null,
null,
"35.0",
"2016.0",
null,
null,
null,
"2541.6499999999996"
],
[
"75%",
"2020.0",
null,
null,
null,
null,
"53.0",
null,
null,
"53.0",
"2017.0",
null,
null,
null,
"4193.797500000001"
],
[
"max",
"2021.0",
null,
null,
null,
null,
"94.0",
null,
null,
"70.0",
"2021.0",
null,
null,
null,
"83421.85"
]
],
"shape": {
"columns": 14,
"rows": 11
}
},
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ANNEE_CTR</th>\n",
" <th>CONTRAT_ANCIENNETE</th>\n",
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
" <th>GROUPE_KM</th>\n",
" <th>ZONE_RISQUE</th>\n",
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
" <th>GENRE</th>\n",
" <th>DEUXIEME_CONDUCTEUR</th>\n",
" <th>ANCIENNETE_PERMIS</th>\n",
" <th>ANNEE_CONSTRUCTION</th>\n",
" <th>ENERGIE</th>\n",
" <th>EQUIPEMENT_SECURITE</th>\n",
" <th>VALEUR_DU_BIEN</th>\n",
" <th>CM</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>824.000000</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824.000000</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824.000000</td>\n",
" <td>824.000000</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>NaN</td>\n",
" <td>5</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>14</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>NaN</td>\n",
" <td>(0,1]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[0;20000[</td>\n",
" <td>C</td>\n",
" <td>NaN</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>ESSENCE</td>\n",
" <td>FAUX</td>\n",
" <td>[10000;15000[</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>NaN</td>\n",
" <td>297</td>\n",
" <td>398</td>\n",
" <td>391</td>\n",
" <td>269</td>\n",
" <td>NaN</td>\n",
" <td>483</td>\n",
" <td>663</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>413</td>\n",
" <td>517</td>\n",
" <td>213</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>2018.384709</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>44.383495</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>35.688107</td>\n",
" <td>2015.212379</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4246.016978</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>1.515833</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>13.808217</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>19.370621</td>\n",
" <td>3.163782</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>6869.616917</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>2016.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>19.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.000000</td>\n",
" <td>1998.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>7.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>2017.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>34.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>18.000000</td>\n",
" <td>2014.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1159.961250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>2018.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>43.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>35.000000</td>\n",
" <td>2016.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2541.650000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>2020.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>53.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>53.000000</td>\n",
" <td>2017.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4193.797500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>2021.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>94.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>70.000000</td>\n",
" <td>2021.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>83421.850000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION \\\n",
"count 824.000000 824 824 \n",
"unique NaN 5 3 \n",
"top NaN (0,1] MENSUEL \n",
"freq NaN 297 398 \n",
"mean 2018.384709 NaN NaN \n",
"std 1.515833 NaN NaN \n",
"min 2016.000000 NaN NaN \n",
"25% 2017.000000 NaN NaN \n",
"50% 2018.000000 NaN NaN \n",
"75% 2020.000000 NaN NaN \n",
"max 2021.000000 NaN NaN \n",
"\n",
" GROUPE_KM ZONE_RISQUE AGE_ASSURE_PRINCIPAL GENRE DEUXIEME_CONDUCTEUR \\\n",
"count 824 824 824.000000 824 824 \n",
"unique 4 14 NaN 2 2 \n",
"top [0;20000[ C NaN M False \n",
"freq 391 269 NaN 483 663 \n",
"mean NaN NaN 44.383495 NaN NaN \n",
"std NaN NaN 13.808217 NaN NaN \n",
"min NaN NaN 19.000000 NaN NaN \n",
"25% NaN NaN 34.000000 NaN NaN \n",
"50% NaN NaN 43.000000 NaN NaN \n",
"75% NaN NaN 53.000000 NaN NaN \n",
"max NaN NaN 94.000000 NaN NaN \n",
"\n",
" ANCIENNETE_PERMIS ANNEE_CONSTRUCTION ENERGIE EQUIPEMENT_SECURITE \\\n",
"count 824.000000 824.000000 824 824 \n",
"unique NaN NaN 3 2 \n",
"top NaN NaN ESSENCE FAUX \n",
"freq NaN NaN 413 517 \n",
"mean 35.688107 2015.212379 NaN NaN \n",
"std 19.370621 3.163782 NaN NaN \n",
"min 1.000000 1998.000000 NaN NaN \n",
"25% 18.000000 2014.000000 NaN NaN \n",
"50% 35.000000 2016.000000 NaN NaN \n",
"75% 53.000000 2017.000000 NaN NaN \n",
"max 70.000000 2021.000000 NaN NaN \n",
"\n",
" VALEUR_DU_BIEN CM \n",
"count 824 824.000000 \n",
"unique 6 NaN \n",
"top [10000;15000[ NaN \n",
"freq 213 NaN \n",
"mean NaN 4246.016978 \n",
"std NaN 6869.616917 \n",
"min NaN 7.500000 \n",
"25% NaN 1159.961250 \n",
"50% NaN 2541.650000 \n",
"75% NaN 4193.797500 \n",
"max NaN 83421.850000 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_model.describe(include='all')"
]
},
{
"cell_type": "markdown",
"id": "92d6156a",
"metadata": {},
"source": [
"#### Etude des corrélations parmi les variables explicatives"
]
},
{
"cell_type": "markdown",
"id": "d7327570",
"metadata": {},
"source": [
"**Question :** Selon vous, pourquoi faut-il s'intéresser à la corrélation des variables ? "
]
},
{
"cell_type": "markdown",
"id": "475e141b",
"metadata": {},
"source": [
"*Réponse*: Pour avoir un modèle qui fit mieux + déterminer un potentiel effet de causalité entre features et target + sélectionner certaines variables."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "1b156435",
"metadata": {},
"outputs": [],
"source": [
"data_set = data_model.drop(\"CM\", axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "0ef0fcc0",
"metadata": {},
"outputs": [],
"source": [
"#Séparation en variables qualitatives ou catégorielles\n",
"variables_na = []\n",
"variables_numeriques = []\n",
"variables_01 = []\n",
"variables_categorielles = []\n",
"for colu in data_set.columns:\n",
" if True in data_set[colu].isna().unique() :\n",
" variables_na.append(data_set[colu])\n",
" else :\n",
" if str(data_set[colu].dtypes) in [\"int32\",\"int64\",\"float64\"]:\n",
" if len(data_set[colu].unique())==2 :\n",
" variables_categorielles.append(data_set[colu])\n",
" else :\n",
" variables_numeriques.append(data_set[colu])\n",
" else :\n",
" if len(data_set[colu].unique())==2 :\n",
" variables_categorielles.append(data_set[colu])\n",
" else :\n",
" variables_categorielles.append(data_set[colu])"
]
},
{
"cell_type": "markdown",
"id": "e82fcade",
"metadata": {},
"source": [
"##### Corrélation des variables catégorielles :"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "e130aae5",
"metadata": {},
"outputs": [],
"source": [
"vars_categorielles = pd.DataFrame(variables_categorielles).transpose()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c39e2ad0",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"coloraxis": "coloraxis",
"hovertemplate": "x: %{x}<br>y: %{y}<br>color: %{z}<extra></extra>",
"name": "0",
"texttemplate": "%{z:.2f}",
"type": "heatmap",
"x": [
"CONTRAT_ANCIENNETE",
"FREQUENCE_PAIEMENT_COTISATION",
"GROUPE_KM",
"ZONE_RISQUE",
"GENRE",
"DEUXIEME_CONDUCTEUR",
"ENERGIE",
"EQUIPEMENT_SECURITE",
"VALEUR_DU_BIEN"
],
"xaxis": "x",
"y": [
"CONTRAT_ANCIENNETE",
"FREQUENCE_PAIEMENT_COTISATION",
"GROUPE_KM",
"ZONE_RISQUE",
"GENRE",
"DEUXIEME_CONDUCTEUR",
"ENERGIE",
"EQUIPEMENT_SECURITE",
"VALEUR_DU_BIEN"
],
"yaxis": "y",
"z": {
"bdata": "AAAAAAAA8D8AAAAAAAAAACoCGzzITrA/jS6+t390sj/aAKYMJa2eP5RMqUS3uZs/ytNpsBVXkz8AAAAAAAAAAJsekiMPM4I/AAAAAAAAAAAAAAAAAADwPwAAAAAAAAAAAAAAAAAAAABgNwyfFOK3Px3tLvtk1qI/VTS7w965nj/DbHQwNU6sP6xOyIjBVMQ/KwIbPMhOsD8AAAAAAAAAAAAAAAAAAPA/JGwWgOwjwz/Y12crRVC2P1AU8aUpk3Y/tZ25v8HgyT9++YWBDBq6PxMKBP1KAMk/ki6+t390sj8AAAAAAAAAACNsFoDsI8M/AAAAAAAA8D8AAAAAAAAAAOzpAHMW1bU/OToUIB5twT+gpoD1ZjrEP/5ATjN+vpg/0gCmDCWtnj9gNwyfFOK3P9jXZytFULY/AAAAAAAAAAAAAAAAAADwPwAAAAAAAAAA2p0N4q1bwz/UsLoqS0u5PxFqf8IHB9E/lEypRLe5mz8d7S77ZNaiP1AU8aUpk3Y/7OkAcxbVtT8AAAAAAAAAAAAAAAAAAPA/AAAAAAAAAAAAAAAAAAAAAOYlMsJ0brs/ytNpsBVXkz9RNLvD3rmeP7edub/B4Mk/OjoUIB5twT/anQ3irVvDPwAAAAAAAAAAAAAAAAAA8D8nEbUEUmnAP+SA2g/TvNE/AAAAAAAAAADDbHQwNU6sP335hYEMGro/oKaA9WY6xD/UsLoqS0u5PwAAAAAAAAAAJxG1BFJpwD8AAAAAAADwP+fmCf6XRco/mx6SIw8zgj+rTsiIwVTEPxIKBP1KAMk//kBOM36+mD8Ran/CBwfRP+YlMsJ0brs/5YDaD9O80T/n5gn+l0XKPwAAAAAAAPA/",
"dtype": "f8",
"shape": "9, 9"
}
}
],
"layout": {
"coloraxis": {
"colorscale": [
[
0,
"rgb(5,48,97)"
],
[
0.1,
"rgb(33,102,172)"
],
[
0.2,
"rgb(67,147,195)"
],
[
0.3,
"rgb(146,197,222)"
],
[
0.4,
"rgb(209,229,240)"
],
[
0.5,
"rgb(247,247,247)"
],
[
0.6,
"rgb(253,219,199)"
],
[
0.7,
"rgb(244,165,130)"
],
[
0.8,
"rgb(214,96,77)"
],
[
0.9,
"rgb(178,24,43)"
],
[
1,
"rgb(103,0,31)"
]
]
},
"template": {
"data": {
"bar": [
{
"error_x": {
"color": "#2a3f5f"
},
"error_y": {
"color": "#2a3f5f"
},
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "bar"
}
],
"barpolar": [
{
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "barpolar"
}
],
"carpet": [
{
"aaxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"baxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"type": "carpet"
}
],
"choropleth": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "choropleth"
}
],
"contour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "contour"
}
],
"contourcarpet": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "contourcarpet"
}
],
"heatmap": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmap"
}
],
"histogram": [
{
"marker": {
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "histogram"
}
],
"histogram2d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2d"
}
],
"histogram2dcontour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2dcontour"
}
],
"mesh3d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "mesh3d"
}
],
"parcoords": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "parcoords"
}
],
"pie": [
{
"automargin": true,
"type": "pie"
}
],
"scatter": [
{
"fillpattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
},
"type": "scatter"
}
],
"scatter3d": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatter3d"
}
],
"scattercarpet": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattercarpet"
}
],
"scattergeo": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergeo"
}
],
"scattergl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergl"
}
],
"scattermap": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermap"
}
],
"scattermapbox": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermapbox"
}
],
"scatterpolar": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolar"
}
],
"scatterpolargl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolargl"
}
],
"scatterternary": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterternary"
}
],
"surface": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "surface"
}
],
"table": [
{
"cells": {
"fill": {
"color": "#EBF0F8"
},
"line": {
"color": "white"
}
},
"header": {
"fill": {
"color": "#C8D4E3"
},
"line": {
"color": "white"
}
},
"type": "table"
}
]
},
"layout": {
"annotationdefaults": {
"arrowcolor": "#2a3f5f",
"arrowhead": 0,
"arrowwidth": 1
},
"autotypenumbers": "strict",
"coloraxis": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"colorscale": {
"diverging": [
[
0,
"#8e0152"
],
[
0.1,
"#c51b7d"
],
[
0.2,
"#de77ae"
],
[
0.3,
"#f1b6da"
],
[
0.4,
"#fde0ef"
],
[
0.5,
"#f7f7f7"
],
[
0.6,
"#e6f5d0"
],
[
0.7,
"#b8e186"
],
[
0.8,
"#7fbc41"
],
[
0.9,
"#4d9221"
],
[
1,
"#276419"
]
],
"sequential": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"sequentialminus": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
]
},
"colorway": [
"#636efa",
"#EF553B",
"#00cc96",
"#ab63fa",
"#FFA15A",
"#19d3f3",
"#FF6692",
"#B6E880",
"#FF97FF",
"#FECB52"
],
"font": {
"color": "#2a3f5f"
},
"geo": {
"bgcolor": "white",
"lakecolor": "white",
"landcolor": "#E5ECF6",
"showlakes": true,
"showland": true,
"subunitcolor": "white"
},
"hoverlabel": {
"align": "left"
},
"hovermode": "closest",
"mapbox": {
"style": "light"
},
"paper_bgcolor": "white",
"plot_bgcolor": "#E5ECF6",
"polar": {
"angularaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"radialaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"scene": {
"xaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"yaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"zaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
}
},
"shapedefaults": {
"line": {
"color": "#2a3f5f"
}
},
"ternary": {
"aaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"baxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"caxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"title": {
"x": 0.05
},
"xaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
},
"yaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
}
}
},
"title": {
"text": "Matrice de corrélation des variables catégorielles (V de Cramér)"
},
"xaxis": {
"anchor": "y",
"domain": [
0,
1
]
},
"yaxis": {
"anchor": "x",
"autorange": "reversed",
"domain": [
0,
1
]
}
}
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Matrice de corrélation pour les variables catégorielles (V de Cramér)\n",
"def cramers_v(confusion_matrix):\n",
" \"\"\"Calcule le V de Cramér à partir d'une matrice de contingence\"\"\"\n",
" chi2 = chi2_contingency(confusion_matrix)[0]\n",
" n = confusion_matrix.sum().sum()\n",
" phi2 = chi2 / n\n",
" r, k = confusion_matrix.shape\n",
" phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))\n",
" rcorr = r - ((r-1)**2)/(n-1)\n",
" kcorr = k - ((k-1)**2)/(n-1)\n",
" return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))\n",
"\n",
"# Créer la matrice de corrélation\n",
"categorical_cols = vars_categorielles.columns\n",
"n_vars = len(categorical_cols)\n",
"cramers_matrix = np.zeros((n_vars, n_vars))\n",
"\n",
"for i, col1 in enumerate(categorical_cols):\n",
" for j, col2 in enumerate(categorical_cols):\n",
" if i == j:\n",
" cramers_matrix[i, j] = 1.0\n",
" else:\n",
" confusion_matrix = pd.crosstab(vars_categorielles[col1], vars_categorielles[col2])\n",
" cramers_matrix[i, j] = cramers_v(confusion_matrix)\n",
"\n",
"# Créer le DataFrame de corrélation\n",
"correlation_cat = pd.DataFrame(cramers_matrix,\n",
" index=categorical_cols,\n",
" columns=categorical_cols)\n",
"\n",
"# Visualiser avec Plotly\n",
"fig = px.imshow(correlation_cat,\n",
" text_auto='.2f', # type: ignore\n",
" aspect=\"auto\",\n",
" color_continuous_scale='RdBu_r',\n",
" title='Matrice de corrélation des variables catégorielles (V de Cramér)')\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"id": "8f615121",
"metadata": {},
"source": [
"##### Corrélation des variables numériques :"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "a16215ab",
"metadata": {},
"outputs": [],
"source": [
"vars_numeriques = pd.DataFrame(variables_numeriques).transpose()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "532ca6c4",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"coloraxis": "coloraxis",
"hovertemplate": "x: %{x}<br>y: %{y}<br>color: %{z}<extra></extra>",
"name": "0",
"texttemplate": "%{z}",
"type": "heatmap",
"x": [
"ANNEE_CTR",
"AGE_ASSURE_PRINCIPAL",
"ANCIENNETE_PERMIS",
"ANNEE_CONSTRUCTION"
],
"xaxis": "x",
"y": [
"ANNEE_CTR",
"AGE_ASSURE_PRINCIPAL",
"ANCIENNETE_PERMIS",
"ANNEE_CONSTRUCTION"
],
"yaxis": "y",
"z": {
"bdata": "AAAAAAAA8D+ybZcEUUCbP/CBLCtO46Q/qr2Q49LN2D+ybZcEUUCbPwAAAAAAAPA/slV7SAtP4T84L73yETWgv/CBLCtO46Q/slV7SAtP4T8AAAAAAADwP0I6y25dD6E/qr2Q49LN2D84L73yETWgv0I6y25dD6E/AAAAAAAA8D8=",
"dtype": "f8",
"shape": "4, 4"
}
}
],
"layout": {
"coloraxis": {
"colorscale": [
[
0,
"rgb(5,48,97)"
],
[
0.1,
"rgb(33,102,172)"
],
[
0.2,
"rgb(67,147,195)"
],
[
0.3,
"rgb(146,197,222)"
],
[
0.4,
"rgb(209,229,240)"
],
[
0.5,
"rgb(247,247,247)"
],
[
0.6,
"rgb(253,219,199)"
],
[
0.7,
"rgb(244,165,130)"
],
[
0.8,
"rgb(214,96,77)"
],
[
0.9,
"rgb(178,24,43)"
],
[
1,
"rgb(103,0,31)"
]
]
},
"template": {
"data": {
"bar": [
{
"error_x": {
"color": "#2a3f5f"
},
"error_y": {
"color": "#2a3f5f"
},
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "bar"
}
],
"barpolar": [
{
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "barpolar"
}
],
"carpet": [
{
"aaxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"baxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"type": "carpet"
}
],
"choropleth": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "choropleth"
}
],
"contour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "contour"
}
],
"contourcarpet": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "contourcarpet"
}
],
"heatmap": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmap"
}
],
"histogram": [
{
"marker": {
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "histogram"
}
],
"histogram2d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2d"
}
],
"histogram2dcontour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2dcontour"
}
],
"mesh3d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "mesh3d"
}
],
"parcoords": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "parcoords"
}
],
"pie": [
{
"automargin": true,
"type": "pie"
}
],
"scatter": [
{
"fillpattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
},
"type": "scatter"
}
],
"scatter3d": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatter3d"
}
],
"scattercarpet": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattercarpet"
}
],
"scattergeo": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergeo"
}
],
"scattergl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergl"
}
],
"scattermap": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermap"
}
],
"scattermapbox": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermapbox"
}
],
"scatterpolar": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolar"
}
],
"scatterpolargl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolargl"
}
],
"scatterternary": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterternary"
}
],
"surface": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "surface"
}
],
"table": [
{
"cells": {
"fill": {
"color": "#EBF0F8"
},
"line": {
"color": "white"
}
},
"header": {
"fill": {
"color": "#C8D4E3"
},
"line": {
"color": "white"
}
},
"type": "table"
}
]
},
"layout": {
"annotationdefaults": {
"arrowcolor": "#2a3f5f",
"arrowhead": 0,
"arrowwidth": 1
},
"autotypenumbers": "strict",
"coloraxis": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"colorscale": {
"diverging": [
[
0,
"#8e0152"
],
[
0.1,
"#c51b7d"
],
[
0.2,
"#de77ae"
],
[
0.3,
"#f1b6da"
],
[
0.4,
"#fde0ef"
],
[
0.5,
"#f7f7f7"
],
[
0.6,
"#e6f5d0"
],
[
0.7,
"#b8e186"
],
[
0.8,
"#7fbc41"
],
[
0.9,
"#4d9221"
],
[
1,
"#276419"
]
],
"sequential": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"sequentialminus": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
]
},
"colorway": [
"#636efa",
"#EF553B",
"#00cc96",
"#ab63fa",
"#FFA15A",
"#19d3f3",
"#FF6692",
"#B6E880",
"#FF97FF",
"#FECB52"
],
"font": {
"color": "#2a3f5f"
},
"geo": {
"bgcolor": "white",
"lakecolor": "white",
"landcolor": "#E5ECF6",
"showlakes": true,
"showland": true,
"subunitcolor": "white"
},
"hoverlabel": {
"align": "left"
},
"hovermode": "closest",
"mapbox": {
"style": "light"
},
"paper_bgcolor": "white",
"plot_bgcolor": "#E5ECF6",
"polar": {
"angularaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"radialaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"scene": {
"xaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"yaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"zaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
}
},
"shapedefaults": {
"line": {
"color": "#2a3f5f"
}
},
"ternary": {
"aaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"baxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"caxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"title": {
"x": 0.05
},
"xaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
},
"yaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
}
}
},
"title": {
"text": "Matrice de corrélation des variables numériques"
},
"xaxis": {
"anchor": "y",
"domain": [
0,
1
]
},
"yaxis": {
"anchor": "x",
"autorange": "reversed",
"domain": [
0,
1
]
}
}
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"vars_numeriques.corr()\n",
"fig = px.imshow(vars_numeriques.corr(),\n",
" text_auto=True,\n",
" aspect=\"auto\",\n",
" color_continuous_scale='RdBu_r',\n",
" title='Matrice de corrélation des variables numériques')\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"id": "98c7dba6",
"metadata": {},
"source": [
"**Question :** quels sont vos commentaires ?"
]
},
{
"cell_type": "markdown",
"id": "212209ec",
"metadata": {},
"source": [
"#### Preprocessing"
]
},
{
"cell_type": "markdown",
"id": "65aca700",
"metadata": {},
"source": [
"Deux étapes sont nécessaires avant de lancer l'apprentissage d'un modèle, c'est ce qu'on connait comme le *Preprocessing* :\n",
"\n",
"* Les modèles proposés par la librairie \"sklearn\" ne gèrent que des variables numériques. Il est donc nécessaire de transformer les variables catégorielles en variables numériques : ce processus s'appelle le *One Hot Encoding*.\n",
"* Normaliser les données numériques"
]
},
{
"cell_type": "markdown",
"id": "95f5cc9f",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant de réaliser le One Hot Encoding des variables catégorielles. Vous pourrez utiliser la fonction \"preproc.OneHotEncoder\" de la librairie sklearn"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "b8530717",
"metadata": {},
"outputs": [],
"source": [
"encoder = preproc.OneHotEncoder(sparse_output=False, drop='first')\n",
"encoder.fit(vars_categorielles)\n",
"vars_categorielles_enc = encoder.transform(vars_categorielles)\n",
"vars_categorielles_enc = pd.DataFrame(vars_categorielles_enc, columns=encoder.get_feature_names_out()) # type: ignore"
]
},
{
"cell_type": "markdown",
"id": "b70abc5c",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant normaliser les variables numériques présentes dans la base. Vous pourrez utiliser la fonction \"preproc.StandardScaler\" de la librairie sklearn"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "4ff3847d",
"metadata": {},
"outputs": [],
"source": [
"scaler = preproc.StandardScaler()\n",
"scaler.fit(vars_numeriques)\n",
"vars_numeriques_scaled = scaler.transform(vars_numeriques)\n",
"vars_numeriques_scaled = pd.DataFrame(vars_numeriques_scaled, columns=vars_numeriques.columns)"
]
},
{
"cell_type": "markdown",
"id": "62d49546",
"metadata": {},
"source": [
"#### Sampling"
]
},
{
"cell_type": "markdown",
"id": "64d229f4",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant construire la base d'apprentissage (80% des données) et la base de test (20%)."
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "6a1c7907",
"metadata": {},
"outputs": [],
"source": [
"train, test = train_test_split(data_model, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "markdown",
"id": "84dc7a07",
"metadata": {},
"source": [
"#### Fitting"
]
},
{
"cell_type": "markdown",
"id": "97c7b783",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant construire le modèle"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd26339b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "8d624704",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant d'évaluer les performances du modèle (MAE, MSE et RMSE)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c4ca2cf9",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "fb2fe98c",
"metadata": {},
"source": [
"**Question :** que pensez-vous des performances de ce modèle ?"
]
},
{
"cell_type": "markdown",
"id": "7ecba832",
"metadata": {},
"source": [
"## Algorithme supervisé : Random Forest "
]
},
{
"cell_type": "markdown",
"id": "efcb8987",
"metadata": {},
"source": [
"A ce stade, nous avons vu les différentes étapes pour lancer un algorithme de Machine Learning. Néanmoins, ces étapes ne sont pas suffisantes pour construire un modèle performant. \n",
"En effet, afin de construire un modèle performant le Data Scientist doit agir sur l'apprentissage du modèle. Dans ce qui suit nous :\n",
"* Changerons d'algorithme pour utiliser un algorithme plus performant (Random Forest)\n",
"* Raliserons un *grid search* sur les paramètres du modèle\n",
"* Appliquerons l'apprentissage par validation croisée\n"
]
},
{
"cell_type": "markdown",
"id": "d6723a2f",
"metadata": {},
"source": [
"### Modèle avec Validation Croisée"
]
},
{
"cell_type": "markdown",
"id": "3716b09f",
"metadata": {},
"source": [
"#### Sampling"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab1e1367",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "3f5d735e",
"metadata": {},
"source": [
"#### Fitting avec Cross-Validation"
]
},
{
"cell_type": "markdown",
"id": "bc819f8f",
"metadata": {},
"source": [
"**Exercice :** construisez un modèle RF (RandomForestRegressor) en implémentant la technique de validation croisée. Pensez à enregistrer au sein d'une variable/liste les performances (MAE, MSE & RMSE) du modèle au sein de chaque fold."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b515460e",
"metadata": {},
"outputs": [],
"source": [
"#Initialisation\n",
"# Nombre de sous-échantillons pour la cross-validation\n",
"num_splits = 5\n",
"\n",
"# Random Forest regressor\n",
"rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)\n",
"\n",
"# Initialisation du KFold cross-validation splitter\n",
"kf = KFold(n_splits=num_splits)\n",
"\n",
"# Listes pour enregistrer les performances du modèle\n",
"MAE_scores = []\n",
"MSE_scores = []\n",
"RMSE_scores = []"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eebb394f",
"metadata": {},
"outputs": [],
"source": [
"# Entrainement avec cross-validation\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b067126c",
"metadata": {},
"outputs": [],
"source": [
"# Métriques sur tous les folds\n",
"\n",
"#MAE\n",
"for fold, mae in enumerate(MAE_scores, start=1):\n",
" print(f\"Fold {fold} MAE:\", mae)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6597152c",
"metadata": {},
"outputs": [],
"source": [
"#MSE\n",
"for fold, mse in enumerate(MSE_scores, start=1):\n",
" print(f\"Fold {fold} MSE:\", mse)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63ff1c9d",
"metadata": {},
"outputs": [],
"source": [
"#RMSE\n",
"for fold, rmse in enumerate(RMSE_scores, start=1):\n",
" print(f\"Fold {fold} RMSE:\", rmse)"
]
},
{
"cell_type": "markdown",
"id": "ec1961c2",
"metadata": {},
"source": [
"**Question :** Commentez les résultats."
]
},
{
"cell_type": "markdown",
"id": "5a8163ef",
"metadata": {},
"source": [
"### Ajout d'un Grid Search pour les hyper paramètres"
]
},
{
"cell_type": "markdown",
"id": "5a6adbfe",
"metadata": {},
"source": [
"#### Sampling"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9342ad6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "dce52b11",
"metadata": {},
"source": [
"#### Fitting avec Cross-Validation et *Grid Search*"
]
},
{
"cell_type": "markdown",
"id": "7e3a9dd0",
"metadata": {},
"source": [
"**Exercice :** Intégrez la technique de Grid Search pour rechercher les paramètres optimaux du modèle."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d58dbc2",
"metadata": {},
"outputs": [],
"source": [
"#Initialisation\n",
"# Nombre de sous-échantillons pour la cross-validation\n",
"num_splits = 5\n",
"\n",
"# Initialisation du KFold cross-validation splitter\n",
"kf = KFold(n_splits=num_splits)\n",
"\n",
"# Listes pour enregistrer les performances du modèle\n",
"MAE_scores = []\n",
"MSE_scores = []\n",
"RMSE_scores = []\n",
"\n",
"# Hyperparamètres à tester\n",
"n_estimators_values = [] #Complétez ici par les paramètres à tester\n",
"max_depth_values = [] #Complétez ici par les paramètres à tester\n",
"min_samples_split_values = [] #Complétez ici par les paramètres à tester\n",
"\n",
"# Liste pour sauveagrder les meilleurs résultats\n",
"best_score = np.inf\n",
"best_params = {}\n",
"\n",
"MAE_best_score = []\n",
"MSE_best_score = []\n",
"RMSE_best_score = []"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "47da5172",
"metadata": {},
"outputs": [],
"source": [
"#Complétez ici avec votre code"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4936c46",
"metadata": {},
"outputs": [],
"source": [
"# Meilleurs résultats\n",
"print(\"Meilleurs paramètres:\", best_params)\n",
"print(\"Meilleure RMSE :\", best_score)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3215c463",
"metadata": {},
"outputs": [],
"source": [
"# Métriques sur tous les folds\n",
"\n",
"#RMSE\n",
"for fold, rmse in enumerate(RMSE_best_score, start=1):\n",
" print(f\"Fold {fold} RMSE:\", rmse)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb9a5c9b",
"metadata": {},
"outputs": [],
"source": [
"#MAE\n",
"for fold, mse in enumerate(MSE_best_score, start=1):\n",
" print(f\"Fold {fold} MSE:\", mse)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0f0768ad",
"metadata": {},
"outputs": [],
"source": [
"#MSE\n",
"for fold, mae in enumerate(MAE_best_score, start=1):\n",
" print(f\"Fold {fold} MAE:\", mae)"
]
},
{
"cell_type": "markdown",
"id": "802a625f",
"metadata": {},
"source": [
"**Question :** Commentez les résultats"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "studies",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}