Files
ArtStudies/M2/Machine Learning/TP_3/2025_TP_3_M2_ISF.ipynb

4381 lines
124 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "8750d15b",
"metadata": {},
"source": [
"# Cours 3 : Machine Learning - Algorithmes supervisés (1/2)"
]
},
{
"cell_type": "markdown",
"id": "f7c08ae5",
"metadata": {},
"source": [
"## Préambule"
]
},
{
"cell_type": "markdown",
"id": "ec7ecb4b",
"metadata": {},
"source": [
"Les objectifs de cette séance (3h) sont :\n",
"* Préparation des bases de modélisation (sampling)\n",
"* Mettre en application un modèle supervisé simple.\n",
"* Construire un modèle de Machine Learning (cross-validation et hyperparamétrage) pour résoudre un problème de régression\n",
"* Analyser les performances du modèle"
]
},
{
"cell_type": "markdown",
"id": "4e99c600",
"metadata": {},
"source": [
"## Préparation du workspace"
]
},
{
"cell_type": "markdown",
"id": "c1b01045",
"metadata": {},
"source": [
"### Import de librairies "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "97d58527",
"metadata": {},
"outputs": [],
"source": [
"# Données\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"# Graphiques\n",
"import seaborn as sns\n",
"\n",
"sns.set()\n",
"import plotly.express as px\n",
"\n",
"# Statistiques\n",
"from scipy.stats import chi2_contingency\n",
"\n",
"import sklearn.preprocessing as preproc\n",
"from sklearn import metrics\n",
"\n",
"# Machine Learning\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.model_selection import KFold, cross_val_score, train_test_split\n",
"from sklearn.tree import DecisionTreeRegressor"
]
},
{
"cell_type": "markdown",
"id": "06153286",
"metadata": {},
"source": [
"### Définition des fonctions "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c67db932",
"metadata": {},
"outputs": [],
"source": [
"def cramers_V(var1, var2):\n",
" crosstab = np.array(\n",
" pd.crosstab(var1, var2, rownames=None, colnames=None),\n",
" ) # Cross table building\n",
" stat = chi2_contingency(crosstab)[\n",
" 0\n",
" ] # Keeping of the test statistic of the Chi2 test\n",
" obs = np.sum(crosstab) # Number of observations\n",
" mini = (\n",
" min(crosstab.shape) - 1\n",
" ) # Take the minimum value between the columns and the rows of the cross table\n",
" return stat / (obs * mini)"
]
},
{
"cell_type": "markdown",
"id": "985e4e97",
"metadata": {},
"source": [
"### Constantes"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c9597b48",
"metadata": {},
"outputs": [],
"source": [
"input_path = \"./1_inputs\"\n",
"output_path = \"./2_outputs\""
]
},
{
"cell_type": "markdown",
"id": "b2b035d2",
"metadata": {},
"source": [
"### Import des données"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8051b5f4",
"metadata": {},
"outputs": [],
"source": [
"path = input_path + \"/base_retraitee.csv\"\n",
"data_retraitee = pd.read_csv(path, sep=\",\", decimal=\".\")"
]
},
{
"cell_type": "markdown",
"id": "a2578ba1",
"metadata": {},
"source": [
"## Algorithme supervisé : CART "
]
},
{
"cell_type": "markdown",
"id": "aaa0b27d",
"metadata": {},
"source": [
"Dans cette partie l'objectif est de construire un modèle simple (algorithme CART) afin de voir les différentes étapes nécessaire au lancement d'un modèle\n",
"Nous modéliserons directement le coût des sinistres. "
]
},
{
"cell_type": "markdown",
"id": "a0458a05",
"metadata": {},
"source": [
"### Construction du modèle"
]
},
{
"cell_type": "markdown",
"id": "b3715c37",
"metadata": {},
"source": [
"La première étape est de calculer les côut moyen de chaque sinistre (target ou variable réponse). Cette variable sera la variable à prédire en fonction des variables explicatives."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "c427a4b8",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/tp/_ld5_pzs6nx6mv1pbjhq1l740000gn/T/ipykernel_79947/358057511.py:7: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" data_model[\"CM\"] = data_model[\"CHARGE\"] / data_model[\"NB\"]\n"
]
},
{
"data": {
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
"columns": [
{
"name": "index",
"rawType": "int64",
"type": "integer"
},
{
"name": "ANNEE_CTR",
"rawType": "int64",
"type": "integer"
},
{
"name": "CONTRAT_ANCIENNETE",
"rawType": "object",
"type": "string"
},
{
"name": "FREQUENCE_PAIEMENT_COTISATION",
"rawType": "object",
"type": "string"
},
{
"name": "GROUPE_KM",
"rawType": "object",
"type": "string"
},
{
"name": "ZONE_RISQUE",
"rawType": "object",
"type": "string"
},
{
"name": "AGE_ASSURE_PRINCIPAL",
"rawType": "int64",
"type": "integer"
},
{
"name": "GENRE",
"rawType": "object",
"type": "string"
},
{
"name": "DEUXIEME_CONDUCTEUR",
"rawType": "bool",
"type": "boolean"
},
{
"name": "ANCIENNETE_PERMIS",
"rawType": "int64",
"type": "integer"
},
{
"name": "ANNEE_CONSTRUCTION",
"rawType": "float64",
"type": "float"
},
{
"name": "ENERGIE",
"rawType": "object",
"type": "string"
},
{
"name": "EQUIPEMENT_SECURITE",
"rawType": "object",
"type": "string"
},
{
"name": "VALEUR_DU_BIEN",
"rawType": "object",
"type": "string"
},
{
"name": "CM",
"rawType": "float64",
"type": "float"
}
],
"ref": "a70f0dbd-403e-4585-990e-4028b5b0673d",
"rows": [
[
"10",
"2019",
"(0,1]",
"MENSUEL",
"[0;20000[",
"C",
"40",
"M",
"False",
"37",
"2017.0",
"ESSENCE",
"VRAI",
"[15000;20000[",
"1072.98"
],
[
"34",
"2020",
"(-1,0]",
"MENSUEL",
"[20000;40000[",
"C",
"27",
"M",
"True",
"13",
"2018.0",
"AUTRE",
"FAUX",
"[35000;99999[",
"3750.0"
],
[
"36",
"2019",
"(-1,0]",
"MENSUEL",
"[20000;40000[",
"L",
"19",
"M",
"False",
"2",
"2017.0",
"ESSENCE",
"VRAI",
"[0;10000[",
"1838.49"
],
[
"78",
"2019",
"(-1,0]",
"MENSUEL",
"[20000;40000[",
"B",
"40",
"M",
"False",
"45",
"2018.0",
"DIESEL",
"FAUX",
"[15000;20000[",
"4892.74"
],
[
"89",
"2018",
"(1,2]",
"MENSUEL",
"[20000;40000[",
"C",
"20",
"M",
"False",
"11",
"2014.0",
"ESSENCE",
"FAUX",
"[25000;35000[",
"166.73"
]
],
"shape": {
"columns": 14,
"rows": 5
}
},
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ANNEE_CTR</th>\n",
" <th>CONTRAT_ANCIENNETE</th>\n",
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
" <th>GROUPE_KM</th>\n",
" <th>ZONE_RISQUE</th>\n",
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
" <th>GENRE</th>\n",
" <th>DEUXIEME_CONDUCTEUR</th>\n",
" <th>ANCIENNETE_PERMIS</th>\n",
" <th>ANNEE_CONSTRUCTION</th>\n",
" <th>ENERGIE</th>\n",
" <th>EQUIPEMENT_SECURITE</th>\n",
" <th>VALEUR_DU_BIEN</th>\n",
" <th>CM</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>2019</td>\n",
" <td>(0,1]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[0;20000[</td>\n",
" <td>C</td>\n",
" <td>40</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>37</td>\n",
" <td>2017.0</td>\n",
" <td>ESSENCE</td>\n",
" <td>VRAI</td>\n",
" <td>[15000;20000[</td>\n",
" <td>1072.98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>2020</td>\n",
" <td>(-1,0]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[20000;40000[</td>\n",
" <td>C</td>\n",
" <td>27</td>\n",
" <td>M</td>\n",
" <td>True</td>\n",
" <td>13</td>\n",
" <td>2018.0</td>\n",
" <td>AUTRE</td>\n",
" <td>FAUX</td>\n",
" <td>[35000;99999[</td>\n",
" <td>3750.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>2019</td>\n",
" <td>(-1,0]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[20000;40000[</td>\n",
" <td>L</td>\n",
" <td>19</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>2</td>\n",
" <td>2017.0</td>\n",
" <td>ESSENCE</td>\n",
" <td>VRAI</td>\n",
" <td>[0;10000[</td>\n",
" <td>1838.49</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>2019</td>\n",
" <td>(-1,0]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[20000;40000[</td>\n",
" <td>B</td>\n",
" <td>40</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>45</td>\n",
" <td>2018.0</td>\n",
" <td>DIESEL</td>\n",
" <td>FAUX</td>\n",
" <td>[15000;20000[</td>\n",
" <td>4892.74</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>2018</td>\n",
" <td>(1,2]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[20000;40000[</td>\n",
" <td>C</td>\n",
" <td>20</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>11</td>\n",
" <td>2014.0</td>\n",
" <td>ESSENCE</td>\n",
" <td>FAUX</td>\n",
" <td>[25000;35000[</td>\n",
" <td>166.73</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION GROUPE_KM \\\n",
"10 2019 (0,1] MENSUEL [0;20000[ \n",
"34 2020 (-1,0] MENSUEL [20000;40000[ \n",
"36 2019 (-1,0] MENSUEL [20000;40000[ \n",
"78 2019 (-1,0] MENSUEL [20000;40000[ \n",
"89 2018 (1,2] MENSUEL [20000;40000[ \n",
"\n",
" ZONE_RISQUE AGE_ASSURE_PRINCIPAL GENRE DEUXIEME_CONDUCTEUR \\\n",
"10 C 40 M False \n",
"34 C 27 M True \n",
"36 L 19 M False \n",
"78 B 40 M False \n",
"89 C 20 M False \n",
"\n",
" ANCIENNETE_PERMIS ANNEE_CONSTRUCTION ENERGIE EQUIPEMENT_SECURITE \\\n",
"10 37 2017.0 ESSENCE VRAI \n",
"34 13 2018.0 AUTRE FAUX \n",
"36 2 2017.0 ESSENCE VRAI \n",
"78 45 2018.0 DIESEL FAUX \n",
"89 11 2014.0 ESSENCE FAUX \n",
"\n",
" VALEUR_DU_BIEN CM \n",
"10 [15000;20000[ 1072.98 \n",
"34 [35000;99999[ 3750.00 \n",
"36 [0;10000[ 1838.49 \n",
"78 [15000;20000[ 4892.74 \n",
"89 [25000;35000[ 166.73 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_model = data_retraitee\n",
"\n",
"# Filtre pour ne garder que les lignes qui ont un sinistre (NB > 0)\n",
"data_model = data_model[data_model[\"NB\"] > 0]\n",
"\n",
"# Calcul du cout moyen \"théorique\" des sinistres\n",
"data_model[\"CM\"] = data_model[\"CHARGE\"] / data_model[\"NB\"]\n",
"data_model = data_model.drop([\"CHARGE\", \"NB\", \"EXPO\"], axis=1)\n",
"data_model.head()"
]
},
{
"cell_type": "markdown",
"id": "e3e85088",
"metadata": {},
"source": [
"**Exercice :** construisez les statistiques descriptives de la base utilisée."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "c8fd3ee1",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
"columns": [
{
"name": "index",
"rawType": "object",
"type": "string"
},
{
"name": "ANNEE_CTR",
"rawType": "float64",
"type": "float"
},
{
"name": "CONTRAT_ANCIENNETE",
"rawType": "object",
"type": "unknown"
},
{
"name": "FREQUENCE_PAIEMENT_COTISATION",
"rawType": "object",
"type": "unknown"
},
{
"name": "GROUPE_KM",
"rawType": "object",
"type": "unknown"
},
{
"name": "ZONE_RISQUE",
"rawType": "object",
"type": "unknown"
},
{
"name": "AGE_ASSURE_PRINCIPAL",
"rawType": "float64",
"type": "float"
},
{
"name": "GENRE",
"rawType": "object",
"type": "unknown"
},
{
"name": "DEUXIEME_CONDUCTEUR",
"rawType": "object",
"type": "unknown"
},
{
"name": "ANCIENNETE_PERMIS",
"rawType": "float64",
"type": "float"
},
{
"name": "ANNEE_CONSTRUCTION",
"rawType": "float64",
"type": "float"
},
{
"name": "ENERGIE",
"rawType": "object",
"type": "unknown"
},
{
"name": "EQUIPEMENT_SECURITE",
"rawType": "object",
"type": "unknown"
},
{
"name": "VALEUR_DU_BIEN",
"rawType": "object",
"type": "unknown"
},
{
"name": "CM",
"rawType": "float64",
"type": "float"
}
],
"ref": "63d03be7-3681-4d8e-b0be-ed765b8f2594",
"rows": [
[
"count",
"824.0",
"824",
"824",
"824",
"824",
"824.0",
"824",
"824",
"824.0",
"824.0",
"824",
"824",
"824",
"824.0"
],
[
"unique",
null,
"5",
"3",
"4",
"14",
null,
"2",
"2",
null,
null,
"3",
"2",
"6",
null
],
[
"top",
null,
"(0,1]",
"MENSUEL",
"[0;20000[",
"C",
null,
"M",
"False",
null,
null,
"ESSENCE",
"FAUX",
"[10000;15000[",
null
],
[
"freq",
null,
"297",
"398",
"391",
"269",
null,
"483",
"663",
null,
null,
"413",
"517",
"213",
null
],
[
"mean",
"2018.384708737864",
null,
null,
null,
null,
"44.383495145631066",
null,
null,
"35.68810679611651",
"2015.2123786407767",
null,
null,
null,
"4246.01697815534"
],
[
"std",
"1.515832735580178",
null,
null,
null,
null,
"13.808216667998865",
null,
null,
"19.370620845496358",
"3.1637823115731556",
null,
null,
null,
"6869.61691660173"
],
[
"min",
"2016.0",
null,
null,
null,
null,
"19.0",
null,
null,
"1.0",
"1998.0",
null,
null,
null,
"7.5"
],
[
"25%",
"2017.0",
null,
null,
null,
null,
"34.0",
null,
null,
"18.0",
"2014.0",
null,
null,
null,
"1159.96125"
],
[
"50%",
"2018.0",
null,
null,
null,
null,
"43.0",
null,
null,
"35.0",
"2016.0",
null,
null,
null,
"2541.6499999999996"
],
[
"75%",
"2020.0",
null,
null,
null,
null,
"53.0",
null,
null,
"53.0",
"2017.0",
null,
null,
null,
"4193.797500000001"
],
[
"max",
"2021.0",
null,
null,
null,
null,
"94.0",
null,
null,
"70.0",
"2021.0",
null,
null,
null,
"83421.85"
]
],
"shape": {
"columns": 14,
"rows": 11
}
},
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ANNEE_CTR</th>\n",
" <th>CONTRAT_ANCIENNETE</th>\n",
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
" <th>GROUPE_KM</th>\n",
" <th>ZONE_RISQUE</th>\n",
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
" <th>GENRE</th>\n",
" <th>DEUXIEME_CONDUCTEUR</th>\n",
" <th>ANCIENNETE_PERMIS</th>\n",
" <th>ANNEE_CONSTRUCTION</th>\n",
" <th>ENERGIE</th>\n",
" <th>EQUIPEMENT_SECURITE</th>\n",
" <th>VALEUR_DU_BIEN</th>\n",
" <th>CM</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>824.000000</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824.000000</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824.000000</td>\n",
" <td>824.000000</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>NaN</td>\n",
" <td>5</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>14</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>NaN</td>\n",
" <td>(0,1]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[0;20000[</td>\n",
" <td>C</td>\n",
" <td>NaN</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>ESSENCE</td>\n",
" <td>FAUX</td>\n",
" <td>[10000;15000[</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>NaN</td>\n",
" <td>297</td>\n",
" <td>398</td>\n",
" <td>391</td>\n",
" <td>269</td>\n",
" <td>NaN</td>\n",
" <td>483</td>\n",
" <td>663</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>413</td>\n",
" <td>517</td>\n",
" <td>213</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>2018.384709</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>44.383495</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>35.688107</td>\n",
" <td>2015.212379</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4246.016978</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>1.515833</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>13.808217</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>19.370621</td>\n",
" <td>3.163782</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>6869.616917</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>2016.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>19.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.000000</td>\n",
" <td>1998.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>7.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>2017.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>34.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>18.000000</td>\n",
" <td>2014.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1159.961250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>2018.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>43.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>35.000000</td>\n",
" <td>2016.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2541.650000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>2020.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>53.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>53.000000</td>\n",
" <td>2017.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4193.797500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>2021.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>94.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>70.000000</td>\n",
" <td>2021.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>83421.850000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION \\\n",
"count 824.000000 824 824 \n",
"unique NaN 5 3 \n",
"top NaN (0,1] MENSUEL \n",
"freq NaN 297 398 \n",
"mean 2018.384709 NaN NaN \n",
"std 1.515833 NaN NaN \n",
"min 2016.000000 NaN NaN \n",
"25% 2017.000000 NaN NaN \n",
"50% 2018.000000 NaN NaN \n",
"75% 2020.000000 NaN NaN \n",
"max 2021.000000 NaN NaN \n",
"\n",
" GROUPE_KM ZONE_RISQUE AGE_ASSURE_PRINCIPAL GENRE DEUXIEME_CONDUCTEUR \\\n",
"count 824 824 824.000000 824 824 \n",
"unique 4 14 NaN 2 2 \n",
"top [0;20000[ C NaN M False \n",
"freq 391 269 NaN 483 663 \n",
"mean NaN NaN 44.383495 NaN NaN \n",
"std NaN NaN 13.808217 NaN NaN \n",
"min NaN NaN 19.000000 NaN NaN \n",
"25% NaN NaN 34.000000 NaN NaN \n",
"50% NaN NaN 43.000000 NaN NaN \n",
"75% NaN NaN 53.000000 NaN NaN \n",
"max NaN NaN 94.000000 NaN NaN \n",
"\n",
" ANCIENNETE_PERMIS ANNEE_CONSTRUCTION ENERGIE EQUIPEMENT_SECURITE \\\n",
"count 824.000000 824.000000 824 824 \n",
"unique NaN NaN 3 2 \n",
"top NaN NaN ESSENCE FAUX \n",
"freq NaN NaN 413 517 \n",
"mean 35.688107 2015.212379 NaN NaN \n",
"std 19.370621 3.163782 NaN NaN \n",
"min 1.000000 1998.000000 NaN NaN \n",
"25% 18.000000 2014.000000 NaN NaN \n",
"50% 35.000000 2016.000000 NaN NaN \n",
"75% 53.000000 2017.000000 NaN NaN \n",
"max 70.000000 2021.000000 NaN NaN \n",
"\n",
" VALEUR_DU_BIEN CM \n",
"count 824 824.000000 \n",
"unique 6 NaN \n",
"top [10000;15000[ NaN \n",
"freq 213 NaN \n",
"mean NaN 4246.016978 \n",
"std NaN 6869.616917 \n",
"min NaN 7.500000 \n",
"25% NaN 1159.961250 \n",
"50% NaN 2541.650000 \n",
"75% NaN 4193.797500 \n",
"max NaN 83421.850000 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_model.describe(include=\"all\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "2d32ae2b",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"bingroup": "x",
"hovertemplate": "CM=%{x}<br>count=%{y}<extra></extra>",
"legendgroup": "",
"marker": {
"color": "#636efa",
"pattern": {
"shape": ""
}
},
"name": "",
"orientation": "v",
"showlegend": false,
"type": "histogram",
"x": {
"bdata": "",
"dtype": "f8"
},
"xaxis": "x",
"yaxis": "y"
}
],
"layout": {
"barmode": "relative",
"legend": {
"tracegroupgap": 0
},
"margin": {
"t": 60
},
"template": {
"data": {
"bar": [
{
"error_x": {
"color": "#2a3f5f"
},
"error_y": {
"color": "#2a3f5f"
},
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "bar"
}
],
"barpolar": [
{
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "barpolar"
}
],
"carpet": [
{
"aaxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"baxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"type": "carpet"
}
],
"choropleth": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "choropleth"
}
],
"contour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "contour"
}
],
"contourcarpet": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "contourcarpet"
}
],
"heatmap": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmap"
}
],
"histogram": [
{
"marker": {
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "histogram"
}
],
"histogram2d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2d"
}
],
"histogram2dcontour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2dcontour"
}
],
"mesh3d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "mesh3d"
}
],
"parcoords": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "parcoords"
}
],
"pie": [
{
"automargin": true,
"type": "pie"
}
],
"scatter": [
{
"fillpattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
},
"type": "scatter"
}
],
"scatter3d": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatter3d"
}
],
"scattercarpet": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattercarpet"
}
],
"scattergeo": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergeo"
}
],
"scattergl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergl"
}
],
"scattermap": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermap"
}
],
"scattermapbox": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermapbox"
}
],
"scatterpolar": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolar"
}
],
"scatterpolargl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolargl"
}
],
"scatterternary": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterternary"
}
],
"surface": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "surface"
}
],
"table": [
{
"cells": {
"fill": {
"color": "#EBF0F8"
},
"line": {
"color": "white"
}
},
"header": {
"fill": {
"color": "#C8D4E3"
},
"line": {
"color": "white"
}
},
"type": "table"
}
]
},
"layout": {
"annotationdefaults": {
"arrowcolor": "#2a3f5f",
"arrowhead": 0,
"arrowwidth": 1
},
"autotypenumbers": "strict",
"coloraxis": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"colorscale": {
"diverging": [
[
0,
"#8e0152"
],
[
0.1,
"#c51b7d"
],
[
0.2,
"#de77ae"
],
[
0.3,
"#f1b6da"
],
[
0.4,
"#fde0ef"
],
[
0.5,
"#f7f7f7"
],
[
0.6,
"#e6f5d0"
],
[
0.7,
"#b8e186"
],
[
0.8,
"#7fbc41"
],
[
0.9,
"#4d9221"
],
[
1,
"#276419"
]
],
"sequential": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"sequentialminus": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
]
},
"colorway": [
"#636efa",
"#EF553B",
"#00cc96",
"#ab63fa",
"#FFA15A",
"#19d3f3",
"#FF6692",
"#B6E880",
"#FF97FF",
"#FECB52"
],
"font": {
"color": "#2a3f5f"
},
"geo": {
"bgcolor": "white",
"lakecolor": "white",
"landcolor": "#E5ECF6",
"showlakes": true,
"showland": true,
"subunitcolor": "white"
},
"hoverlabel": {
"align": "left"
},
"hovermode": "closest",
"mapbox": {
"style": "light"
},
"paper_bgcolor": "white",
"plot_bgcolor": "#E5ECF6",
"polar": {
"angularaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"radialaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"scene": {
"xaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"yaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"zaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
}
},
"shapedefaults": {
"line": {
"color": "#2a3f5f"
}
},
"ternary": {
"aaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"baxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"caxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"title": {
"x": 0.05
},
"xaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
},
"yaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
}
}
},
"xaxis": {
"anchor": "y",
"domain": [
0,
1
],
"title": {
"text": "CM"
}
},
"yaxis": {
"anchor": "x",
"domain": [
0,
1
],
"title": {
"text": "count"
}
}
}
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Observation de la distribution\n",
"fig = px.histogram(data_model, x=\"CM\")\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"id": "92d6156a",
"metadata": {},
"source": [
"#### Etude des corrélations parmi les variables explicatives"
]
},
{
"cell_type": "markdown",
"id": "d7327570",
"metadata": {},
"source": [
"**Question :** Selon vous, pourquoi faut-il s'intéresser à la corrélation des variables ? "
]
},
{
"cell_type": "markdown",
"id": "475e141b",
"metadata": {},
"source": [
"*Réponse*: Pour avoir un modèle qui fit mieux + déterminer un potentiel effet de causalité entre features et target + sélectionner certaines variables."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "1b156435",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(824, 13)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_set = data_model.drop(\"CM\", axis=1)\n",
"data_set.shape"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "0ef0fcc0",
"metadata": {},
"outputs": [],
"source": [
"# Séparation en variables qualitatives ou catégorielles\n",
"variables_na = []\n",
"variables_numeriques = []\n",
"variables_01 = []\n",
"variables_categorielles = []\n",
"for colu in data_set.columns:\n",
" if True in data_set[colu].isna().unique():\n",
" variables_na.append(data_set[colu])\n",
" elif str(data_set[colu].dtypes) in [\"int32\", \"int64\", \"float64\"]:\n",
" if len(data_set[colu].unique()) == 2:\n",
" variables_categorielles.append(data_set[colu])\n",
" else:\n",
" variables_numeriques.append(data_set[colu])\n",
" elif len(data_set[colu].unique()) == 2:\n",
" variables_categorielles.append(data_set[colu])\n",
" else:\n",
" variables_categorielles.append(data_set[colu])"
]
},
{
"cell_type": "markdown",
"id": "e82fcade",
"metadata": {},
"source": [
"##### Corrélation des variables catégorielles :"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "e130aae5",
"metadata": {},
"outputs": [],
"source": [
"vars_categorielles = pd.DataFrame(variables_categorielles).transpose()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "c39e2ad0",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
"columns": [
{
"name": "index",
"rawType": "object",
"type": "string"
},
{
"name": "CONTRAT_ANCIENNETE",
"rawType": "float64",
"type": "float"
},
{
"name": "FREQUENCE_PAIEMENT_COTISATION",
"rawType": "float64",
"type": "float"
},
{
"name": "GROUPE_KM",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE",
"rawType": "float64",
"type": "float"
},
{
"name": "GENRE",
"rawType": "float64",
"type": "float"
},
{
"name": "DEUXIEME_CONDUCTEUR",
"rawType": "float64",
"type": "float"
},
{
"name": "ENERGIE",
"rawType": "float64",
"type": "float"
},
{
"name": "EQUIPEMENT_SECURITE",
"rawType": "float64",
"type": "float"
},
{
"name": "VALEUR_DU_BIEN",
"rawType": "float64",
"type": "float"
}
],
"ref": "b82309b4-707a-46f5-b3fe-c9c1324ee757",
"rows": [
[
"CONTRAT_ANCIENNETE",
"1.0",
"0.0",
"0.01",
"0.02",
"0.01",
"0.01",
"0.01",
"0.0",
"0.01"
],
[
"FREQUENCE_PAIEMENT_COTISATION",
"0.0",
"1.0",
"0.0",
"0.01",
"0.01",
"0.0",
"0.0",
"0.01",
"0.03"
],
[
"GROUPE_KM",
"0.01",
"0.0",
"1.0",
"0.04",
"0.01",
"0.0",
"0.04",
"0.01",
"0.04"
],
[
"ZONE_RISQUE",
"0.02",
"0.01",
"0.04",
"1.0",
"0.01",
"0.02",
"0.03",
"0.04",
"0.02"
],
[
"GENRE",
"0.01",
"0.01",
"0.01",
"0.01",
"1.0",
"0.0",
"0.03",
"0.01",
"0.08"
],
[
"DEUXIEME_CONDUCTEUR",
"0.01",
"0.0",
"0.0",
"0.02",
"0.0",
"0.99",
"0.0",
"0.0",
"0.02"
],
[
"ENERGIE",
"0.01",
"0.0",
"0.04",
"0.03",
"0.03",
"0.0",
"1.0",
"0.02",
"0.08"
],
[
"EQUIPEMENT_SECURITE",
"0.0",
"0.01",
"0.01",
"0.04",
"0.01",
"0.0",
"0.02",
"0.99",
"0.05"
],
[
"VALEUR_DU_BIEN",
"0.01",
"0.03",
"0.04",
"0.02",
"0.08",
"0.02",
"0.08",
"0.05",
"1.0"
]
],
"shape": {
"columns": 9,
"rows": 9
}
},
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CONTRAT_ANCIENNETE</th>\n",
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
" <th>GROUPE_KM</th>\n",
" <th>ZONE_RISQUE</th>\n",
" <th>GENRE</th>\n",
" <th>DEUXIEME_CONDUCTEUR</th>\n",
" <th>ENERGIE</th>\n",
" <th>EQUIPEMENT_SECURITE</th>\n",
" <th>VALEUR_DU_BIEN</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>CONTRAT_ANCIENNETE</th>\n",
" <td>1.00</td>\n",
" <td>0.00</td>\n",
" <td>0.01</td>\n",
" <td>0.02</td>\n",
" <td>0.01</td>\n",
" <td>0.01</td>\n",
" <td>0.01</td>\n",
" <td>0.00</td>\n",
" <td>0.01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
" <td>0.00</td>\n",
" <td>1.00</td>\n",
" <td>0.00</td>\n",
" <td>0.01</td>\n",
" <td>0.01</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" <td>0.01</td>\n",
" <td>0.03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GROUPE_KM</th>\n",
" <td>0.01</td>\n",
" <td>0.00</td>\n",
" <td>1.00</td>\n",
" <td>0.04</td>\n",
" <td>0.01</td>\n",
" <td>0.00</td>\n",
" <td>0.04</td>\n",
" <td>0.01</td>\n",
" <td>0.04</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ZONE_RISQUE</th>\n",
" <td>0.02</td>\n",
" <td>0.01</td>\n",
" <td>0.04</td>\n",
" <td>1.00</td>\n",
" <td>0.01</td>\n",
" <td>0.02</td>\n",
" <td>0.03</td>\n",
" <td>0.04</td>\n",
" <td>0.02</td>\n",
" </tr>\n",
" <tr>\n",
" <th>GENRE</th>\n",
" <td>0.01</td>\n",
" <td>0.01</td>\n",
" <td>0.01</td>\n",
" <td>0.01</td>\n",
" <td>1.00</td>\n",
" <td>0.00</td>\n",
" <td>0.03</td>\n",
" <td>0.01</td>\n",
" <td>0.08</td>\n",
" </tr>\n",
" <tr>\n",
" <th>DEUXIEME_CONDUCTEUR</th>\n",
" <td>0.01</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" <td>0.02</td>\n",
" <td>0.00</td>\n",
" <td>0.99</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" <td>0.02</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ENERGIE</th>\n",
" <td>0.01</td>\n",
" <td>0.00</td>\n",
" <td>0.04</td>\n",
" <td>0.03</td>\n",
" <td>0.03</td>\n",
" <td>0.00</td>\n",
" <td>1.00</td>\n",
" <td>0.02</td>\n",
" <td>0.08</td>\n",
" </tr>\n",
" <tr>\n",
" <th>EQUIPEMENT_SECURITE</th>\n",
" <td>0.00</td>\n",
" <td>0.01</td>\n",
" <td>0.01</td>\n",
" <td>0.04</td>\n",
" <td>0.01</td>\n",
" <td>0.00</td>\n",
" <td>0.02</td>\n",
" <td>0.99</td>\n",
" <td>0.05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>VALEUR_DU_BIEN</th>\n",
" <td>0.01</td>\n",
" <td>0.03</td>\n",
" <td>0.04</td>\n",
" <td>0.02</td>\n",
" <td>0.08</td>\n",
" <td>0.02</td>\n",
" <td>0.08</td>\n",
" <td>0.05</td>\n",
" <td>1.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" CONTRAT_ANCIENNETE \\\n",
"CONTRAT_ANCIENNETE 1.00 \n",
"FREQUENCE_PAIEMENT_COTISATION 0.00 \n",
"GROUPE_KM 0.01 \n",
"ZONE_RISQUE 0.02 \n",
"GENRE 0.01 \n",
"DEUXIEME_CONDUCTEUR 0.01 \n",
"ENERGIE 0.01 \n",
"EQUIPEMENT_SECURITE 0.00 \n",
"VALEUR_DU_BIEN 0.01 \n",
"\n",
" FREQUENCE_PAIEMENT_COTISATION GROUPE_KM \\\n",
"CONTRAT_ANCIENNETE 0.00 0.01 \n",
"FREQUENCE_PAIEMENT_COTISATION 1.00 0.00 \n",
"GROUPE_KM 0.00 1.00 \n",
"ZONE_RISQUE 0.01 0.04 \n",
"GENRE 0.01 0.01 \n",
"DEUXIEME_CONDUCTEUR 0.00 0.00 \n",
"ENERGIE 0.00 0.04 \n",
"EQUIPEMENT_SECURITE 0.01 0.01 \n",
"VALEUR_DU_BIEN 0.03 0.04 \n",
"\n",
" ZONE_RISQUE GENRE DEUXIEME_CONDUCTEUR \\\n",
"CONTRAT_ANCIENNETE 0.02 0.01 0.01 \n",
"FREQUENCE_PAIEMENT_COTISATION 0.01 0.01 0.00 \n",
"GROUPE_KM 0.04 0.01 0.00 \n",
"ZONE_RISQUE 1.00 0.01 0.02 \n",
"GENRE 0.01 1.00 0.00 \n",
"DEUXIEME_CONDUCTEUR 0.02 0.00 0.99 \n",
"ENERGIE 0.03 0.03 0.00 \n",
"EQUIPEMENT_SECURITE 0.04 0.01 0.00 \n",
"VALEUR_DU_BIEN 0.02 0.08 0.02 \n",
"\n",
" ENERGIE EQUIPEMENT_SECURITE VALEUR_DU_BIEN \n",
"CONTRAT_ANCIENNETE 0.01 0.00 0.01 \n",
"FREQUENCE_PAIEMENT_COTISATION 0.00 0.01 0.03 \n",
"GROUPE_KM 0.04 0.01 0.04 \n",
"ZONE_RISQUE 0.03 0.04 0.02 \n",
"GENRE 0.03 0.01 0.08 \n",
"DEUXIEME_CONDUCTEUR 0.00 0.00 0.02 \n",
"ENERGIE 1.00 0.02 0.08 \n",
"EQUIPEMENT_SECURITE 0.02 0.99 0.05 \n",
"VALEUR_DU_BIEN 0.08 0.05 1.00 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Test du V de Cramer\n",
"rows = []\n",
"\n",
"for var1 in vars_categorielles:\n",
" col = []\n",
" for var2 in vars_categorielles:\n",
" cramers = cramers_V(\n",
" vars_categorielles[var1],\n",
" vars_categorielles[var2],\n",
" ) # V de Cramer\n",
" col.append(round(cramers, 2)) # arrondi du résultat\n",
" rows.append(col)\n",
"\n",
"cramers_results = np.array(rows)\n",
"v_cramer_resultats = pd.DataFrame(\n",
" cramers_results,\n",
" columns=vars_categorielles.columns,\n",
" index=vars_categorielles.columns,\n",
")\n",
"\n",
"v_cramer_resultats"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "1755a2a4",
"metadata": {},
"outputs": [],
"source": [
"# On repère les variables trop corrélées\n",
"for i in range(v_cramer_resultats.shape[0]):\n",
" for j in range(i + 1, v_cramer_resultats.shape[0]):\n",
" if v_cramer_resultats.iloc[i, j] > 0.7:\n",
" print(\n",
" v_cramer_resultats.index.to_numpy()[i]\n",
" + \" et \"\n",
" + v_cramer_resultats.columns[j]\n",
" + \" sont trop dépendantes, V-CRAMER = \"\n",
" + str(v_cramer_resultats.iloc[i, j]),\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "8f615121",
"metadata": {},
"source": [
"##### Corrélation des variables numériques :"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "a16215ab",
"metadata": {},
"outputs": [],
"source": [
"vars_numeriques = pd.DataFrame(variables_numeriques).transpose()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "532ca6c4",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
"columns": [
{
"name": "index",
"rawType": "object",
"type": "string"
},
{
"name": "ANNEE_CTR",
"rawType": "float64",
"type": "float"
},
{
"name": "AGE_ASSURE_PRINCIPAL",
"rawType": "float64",
"type": "float"
},
{
"name": "ANCIENNETE_PERMIS",
"rawType": "float64",
"type": "float"
},
{
"name": "ANNEE_CONSTRUCTION",
"rawType": "float64",
"type": "float"
}
],
"ref": "d1b7089c-632a-4c1a-8ee0-ef265e2f24f3",
"rows": [
[
"ANNEE_CTR",
"1.0",
"0.0266125353863182",
"0.04079670216583853",
"0.38756248686965"
],
[
"AGE_ASSURE_PRINCIPAL",
"0.0266125353863182",
"1.0",
"0.5408989349040694",
"-0.03165489280817585"
],
[
"ANCIENNETE_PERMIS",
"0.04079670216583853",
"0.5408989349040694",
"1.0",
"0.033320350432053406"
],
[
"ANNEE_CONSTRUCTION",
"0.38756248686965",
"-0.03165489280817585",
"0.033320350432053406",
"1.0"
]
],
"shape": {
"columns": 4,
"rows": 4
}
},
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ANNEE_CTR</th>\n",
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
" <th>ANCIENNETE_PERMIS</th>\n",
" <th>ANNEE_CONSTRUCTION</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>ANNEE_CTR</th>\n",
" <td>1.000000</td>\n",
" <td>0.026613</td>\n",
" <td>0.040797</td>\n",
" <td>0.387562</td>\n",
" </tr>\n",
" <tr>\n",
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
" <td>0.026613</td>\n",
" <td>1.000000</td>\n",
" <td>0.540899</td>\n",
" <td>-0.031655</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ANCIENNETE_PERMIS</th>\n",
" <td>0.040797</td>\n",
" <td>0.540899</td>\n",
" <td>1.000000</td>\n",
" <td>0.033320</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ANNEE_CONSTRUCTION</th>\n",
" <td>0.387562</td>\n",
" <td>-0.031655</td>\n",
" <td>0.033320</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ANNEE_CTR AGE_ASSURE_PRINCIPAL ANCIENNETE_PERMIS \\\n",
"ANNEE_CTR 1.000000 0.026613 0.040797 \n",
"AGE_ASSURE_PRINCIPAL 0.026613 1.000000 0.540899 \n",
"ANCIENNETE_PERMIS 0.040797 0.540899 1.000000 \n",
"ANNEE_CONSTRUCTION 0.387562 -0.031655 0.033320 \n",
"\n",
" ANNEE_CONSTRUCTION \n",
"ANNEE_CTR 0.387562 \n",
"AGE_ASSURE_PRINCIPAL -0.031655 \n",
"ANCIENNETE_PERMIS 0.033320 \n",
"ANNEE_CONSTRUCTION 1.000000 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Corrélation de Pearson\n",
"correlations_num = vars_numeriques.corr(method=\"pearson\")\n",
"correlations_num"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "6c3bd9b2",
"metadata": {},
"outputs": [],
"source": [
"# On repère les variables trop corrélées\n",
"nb_variables = correlations_num.shape[0]\n",
"for i in range(nb_variables):\n",
" for j in range(i + 1, nb_variables):\n",
" if abs(correlations_num.iloc[i, j]) > 0.7:\n",
" print(\n",
" correlations_num.index.to_numpy()[i]\n",
" + \" et \"\n",
" + correlations_num.columns[j]\n",
" + \" sont trop dépendantes, corr = \"\n",
" + str(correlations_num.iloc[i, j]),\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "98c7dba6",
"metadata": {},
"source": [
"**Question :** quels sont vos commentaires ?"
]
},
{
"cell_type": "markdown",
"id": "67406b54",
"metadata": {},
"source": [
"*Réponse*: Aucune des variables ne semblent corrélées."
]
},
{
"cell_type": "markdown",
"id": "212209ec",
"metadata": {},
"source": [
"#### Preprocessing"
]
},
{
"cell_type": "markdown",
"id": "65aca700",
"metadata": {},
"source": [
"Deux étapes sont nécessaires avant de lancer l'apprentissage d'un modèle, c'est ce qu'on connait comme le *Preprocessing* :\n",
"\n",
"* Les modèles proposés par la librairie \"sklearn\" ne gèrent que des variables numériques. Il est donc nécessaire de transformer les variables catégorielles en variables numériques : ce processus s'appelle le *One Hot Encoding*.\n",
"* Normaliser les données numériques"
]
},
{
"cell_type": "markdown",
"id": "95f5cc9f",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant de réaliser le One Hot Encoding des variables catégorielles. Vous pourrez utiliser la fonction \"preproc.OneHotEncoder\" de la librairie sklearn"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "b8530717",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
"columns": [
{
"name": "index",
"rawType": "int64",
"type": "integer"
},
{
"name": "CONTRAT_ANCIENNETE_(0,1]",
"rawType": "float64",
"type": "float"
},
{
"name": "CONTRAT_ANCIENNETE_(1,2]",
"rawType": "float64",
"type": "float"
},
{
"name": "CONTRAT_ANCIENNETE_(2,5]",
"rawType": "float64",
"type": "float"
},
{
"name": "CONTRAT_ANCIENNETE_(5,10]",
"rawType": "float64",
"type": "float"
},
{
"name": "FREQUENCE_PAIEMENT_COTISATION_MENSUEL",
"rawType": "float64",
"type": "float"
},
{
"name": "FREQUENCE_PAIEMENT_COTISATION_TRIMESTRIEL",
"rawType": "float64",
"type": "float"
},
{
"name": "GROUPE_KM_[20000;40000[",
"rawType": "float64",
"type": "float"
},
{
"name": "GROUPE_KM_[40000;60000[",
"rawType": "float64",
"type": "float"
},
{
"name": "GROUPE_KM_[60000;99999[",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_B",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_C",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_D",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_E",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_F",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_G",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_H",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_I",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_J",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_K",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_L",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_M",
"rawType": "float64",
"type": "float"
},
{
"name": "ZONE_RISQUE_T",
"rawType": "float64",
"type": "float"
},
{
"name": "GENRE_M",
"rawType": "float64",
"type": "float"
},
{
"name": "DEUXIEME_CONDUCTEUR_True",
"rawType": "float64",
"type": "float"
},
{
"name": "ENERGIE_DIESEL",
"rawType": "float64",
"type": "float"
},
{
"name": "ENERGIE_ESSENCE",
"rawType": "float64",
"type": "float"
},
{
"name": "EQUIPEMENT_SECURITE_VRAI",
"rawType": "float64",
"type": "float"
},
{
"name": "VALEUR_DU_BIEN_[10000;15000[",
"rawType": "float64",
"type": "float"
},
{
"name": "VALEUR_DU_BIEN_[15000;20000[",
"rawType": "float64",
"type": "float"
},
{
"name": "VALEUR_DU_BIEN_[20000;25000[",
"rawType": "float64",
"type": "float"
},
{
"name": "VALEUR_DU_BIEN_[25000;35000[",
"rawType": "float64",
"type": "float"
},
{
"name": "VALEUR_DU_BIEN_[35000;99999[",
"rawType": "float64",
"type": "float"
}
],
"ref": "e9b7e285-2962-4a24-989a-bde4f9adf740",
"rows": [
[
"0",
"1.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"1.0",
"1.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0"
],
[
"1",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"1.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0"
],
[
"2",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"1.0",
"1.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0"
],
[
"3",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"1.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0"
],
[
"4",
"0.0",
"1.0",
"0.0",
"0.0",
"1.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"1.0",
"0.0",
"0.0",
"0.0",
"0.0",
"1.0",
"0.0"
]
],
"shape": {
"columns": 32,
"rows": 5
}
},
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CONTRAT_ANCIENNETE_(0,1]</th>\n",
" <th>CONTRAT_ANCIENNETE_(1,2]</th>\n",
" <th>CONTRAT_ANCIENNETE_(2,5]</th>\n",
" <th>CONTRAT_ANCIENNETE_(5,10]</th>\n",
" <th>FREQUENCE_PAIEMENT_COTISATION_MENSUEL</th>\n",
" <th>FREQUENCE_PAIEMENT_COTISATION_TRIMESTRIEL</th>\n",
" <th>GROUPE_KM_[20000;40000[</th>\n",
" <th>GROUPE_KM_[40000;60000[</th>\n",
" <th>GROUPE_KM_[60000;99999[</th>\n",
" <th>ZONE_RISQUE_B</th>\n",
" <th>...</th>\n",
" <th>GENRE_M</th>\n",
" <th>DEUXIEME_CONDUCTEUR_True</th>\n",
" <th>ENERGIE_DIESEL</th>\n",
" <th>ENERGIE_ESSENCE</th>\n",
" <th>EQUIPEMENT_SECURITE_VRAI</th>\n",
" <th>VALEUR_DU_BIEN_[10000;15000[</th>\n",
" <th>VALEUR_DU_BIEN_[15000;20000[</th>\n",
" <th>VALEUR_DU_BIEN_[20000;25000[</th>\n",
" <th>VALEUR_DU_BIEN_[25000;35000[</th>\n",
" <th>VALEUR_DU_BIEN_[35000;99999[</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 32 columns</p>\n",
"</div>"
],
"text/plain": [
" CONTRAT_ANCIENNETE_(0,1] CONTRAT_ANCIENNETE_(1,2] \\\n",
"0 1.0 0.0 \n",
"1 0.0 0.0 \n",
"2 0.0 0.0 \n",
"3 0.0 0.0 \n",
"4 0.0 1.0 \n",
"\n",
" CONTRAT_ANCIENNETE_(2,5] CONTRAT_ANCIENNETE_(5,10] \\\n",
"0 0.0 0.0 \n",
"1 0.0 0.0 \n",
"2 0.0 0.0 \n",
"3 0.0 0.0 \n",
"4 0.0 0.0 \n",
"\n",
" FREQUENCE_PAIEMENT_COTISATION_MENSUEL \\\n",
"0 1.0 \n",
"1 1.0 \n",
"2 1.0 \n",
"3 1.0 \n",
"4 1.0 \n",
"\n",
" FREQUENCE_PAIEMENT_COTISATION_TRIMESTRIEL GROUPE_KM_[20000;40000[ \\\n",
"0 0.0 0.0 \n",
"1 0.0 1.0 \n",
"2 0.0 1.0 \n",
"3 0.0 1.0 \n",
"4 0.0 1.0 \n",
"\n",
" GROUPE_KM_[40000;60000[ GROUPE_KM_[60000;99999[ ZONE_RISQUE_B ... \\\n",
"0 0.0 0.0 0.0 ... \n",
"1 0.0 0.0 0.0 ... \n",
"2 0.0 0.0 0.0 ... \n",
"3 0.0 0.0 1.0 ... \n",
"4 0.0 0.0 0.0 ... \n",
"\n",
" GENRE_M DEUXIEME_CONDUCTEUR_True ENERGIE_DIESEL ENERGIE_ESSENCE \\\n",
"0 1.0 0.0 0.0 1.0 \n",
"1 1.0 1.0 0.0 0.0 \n",
"2 1.0 0.0 0.0 1.0 \n",
"3 1.0 0.0 1.0 0.0 \n",
"4 1.0 0.0 0.0 1.0 \n",
"\n",
" EQUIPEMENT_SECURITE_VRAI VALEUR_DU_BIEN_[10000;15000[ \\\n",
"0 1.0 0.0 \n",
"1 0.0 0.0 \n",
"2 1.0 0.0 \n",
"3 0.0 0.0 \n",
"4 0.0 0.0 \n",
"\n",
" VALEUR_DU_BIEN_[15000;20000[ VALEUR_DU_BIEN_[20000;25000[ \\\n",
"0 1.0 0.0 \n",
"1 0.0 0.0 \n",
"2 0.0 0.0 \n",
"3 1.0 0.0 \n",
"4 0.0 0.0 \n",
"\n",
" VALEUR_DU_BIEN_[25000;35000[ VALEUR_DU_BIEN_[35000;99999[ \n",
"0 0.0 0.0 \n",
"1 0.0 1.0 \n",
"2 0.0 0.0 \n",
"3 0.0 0.0 \n",
"4 1.0 0.0 \n",
"\n",
"[5 rows x 32 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# One hot encoding des variables catégorielles\n",
"preproc_ohe = preproc.OneHotEncoder(handle_unknown=\"ignore\")\n",
"preproc_ohe = preproc.OneHotEncoder(drop=\"first\", sparse_output=False).fit(\n",
" vars_categorielles,\n",
")\n",
"\n",
"variables_categorielles_ohe = preproc_ohe.transform(vars_categorielles)\n",
"variables_categorielles_ohe = pd.DataFrame(\n",
" variables_categorielles_ohe,\n",
" columns=preproc_ohe.get_feature_names_out(vars_categorielles.columns),\n",
")\n",
"variables_categorielles_ohe.head()"
]
},
{
"cell_type": "markdown",
"id": "b70abc5c",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant normaliser les variables numériques présentes dans la base. Vous pourrez utiliser la fonction \"preproc.StandardScaler\" de la librairie sklearn"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "4ff3847d",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
"columns": [
{
"name": "index",
"rawType": "int64",
"type": "integer"
},
{
"name": "ANNEE_CTR",
"rawType": "float64",
"type": "float"
},
{
"name": "AGE_ASSURE_PRINCIPAL",
"rawType": "float64",
"type": "float"
},
{
"name": "ANCIENNETE_PERMIS",
"rawType": "float64",
"type": "float"
},
{
"name": "ANNEE_CONSTRUCTION",
"rawType": "float64",
"type": "float"
}
],
"ref": "e1f3a979-59a5-4ca5-8c34-9aaf039cc275",
"rows": [
[
"0",
"0.40615626262983295",
"-0.31764836563527515",
"0.067767057718506",
"0.5653698304986595"
],
[
"1",
"1.06626032654885",
"-1.2596885906311412",
"-1.1719751563806404",
"0.8816391722032739"
],
[
"2",
"0.40615626262983295",
"-1.839405652167059",
"-1.740190337842749",
"0.5653698304986595"
],
[
"3",
"0.40615626262983295",
"-0.31764836563527515",
"0.48101446241822143",
"0.8816391722032739"
],
[
"4",
"-0.25394780128918387",
"-1.7669410194750692",
"-1.2752870075555691",
"-0.38343819461518397"
]
],
"shape": {
"columns": 4,
"rows": 5
}
},
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ANNEE_CTR</th>\n",
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
" <th>ANCIENNETE_PERMIS</th>\n",
" <th>ANNEE_CONSTRUCTION</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.406156</td>\n",
" <td>-0.317648</td>\n",
" <td>0.067767</td>\n",
" <td>0.565370</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.066260</td>\n",
" <td>-1.259689</td>\n",
" <td>-1.171975</td>\n",
" <td>0.881639</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.406156</td>\n",
" <td>-1.839406</td>\n",
" <td>-1.740190</td>\n",
" <td>0.565370</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.406156</td>\n",
" <td>-0.317648</td>\n",
" <td>0.481014</td>\n",
" <td>0.881639</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-0.253948</td>\n",
" <td>-1.766941</td>\n",
" <td>-1.275287</td>\n",
" <td>-0.383438</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ANNEE_CTR AGE_ASSURE_PRINCIPAL ANCIENNETE_PERMIS ANNEE_CONSTRUCTION\n",
"0 0.406156 -0.317648 0.067767 0.565370\n",
"1 1.066260 -1.259689 -1.171975 0.881639\n",
"2 0.406156 -1.839406 -1.740190 0.565370\n",
"3 0.406156 -0.317648 0.481014 0.881639\n",
"4 -0.253948 -1.766941 -1.275287 -0.383438"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Normalisation des varibales numériques\n",
"preproc_scale = preproc.StandardScaler(with_mean=True, with_std=True)\n",
"preproc_scale.fit(vars_numeriques)\n",
"\n",
"vars_numeriques_scaled = preproc_scale.transform(vars_numeriques)\n",
"vars_numeriques_scaled = pd.DataFrame(\n",
" vars_numeriques_scaled,\n",
" columns=vars_numeriques.columns,\n",
")\n",
"vars_numeriques_scaled.head()"
]
},
{
"cell_type": "markdown",
"id": "62d49546",
"metadata": {},
"source": [
"#### Sampling"
]
},
{
"cell_type": "markdown",
"id": "64d229f4",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant construire la base d'apprentissage (80% des données) et la base de test (20%)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "6a1c7907",
"metadata": {},
"outputs": [],
"source": [
"X_global = vars_numeriques_scaled.merge(\n",
" variables_categorielles_ohe,\n",
" left_index=True,\n",
" right_index=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "58a14153",
"metadata": {},
"outputs": [],
"source": [
"# Réorganisation des données\n",
"X = X_global.to_numpy()\n",
"Y = data_model[\"CM\"]\n",
"\n",
"# Sampling en 80% train et 20% test\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X,\n",
" Y,\n",
" test_size=0.2,\n",
" random_state=42,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "84dc7a07",
"metadata": {},
"source": [
"#### Fitting"
]
},
{
"cell_type": "markdown",
"id": "97c7b783",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant construire le modèle"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "053e013c",
"metadata": {},
"outputs": [],
"source": [
"# Initialisation de l'objet\n",
"model_CART = DecisionTreeRegressor()\n",
"\n",
"# Train Decision Tree Classifer\n",
"model_CART = model_CART.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"id": "8d624704",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant d'évaluer les performances du modèle (MAE, MSE et RMSE)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "c4ca2cf9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAE: 0.00\n",
"MSE: 0.00\n",
"RMSE: 0.00\n"
]
}
],
"source": [
"# Prédictions sur l'ensemble d'entraînement\n",
"y_pred_train = model_CART.predict(X_train)\n",
"\n",
"mae = metrics.mean_absolute_error(y_train, y_pred_train)\n",
"mse = metrics.mean_squared_error(y_train, y_pred_train)\n",
"rmse = metrics.root_mean_squared_error(y_train, y_pred_train)\n",
"\n",
"print(f\"MAE: {mae:.2f}\")\n",
"print(f\"MSE: {mse:.2f}\")\n",
"print(f\"RMSE: {rmse:.2f}\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "4b739d5b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAE: 5047.78\n",
"MSE: 88252585.29\n",
"RMSE: 9394.28\n"
]
}
],
"source": [
"y_pred_test = model_CART.predict(X_test)\n",
"\n",
"mae = metrics.mean_absolute_error(y_test, y_pred_test)\n",
"mse = metrics.mean_squared_error(y_test, y_pred_test)\n",
"rmse = metrics.root_mean_squared_error(y_test, y_pred_test)\n",
"\n",
"print(f\"MAE: {mae:.2f}\")\n",
"print(f\"MSE: {mse:.2f}\")\n",
"print(f\"RMSE: {rmse:.2f}\")"
]
},
{
"cell_type": "markdown",
"id": "fb2fe98c",
"metadata": {},
"source": [
"**Question :** que pensez-vous des performances de ce modèle ?"
]
},
{
"cell_type": "markdown",
"id": "bdd7ccd6",
"metadata": {},
"source": [
"*Réponse*: \n",
"\n",
"Erreur Absolue Moyenne (MAE)\n",
"La MAE représente l'écart absolu moyen entre les prédictions du modèle et les valeurs réelles. Une MAE de 5950.05 signifie qu'en moyenne, notre modèle commet une erreur de cette magnitude, dans l'unité de la variable cible. C'est l'indicateur le plus direct de l'erreur de prédiction moyenne.\n",
"\n",
"Racine de l'Erreur Quadratique Moyenne (RMSE)\n",
"La RMSE est la racine carrée de la moyenne des erreurs au carré ($RMSE = \\sqrt{MSE}$). En raison de l'opération de mise au carré, cette métrique est particulièrement sensible aux grandes erreurs. La valeur obtenue est de 12651.79."
]
},
{
"cell_type": "markdown",
"id": "7ecba832",
"metadata": {},
"source": [
"## Algorithme supervisé : Random Forest "
]
},
{
"cell_type": "markdown",
"id": "efcb8987",
"metadata": {},
"source": [
"A ce stade, nous avons vu les différentes étapes pour lancer un algorithme de Machine Learning. Néanmoins, ces étapes ne sont pas suffisantes pour construire un modèle performant. \n",
"En effet, afin de construire un modèle performant le Data Scientist doit agir sur l'apprentissage du modèle. Dans ce qui suit nous :\n",
"* Changerons d'algorithme pour utiliser un algorithme plus performant (Random Forest)\n",
"* Raliserons un *grid search* sur les paramètres du modèle\n",
"* Appliquerons l'apprentissage par validation croisée\n"
]
},
{
"cell_type": "markdown",
"id": "d6723a2f",
"metadata": {},
"source": [
"### Modèle avec Validation Croisée"
]
},
{
"cell_type": "markdown",
"id": "3716b09f",
"metadata": {},
"source": [
"#### Sampling"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "ab1e1367",
"metadata": {},
"outputs": [],
"source": [
"X_global = vars_numeriques_scaled.merge(\n",
" variables_categorielles_ohe,\n",
" left_index=True,\n",
" right_index=True,\n",
")\n",
"\n",
"# Réorganisation des données\n",
"X = X_global.to_numpy()\n",
"Y = np.array(data_model[\"CM\"])"
]
},
{
"cell_type": "markdown",
"id": "3f5d735e",
"metadata": {},
"source": [
"#### Fitting avec Cross-Validation"
]
},
{
"cell_type": "markdown",
"id": "bc819f8f",
"metadata": {},
"source": [
"**Exercice :** construisez un modèle RF (RandomForestRegressor) en implémentant la technique de validation croisée. Pensez à enregistrer au sein d'une variable/liste les performances (MAE, MSE & RMSE) du modèle au sein de chaque fold."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "b515460e",
"metadata": {},
"outputs": [],
"source": [
"# Initialisation\n",
"# Nombre de sous-échantillons pour la cross-validation\n",
"num_splits = 5\n",
"\n",
"# Random Forest regressor\n",
"rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)\n",
"\n",
"# Initialisation du KFold cross-validation splitter\n",
"kf = KFold(n_splits=num_splits)\n",
"\n",
"# Listes pour enregistrer les performances du modèle\n",
"MAE_scores = []\n",
"MSE_scores = []\n",
"RMSE_scores = []"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "eebb394f",
"metadata": {},
"outputs": [],
"source": [
"# Entrainement avec cross-validation\n",
"for train_index, test_index in kf.split(X):\n",
" X_train, X_test = X[train_index], X[test_index]\n",
" y_train, y_test = Y[train_index], Y[test_index]\n",
"\n",
" # Fitting\n",
" rf_regressor.fit(X_train, y_train)\n",
"\n",
" # Evaluation du modèle\n",
" y_pred_test = rf_regressor.predict(X_test)\n",
"\n",
" MAE = metrics.mean_absolute_error(y_test, y_pred_test)\n",
" MSE = metrics.mean_squared_error(y_test, y_pred_test)\n",
" RMSE = metrics.root_mean_squared_error(y_test, y_pred_test)\n",
"\n",
" # Concaténation des résultats\n",
" MAE_scores.append(MAE)\n",
" MSE_scores.append(MSE)\n",
" RMSE_scores.append(RMSE)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "b067126c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fold 1 MAE: 4007.8326951515155\n",
"Fold 2 MAE: 3651.8632978787878\n",
"Fold 3 MAE: 4718.226707878788\n",
"Fold 4 MAE: 4031.310562727273\n",
"Fold 5 MAE: 4410.05992957317\n"
]
}
],
"source": [
"# Métriques sur tous les folds\n",
"\n",
"# MAE\n",
"for fold, mae in enumerate(MAE_scores, start=1):\n",
" print(f\"Fold {fold} MAE:\", mae)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "6597152c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fold 1 MSE: 32761893.668576293\n",
"Fold 2 MSE: 50894497.0512714\n",
"Fold 3 MSE: 106861487.03512044\n",
"Fold 4 MSE: 35487273.569623545\n",
"Fold 5 MSE: 54729524.04672807\n"
]
}
],
"source": [
"# MSE\n",
"for fold, mse in enumerate(MSE_scores, start=1):\n",
" print(f\"Fold {fold} MSE:\", mse)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "63ff1c9d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fold 1 RMSE: 5723.8006314490285\n",
"Fold 2 RMSE: 7134.037920509772\n",
"Fold 3 RMSE: 10337.38298773536\n",
"Fold 4 RMSE: 5957.119569861222\n",
"Fold 5 RMSE: 7397.940527385177\n"
]
}
],
"source": [
"# RMSE\n",
"for fold, rmse in enumerate(RMSE_scores, start=1):\n",
" print(f\"Fold {fold} RMSE:\", rmse)"
]
},
{
"cell_type": "markdown",
"id": "ec1961c2",
"metadata": {},
"source": [
"**Question :** Commentez les résultats."
]
},
{
"cell_type": "markdown",
"id": "5a8163ef",
"metadata": {},
"source": [
"### Ajout d'un Grid Search pour les hyper paramètres"
]
},
{
"cell_type": "markdown",
"id": "5a6adbfe",
"metadata": {},
"source": [
"#### Sampling"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "d9342ad6",
"metadata": {},
"outputs": [],
"source": [
"X_global = vars_numeriques_scaled.merge(\n",
" variables_categorielles_ohe,\n",
" left_index=True,\n",
" right_index=True,\n",
")\n",
"# Réorganisation des données\n",
"X = X_global.to_numpy()\n",
"Y = np.array(data_model[\"CM\"])"
]
},
{
"cell_type": "markdown",
"id": "dce52b11",
"metadata": {},
"source": [
"#### Fitting avec Cross-Validation et *Grid Search*"
]
},
{
"cell_type": "markdown",
"id": "7e3a9dd0",
"metadata": {},
"source": [
"**Exercice :** Intégrez la technique de Grid Search pour rechercher les paramètres optimaux du modèle."
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "6d58dbc2",
"metadata": {},
"outputs": [],
"source": [
"# Initialisation\n",
"# Nombre de sous-échantillons pour la cross-validation\n",
"num_splits = 5\n",
"\n",
"# Initialisation du KFold cross-validation splitter\n",
"kf = KFold(n_splits=num_splits)\n",
"\n",
"# Listes pour enregistrer les performances du modèle\n",
"MAE_scores = []\n",
"MSE_scores = []\n",
"RMSE_scores = []\n",
"\n",
"# Hyperparamètres à tester\n",
"n_estimators_values = [60, 65, 70, 75]\n",
"max_depth_values = [None, 1, 2, 3]\n",
"min_samples_split_values = [5, 8, 10, 11, 13, 14, 15]\n",
"\n",
"# Liste pour sauveagrder les meilleurs résultats\n",
"best_score = np.inf\n",
"best_params = {}\n",
"\n",
"MAE_best_score = []\n",
"MSE_best_score = []\n",
"RMSE_best_score = []"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "47da5172",
"metadata": {},
"outputs": [],
"source": [
"# grid search à la main\n",
"for n_estimators in n_estimators_values:\n",
" for max_depth in max_depth_values:\n",
" for min_samples_split in min_samples_split_values:\n",
" # Réinitialisation des résultats\n",
" MAE_scores = []\n",
" MSE_scores = []\n",
" RMSE_scores = []\n",
"\n",
" # Boucle de Cross-Validation\n",
" for train_index, test_index in kf.split(X):\n",
" X_train, X_test = X[train_index], X[test_index]\n",
" y_train, y_test = Y[train_index], Y[test_index]\n",
"\n",
" # Modèle avec hyperparamètres actuels\n",
" rf_regressor = RandomForestRegressor(\n",
" n_estimators=n_estimators,\n",
" max_depth=max_depth,\n",
" min_samples_split=min_samples_split,\n",
" random_state=42,\n",
" )\n",
"\n",
" rf_regressor.fit(X_train, y_train)\n",
"\n",
" # Evaluation du modèle\n",
" y_pred_test = rf_regressor.predict(X_test)\n",
"\n",
" MAE = metrics.mean_absolute_error(y_test, y_pred_test)\n",
" MSE = metrics.mean_squared_error(y_test, y_pred_test)\n",
" RMSE = metrics.root_mean_squared_error(y_test, y_pred_test)\n",
"\n",
" # Concaténation des résultats\n",
" MAE_scores.append(MAE)\n",
" MSE_scores.append(MSE)\n",
" RMSE_scores.append(RMSE)\n",
"\n",
" # Calcul du meilleur score pour le jeu de paramètres\n",
" min_rmse = np.min(RMSE_scores)\n",
"\n",
" # Mise à jour du meilleur score si besoin\n",
" if min_rmse < best_score:\n",
" best_score = min_rmse\n",
" best_params = {\n",
" \"n_estimators\": n_estimators,\n",
" \"max_depth\": max_depth,\n",
" \"min_samples_split\": min_samples_split,\n",
" }\n",
"\n",
" # Sauvegarde des scores pour analyse\n",
" MAE_best_score = MAE_scores\n",
" MSE_best_score = MSE_scores\n",
" RMSE_best_score = RMSE_scores\n",
"\n",
" # Sauvegarde du modèle pour l'utiliser directement\n",
" best_model_regressor = rf_regressor"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "d4936c46",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Meilleurs paramètres: {'n_estimators': 65, 'max_depth': 1, 'min_samples_split': 5}\n",
"Meilleure RMSE : 4548.156488811854\n"
]
}
],
"source": [
"# Meilleurs résultats\n",
"print(\"Meilleurs paramètres:\", best_params)\n",
"print(\"Meilleure RMSE :\", best_score)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "3215c463",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fold 1 RMSE: 5168.96443207593\n",
"Fold 2 RMSE: 6779.919772901815\n",
"Fold 3 RMSE: 10081.628056733409\n",
"Fold 4 RMSE: 4548.156488811854\n",
"Fold 5 RMSE: 6713.822743503048\n"
]
}
],
"source": [
"# Métriques sur tous les folds\n",
"\n",
"# RMSE\n",
"for fold, rmse in enumerate(RMSE_best_score, start=1):\n",
" print(f\"Fold {fold} RMSE:\", rmse)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "bb9a5c9b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fold 1 MSE: 26718193.300066035\n",
"Fold 2 MSE: 45967312.126985006\n",
"Fold 3 MSE: 101639224.27431424\n",
"Fold 4 MSE: 20685727.446721368\n",
"Fold 5 MSE: 45075415.831178784\n"
]
}
],
"source": [
"# MAE\n",
"for fold, mse in enumerate(MSE_best_score, start=1):\n",
" print(f\"Fold {fold} MSE:\", mse)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "0f0768ad",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fold 1 MAE: 3516.8014139306597\n",
"Fold 2 MAE: 3209.253810522964\n",
"Fold 3 MAE: 4545.1440942571835\n",
"Fold 4 MAE: 3088.226098509521\n",
"Fold 5 MAE: 3576.4647056529234\n"
]
}
],
"source": [
"# MSE\n",
"for fold, mae in enumerate(MAE_best_score, start=1):\n",
" print(f\"Fold {fold} MAE:\", mae)"
]
},
{
"cell_type": "markdown",
"id": "802a625f",
"metadata": {},
"source": [
"**Question :** Commentez les résultats"
]
},
{
"cell_type": "markdown",
"id": "bd1e91ee",
"metadata": {},
"source": [
"### Implémentation avec les librairies existantes"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "4b8cc48d",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.model_selection import GridSearchCV, KFold"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "f0e5d591",
"metadata": {},
"outputs": [],
"source": [
"# Sampling en 80% train et 20% test\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X,\n",
" Y,\n",
" test_size=0.2,\n",
" random_state=42,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "71177a63",
"metadata": {},
"outputs": [],
"source": [
"# Supposons que vous ayez des données d'entraînement X_train et y_train\n",
"\n",
"# Définir la grille d'hyperparamètres à rechercher\n",
"param_grid = {\n",
" \"n_estimators\": [60, 65, 70, 75],\n",
" \"max_depth\": [None, 1, 2, 3],\n",
" \"min_samples_split\": [5, 8, 10, 11, 13, 14, 15],\n",
"}\n",
"# Nombre de folds pour la validation croisée\n",
"num_folds = 5"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "e463b9d7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Meilleurs hyperparamètres : {'max_depth': 1, 'min_samples_split': 5, 'n_estimators': 60}\n"
]
}
],
"source": [
"# Initialisation du modèle RandomForestRegressor\n",
"rf = RandomForestRegressor(random_state=42)\n",
"\n",
"# Création de l'objet GridSearchCV pour la recherche sur grille avec validation croisée\n",
"grid_search = GridSearchCV(\n",
" estimator=rf,\n",
" param_grid=param_grid,\n",
" cv=KFold(\n",
" n_splits=num_folds,\n",
" shuffle=True,\n",
" random_state=42,\n",
" ), # Validation croisée avec 5 folds\n",
" scoring=\"neg_mean_squared_error\", # Métrique d'évaluation (moins c'est mieux)\n",
" n_jobs=-1, # Utiliser tous les cœurs du processeur\n",
")\n",
"\n",
"# Exécution de la recherche sur grille\n",
"grid_search.fit(X_train, y_train)\n",
"\n",
"# Afficher les meilleurs hyperparamètres\n",
"best_params = grid_search.best_params_\n",
"print(\"Meilleurs hyperparamètres : \", best_params)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "d1b84e91",
"metadata": {},
"outputs": [],
"source": [
"# Initialiser le modèle final avec les meilleurs hyperparamètres\n",
"best_rf = RandomForestRegressor(random_state=42, **best_params)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "c46d32a7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE pour le fold 1: -8836.353449486982\n",
"RMSE pour le fold 2: -5242.128416843558\n",
"RMSE pour le fold 3: -7205.432382938018\n",
"RMSE pour le fold 4: -4902.177844748944\n",
"RMSE pour le fold 5: -7707.687751500834\n",
"\n",
"\n",
"MSE pour le fold 1: -78081142.28426048\n",
"MSE pour le fold 2: -27479910.338678744\n",
"MSE pour le fold 3: -51918255.825091854\n",
"MSE pour le fold 4: -24031347.6215474\n",
"MSE pour le fold 5: -59408450.47463598\n",
"\n",
"\n",
"MAE pour le fold 1: -4047.520107345083\n",
"MAE pour le fold 2: -3389.6166968886077\n",
"MAE pour le fold 3: -3373.620497619359\n",
"MAE pour le fold 4: -3186.2100657449696\n",
"MAE pour le fold 5: -4145.078817961569\n"
]
}
],
"source": [
"# Cross validation\n",
"# RMSE de chaque fold\n",
"rmse_scores = cross_val_score(\n",
" best_rf,\n",
" X_train,\n",
" y_train,\n",
" cv=num_folds,\n",
" scoring=\"neg_root_mean_squared_error\",\n",
")\n",
"\n",
"# Afficher les scores pour chaque fold\n",
"for i, score in enumerate(rmse_scores):\n",
" print(f\"RMSE pour le fold {i + 1}: {score}\")\n",
"\n",
"# MSE de chaque fold\n",
"mse_scores = cross_val_score(\n",
" best_rf,\n",
" X_train,\n",
" y_train,\n",
" cv=num_folds,\n",
" scoring=\"neg_mean_squared_error\",\n",
")\n",
"\n",
"# Afficher les scores pour chaque fold\n",
"print(\"\\n\")\n",
"for i, score in enumerate(mse_scores):\n",
" print(f\"MSE pour le fold {i + 1}: {score}\")\n",
"\n",
"# MAE de chaque fold\n",
"mae_scores = cross_val_score(\n",
" best_rf,\n",
" X_train,\n",
" y_train,\n",
" cv=num_folds,\n",
" scoring=\"neg_mean_absolute_error\",\n",
")\n",
"\n",
"# Afficher les scores pour chaque fold\n",
"print(\"\\n\")\n",
"for i, score in enumerate(mae_scores):\n",
" print(f\"MAE pour le fold {i + 1}: {score}\")"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "3ba2274c",
"metadata": {},
"outputs": [],
"source": [
"# Entraîner le modèle final sur toute la base\n",
"best_rf.fit(X_train, y_train)\n",
"\n",
"# Faire des prédictions sur l'ensemble de test\n",
"y_pred = best_rf.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "ec717a0c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE : 6792.775060864194\n",
"MSE : 46141793.02749855\n",
"MAE : 3387.6746891178996\n"
]
}
],
"source": [
"# Calculer la métrique de performance (dans ce cas, RMSE)\n",
"rmse = metrics.root_mean_squared_error(y_test, y_pred)\n",
"print(f\"RMSE : {rmse}\")\n",
"\n",
"# Calculer la métrique de performance (dans ce cas, MSE)\n",
"mse = metrics.mean_squared_error(y_test, y_pred)\n",
"print(f\"MSE : {mse}\")\n",
"\n",
"# Calculer la métrique de performance (dans ce cas, MAE)\n",
"mae = metrics.mean_absolute_error(y_test, y_pred)\n",
"print(f\"MAE : {mae}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "001baf7d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "studies",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}