Files
ArtStudies/M2/Machine Learning/TP_3/2025_TP_3_M2_ISF.ipynb

3670 lines
95 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "8750d15b",
"metadata": {},
"source": [
"# Cours 3 : Machine Learning - Algorithmes supervisés (1/2)"
]
},
{
"cell_type": "markdown",
"id": "f7c08ae5",
"metadata": {},
"source": [
"## Préambule"
]
},
{
"cell_type": "markdown",
"id": "ec7ecb4b",
"metadata": {},
"source": [
"Les objectifs de cette séance (3h) sont :\n",
"* Préparation des bases de modélisation (sampling)\n",
"* Mettre en application un modèle supervisé simple.\n",
"* Construire un modèle de Machine Learning (cross-validation et hyperparamétrage) pour résoudre un problème de régression\n",
"* Analyser les performances du modèle"
]
},
{
"cell_type": "markdown",
"id": "4e99c600",
"metadata": {},
"source": [
"## Préparation du workspace"
]
},
{
"cell_type": "markdown",
"id": "c1b01045",
"metadata": {},
"source": [
"### Import de librairies "
]
},
{
"cell_type": "code",
"execution_count": 157,
"id": "97d58527",
"metadata": {},
"outputs": [],
"source": [
"# Données\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"#Graphiques\n",
"import seaborn as sns\n",
"\n",
"sns.set()\n",
"import plotly.express as px\n",
"import plotly.graph_objects as gp\n",
"import sklearn.preprocessing as preproc\n",
"\n",
"#Statistiques\n",
"from scipy.stats import chi2_contingency\n",
"from sklearn import metrics\n",
"\n",
"# Machine Learning\n",
"from sklearn.cluster import KMeans\n",
"import sklearn.metrics as metrics\n",
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.model_selection import KFold, train_test_split\n",
"from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor"
]
},
{
"cell_type": "markdown",
"id": "06153286",
"metadata": {},
"source": [
"### Définition des fonctions "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c67db932",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "985e4e97",
"metadata": {},
"source": [
"### Constantes"
]
},
{
"cell_type": "code",
"execution_count": 158,
"id": "c9597b48",
"metadata": {},
"outputs": [],
"source": [
"input_path = \"./1_inputs\"\n",
"output_path = \"./2_outputs\""
]
},
{
"cell_type": "markdown",
"id": "b2b035d2",
"metadata": {},
"source": [
"### Import des données"
]
},
{
"cell_type": "code",
"execution_count": 159,
"id": "8051b5f4",
"metadata": {},
"outputs": [],
"source": [
"path =input_path + '/base_retraitee.csv'\n",
"data_retraitee = pd.read_csv(path,sep=\",\",decimal=\".\")"
]
},
{
"cell_type": "markdown",
"id": "a2578ba1",
"metadata": {},
"source": [
"## Algorithme supervisé : CART "
]
},
{
"cell_type": "markdown",
"id": "aaa0b27d",
"metadata": {},
"source": [
"Dans cette partie l'objectif est de construire un modèle simple (algorithme CART) afin de voir les différentes étapes nécessaire au lancement d'un modèle\n",
"Nous modéliserons directement le coût des sinistres. "
]
},
{
"cell_type": "markdown",
"id": "a0458a05",
"metadata": {},
"source": [
"### Construction du modèle"
]
},
{
"cell_type": "markdown",
"id": "b3715c37",
"metadata": {},
"source": [
"La première étape est de calculer les côut moyen de chaque sinistre (target ou variable réponse). Cette variable sera la variable à prédire en fonction des variables explicatives."
]
},
{
"cell_type": "code",
"execution_count": 160,
"id": "c427a4b8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(824, 14)"
]
},
"execution_count": 160,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_model = data_retraitee.copy()\n",
"\n",
"# Filtre pour ne garder que les lignes qui ont un sinistre (NB > 0)\n",
"data_model = data_model[data_model['NB'] > 0]\n",
"\n",
"# Calcul du cout moyen \"théorique\" des sinistres\n",
"data_model[\"CM\"] = (data_model[\"CHARGE\"] / data_model[\"NB\"])\n",
"data_model = data_model.drop(['CHARGE', 'NB', \"EXPO\"], axis=1)\n",
"data_model.shape"
]
},
{
"cell_type": "markdown",
"id": "e3e85088",
"metadata": {},
"source": [
"**Exercice :** construisez les statistiques descriptives de la base utilisée."
]
},
{
"cell_type": "code",
"execution_count": 161,
"id": "c8fd3ee1",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
"columns": [
{
"name": "index",
"rawType": "object",
"type": "string"
},
{
"name": "ANNEE_CTR",
"rawType": "float64",
"type": "float"
},
{
"name": "CONTRAT_ANCIENNETE",
"rawType": "object",
"type": "unknown"
},
{
"name": "FREQUENCE_PAIEMENT_COTISATION",
"rawType": "object",
"type": "unknown"
},
{
"name": "GROUPE_KM",
"rawType": "object",
"type": "unknown"
},
{
"name": "ZONE_RISQUE",
"rawType": "object",
"type": "unknown"
},
{
"name": "AGE_ASSURE_PRINCIPAL",
"rawType": "float64",
"type": "float"
},
{
"name": "GENRE",
"rawType": "object",
"type": "unknown"
},
{
"name": "DEUXIEME_CONDUCTEUR",
"rawType": "object",
"type": "unknown"
},
{
"name": "ANCIENNETE_PERMIS",
"rawType": "float64",
"type": "float"
},
{
"name": "ANNEE_CONSTRUCTION",
"rawType": "float64",
"type": "float"
},
{
"name": "ENERGIE",
"rawType": "object",
"type": "unknown"
},
{
"name": "EQUIPEMENT_SECURITE",
"rawType": "object",
"type": "unknown"
},
{
"name": "VALEUR_DU_BIEN",
"rawType": "object",
"type": "unknown"
},
{
"name": "CM",
"rawType": "float64",
"type": "float"
}
],
"ref": "e80a8f38-8160-41fb-bbfa-ae1f7b39de11",
"rows": [
[
"count",
"824.0",
"824",
"824",
"824",
"824",
"824.0",
"824",
"824",
"824.0",
"824.0",
"824",
"824",
"824",
"824.0"
],
[
"unique",
null,
"5",
"3",
"4",
"14",
null,
"2",
"2",
null,
null,
"3",
"2",
"6",
null
],
[
"top",
null,
"(0,1]",
"MENSUEL",
"[0;20000[",
"C",
null,
"M",
"False",
null,
null,
"ESSENCE",
"FAUX",
"[10000;15000[",
null
],
[
"freq",
null,
"297",
"398",
"391",
"269",
null,
"483",
"663",
null,
null,
"413",
"517",
"213",
null
],
[
"mean",
"2018.384708737864",
null,
null,
null,
null,
"44.383495145631066",
null,
null,
"35.68810679611651",
"2015.2123786407767",
null,
null,
null,
"4246.01697815534"
],
[
"std",
"1.515832735580178",
null,
null,
null,
null,
"13.808216667998865",
null,
null,
"19.370620845496358",
"3.1637823115731556",
null,
null,
null,
"6869.61691660173"
],
[
"min",
"2016.0",
null,
null,
null,
null,
"19.0",
null,
null,
"1.0",
"1998.0",
null,
null,
null,
"7.5"
],
[
"25%",
"2017.0",
null,
null,
null,
null,
"34.0",
null,
null,
"18.0",
"2014.0",
null,
null,
null,
"1159.96125"
],
[
"50%",
"2018.0",
null,
null,
null,
null,
"43.0",
null,
null,
"35.0",
"2016.0",
null,
null,
null,
"2541.6499999999996"
],
[
"75%",
"2020.0",
null,
null,
null,
null,
"53.0",
null,
null,
"53.0",
"2017.0",
null,
null,
null,
"4193.797500000001"
],
[
"max",
"2021.0",
null,
null,
null,
null,
"94.0",
null,
null,
"70.0",
"2021.0",
null,
null,
null,
"83421.85"
]
],
"shape": {
"columns": 14,
"rows": 11
}
},
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ANNEE_CTR</th>\n",
" <th>CONTRAT_ANCIENNETE</th>\n",
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
" <th>GROUPE_KM</th>\n",
" <th>ZONE_RISQUE</th>\n",
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
" <th>GENRE</th>\n",
" <th>DEUXIEME_CONDUCTEUR</th>\n",
" <th>ANCIENNETE_PERMIS</th>\n",
" <th>ANNEE_CONSTRUCTION</th>\n",
" <th>ENERGIE</th>\n",
" <th>EQUIPEMENT_SECURITE</th>\n",
" <th>VALEUR_DU_BIEN</th>\n",
" <th>CM</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>824.000000</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824.000000</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824.000000</td>\n",
" <td>824.000000</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824</td>\n",
" <td>824.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>NaN</td>\n",
" <td>5</td>\n",
" <td>3</td>\n",
" <td>4</td>\n",
" <td>14</td>\n",
" <td>NaN</td>\n",
" <td>2</td>\n",
" <td>2</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>NaN</td>\n",
" <td>(0,1]</td>\n",
" <td>MENSUEL</td>\n",
" <td>[0;20000[</td>\n",
" <td>C</td>\n",
" <td>NaN</td>\n",
" <td>M</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>ESSENCE</td>\n",
" <td>FAUX</td>\n",
" <td>[10000;15000[</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>NaN</td>\n",
" <td>297</td>\n",
" <td>398</td>\n",
" <td>391</td>\n",
" <td>269</td>\n",
" <td>NaN</td>\n",
" <td>483</td>\n",
" <td>663</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>413</td>\n",
" <td>517</td>\n",
" <td>213</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>2018.384709</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>44.383495</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>35.688107</td>\n",
" <td>2015.212379</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4246.016978</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>1.515833</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>13.808217</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>19.370621</td>\n",
" <td>3.163782</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>6869.616917</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>2016.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>19.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1.000000</td>\n",
" <td>1998.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>7.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>2017.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>34.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>18.000000</td>\n",
" <td>2014.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1159.961250</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>2018.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>43.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>35.000000</td>\n",
" <td>2016.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>2541.650000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>2020.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>53.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>53.000000</td>\n",
" <td>2017.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4193.797500</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>2021.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>94.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>70.000000</td>\n",
" <td>2021.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>83421.850000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION \\\n",
"count 824.000000 824 824 \n",
"unique NaN 5 3 \n",
"top NaN (0,1] MENSUEL \n",
"freq NaN 297 398 \n",
"mean 2018.384709 NaN NaN \n",
"std 1.515833 NaN NaN \n",
"min 2016.000000 NaN NaN \n",
"25% 2017.000000 NaN NaN \n",
"50% 2018.000000 NaN NaN \n",
"75% 2020.000000 NaN NaN \n",
"max 2021.000000 NaN NaN \n",
"\n",
" GROUPE_KM ZONE_RISQUE AGE_ASSURE_PRINCIPAL GENRE DEUXIEME_CONDUCTEUR \\\n",
"count 824 824 824.000000 824 824 \n",
"unique 4 14 NaN 2 2 \n",
"top [0;20000[ C NaN M False \n",
"freq 391 269 NaN 483 663 \n",
"mean NaN NaN 44.383495 NaN NaN \n",
"std NaN NaN 13.808217 NaN NaN \n",
"min NaN NaN 19.000000 NaN NaN \n",
"25% NaN NaN 34.000000 NaN NaN \n",
"50% NaN NaN 43.000000 NaN NaN \n",
"75% NaN NaN 53.000000 NaN NaN \n",
"max NaN NaN 94.000000 NaN NaN \n",
"\n",
" ANCIENNETE_PERMIS ANNEE_CONSTRUCTION ENERGIE EQUIPEMENT_SECURITE \\\n",
"count 824.000000 824.000000 824 824 \n",
"unique NaN NaN 3 2 \n",
"top NaN NaN ESSENCE FAUX \n",
"freq NaN NaN 413 517 \n",
"mean 35.688107 2015.212379 NaN NaN \n",
"std 19.370621 3.163782 NaN NaN \n",
"min 1.000000 1998.000000 NaN NaN \n",
"25% 18.000000 2014.000000 NaN NaN \n",
"50% 35.000000 2016.000000 NaN NaN \n",
"75% 53.000000 2017.000000 NaN NaN \n",
"max 70.000000 2021.000000 NaN NaN \n",
"\n",
" VALEUR_DU_BIEN CM \n",
"count 824 824.000000 \n",
"unique 6 NaN \n",
"top [10000;15000[ NaN \n",
"freq 213 NaN \n",
"mean NaN 4246.016978 \n",
"std NaN 6869.616917 \n",
"min NaN 7.500000 \n",
"25% NaN 1159.961250 \n",
"50% NaN 2541.650000 \n",
"75% NaN 4193.797500 \n",
"max NaN 83421.850000 "
]
},
"execution_count": 161,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_model.describe(include='all')"
]
},
{
"cell_type": "markdown",
"id": "92d6156a",
"metadata": {},
"source": [
"#### Etude des corrélations parmi les variables explicatives"
]
},
{
"cell_type": "markdown",
"id": "d7327570",
"metadata": {},
"source": [
"**Question :** Selon vous, pourquoi faut-il s'intéresser à la corrélation des variables ? "
]
},
{
"cell_type": "markdown",
"id": "475e141b",
"metadata": {},
"source": [
"*Réponse*: Pour avoir un modèle qui fit mieux + déterminer un potentiel effet de causalité entre features et target + sélectionner certaines variables."
]
},
{
"cell_type": "code",
"execution_count": 162,
"id": "1b156435",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(824, 13)"
]
},
"execution_count": 162,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_set = data_model.drop(\"CM\", axis=1)\n",
"data_set.shape"
]
},
{
"cell_type": "code",
"execution_count": 163,
"id": "0ef0fcc0",
"metadata": {},
"outputs": [],
"source": [
"#Séparation en variables qualitatives ou catégorielles\n",
"variables_na = []\n",
"variables_numeriques = []\n",
"variables_01 = []\n",
"variables_categorielles = []\n",
"for colu in data_set.columns:\n",
" if True in data_set[colu].isna().unique() :\n",
" variables_na.append(data_set[colu])\n",
" else :\n",
" if str(data_set[colu].dtypes) in [\"int32\",\"int64\",\"float64\"]:\n",
" if len(data_set[colu].unique())==2 :\n",
" variables_categorielles.append(data_set[colu])\n",
" else :\n",
" variables_numeriques.append(data_set[colu])\n",
" else :\n",
" if len(data_set[colu].unique())==2 :\n",
" variables_categorielles.append(data_set[colu])\n",
" else :\n",
" variables_categorielles.append(data_set[colu])"
]
},
{
"cell_type": "markdown",
"id": "e82fcade",
"metadata": {},
"source": [
"##### Corrélation des variables catégorielles :"
]
},
{
"cell_type": "code",
"execution_count": 164,
"id": "e130aae5",
"metadata": {},
"outputs": [],
"source": [
"vars_categorielles = pd.DataFrame(variables_categorielles).transpose()"
]
},
{
"cell_type": "code",
"execution_count": 165,
"id": "c39e2ad0",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"coloraxis": "coloraxis",
"hovertemplate": "x: %{x}<br>y: %{y}<br>color: %{z}<extra></extra>",
"name": "0",
"texttemplate": "%{z:.2f}",
"type": "heatmap",
"x": [
"CONTRAT_ANCIENNETE",
"FREQUENCE_PAIEMENT_COTISATION",
"GROUPE_KM",
"ZONE_RISQUE",
"GENRE",
"DEUXIEME_CONDUCTEUR",
"ENERGIE",
"EQUIPEMENT_SECURITE",
"VALEUR_DU_BIEN"
],
"xaxis": "x",
"y": [
"CONTRAT_ANCIENNETE",
"FREQUENCE_PAIEMENT_COTISATION",
"GROUPE_KM",
"ZONE_RISQUE",
"GENRE",
"DEUXIEME_CONDUCTEUR",
"ENERGIE",
"EQUIPEMENT_SECURITE",
"VALEUR_DU_BIEN"
],
"yaxis": "y",
"z": {
"bdata": "AAAAAAAA8D8AAAAAAAAAACoCGzzITrA/jS6+t390sj/aAKYMJa2eP5RMqUS3uZs/ytNpsBVXkz8AAAAAAAAAAJsekiMPM4I/AAAAAAAAAAAAAAAAAADwPwAAAAAAAAAAAAAAAAAAAABgNwyfFOK3Px3tLvtk1qI/VTS7w965nj/DbHQwNU6sP6xOyIjBVMQ/KwIbPMhOsD8AAAAAAAAAAAAAAAAAAPA/JGwWgOwjwz/Y12crRVC2P1AU8aUpk3Y/tZ25v8HgyT9++YWBDBq6PxMKBP1KAMk/ki6+t390sj8AAAAAAAAAACNsFoDsI8M/AAAAAAAA8D8AAAAAAAAAAOzpAHMW1bU/OToUIB5twT+gpoD1ZjrEP/5ATjN+vpg/0gCmDCWtnj9gNwyfFOK3P9jXZytFULY/AAAAAAAAAAAAAAAAAADwPwAAAAAAAAAA2p0N4q1bwz/UsLoqS0u5PxFqf8IHB9E/lEypRLe5mz8d7S77ZNaiP1AU8aUpk3Y/7OkAcxbVtT8AAAAAAAAAAAAAAAAAAPA/AAAAAAAAAAAAAAAAAAAAAOYlMsJ0brs/ytNpsBVXkz9RNLvD3rmeP7edub/B4Mk/OjoUIB5twT/anQ3irVvDPwAAAAAAAAAAAAAAAAAA8D8nEbUEUmnAP+SA2g/TvNE/AAAAAAAAAADDbHQwNU6sP335hYEMGro/oKaA9WY6xD/UsLoqS0u5PwAAAAAAAAAAJxG1BFJpwD8AAAAAAADwP+fmCf6XRco/mx6SIw8zgj+rTsiIwVTEPxIKBP1KAMk//kBOM36+mD8Ran/CBwfRP+YlMsJ0brs/5YDaD9O80T/n5gn+l0XKPwAAAAAAAPA/",
"dtype": "f8",
"shape": "9, 9"
}
}
],
"layout": {
"coloraxis": {
"colorscale": [
[
0,
"rgb(5,48,97)"
],
[
0.1,
"rgb(33,102,172)"
],
[
0.2,
"rgb(67,147,195)"
],
[
0.3,
"rgb(146,197,222)"
],
[
0.4,
"rgb(209,229,240)"
],
[
0.5,
"rgb(247,247,247)"
],
[
0.6,
"rgb(253,219,199)"
],
[
0.7,
"rgb(244,165,130)"
],
[
0.8,
"rgb(214,96,77)"
],
[
0.9,
"rgb(178,24,43)"
],
[
1,
"rgb(103,0,31)"
]
]
},
"template": {
"data": {
"bar": [
{
"error_x": {
"color": "#2a3f5f"
},
"error_y": {
"color": "#2a3f5f"
},
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "bar"
}
],
"barpolar": [
{
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "barpolar"
}
],
"carpet": [
{
"aaxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"baxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"type": "carpet"
}
],
"choropleth": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "choropleth"
}
],
"contour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "contour"
}
],
"contourcarpet": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "contourcarpet"
}
],
"heatmap": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmap"
}
],
"histogram": [
{
"marker": {
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "histogram"
}
],
"histogram2d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2d"
}
],
"histogram2dcontour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2dcontour"
}
],
"mesh3d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "mesh3d"
}
],
"parcoords": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "parcoords"
}
],
"pie": [
{
"automargin": true,
"type": "pie"
}
],
"scatter": [
{
"fillpattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
},
"type": "scatter"
}
],
"scatter3d": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatter3d"
}
],
"scattercarpet": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattercarpet"
}
],
"scattergeo": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergeo"
}
],
"scattergl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergl"
}
],
"scattermap": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermap"
}
],
"scattermapbox": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermapbox"
}
],
"scatterpolar": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolar"
}
],
"scatterpolargl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolargl"
}
],
"scatterternary": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterternary"
}
],
"surface": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "surface"
}
],
"table": [
{
"cells": {
"fill": {
"color": "#EBF0F8"
},
"line": {
"color": "white"
}
},
"header": {
"fill": {
"color": "#C8D4E3"
},
"line": {
"color": "white"
}
},
"type": "table"
}
]
},
"layout": {
"annotationdefaults": {
"arrowcolor": "#2a3f5f",
"arrowhead": 0,
"arrowwidth": 1
},
"autotypenumbers": "strict",
"coloraxis": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"colorscale": {
"diverging": [
[
0,
"#8e0152"
],
[
0.1,
"#c51b7d"
],
[
0.2,
"#de77ae"
],
[
0.3,
"#f1b6da"
],
[
0.4,
"#fde0ef"
],
[
0.5,
"#f7f7f7"
],
[
0.6,
"#e6f5d0"
],
[
0.7,
"#b8e186"
],
[
0.8,
"#7fbc41"
],
[
0.9,
"#4d9221"
],
[
1,
"#276419"
]
],
"sequential": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"sequentialminus": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
]
},
"colorway": [
"#636efa",
"#EF553B",
"#00cc96",
"#ab63fa",
"#FFA15A",
"#19d3f3",
"#FF6692",
"#B6E880",
"#FF97FF",
"#FECB52"
],
"font": {
"color": "#2a3f5f"
},
"geo": {
"bgcolor": "white",
"lakecolor": "white",
"landcolor": "#E5ECF6",
"showlakes": true,
"showland": true,
"subunitcolor": "white"
},
"hoverlabel": {
"align": "left"
},
"hovermode": "closest",
"mapbox": {
"style": "light"
},
"paper_bgcolor": "white",
"plot_bgcolor": "#E5ECF6",
"polar": {
"angularaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"radialaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"scene": {
"xaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"yaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"zaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
}
},
"shapedefaults": {
"line": {
"color": "#2a3f5f"
}
},
"ternary": {
"aaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"baxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"caxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"title": {
"x": 0.05
},
"xaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
},
"yaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
}
}
},
"title": {
"text": "Matrice de corrélation des variables catégorielles (V de Cramér)"
},
"xaxis": {
"anchor": "y",
"domain": [
0,
1
]
},
"yaxis": {
"anchor": "x",
"autorange": "reversed",
"domain": [
0,
1
]
}
}
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Matrice de corrélation pour les variables catégorielles (V de Cramér)\n",
"def cramers_v(confusion_matrix):\n",
" \"\"\"Calcule le V de Cramér à partir d'une matrice de contingence\"\"\"\n",
" chi2 = chi2_contingency(confusion_matrix)[0]\n",
" n = confusion_matrix.sum().sum()\n",
" phi2 = chi2 / n\n",
" r, k = confusion_matrix.shape\n",
" phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))\n",
" rcorr = r - ((r-1)**2)/(n-1)\n",
" kcorr = k - ((k-1)**2)/(n-1)\n",
" return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))\n",
"\n",
"# Créer la matrice de corrélation\n",
"categorical_cols = vars_categorielles.columns\n",
"n_vars = len(categorical_cols)\n",
"cramers_matrix = np.zeros((n_vars, n_vars))\n",
"\n",
"for i, col1 in enumerate(categorical_cols):\n",
" for j, col2 in enumerate(categorical_cols):\n",
" if i == j:\n",
" cramers_matrix[i, j] = 1.0\n",
" else:\n",
" confusion_matrix = pd.crosstab(vars_categorielles[col1], vars_categorielles[col2])\n",
" cramers_matrix[i, j] = cramers_v(confusion_matrix)\n",
"\n",
"# Créer le DataFrame de corrélation\n",
"correlation_cat = pd.DataFrame(cramers_matrix,\n",
" index=categorical_cols,\n",
" columns=categorical_cols)\n",
"\n",
"# Visualiser avec Plotly\n",
"fig = px.imshow(correlation_cat,\n",
" text_auto='.2f', # type: ignore\n",
" aspect=\"auto\",\n",
" color_continuous_scale='RdBu_r',\n",
" title='Matrice de corrélation des variables catégorielles (V de Cramér)')\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"id": "8f615121",
"metadata": {},
"source": [
"##### Corrélation des variables numériques :"
]
},
{
"cell_type": "code",
"execution_count": 166,
"id": "a16215ab",
"metadata": {},
"outputs": [],
"source": [
"vars_numeriques = pd.DataFrame(variables_numeriques).transpose()"
]
},
{
"cell_type": "code",
"execution_count": 167,
"id": "532ca6c4",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"coloraxis": "coloraxis",
"hovertemplate": "x: %{x}<br>y: %{y}<br>color: %{z}<extra></extra>",
"name": "0",
"texttemplate": "%{z}",
"type": "heatmap",
"x": [
"ANNEE_CTR",
"AGE_ASSURE_PRINCIPAL",
"ANCIENNETE_PERMIS",
"ANNEE_CONSTRUCTION"
],
"xaxis": "x",
"y": [
"ANNEE_CTR",
"AGE_ASSURE_PRINCIPAL",
"ANCIENNETE_PERMIS",
"ANNEE_CONSTRUCTION"
],
"yaxis": "y",
"z": {
"bdata": "AAAAAAAA8D+ybZcEUUCbP/CBLCtO46Q/qr2Q49LN2D+ybZcEUUCbPwAAAAAAAPA/slV7SAtP4T84L73yETWgv/CBLCtO46Q/slV7SAtP4T8AAAAAAADwP0I6y25dD6E/qr2Q49LN2D84L73yETWgv0I6y25dD6E/AAAAAAAA8D8=",
"dtype": "f8",
"shape": "4, 4"
}
}
],
"layout": {
"coloraxis": {
"colorscale": [
[
0,
"rgb(5,48,97)"
],
[
0.1,
"rgb(33,102,172)"
],
[
0.2,
"rgb(67,147,195)"
],
[
0.3,
"rgb(146,197,222)"
],
[
0.4,
"rgb(209,229,240)"
],
[
0.5,
"rgb(247,247,247)"
],
[
0.6,
"rgb(253,219,199)"
],
[
0.7,
"rgb(244,165,130)"
],
[
0.8,
"rgb(214,96,77)"
],
[
0.9,
"rgb(178,24,43)"
],
[
1,
"rgb(103,0,31)"
]
]
},
"template": {
"data": {
"bar": [
{
"error_x": {
"color": "#2a3f5f"
},
"error_y": {
"color": "#2a3f5f"
},
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "bar"
}
],
"barpolar": [
{
"marker": {
"line": {
"color": "#E5ECF6",
"width": 0.5
},
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "barpolar"
}
],
"carpet": [
{
"aaxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"baxis": {
"endlinecolor": "#2a3f5f",
"gridcolor": "white",
"linecolor": "white",
"minorgridcolor": "white",
"startlinecolor": "#2a3f5f"
},
"type": "carpet"
}
],
"choropleth": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "choropleth"
}
],
"contour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "contour"
}
],
"contourcarpet": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "contourcarpet"
}
],
"heatmap": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "heatmap"
}
],
"histogram": [
{
"marker": {
"pattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
}
},
"type": "histogram"
}
],
"histogram2d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2d"
}
],
"histogram2dcontour": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "histogram2dcontour"
}
],
"mesh3d": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"type": "mesh3d"
}
],
"parcoords": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "parcoords"
}
],
"pie": [
{
"automargin": true,
"type": "pie"
}
],
"scatter": [
{
"fillpattern": {
"fillmode": "overlay",
"size": 10,
"solidity": 0.2
},
"type": "scatter"
}
],
"scatter3d": [
{
"line": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatter3d"
}
],
"scattercarpet": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattercarpet"
}
],
"scattergeo": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergeo"
}
],
"scattergl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattergl"
}
],
"scattermap": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermap"
}
],
"scattermapbox": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scattermapbox"
}
],
"scatterpolar": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolar"
}
],
"scatterpolargl": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterpolargl"
}
],
"scatterternary": [
{
"marker": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"type": "scatterternary"
}
],
"surface": [
{
"colorbar": {
"outlinewidth": 0,
"ticks": ""
},
"colorscale": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"type": "surface"
}
],
"table": [
{
"cells": {
"fill": {
"color": "#EBF0F8"
},
"line": {
"color": "white"
}
},
"header": {
"fill": {
"color": "#C8D4E3"
},
"line": {
"color": "white"
}
},
"type": "table"
}
]
},
"layout": {
"annotationdefaults": {
"arrowcolor": "#2a3f5f",
"arrowhead": 0,
"arrowwidth": 1
},
"autotypenumbers": "strict",
"coloraxis": {
"colorbar": {
"outlinewidth": 0,
"ticks": ""
}
},
"colorscale": {
"diverging": [
[
0,
"#8e0152"
],
[
0.1,
"#c51b7d"
],
[
0.2,
"#de77ae"
],
[
0.3,
"#f1b6da"
],
[
0.4,
"#fde0ef"
],
[
0.5,
"#f7f7f7"
],
[
0.6,
"#e6f5d0"
],
[
0.7,
"#b8e186"
],
[
0.8,
"#7fbc41"
],
[
0.9,
"#4d9221"
],
[
1,
"#276419"
]
],
"sequential": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
],
"sequentialminus": [
[
0,
"#0d0887"
],
[
0.1111111111111111,
"#46039f"
],
[
0.2222222222222222,
"#7201a8"
],
[
0.3333333333333333,
"#9c179e"
],
[
0.4444444444444444,
"#bd3786"
],
[
0.5555555555555556,
"#d8576b"
],
[
0.6666666666666666,
"#ed7953"
],
[
0.7777777777777778,
"#fb9f3a"
],
[
0.8888888888888888,
"#fdca26"
],
[
1,
"#f0f921"
]
]
},
"colorway": [
"#636efa",
"#EF553B",
"#00cc96",
"#ab63fa",
"#FFA15A",
"#19d3f3",
"#FF6692",
"#B6E880",
"#FF97FF",
"#FECB52"
],
"font": {
"color": "#2a3f5f"
},
"geo": {
"bgcolor": "white",
"lakecolor": "white",
"landcolor": "#E5ECF6",
"showlakes": true,
"showland": true,
"subunitcolor": "white"
},
"hoverlabel": {
"align": "left"
},
"hovermode": "closest",
"mapbox": {
"style": "light"
},
"paper_bgcolor": "white",
"plot_bgcolor": "#E5ECF6",
"polar": {
"angularaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"radialaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"scene": {
"xaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"yaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
},
"zaxis": {
"backgroundcolor": "#E5ECF6",
"gridcolor": "white",
"gridwidth": 2,
"linecolor": "white",
"showbackground": true,
"ticks": "",
"zerolinecolor": "white"
}
},
"shapedefaults": {
"line": {
"color": "#2a3f5f"
}
},
"ternary": {
"aaxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"baxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
},
"bgcolor": "#E5ECF6",
"caxis": {
"gridcolor": "white",
"linecolor": "white",
"ticks": ""
}
},
"title": {
"x": 0.05
},
"xaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
},
"yaxis": {
"automargin": true,
"gridcolor": "white",
"linecolor": "white",
"ticks": "",
"title": {
"standoff": 15
},
"zerolinecolor": "white",
"zerolinewidth": 2
}
}
},
"title": {
"text": "Matrice de corrélation des variables numériques"
},
"xaxis": {
"anchor": "y",
"domain": [
0,
1
]
},
"yaxis": {
"anchor": "x",
"autorange": "reversed",
"domain": [
0,
1
]
}
}
}
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"vars_numeriques.corr()\n",
"fig = px.imshow(vars_numeriques.corr(),\n",
" text_auto=True,\n",
" aspect=\"auto\",\n",
" color_continuous_scale='RdBu_r',\n",
" title='Matrice de corrélation des variables numériques')\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"id": "98c7dba6",
"metadata": {},
"source": [
"**Question :** quels sont vos commentaires ?"
]
},
{
"cell_type": "markdown",
"id": "67406b54",
"metadata": {},
"source": [
"*Réponse*: Aucune des variables ne semblent corrélées."
]
},
{
"cell_type": "markdown",
"id": "212209ec",
"metadata": {},
"source": [
"#### Preprocessing"
]
},
{
"cell_type": "markdown",
"id": "65aca700",
"metadata": {},
"source": [
"Deux étapes sont nécessaires avant de lancer l'apprentissage d'un modèle, c'est ce qu'on connait comme le *Preprocessing* :\n",
"\n",
"* Les modèles proposés par la librairie \"sklearn\" ne gèrent que des variables numériques. Il est donc nécessaire de transformer les variables catégorielles en variables numériques : ce processus s'appelle le *One Hot Encoding*.\n",
"* Normaliser les données numériques"
]
},
{
"cell_type": "markdown",
"id": "95f5cc9f",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant de réaliser le One Hot Encoding des variables catégorielles. Vous pourrez utiliser la fonction \"preproc.OneHotEncoder\" de la librairie sklearn"
]
},
{
"cell_type": "code",
"execution_count": 168,
"id": "b8530717",
"metadata": {},
"outputs": [],
"source": [
"encoder = preproc.OneHotEncoder()\n",
"encoder.fit(vars_categorielles)\n",
"vars_categorielles_enc = encoder.transform(vars_categorielles)\n",
"vars_categorielles_enc = pd.DataFrame(vars_categorielles_enc.toarray(), columns=encoder.get_feature_names_out(vars_categorielles.columns))"
]
},
{
"cell_type": "markdown",
"id": "b70abc5c",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant normaliser les variables numériques présentes dans la base. Vous pourrez utiliser la fonction \"preproc.StandardScaler\" de la librairie sklearn"
]
},
{
"cell_type": "code",
"execution_count": 169,
"id": "4ff3847d",
"metadata": {},
"outputs": [],
"source": [
"scaler = preproc.StandardScaler()\n",
"scaler.fit(vars_numeriques)\n",
"vars_numeriques_scaled = scaler.transform(vars_numeriques)\n",
"vars_numeriques_scaled = pd.DataFrame(vars_numeriques_scaled, columns=vars_numeriques.columns)"
]
},
{
"cell_type": "markdown",
"id": "62d49546",
"metadata": {},
"source": [
"#### Sampling"
]
},
{
"cell_type": "markdown",
"id": "64d229f4",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant construire la base d'apprentissage (80% des données) et la base de test (20%)."
]
},
{
"cell_type": "code",
"execution_count": 170,
"id": "6a1c7907",
"metadata": {},
"outputs": [],
"source": [
"X = data_model_preprocessed = vars_numeriques_scaled.merge(vars_categorielles_enc, left_index=True, right_index=True) # type: ignore\n",
"Y = data_model[\"CM\"]\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "markdown",
"id": "84dc7a07",
"metadata": {},
"source": [
"#### Fitting"
]
},
{
"cell_type": "markdown",
"id": "97c7b783",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant construire le modèle"
]
},
{
"cell_type": "code",
"execution_count": 171,
"id": "053e013c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-4 {\n",
" /* Definition of color scheme common for light and dark mode */\n",
" --sklearn-color-text: #000;\n",
" --sklearn-color-text-muted: #666;\n",
" --sklearn-color-line: gray;\n",
" /* Definition of color scheme for unfitted estimators */\n",
" --sklearn-color-unfitted-level-0: #fff5e6;\n",
" --sklearn-color-unfitted-level-1: #f6e4d2;\n",
" --sklearn-color-unfitted-level-2: #ffe0b3;\n",
" --sklearn-color-unfitted-level-3: chocolate;\n",
" /* Definition of color scheme for fitted estimators */\n",
" --sklearn-color-fitted-level-0: #f0f8ff;\n",
" --sklearn-color-fitted-level-1: #d4ebff;\n",
" --sklearn-color-fitted-level-2: #b3dbfd;\n",
" --sklearn-color-fitted-level-3: cornflowerblue;\n",
"\n",
" /* Specific color for light theme */\n",
" --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
" --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));\n",
" --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));\n",
" --sklearn-color-icon: #696969;\n",
"\n",
" @media (prefers-color-scheme: dark) {\n",
" /* Redefinition of color scheme for dark theme */\n",
" --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
" --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));\n",
" --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));\n",
" --sklearn-color-icon: #878787;\n",
" }\n",
"}\n",
"\n",
"#sk-container-id-4 {\n",
" color: var(--sklearn-color-text);\n",
"}\n",
"\n",
"#sk-container-id-4 pre {\n",
" padding: 0;\n",
"}\n",
"\n",
"#sk-container-id-4 input.sk-hidden--visually {\n",
" border: 0;\n",
" clip: rect(1px 1px 1px 1px);\n",
" clip: rect(1px, 1px, 1px, 1px);\n",
" height: 1px;\n",
" margin: -1px;\n",
" overflow: hidden;\n",
" padding: 0;\n",
" position: absolute;\n",
" width: 1px;\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-dashed-wrapped {\n",
" border: 1px dashed var(--sklearn-color-line);\n",
" margin: 0 0.4em 0.5em 0.4em;\n",
" box-sizing: border-box;\n",
" padding-bottom: 0.4em;\n",
" background-color: var(--sklearn-color-background);\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-container {\n",
" /* jupyter's `normalize.less` sets `[hidden] { display: none; }`\n",
" but bootstrap.min.css set `[hidden] { display: none !important; }`\n",
" so we also need the `!important` here to be able to override the\n",
" default hidden behavior on the sphinx rendered scikit-learn.org.\n",
" See: https://github.com/scikit-learn/scikit-learn/issues/21755 */\n",
" display: inline-block !important;\n",
" position: relative;\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-text-repr-fallback {\n",
" display: none;\n",
"}\n",
"\n",
"div.sk-parallel-item,\n",
"div.sk-serial,\n",
"div.sk-item {\n",
" /* draw centered vertical line to link estimators */\n",
" background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));\n",
" background-size: 2px 100%;\n",
" background-repeat: no-repeat;\n",
" background-position: center center;\n",
"}\n",
"\n",
"/* Parallel-specific style estimator block */\n",
"\n",
"#sk-container-id-4 div.sk-parallel-item::after {\n",
" content: \"\";\n",
" width: 100%;\n",
" border-bottom: 2px solid var(--sklearn-color-text-on-default-background);\n",
" flex-grow: 1;\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-parallel {\n",
" display: flex;\n",
" align-items: stretch;\n",
" justify-content: center;\n",
" background-color: var(--sklearn-color-background);\n",
" position: relative;\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-parallel-item {\n",
" display: flex;\n",
" flex-direction: column;\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-parallel-item:first-child::after {\n",
" align-self: flex-end;\n",
" width: 50%;\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-parallel-item:last-child::after {\n",
" align-self: flex-start;\n",
" width: 50%;\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-parallel-item:only-child::after {\n",
" width: 0;\n",
"}\n",
"\n",
"/* Serial-specific style estimator block */\n",
"\n",
"#sk-container-id-4 div.sk-serial {\n",
" display: flex;\n",
" flex-direction: column;\n",
" align-items: center;\n",
" background-color: var(--sklearn-color-background);\n",
" padding-right: 1em;\n",
" padding-left: 1em;\n",
"}\n",
"\n",
"\n",
"/* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is\n",
"clickable and can be expanded/collapsed.\n",
"- Pipeline and ColumnTransformer use this feature and define the default style\n",
"- Estimators will overwrite some part of the style using the `sk-estimator` class\n",
"*/\n",
"\n",
"/* Pipeline and ColumnTransformer style (default) */\n",
"\n",
"#sk-container-id-4 div.sk-toggleable {\n",
" /* Default theme specific background. It is overwritten whether we have a\n",
" specific estimator or a Pipeline/ColumnTransformer */\n",
" background-color: var(--sklearn-color-background);\n",
"}\n",
"\n",
"/* Toggleable label */\n",
"#sk-container-id-4 label.sk-toggleable__label {\n",
" cursor: pointer;\n",
" display: flex;\n",
" width: 100%;\n",
" margin-bottom: 0;\n",
" padding: 0.5em;\n",
" box-sizing: border-box;\n",
" text-align: center;\n",
" align-items: start;\n",
" justify-content: space-between;\n",
" gap: 0.5em;\n",
"}\n",
"\n",
"#sk-container-id-4 label.sk-toggleable__label .caption {\n",
" font-size: 0.6rem;\n",
" font-weight: lighter;\n",
" color: var(--sklearn-color-text-muted);\n",
"}\n",
"\n",
"#sk-container-id-4 label.sk-toggleable__label-arrow:before {\n",
" /* Arrow on the left of the label */\n",
" content: \"▸\";\n",
" float: left;\n",
" margin-right: 0.25em;\n",
" color: var(--sklearn-color-icon);\n",
"}\n",
"\n",
"#sk-container-id-4 label.sk-toggleable__label-arrow:hover:before {\n",
" color: var(--sklearn-color-text);\n",
"}\n",
"\n",
"/* Toggleable content - dropdown */\n",
"\n",
"#sk-container-id-4 div.sk-toggleable__content {\n",
" max-height: 0;\n",
" max-width: 0;\n",
" overflow: hidden;\n",
" text-align: left;\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-toggleable__content.fitted {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-toggleable__content pre {\n",
" margin: 0.2em;\n",
" border-radius: 0.25em;\n",
" color: var(--sklearn-color-text);\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-toggleable__content.fitted pre {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-fitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-4 input.sk-toggleable__control:checked~div.sk-toggleable__content {\n",
" /* Expand drop-down */\n",
" max-height: 200px;\n",
" max-width: 100%;\n",
" overflow: auto;\n",
"}\n",
"\n",
"#sk-container-id-4 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {\n",
" content: \"▾\";\n",
"}\n",
"\n",
"/* Pipeline/ColumnTransformer-specific style */\n",
"\n",
"#sk-container-id-4 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" color: var(--sklearn-color-text);\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"/* Estimator-specific style */\n",
"\n",
"/* Colorize estimator box */\n",
"#sk-container-id-4 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-label label.sk-toggleable__label,\n",
"#sk-container-id-4 div.sk-label label {\n",
" /* The background is the default theme color */\n",
" color: var(--sklearn-color-text-on-default-background);\n",
"}\n",
"\n",
"/* On hover, darken the color of the background */\n",
"#sk-container-id-4 div.sk-label:hover label.sk-toggleable__label {\n",
" color: var(--sklearn-color-text);\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"/* Label box, darken color on hover, fitted */\n",
"#sk-container-id-4 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {\n",
" color: var(--sklearn-color-text);\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"/* Estimator label */\n",
"\n",
"#sk-container-id-4 div.sk-label label {\n",
" font-family: monospace;\n",
" font-weight: bold;\n",
" display: inline-block;\n",
" line-height: 1.2em;\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-label-container {\n",
" text-align: center;\n",
"}\n",
"\n",
"/* Estimator-specific */\n",
"#sk-container-id-4 div.sk-estimator {\n",
" font-family: monospace;\n",
" border: 1px dotted var(--sklearn-color-border-box);\n",
" border-radius: 0.25em;\n",
" box-sizing: border-box;\n",
" margin-bottom: 0.5em;\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-0);\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-estimator.fitted {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-0);\n",
"}\n",
"\n",
"/* on hover */\n",
"#sk-container-id-4 div.sk-estimator:hover {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-2);\n",
"}\n",
"\n",
"#sk-container-id-4 div.sk-estimator.fitted:hover {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-2);\n",
"}\n",
"\n",
"/* Specification for estimator info (e.g. \"i\" and \"?\") */\n",
"\n",
"/* Common style for \"i\" and \"?\" */\n",
"\n",
".sk-estimator-doc-link,\n",
"a:link.sk-estimator-doc-link,\n",
"a:visited.sk-estimator-doc-link {\n",
" float: right;\n",
" font-size: smaller;\n",
" line-height: 1em;\n",
" font-family: monospace;\n",
" background-color: var(--sklearn-color-background);\n",
" border-radius: 1em;\n",
" height: 1em;\n",
" width: 1em;\n",
" text-decoration: none !important;\n",
" margin-left: 0.5em;\n",
" text-align: center;\n",
" /* unfitted */\n",
" border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
" color: var(--sklearn-color-unfitted-level-1);\n",
"}\n",
"\n",
".sk-estimator-doc-link.fitted,\n",
"a:link.sk-estimator-doc-link.fitted,\n",
"a:visited.sk-estimator-doc-link.fitted {\n",
" /* fitted */\n",
" border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
" color: var(--sklearn-color-fitted-level-1);\n",
"}\n",
"\n",
"/* On hover */\n",
"div.sk-estimator:hover .sk-estimator-doc-link:hover,\n",
".sk-estimator-doc-link:hover,\n",
"div.sk-label-container:hover .sk-estimator-doc-link:hover,\n",
".sk-estimator-doc-link:hover {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-3);\n",
" color: var(--sklearn-color-background);\n",
" text-decoration: none;\n",
"}\n",
"\n",
"div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,\n",
".sk-estimator-doc-link.fitted:hover,\n",
"div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,\n",
".sk-estimator-doc-link.fitted:hover {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-3);\n",
" color: var(--sklearn-color-background);\n",
" text-decoration: none;\n",
"}\n",
"\n",
"/* Span, style for the box shown on hovering the info icon */\n",
".sk-estimator-doc-link span {\n",
" display: none;\n",
" z-index: 9999;\n",
" position: relative;\n",
" font-weight: normal;\n",
" right: .2ex;\n",
" padding: .5ex;\n",
" margin: .5ex;\n",
" width: min-content;\n",
" min-width: 20ex;\n",
" max-width: 50ex;\n",
" color: var(--sklearn-color-text);\n",
" box-shadow: 2pt 2pt 4pt #999;\n",
" /* unfitted */\n",
" background: var(--sklearn-color-unfitted-level-0);\n",
" border: .5pt solid var(--sklearn-color-unfitted-level-3);\n",
"}\n",
"\n",
".sk-estimator-doc-link.fitted span {\n",
" /* fitted */\n",
" background: var(--sklearn-color-fitted-level-0);\n",
" border: var(--sklearn-color-fitted-level-3);\n",
"}\n",
"\n",
".sk-estimator-doc-link:hover span {\n",
" display: block;\n",
"}\n",
"\n",
"/* \"?\"-specific style due to the `<a>` HTML tag */\n",
"\n",
"#sk-container-id-4 a.estimator_doc_link {\n",
" float: right;\n",
" font-size: 1rem;\n",
" line-height: 1em;\n",
" font-family: monospace;\n",
" background-color: var(--sklearn-color-background);\n",
" border-radius: 1rem;\n",
" height: 1rem;\n",
" width: 1rem;\n",
" text-decoration: none;\n",
" /* unfitted */\n",
" color: var(--sklearn-color-unfitted-level-1);\n",
" border: var(--sklearn-color-unfitted-level-1) 1pt solid;\n",
"}\n",
"\n",
"#sk-container-id-4 a.estimator_doc_link.fitted {\n",
" /* fitted */\n",
" border: var(--sklearn-color-fitted-level-1) 1pt solid;\n",
" color: var(--sklearn-color-fitted-level-1);\n",
"}\n",
"\n",
"/* On hover */\n",
"#sk-container-id-4 a.estimator_doc_link:hover {\n",
" /* unfitted */\n",
" background-color: var(--sklearn-color-unfitted-level-3);\n",
" color: var(--sklearn-color-background);\n",
" text-decoration: none;\n",
"}\n",
"\n",
"#sk-container-id-4 a.estimator_doc_link.fitted:hover {\n",
" /* fitted */\n",
" background-color: var(--sklearn-color-fitted-level-3);\n",
"}\n",
"</style><div id=\"sk-container-id-4\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>DecisionTreeRegressor()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" checked><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow\"><div><div>DecisionTreeRegressor</div></div><div><a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.6/modules/generated/sklearn.tree.DecisionTreeRegressor.html\">?<span>Documentation for DecisionTreeRegressor</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></div></label><div class=\"sk-toggleable__content fitted\"><pre>DecisionTreeRegressor()</pre></div> </div></div></div></div>"
],
"text/plain": [
"DecisionTreeRegressor()"
]
},
"execution_count": 171,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree = DecisionTreeRegressor()\n",
"tree.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"id": "8d624704",
"metadata": {},
"source": [
"**Exercice :** proposez un bout de code permettant d'évaluer les performances du modèle (MAE, MSE et RMSE)"
]
},
{
"cell_type": "code",
"execution_count": 172,
"id": "c4ca2cf9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAE: 0.00\n",
"MSE: 0.00\n",
"RMSE: 0.00\n"
]
}
],
"source": [
"# Prédictions sur l'ensemble d'entraînement\n",
"y_pred_train = tree.predict(X_train)\n",
"\n",
"mae = metrics.mean_absolute_error(y_train, y_pred_train)\n",
"mse = metrics.mean_squared_error(y_train, y_pred_train)\n",
"rmse = metrics.root_mean_squared_error(y_train, y_pred_train)\n",
"\n",
"print(f\"MAE: {mae:.2f}\")\n",
"print(f\"MSE: {mse:.2f}\")\n",
"print(f\"RMSE: {rmse:.2f}\")"
]
},
{
"cell_type": "code",
"execution_count": 173,
"id": "4b739d5b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MAE: 5950.05\n",
"MSE: 160067768.70\n",
"RMSE: 12651.79\n"
]
}
],
"source": [
"y_pred_test = tree.predict(X_test)\n",
"\n",
"mae = metrics.mean_absolute_error(y_test, y_pred_test)\n",
"mse = metrics.mean_squared_error(y_test, y_pred_test)\n",
"rmse = metrics.root_mean_squared_error(y_test, y_pred_test)\n",
"\n",
"print(f\"MAE: {mae:.2f}\")\n",
"print(f\"MSE: {mse:.2f}\")\n",
"print(f\"RMSE: {rmse:.2f}\")\n"
]
},
{
"cell_type": "markdown",
"id": "fb2fe98c",
"metadata": {},
"source": [
"**Question :** que pensez-vous des performances de ce modèle ?"
]
},
{
"cell_type": "markdown",
"id": "7ecba832",
"metadata": {},
"source": [
"## Algorithme supervisé : Random Forest "
]
},
{
"cell_type": "markdown",
"id": "efcb8987",
"metadata": {},
"source": [
"A ce stade, nous avons vu les différentes étapes pour lancer un algorithme de Machine Learning. Néanmoins, ces étapes ne sont pas suffisantes pour construire un modèle performant. \n",
"En effet, afin de construire un modèle performant le Data Scientist doit agir sur l'apprentissage du modèle. Dans ce qui suit nous :\n",
"* Changerons d'algorithme pour utiliser un algorithme plus performant (Random Forest)\n",
"* Raliserons un *grid search* sur les paramètres du modèle\n",
"* Appliquerons l'apprentissage par validation croisée\n"
]
},
{
"cell_type": "markdown",
"id": "d6723a2f",
"metadata": {},
"source": [
"### Modèle avec Validation Croisée"
]
},
{
"cell_type": "markdown",
"id": "3716b09f",
"metadata": {},
"source": [
"#### Sampling"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab1e1367",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "3f5d735e",
"metadata": {},
"source": [
"#### Fitting avec Cross-Validation"
]
},
{
"cell_type": "markdown",
"id": "bc819f8f",
"metadata": {},
"source": [
"**Exercice :** construisez un modèle RF (RandomForestRegressor) en implémentant la technique de validation croisée. Pensez à enregistrer au sein d'une variable/liste les performances (MAE, MSE & RMSE) du modèle au sein de chaque fold."
]
},
{
"cell_type": "code",
"execution_count": 174,
"id": "b515460e",
"metadata": {},
"outputs": [],
"source": [
"#Initialisation\n",
"# Nombre de sous-échantillons pour la cross-validation\n",
"num_splits = 5\n",
"\n",
"# Random Forest regressor\n",
"rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)\n",
"\n",
"# Initialisation du KFold cross-validation splitter\n",
"kf = KFold(n_splits=num_splits)\n",
"\n",
"# Listes pour enregistrer les performances du modèle\n",
"MAE_scores = []\n",
"MSE_scores = []\n",
"RMSE_scores = []"
]
},
{
"cell_type": "code",
"execution_count": 175,
"id": "eebb394f",
"metadata": {},
"outputs": [],
"source": [
"# Entrainement avec cross-validation\n"
]
},
{
"cell_type": "code",
"execution_count": 176,
"id": "b067126c",
"metadata": {},
"outputs": [],
"source": [
"# Métriques sur tous les folds\n",
"\n",
"#MAE\n",
"for fold, mae in enumerate(MAE_scores, start=1):\n",
" print(f\"Fold {fold} MAE:\", mae)"
]
},
{
"cell_type": "code",
"execution_count": 177,
"id": "6597152c",
"metadata": {},
"outputs": [],
"source": [
"#MSE\n",
"for fold, mse in enumerate(MSE_scores, start=1):\n",
" print(f\"Fold {fold} MSE:\", mse)"
]
},
{
"cell_type": "code",
"execution_count": 178,
"id": "63ff1c9d",
"metadata": {},
"outputs": [],
"source": [
"#RMSE\n",
"for fold, rmse in enumerate(RMSE_scores, start=1):\n",
" print(f\"Fold {fold} RMSE:\", rmse)"
]
},
{
"cell_type": "markdown",
"id": "ec1961c2",
"metadata": {},
"source": [
"**Question :** Commentez les résultats."
]
},
{
"cell_type": "markdown",
"id": "5a8163ef",
"metadata": {},
"source": [
"### Ajout d'un Grid Search pour les hyper paramètres"
]
},
{
"cell_type": "markdown",
"id": "5a6adbfe",
"metadata": {},
"source": [
"#### Sampling"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9342ad6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "dce52b11",
"metadata": {},
"source": [
"#### Fitting avec Cross-Validation et *Grid Search*"
]
},
{
"cell_type": "markdown",
"id": "7e3a9dd0",
"metadata": {},
"source": [
"**Exercice :** Intégrez la technique de Grid Search pour rechercher les paramètres optimaux du modèle."
]
},
{
"cell_type": "code",
"execution_count": 179,
"id": "6d58dbc2",
"metadata": {},
"outputs": [],
"source": [
"#Initialisation\n",
"# Nombre de sous-échantillons pour la cross-validation\n",
"num_splits = 5\n",
"\n",
"# Initialisation du KFold cross-validation splitter\n",
"kf = KFold(n_splits=num_splits)\n",
"\n",
"# Listes pour enregistrer les performances du modèle\n",
"MAE_scores = []\n",
"MSE_scores = []\n",
"RMSE_scores = []\n",
"\n",
"# Hyperparamètres à tester\n",
"n_estimators_values = [] #Complétez ici par les paramètres à tester\n",
"max_depth_values = [] #Complétez ici par les paramètres à tester\n",
"min_samples_split_values = [] #Complétez ici par les paramètres à tester\n",
"\n",
"# Liste pour sauveagrder les meilleurs résultats\n",
"best_score = np.inf\n",
"best_params = {}\n",
"\n",
"MAE_best_score = []\n",
"MSE_best_score = []\n",
"RMSE_best_score = []"
]
},
{
"cell_type": "code",
"execution_count": 180,
"id": "47da5172",
"metadata": {},
"outputs": [],
"source": [
"#Complétez ici avec votre code"
]
},
{
"cell_type": "code",
"execution_count": 181,
"id": "d4936c46",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Meilleurs paramètres: {}\n",
"Meilleure RMSE : inf\n"
]
}
],
"source": [
"# Meilleurs résultats\n",
"print(\"Meilleurs paramètres:\", best_params)\n",
"print(\"Meilleure RMSE :\", best_score)"
]
},
{
"cell_type": "code",
"execution_count": 182,
"id": "3215c463",
"metadata": {},
"outputs": [],
"source": [
"# Métriques sur tous les folds\n",
"\n",
"#RMSE\n",
"for fold, rmse in enumerate(RMSE_best_score, start=1):\n",
" print(f\"Fold {fold} RMSE:\", rmse)\n"
]
},
{
"cell_type": "code",
"execution_count": 183,
"id": "bb9a5c9b",
"metadata": {},
"outputs": [],
"source": [
"#MAE\n",
"for fold, mse in enumerate(MSE_best_score, start=1):\n",
" print(f\"Fold {fold} MSE:\", mse)"
]
},
{
"cell_type": "code",
"execution_count": 184,
"id": "0f0768ad",
"metadata": {},
"outputs": [],
"source": [
"#MSE\n",
"for fold, mae in enumerate(MAE_best_score, start=1):\n",
" print(f\"Fold {fold} MAE:\", mae)"
]
},
{
"cell_type": "markdown",
"id": "802a625f",
"metadata": {},
"source": [
"**Question :** Commentez les résultats"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "studies",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}