mirror of
https://github.com/ArthurDanjou/ArtStudies.git
synced 2026-01-23 23:51:51 +01:00
1934 lines
60 KiB
Plaintext
1934 lines
60 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8750d15b",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Cours 3 : Machine Learning - Algorithmes supervisés (1/2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f7c08ae5",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Préambule"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ec7ecb4b",
|
|
"metadata": {},
|
|
"source": [
|
|
"Les objectifs de cette séance (3h) sont :\n",
|
|
"* Préparation des bases de modélisation (sampling)\n",
|
|
"* Mettre en application un modèle supervisé simple.\n",
|
|
"* Construire un modèle de Machine Learning (cross-validation et hyperparamétrage) pour résoudre un problème de régression\n",
|
|
"* Analyser les performances du modèle"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4e99c600",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Préparation du workspace"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c1b01045",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Import de librairies "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "97d58527",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Données\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"#Graphiques\n",
|
|
"import seaborn as sns\n",
|
|
"\n",
|
|
"sns.set()\n",
|
|
"import plotly.express as px\n",
|
|
"import plotly.graph_objects as gp\n",
|
|
"import sklearn.preprocessing as preproc\n",
|
|
"\n",
|
|
"#Statistiques\n",
|
|
"from scipy.stats import chi2_contingency\n",
|
|
"from sklearn import metrics\n",
|
|
"\n",
|
|
"# Machine Learning\n",
|
|
"from sklearn.cluster import KMeans\n",
|
|
"from sklearn.ensemble import RandomForestRegressor\n",
|
|
"from sklearn.model_selection import KFold, train_test_split\n",
|
|
"from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "06153286",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Définition des fonctions "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c67db932",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "985e4e97",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Constantes"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"id": "c9597b48",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"input_path = \"./1_inputs\"\n",
|
|
"output_path = \"./2_outputs\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b2b035d2",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Import des données"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 25,
|
|
"id": "8051b5f4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"path =input_path + '/base_retraitee.csv'\n",
|
|
"data_retraitee = pd.read_csv(path,sep=\",\",decimal=\".\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a2578ba1",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Algorithme supervisé : CART "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "aaa0b27d",
|
|
"metadata": {},
|
|
"source": [
|
|
"Dans cette partie l'objectif est de construire un modèle simple (algorithme CART) afin de voir les différentes étapes nécessaire au lancement d'un modèle\n",
|
|
"Nous modéliserons directement le coût des sinistres. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a0458a05",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Construction du modèle"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b3715c37",
|
|
"metadata": {},
|
|
"source": [
|
|
"La première étape est de calculer les côut moyen de chaque sinistre (target ou variable réponse). Cette variable sera la variable à prédire en fonction des variables explicatives."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"id": "c427a4b8",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
|
|
"columns": [
|
|
{
|
|
"name": "index",
|
|
"rawType": "int64",
|
|
"type": "integer"
|
|
},
|
|
{
|
|
"name": "ANNEE_CTR",
|
|
"rawType": "int64",
|
|
"type": "integer"
|
|
},
|
|
{
|
|
"name": "CONTRAT_ANCIENNETE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "FREQUENCE_PAIEMENT_COTISATION",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "GROUPE_KM",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "ZONE_RISQUE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "AGE_ASSURE_PRINCIPAL",
|
|
"rawType": "int64",
|
|
"type": "integer"
|
|
},
|
|
{
|
|
"name": "GENRE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "DEUXIEME_CONDUCTEUR",
|
|
"rawType": "bool",
|
|
"type": "boolean"
|
|
},
|
|
{
|
|
"name": "ANCIENNETE_PERMIS",
|
|
"rawType": "int64",
|
|
"type": "integer"
|
|
},
|
|
{
|
|
"name": "ANNEE_CONSTRUCTION",
|
|
"rawType": "float64",
|
|
"type": "float"
|
|
},
|
|
{
|
|
"name": "ENERGIE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "EQUIPEMENT_SECURITE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "VALEUR_DU_BIEN",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "CM",
|
|
"rawType": "float64",
|
|
"type": "float"
|
|
}
|
|
],
|
|
"ref": "e76df045-0c83-40e9-a027-c48f278ec1d6",
|
|
"rows": [
|
|
[
|
|
"10",
|
|
"2019",
|
|
"(0,1]",
|
|
"MENSUEL",
|
|
"[0;20000[",
|
|
"C",
|
|
"40",
|
|
"M",
|
|
"False",
|
|
"37",
|
|
"2017.0",
|
|
"ESSENCE",
|
|
"VRAI",
|
|
"[15000;20000[",
|
|
"1072.98"
|
|
],
|
|
[
|
|
"34",
|
|
"2020",
|
|
"(-1,0]",
|
|
"MENSUEL",
|
|
"[20000;40000[",
|
|
"C",
|
|
"27",
|
|
"M",
|
|
"True",
|
|
"13",
|
|
"2018.0",
|
|
"AUTRE",
|
|
"FAUX",
|
|
"[35000;99999[",
|
|
"3750.0"
|
|
],
|
|
[
|
|
"36",
|
|
"2019",
|
|
"(-1,0]",
|
|
"MENSUEL",
|
|
"[20000;40000[",
|
|
"L",
|
|
"19",
|
|
"M",
|
|
"False",
|
|
"2",
|
|
"2017.0",
|
|
"ESSENCE",
|
|
"VRAI",
|
|
"[0;10000[",
|
|
"1838.49"
|
|
],
|
|
[
|
|
"78",
|
|
"2019",
|
|
"(-1,0]",
|
|
"MENSUEL",
|
|
"[20000;40000[",
|
|
"B",
|
|
"40",
|
|
"M",
|
|
"False",
|
|
"45",
|
|
"2018.0",
|
|
"DIESEL",
|
|
"FAUX",
|
|
"[15000;20000[",
|
|
"4892.74"
|
|
],
|
|
[
|
|
"89",
|
|
"2018",
|
|
"(1,2]",
|
|
"MENSUEL",
|
|
"[20000;40000[",
|
|
"C",
|
|
"20",
|
|
"M",
|
|
"False",
|
|
"11",
|
|
"2014.0",
|
|
"ESSENCE",
|
|
"FAUX",
|
|
"[25000;35000[",
|
|
"166.73"
|
|
]
|
|
],
|
|
"shape": {
|
|
"columns": 14,
|
|
"rows": 5
|
|
}
|
|
},
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>ANNEE_CTR</th>\n",
|
|
" <th>CONTRAT_ANCIENNETE</th>\n",
|
|
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
|
|
" <th>GROUPE_KM</th>\n",
|
|
" <th>ZONE_RISQUE</th>\n",
|
|
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
|
|
" <th>GENRE</th>\n",
|
|
" <th>DEUXIEME_CONDUCTEUR</th>\n",
|
|
" <th>ANCIENNETE_PERMIS</th>\n",
|
|
" <th>ANNEE_CONSTRUCTION</th>\n",
|
|
" <th>ENERGIE</th>\n",
|
|
" <th>EQUIPEMENT_SECURITE</th>\n",
|
|
" <th>VALEUR_DU_BIEN</th>\n",
|
|
" <th>CM</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>10</th>\n",
|
|
" <td>2019</td>\n",
|
|
" <td>(0,1]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[0;20000[</td>\n",
|
|
" <td>C</td>\n",
|
|
" <td>40</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>False</td>\n",
|
|
" <td>37</td>\n",
|
|
" <td>2017.0</td>\n",
|
|
" <td>ESSENCE</td>\n",
|
|
" <td>VRAI</td>\n",
|
|
" <td>[15000;20000[</td>\n",
|
|
" <td>1072.98</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>34</th>\n",
|
|
" <td>2020</td>\n",
|
|
" <td>(-1,0]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[20000;40000[</td>\n",
|
|
" <td>C</td>\n",
|
|
" <td>27</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>True</td>\n",
|
|
" <td>13</td>\n",
|
|
" <td>2018.0</td>\n",
|
|
" <td>AUTRE</td>\n",
|
|
" <td>FAUX</td>\n",
|
|
" <td>[35000;99999[</td>\n",
|
|
" <td>3750.00</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>36</th>\n",
|
|
" <td>2019</td>\n",
|
|
" <td>(-1,0]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[20000;40000[</td>\n",
|
|
" <td>L</td>\n",
|
|
" <td>19</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>False</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>2017.0</td>\n",
|
|
" <td>ESSENCE</td>\n",
|
|
" <td>VRAI</td>\n",
|
|
" <td>[0;10000[</td>\n",
|
|
" <td>1838.49</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>78</th>\n",
|
|
" <td>2019</td>\n",
|
|
" <td>(-1,0]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[20000;40000[</td>\n",
|
|
" <td>B</td>\n",
|
|
" <td>40</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>False</td>\n",
|
|
" <td>45</td>\n",
|
|
" <td>2018.0</td>\n",
|
|
" <td>DIESEL</td>\n",
|
|
" <td>FAUX</td>\n",
|
|
" <td>[15000;20000[</td>\n",
|
|
" <td>4892.74</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>89</th>\n",
|
|
" <td>2018</td>\n",
|
|
" <td>(1,2]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[20000;40000[</td>\n",
|
|
" <td>C</td>\n",
|
|
" <td>20</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>False</td>\n",
|
|
" <td>11</td>\n",
|
|
" <td>2014.0</td>\n",
|
|
" <td>ESSENCE</td>\n",
|
|
" <td>FAUX</td>\n",
|
|
" <td>[25000;35000[</td>\n",
|
|
" <td>166.73</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION GROUPE_KM \\\n",
|
|
"10 2019 (0,1] MENSUEL [0;20000[ \n",
|
|
"34 2020 (-1,0] MENSUEL [20000;40000[ \n",
|
|
"36 2019 (-1,0] MENSUEL [20000;40000[ \n",
|
|
"78 2019 (-1,0] MENSUEL [20000;40000[ \n",
|
|
"89 2018 (1,2] MENSUEL [20000;40000[ \n",
|
|
"\n",
|
|
" ZONE_RISQUE AGE_ASSURE_PRINCIPAL GENRE DEUXIEME_CONDUCTEUR \\\n",
|
|
"10 C 40 M False \n",
|
|
"34 C 27 M True \n",
|
|
"36 L 19 M False \n",
|
|
"78 B 40 M False \n",
|
|
"89 C 20 M False \n",
|
|
"\n",
|
|
" ANCIENNETE_PERMIS ANNEE_CONSTRUCTION ENERGIE EQUIPEMENT_SECURITE \\\n",
|
|
"10 37 2017.0 ESSENCE VRAI \n",
|
|
"34 13 2018.0 AUTRE FAUX \n",
|
|
"36 2 2017.0 ESSENCE VRAI \n",
|
|
"78 45 2018.0 DIESEL FAUX \n",
|
|
"89 11 2014.0 ESSENCE FAUX \n",
|
|
"\n",
|
|
" VALEUR_DU_BIEN CM \n",
|
|
"10 [15000;20000[ 1072.98 \n",
|
|
"34 [35000;99999[ 3750.00 \n",
|
|
"36 [0;10000[ 1838.49 \n",
|
|
"78 [15000;20000[ 4892.74 \n",
|
|
"89 [25000;35000[ 166.73 "
|
|
]
|
|
},
|
|
"execution_count": 26,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data_model = data_retraitee.copy()\n",
|
|
"\n",
|
|
"# Filtre pour ne garder que les lignes qui ont un sinistre (NB > 0)\n",
|
|
"data_model = data_model[data_model['NB'] > 0]\n",
|
|
"\n",
|
|
"# Calcul du cout moyen \"théorique\" des sinistres\n",
|
|
"data_model[\"CM\"] = (data_model[\"CHARGE\"] / data_model[\"NB\"])\n",
|
|
"data_model = data_model.drop(['CHARGE', 'NB', \"EXPO\"], axis=1)\n",
|
|
"data_model.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e3e85088",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Exercice :** construisez les statistiques descriptives de la base utilisée."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 27,
|
|
"id": "c8fd3ee1",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
|
|
"columns": [
|
|
{
|
|
"name": "index",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "ANNEE_CTR",
|
|
"rawType": "float64",
|
|
"type": "float"
|
|
},
|
|
{
|
|
"name": "CONTRAT_ANCIENNETE",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "FREQUENCE_PAIEMENT_COTISATION",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "GROUPE_KM",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "ZONE_RISQUE",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "AGE_ASSURE_PRINCIPAL",
|
|
"rawType": "float64",
|
|
"type": "float"
|
|
},
|
|
{
|
|
"name": "GENRE",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "DEUXIEME_CONDUCTEUR",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "ANCIENNETE_PERMIS",
|
|
"rawType": "float64",
|
|
"type": "float"
|
|
},
|
|
{
|
|
"name": "ANNEE_CONSTRUCTION",
|
|
"rawType": "float64",
|
|
"type": "float"
|
|
},
|
|
{
|
|
"name": "ENERGIE",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "EQUIPEMENT_SECURITE",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "VALEUR_DU_BIEN",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "CM",
|
|
"rawType": "float64",
|
|
"type": "float"
|
|
}
|
|
],
|
|
"ref": "b2f9efdd-d035-4c51-9797-2e202b404c15",
|
|
"rows": [
|
|
[
|
|
"count",
|
|
"824.0",
|
|
"824",
|
|
"824",
|
|
"824",
|
|
"824",
|
|
"824.0",
|
|
"824",
|
|
"824",
|
|
"824.0",
|
|
"824.0",
|
|
"824",
|
|
"824",
|
|
"824",
|
|
"824.0"
|
|
],
|
|
[
|
|
"unique",
|
|
null,
|
|
"5",
|
|
"3",
|
|
"4",
|
|
"14",
|
|
null,
|
|
"2",
|
|
"2",
|
|
null,
|
|
null,
|
|
"3",
|
|
"2",
|
|
"6",
|
|
null
|
|
],
|
|
[
|
|
"top",
|
|
null,
|
|
"(0,1]",
|
|
"MENSUEL",
|
|
"[0;20000[",
|
|
"C",
|
|
null,
|
|
"M",
|
|
"False",
|
|
null,
|
|
null,
|
|
"ESSENCE",
|
|
"FAUX",
|
|
"[10000;15000[",
|
|
null
|
|
],
|
|
[
|
|
"freq",
|
|
null,
|
|
"297",
|
|
"398",
|
|
"391",
|
|
"269",
|
|
null,
|
|
"483",
|
|
"663",
|
|
null,
|
|
null,
|
|
"413",
|
|
"517",
|
|
"213",
|
|
null
|
|
],
|
|
[
|
|
"mean",
|
|
"2018.384708737864",
|
|
null,
|
|
null,
|
|
null,
|
|
null,
|
|
"44.383495145631066",
|
|
null,
|
|
null,
|
|
"35.68810679611651",
|
|
"2015.2123786407767",
|
|
null,
|
|
null,
|
|
null,
|
|
"4246.01697815534"
|
|
],
|
|
[
|
|
"std",
|
|
"1.515832735580178",
|
|
null,
|
|
null,
|
|
null,
|
|
null,
|
|
"13.808216667998865",
|
|
null,
|
|
null,
|
|
"19.370620845496358",
|
|
"3.1637823115731556",
|
|
null,
|
|
null,
|
|
null,
|
|
"6869.61691660173"
|
|
],
|
|
[
|
|
"min",
|
|
"2016.0",
|
|
null,
|
|
null,
|
|
null,
|
|
null,
|
|
"19.0",
|
|
null,
|
|
null,
|
|
"1.0",
|
|
"1998.0",
|
|
null,
|
|
null,
|
|
null,
|
|
"7.5"
|
|
],
|
|
[
|
|
"25%",
|
|
"2017.0",
|
|
null,
|
|
null,
|
|
null,
|
|
null,
|
|
"34.0",
|
|
null,
|
|
null,
|
|
"18.0",
|
|
"2014.0",
|
|
null,
|
|
null,
|
|
null,
|
|
"1159.96125"
|
|
],
|
|
[
|
|
"50%",
|
|
"2018.0",
|
|
null,
|
|
null,
|
|
null,
|
|
null,
|
|
"43.0",
|
|
null,
|
|
null,
|
|
"35.0",
|
|
"2016.0",
|
|
null,
|
|
null,
|
|
null,
|
|
"2541.6499999999996"
|
|
],
|
|
[
|
|
"75%",
|
|
"2020.0",
|
|
null,
|
|
null,
|
|
null,
|
|
null,
|
|
"53.0",
|
|
null,
|
|
null,
|
|
"53.0",
|
|
"2017.0",
|
|
null,
|
|
null,
|
|
null,
|
|
"4193.797500000001"
|
|
],
|
|
[
|
|
"max",
|
|
"2021.0",
|
|
null,
|
|
null,
|
|
null,
|
|
null,
|
|
"94.0",
|
|
null,
|
|
null,
|
|
"70.0",
|
|
"2021.0",
|
|
null,
|
|
null,
|
|
null,
|
|
"83421.85"
|
|
]
|
|
],
|
|
"shape": {
|
|
"columns": 14,
|
|
"rows": 11
|
|
}
|
|
},
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>ANNEE_CTR</th>\n",
|
|
" <th>CONTRAT_ANCIENNETE</th>\n",
|
|
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
|
|
" <th>GROUPE_KM</th>\n",
|
|
" <th>ZONE_RISQUE</th>\n",
|
|
" <th>AGE_ASSURE_PRINCIPAL</th>\n",
|
|
" <th>GENRE</th>\n",
|
|
" <th>DEUXIEME_CONDUCTEUR</th>\n",
|
|
" <th>ANCIENNETE_PERMIS</th>\n",
|
|
" <th>ANNEE_CONSTRUCTION</th>\n",
|
|
" <th>ENERGIE</th>\n",
|
|
" <th>EQUIPEMENT_SECURITE</th>\n",
|
|
" <th>VALEUR_DU_BIEN</th>\n",
|
|
" <th>CM</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>count</th>\n",
|
|
" <td>824.000000</td>\n",
|
|
" <td>824</td>\n",
|
|
" <td>824</td>\n",
|
|
" <td>824</td>\n",
|
|
" <td>824</td>\n",
|
|
" <td>824.000000</td>\n",
|
|
" <td>824</td>\n",
|
|
" <td>824</td>\n",
|
|
" <td>824.000000</td>\n",
|
|
" <td>824.000000</td>\n",
|
|
" <td>824</td>\n",
|
|
" <td>824</td>\n",
|
|
" <td>824</td>\n",
|
|
" <td>824.000000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>unique</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>4</td>\n",
|
|
" <td>14</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>6</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>top</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>(0,1]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[0;20000[</td>\n",
|
|
" <td>C</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>False</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>ESSENCE</td>\n",
|
|
" <td>FAUX</td>\n",
|
|
" <td>[10000;15000[</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>freq</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>297</td>\n",
|
|
" <td>398</td>\n",
|
|
" <td>391</td>\n",
|
|
" <td>269</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>483</td>\n",
|
|
" <td>663</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>413</td>\n",
|
|
" <td>517</td>\n",
|
|
" <td>213</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>mean</th>\n",
|
|
" <td>2018.384709</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>44.383495</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>35.688107</td>\n",
|
|
" <td>2015.212379</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>4246.016978</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>std</th>\n",
|
|
" <td>1.515833</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>13.808217</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>19.370621</td>\n",
|
|
" <td>3.163782</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>6869.616917</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>min</th>\n",
|
|
" <td>2016.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>19.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>1998.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>7.500000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>25%</th>\n",
|
|
" <td>2017.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>34.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>18.000000</td>\n",
|
|
" <td>2014.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>1159.961250</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>50%</th>\n",
|
|
" <td>2018.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>43.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>35.000000</td>\n",
|
|
" <td>2016.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>2541.650000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>75%</th>\n",
|
|
" <td>2020.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>53.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>53.000000</td>\n",
|
|
" <td>2017.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>4193.797500</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>max</th>\n",
|
|
" <td>2021.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>94.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>70.000000</td>\n",
|
|
" <td>2021.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>83421.850000</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION \\\n",
|
|
"count 824.000000 824 824 \n",
|
|
"unique NaN 5 3 \n",
|
|
"top NaN (0,1] MENSUEL \n",
|
|
"freq NaN 297 398 \n",
|
|
"mean 2018.384709 NaN NaN \n",
|
|
"std 1.515833 NaN NaN \n",
|
|
"min 2016.000000 NaN NaN \n",
|
|
"25% 2017.000000 NaN NaN \n",
|
|
"50% 2018.000000 NaN NaN \n",
|
|
"75% 2020.000000 NaN NaN \n",
|
|
"max 2021.000000 NaN NaN \n",
|
|
"\n",
|
|
" GROUPE_KM ZONE_RISQUE AGE_ASSURE_PRINCIPAL GENRE DEUXIEME_CONDUCTEUR \\\n",
|
|
"count 824 824 824.000000 824 824 \n",
|
|
"unique 4 14 NaN 2 2 \n",
|
|
"top [0;20000[ C NaN M False \n",
|
|
"freq 391 269 NaN 483 663 \n",
|
|
"mean NaN NaN 44.383495 NaN NaN \n",
|
|
"std NaN NaN 13.808217 NaN NaN \n",
|
|
"min NaN NaN 19.000000 NaN NaN \n",
|
|
"25% NaN NaN 34.000000 NaN NaN \n",
|
|
"50% NaN NaN 43.000000 NaN NaN \n",
|
|
"75% NaN NaN 53.000000 NaN NaN \n",
|
|
"max NaN NaN 94.000000 NaN NaN \n",
|
|
"\n",
|
|
" ANCIENNETE_PERMIS ANNEE_CONSTRUCTION ENERGIE EQUIPEMENT_SECURITE \\\n",
|
|
"count 824.000000 824.000000 824 824 \n",
|
|
"unique NaN NaN 3 2 \n",
|
|
"top NaN NaN ESSENCE FAUX \n",
|
|
"freq NaN NaN 413 517 \n",
|
|
"mean 35.688107 2015.212379 NaN NaN \n",
|
|
"std 19.370621 3.163782 NaN NaN \n",
|
|
"min 1.000000 1998.000000 NaN NaN \n",
|
|
"25% 18.000000 2014.000000 NaN NaN \n",
|
|
"50% 35.000000 2016.000000 NaN NaN \n",
|
|
"75% 53.000000 2017.000000 NaN NaN \n",
|
|
"max 70.000000 2021.000000 NaN NaN \n",
|
|
"\n",
|
|
" VALEUR_DU_BIEN CM \n",
|
|
"count 824 824.000000 \n",
|
|
"unique 6 NaN \n",
|
|
"top [10000;15000[ NaN \n",
|
|
"freq 213 NaN \n",
|
|
"mean NaN 4246.016978 \n",
|
|
"std NaN 6869.616917 \n",
|
|
"min NaN 7.500000 \n",
|
|
"25% NaN 1159.961250 \n",
|
|
"50% NaN 2541.650000 \n",
|
|
"75% NaN 4193.797500 \n",
|
|
"max NaN 83421.850000 "
|
|
]
|
|
},
|
|
"execution_count": 27,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data_model.describe(include='all')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "92d6156a",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Etude des corrélations parmi les variables explicatives"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d7327570",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Question :** Selon vous, pourquoi faut-il s'intéresser à la corrélation des variables ? "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "475e141b",
|
|
"metadata": {},
|
|
"source": [
|
|
"*Réponse*: Pour avoir un modèle qui fit mieux + déterminer un potentiel effet de causalité entre features et target + sélectionner certaines variables."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 28,
|
|
"id": "1b156435",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"data_set = data_model.drop(\"CM\", axis=1)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"id": "0ef0fcc0",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Séparation en variables qualitatives ou catégorielles\n",
|
|
"variables_na = []\n",
|
|
"variables_numeriques = []\n",
|
|
"variables_01 = []\n",
|
|
"variables_categorielles = []\n",
|
|
"for colu in data_set.columns:\n",
|
|
" if True in data_set[colu].isna().unique() :\n",
|
|
" variables_na.append(data_set[colu])\n",
|
|
" else :\n",
|
|
" if str(data_set[colu].dtypes) in [\"int32\",\"int64\",\"float64\"]:\n",
|
|
" if len(data_set[colu].unique())==2 :\n",
|
|
" variables_categorielles.append(data_set[colu])\n",
|
|
" else :\n",
|
|
" variables_numeriques.append(data_set[colu])\n",
|
|
" else :\n",
|
|
" if len(data_set[colu].unique())==2 :\n",
|
|
" variables_categorielles.append(data_set[colu])\n",
|
|
" else :\n",
|
|
" variables_categorielles.append(data_set[colu])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e82fcade",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Corrélation des variables catégorielles :"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"id": "e130aae5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"vars_categorielles = pd.DataFrame(variables_categorielles).transpose()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 31,
|
|
"id": "c39e2ad0",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"application/vnd.microsoft.datawrangler.viewer.v0+json": {
|
|
"columns": [
|
|
{
|
|
"name": "index",
|
|
"rawType": "int64",
|
|
"type": "integer"
|
|
},
|
|
{
|
|
"name": "CONTRAT_ANCIENNETE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "FREQUENCE_PAIEMENT_COTISATION",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "GROUPE_KM",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "ZONE_RISQUE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "GENRE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "DEUXIEME_CONDUCTEUR",
|
|
"rawType": "object",
|
|
"type": "unknown"
|
|
},
|
|
{
|
|
"name": "ENERGIE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "EQUIPEMENT_SECURITE",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
},
|
|
{
|
|
"name": "VALEUR_DU_BIEN",
|
|
"rawType": "object",
|
|
"type": "string"
|
|
}
|
|
],
|
|
"ref": "089d2df2-1504-4d62-9804-f974629bdaaa",
|
|
"rows": [
|
|
[
|
|
"10",
|
|
"(0,1]",
|
|
"MENSUEL",
|
|
"[0;20000[",
|
|
"C",
|
|
"M",
|
|
"False",
|
|
"ESSENCE",
|
|
"VRAI",
|
|
"[15000;20000["
|
|
],
|
|
[
|
|
"34",
|
|
"(-1,0]",
|
|
"MENSUEL",
|
|
"[20000;40000[",
|
|
"C",
|
|
"M",
|
|
"True",
|
|
"AUTRE",
|
|
"FAUX",
|
|
"[35000;99999["
|
|
],
|
|
[
|
|
"36",
|
|
"(-1,0]",
|
|
"MENSUEL",
|
|
"[20000;40000[",
|
|
"L",
|
|
"M",
|
|
"False",
|
|
"ESSENCE",
|
|
"VRAI",
|
|
"[0;10000["
|
|
],
|
|
[
|
|
"78",
|
|
"(-1,0]",
|
|
"MENSUEL",
|
|
"[20000;40000[",
|
|
"B",
|
|
"M",
|
|
"False",
|
|
"DIESEL",
|
|
"FAUX",
|
|
"[15000;20000["
|
|
],
|
|
[
|
|
"89",
|
|
"(1,2]",
|
|
"MENSUEL",
|
|
"[20000;40000[",
|
|
"C",
|
|
"M",
|
|
"False",
|
|
"ESSENCE",
|
|
"FAUX",
|
|
"[25000;35000["
|
|
]
|
|
],
|
|
"shape": {
|
|
"columns": 9,
|
|
"rows": 5
|
|
}
|
|
},
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>CONTRAT_ANCIENNETE</th>\n",
|
|
" <th>FREQUENCE_PAIEMENT_COTISATION</th>\n",
|
|
" <th>GROUPE_KM</th>\n",
|
|
" <th>ZONE_RISQUE</th>\n",
|
|
" <th>GENRE</th>\n",
|
|
" <th>DEUXIEME_CONDUCTEUR</th>\n",
|
|
" <th>ENERGIE</th>\n",
|
|
" <th>EQUIPEMENT_SECURITE</th>\n",
|
|
" <th>VALEUR_DU_BIEN</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>10</th>\n",
|
|
" <td>(0,1]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[0;20000[</td>\n",
|
|
" <td>C</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>False</td>\n",
|
|
" <td>ESSENCE</td>\n",
|
|
" <td>VRAI</td>\n",
|
|
" <td>[15000;20000[</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>34</th>\n",
|
|
" <td>(-1,0]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[20000;40000[</td>\n",
|
|
" <td>C</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>True</td>\n",
|
|
" <td>AUTRE</td>\n",
|
|
" <td>FAUX</td>\n",
|
|
" <td>[35000;99999[</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>36</th>\n",
|
|
" <td>(-1,0]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[20000;40000[</td>\n",
|
|
" <td>L</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>False</td>\n",
|
|
" <td>ESSENCE</td>\n",
|
|
" <td>VRAI</td>\n",
|
|
" <td>[0;10000[</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>78</th>\n",
|
|
" <td>(-1,0]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[20000;40000[</td>\n",
|
|
" <td>B</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>False</td>\n",
|
|
" <td>DIESEL</td>\n",
|
|
" <td>FAUX</td>\n",
|
|
" <td>[15000;20000[</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>89</th>\n",
|
|
" <td>(1,2]</td>\n",
|
|
" <td>MENSUEL</td>\n",
|
|
" <td>[20000;40000[</td>\n",
|
|
" <td>C</td>\n",
|
|
" <td>M</td>\n",
|
|
" <td>False</td>\n",
|
|
" <td>ESSENCE</td>\n",
|
|
" <td>FAUX</td>\n",
|
|
" <td>[25000;35000[</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION GROUPE_KM \\\n",
|
|
"10 (0,1] MENSUEL [0;20000[ \n",
|
|
"34 (-1,0] MENSUEL [20000;40000[ \n",
|
|
"36 (-1,0] MENSUEL [20000;40000[ \n",
|
|
"78 (-1,0] MENSUEL [20000;40000[ \n",
|
|
"89 (1,2] MENSUEL [20000;40000[ \n",
|
|
"\n",
|
|
" ZONE_RISQUE GENRE DEUXIEME_CONDUCTEUR ENERGIE EQUIPEMENT_SECURITE \\\n",
|
|
"10 C M False ESSENCE VRAI \n",
|
|
"34 C M True AUTRE FAUX \n",
|
|
"36 L M False ESSENCE VRAI \n",
|
|
"78 B M False DIESEL FAUX \n",
|
|
"89 C M False ESSENCE FAUX \n",
|
|
"\n",
|
|
" VALEUR_DU_BIEN \n",
|
|
"10 [15000;20000[ \n",
|
|
"34 [35000;99999[ \n",
|
|
"36 [0;10000[ \n",
|
|
"78 [15000;20000[ \n",
|
|
"89 [25000;35000[ "
|
|
]
|
|
},
|
|
"execution_count": 31,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"vars_categorielles.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8f615121",
|
|
"metadata": {},
|
|
"source": [
|
|
"##### Corrélation des variables numériques :"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"id": "a16215ab",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"vars_numeriques = pd.DataFrame(variables_numeriques).transpose()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "532ca6c4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Matrice de corrélation des variables numériques:\n",
|
|
" ANNEE_CTR AGE_ASSURE_PRINCIPAL ANCIENNETE_PERMIS \\\n",
|
|
"ANNEE_CTR 1.000000 0.026613 0.040797 \n",
|
|
"AGE_ASSURE_PRINCIPAL 0.026613 1.000000 0.540899 \n",
|
|
"ANCIENNETE_PERMIS 0.040797 0.540899 1.000000 \n",
|
|
"ANNEE_CONSTRUCTION 0.387562 -0.031655 0.033320 \n",
|
|
"\n",
|
|
" ANNEE_CONSTRUCTION \n",
|
|
"ANNEE_CTR 0.387562 \n",
|
|
"AGE_ASSURE_PRINCIPAL -0.031655 \n",
|
|
"ANCIENNETE_PERMIS 0.033320 \n",
|
|
"ANNEE_CONSTRUCTION 1.000000 \n"
|
|
]
|
|
},
|
|
{
|
|
"ename": "ValueError",
|
|
"evalue": "\n Invalid value of type 'builtins.str' received for the 'colorscale' property of imshow\n Received value: 'coolwarm'\n\n The 'colorscale' property is a colorscale and may be\n specified as:\n - A list of colors that will be spaced evenly to create the colorscale.\n Many predefined colorscale lists are included in the sequential, diverging,\n and cyclical modules in the plotly.colors package.\n - A list of 2-element lists where the first element is the\n normalized color level value (starting at 0 and ending at 1),\n and the second item is a valid color string.\n (e.g. [[0, 'green'], [0.5, 'red'], [1.0, 'rgb(0, 0, 255)']])\n - One of the following named colorscales:\n ['aggrnyl', 'agsunset', 'algae', 'amp', 'armyrose', 'balance',\n 'blackbody', 'bluered', 'blues', 'blugrn', 'bluyl', 'brbg',\n 'brwnyl', 'bugn', 'bupu', 'burg', 'burgyl', 'cividis', 'curl',\n 'darkmint', 'deep', 'delta', 'dense', 'earth', 'edge', 'electric',\n 'emrld', 'fall', 'geyser', 'gnbu', 'gray', 'greens', 'greys',\n 'haline', 'hot', 'hsv', 'ice', 'icefire', 'inferno', 'jet',\n 'magenta', 'magma', 'matter', 'mint', 'mrybm', 'mygbm', 'oranges',\n 'orrd', 'oryel', 'oxy', 'peach', 'phase', 'picnic', 'pinkyl',\n 'piyg', 'plasma', 'plotly3', 'portland', 'prgn', 'pubu', 'pubugn',\n 'puor', 'purd', 'purp', 'purples', 'purpor', 'rainbow', 'rdbu',\n 'rdgy', 'rdpu', 'rdylbu', 'rdylgn', 'redor', 'reds', 'solar',\n 'spectral', 'speed', 'sunset', 'sunsetdark', 'teal', 'tealgrn',\n 'tealrose', 'tempo', 'temps', 'thermal', 'tropic', 'turbid',\n 'turbo', 'twilight', 'viridis', 'ylgn', 'ylgnbu', 'ylorbr',\n 'ylorrd'].\n Appending '_r' to a named colorscale reverses it.\n",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
|
"\u001b[31mValueError\u001b[39m Traceback (most recent call last)",
|
|
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[32]\u001b[39m\u001b[32m, line 6\u001b[39m\n\u001b[32m 3\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33mMatrice de corrélation des variables numériques:\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 4\u001b[39m \u001b[38;5;28mprint\u001b[39m(correlation_matrix)\n\u001b[32m----> \u001b[39m\u001b[32m6\u001b[39m fig = \u001b[43mpx\u001b[49m\u001b[43m.\u001b[49m\u001b[43mimshow\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 7\u001b[39m \u001b[43m \u001b[49m\u001b[43mcorrelation_matrix\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mtext_auto\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mcolor_continuous_scale\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mcoolwarm\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maspect\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mauto\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\n\u001b[32m 8\u001b[39m \u001b[43m)\u001b[49m\n\u001b[32m 9\u001b[39m fig.update_layout(title=\u001b[33m\"\u001b[39m\u001b[33mMatrice de corrélation des variables numériques\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 10\u001b[39m fig.show()\n",
|
|
"\u001b[36mFile \u001b[39m\u001b[32m~/Workspace/studies/.venv/lib/python3.13/site-packages/plotly/express/_imshow.py:423\u001b[39m, in \u001b[36mimshow\u001b[39m\u001b[34m(img, zmin, zmax, origin, labels, x, y, animation_frame, facet_col, facet_col_wrap, facet_col_spacing, facet_row_spacing, color_continuous_scale, color_continuous_midpoint, range_color, title, template, width, height, aspect, contrast_rescaling, binary_string, binary_backend, binary_compression_level, binary_format, text_auto)\u001b[39m\n\u001b[32m 420\u001b[39m layout[\u001b[33m\"\u001b[39m\u001b[33myaxis\u001b[39m\u001b[33m\"\u001b[39m][\u001b[33m\"\u001b[39m\u001b[33mconstrain\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[33m\"\u001b[39m\u001b[33mdomain\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 421\u001b[39m colorscale_validator = ColorscaleValidator(\u001b[33m\"\u001b[39m\u001b[33mcolorscale\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mimshow\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 422\u001b[39m layout[\u001b[33m\"\u001b[39m\u001b[33mcoloraxis1\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[38;5;28mdict\u001b[39m(\n\u001b[32m--> \u001b[39m\u001b[32m423\u001b[39m colorscale=\u001b[43mcolorscale_validator\u001b[49m\u001b[43m.\u001b[49m\u001b[43mvalidate_coerce\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 424\u001b[39m \u001b[43m \u001b[49m\u001b[43margs\u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mcolor_continuous_scale\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\n\u001b[32m 425\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m,\n\u001b[32m 426\u001b[39m cmid=color_continuous_midpoint,\n\u001b[32m 427\u001b[39m cmin=zmin,\n\u001b[32m 428\u001b[39m cmax=zmax,\n\u001b[32m 429\u001b[39m )\n\u001b[32m 430\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m labels[\u001b[33m\"\u001b[39m\u001b[33mcolor\u001b[39m\u001b[33m\"\u001b[39m]:\n\u001b[32m 431\u001b[39m layout[\u001b[33m\"\u001b[39m\u001b[33mcoloraxis1\u001b[39m\u001b[33m\"\u001b[39m][\u001b[33m\"\u001b[39m\u001b[33mcolorbar\u001b[39m\u001b[33m\"\u001b[39m] = \u001b[38;5;28mdict\u001b[39m(title_text=labels[\u001b[33m\"\u001b[39m\u001b[33mcolor\u001b[39m\u001b[33m\"\u001b[39m])\n",
|
|
"\u001b[36mFile \u001b[39m\u001b[32m~/Workspace/studies/.venv/lib/python3.13/site-packages/_plotly_utils/basevalidators.py:1636\u001b[39m, in \u001b[36mColorscaleValidator.validate_coerce\u001b[39m\u001b[34m(self, v)\u001b[39m\n\u001b[32m 1631\u001b[39m v = [\n\u001b[32m 1632\u001b[39m [e[\u001b[32m0\u001b[39m], ColorValidator.perform_validate_coerce(e[\u001b[32m1\u001b[39m])] \u001b[38;5;28;01mfor\u001b[39;00m e \u001b[38;5;129;01min\u001b[39;00m v\n\u001b[32m 1633\u001b[39m ]\n\u001b[32m 1635\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m v_valid:\n\u001b[32m-> \u001b[39m\u001b[32m1636\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mraise_invalid_val\u001b[49m\u001b[43m(\u001b[49m\u001b[43mv\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1638\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m v\n",
|
|
"\u001b[36mFile \u001b[39m\u001b[32m~/Workspace/studies/.venv/lib/python3.13/site-packages/_plotly_utils/basevalidators.py:298\u001b[39m, in \u001b[36mBaseValidator.raise_invalid_val\u001b[39m\u001b[34m(self, v, inds)\u001b[39m\n\u001b[32m 295\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m i \u001b[38;5;129;01min\u001b[39;00m inds:\n\u001b[32m 296\u001b[39m name += \u001b[33m\"\u001b[39m\u001b[33m[\u001b[39m\u001b[33m\"\u001b[39m + \u001b[38;5;28mstr\u001b[39m(i) + \u001b[33m\"\u001b[39m\u001b[33m]\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m--> \u001b[39m\u001b[32m298\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m 299\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 300\u001b[39m \u001b[33;03m Invalid value of type {typ} received for the '{name}' property of {pname}\u001b[39;00m\n\u001b[32m 301\u001b[39m \u001b[33;03m Received value: {v}\u001b[39;00m\n\u001b[32m 302\u001b[39m \n\u001b[32m 303\u001b[39m \u001b[33;03m{valid_clr_desc}\"\"\"\u001b[39;00m.format(\n\u001b[32m 304\u001b[39m name=name,\n\u001b[32m 305\u001b[39m pname=\u001b[38;5;28mself\u001b[39m.parent_name,\n\u001b[32m 306\u001b[39m typ=type_str(v),\n\u001b[32m 307\u001b[39m v=\u001b[38;5;28mrepr\u001b[39m(v),\n\u001b[32m 308\u001b[39m valid_clr_desc=\u001b[38;5;28mself\u001b[39m.description(),\n\u001b[32m 309\u001b[39m )\n\u001b[32m 310\u001b[39m )\n",
|
|
"\u001b[31mValueError\u001b[39m: \n Invalid value of type 'builtins.str' received for the 'colorscale' property of imshow\n Received value: 'coolwarm'\n\n The 'colorscale' property is a colorscale and may be\n specified as:\n - A list of colors that will be spaced evenly to create the colorscale.\n Many predefined colorscale lists are included in the sequential, diverging,\n and cyclical modules in the plotly.colors package.\n - A list of 2-element lists where the first element is the\n normalized color level value (starting at 0 and ending at 1),\n and the second item is a valid color string.\n (e.g. [[0, 'green'], [0.5, 'red'], [1.0, 'rgb(0, 0, 255)']])\n - One of the following named colorscales:\n ['aggrnyl', 'agsunset', 'algae', 'amp', 'armyrose', 'balance',\n 'blackbody', 'bluered', 'blues', 'blugrn', 'bluyl', 'brbg',\n 'brwnyl', 'bugn', 'bupu', 'burg', 'burgyl', 'cividis', 'curl',\n 'darkmint', 'deep', 'delta', 'dense', 'earth', 'edge', 'electric',\n 'emrld', 'fall', 'geyser', 'gnbu', 'gray', 'greens', 'greys',\n 'haline', 'hot', 'hsv', 'ice', 'icefire', 'inferno', 'jet',\n 'magenta', 'magma', 'matter', 'mint', 'mrybm', 'mygbm', 'oranges',\n 'orrd', 'oryel', 'oxy', 'peach', 'phase', 'picnic', 'pinkyl',\n 'piyg', 'plasma', 'plotly3', 'portland', 'prgn', 'pubu', 'pubugn',\n 'puor', 'purd', 'purp', 'purples', 'purpor', 'rainbow', 'rdbu',\n 'rdgy', 'rdpu', 'rdylbu', 'rdylgn', 'redor', 'reds', 'solar',\n 'spectral', 'speed', 'sunset', 'sunsetdark', 'teal', 'tealgrn',\n 'tealrose', 'tempo', 'temps', 'thermal', 'tropic', 'turbid',\n 'turbo', 'twilight', 'viridis', 'ylgn', 'ylgnbu', 'ylorbr',\n 'ylorrd'].\n Appending '_r' to a named colorscale reverses it.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Calcul des corrélations entre variables numériques\n",
|
|
"correlation_matrix = vars_numeriques.corr()\n",
|
|
"print(\"Matrice de corrélation des variables numériques:\")\n",
|
|
"print(correlation_matrix)\n",
|
|
"\n",
|
|
"fig = px.imshow(\n",
|
|
" correlation_matrix, text_auto=True, color_continuous_scale=\"coolwarm\", aspect=\"auto\"\n",
|
|
")\n",
|
|
"fig.update_layout(title=\"Matrice de corrélation des variables numériques\")\n",
|
|
"fig.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "98c7dba6",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Question :** quels sont vos commentaires ?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "212209ec",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Preprocessing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "65aca700",
|
|
"metadata": {},
|
|
"source": [
|
|
"Deux étapes sont nécessaires avant de lancer l'apprentissage d'un modèle, c'est ce qu'on connait comme le *Preprocessing* :\n",
|
|
"\n",
|
|
"* Les modèles proposés par la librairie \"sklearn\" ne gèrent que des variables numériques. Il est donc nécessaire de transformer les variables catégorielles en variables numériques : ce processus s'appelle le *One Hot Encoding*.\n",
|
|
"* Normaliser les données numériques"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "95f5cc9f",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Exercice :** proposez un bout de code permettant de réaliser le One Hot Encoding des variables catégorielles. Vous pourrez utiliser la fonction \"preproc.OneHotEncoder\" de la librairie sklearn"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b8530717",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b70abc5c",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Exercice :** proposez un bout de code permettant normaliser les variables numériques présentes dans la base. Vous pourrez utiliser la fonction \"preproc.StandardScaler\" de la librairie sklearn"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4ff3847d",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "62d49546",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Sampling"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "64d229f4",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Exercice :** proposez un bout de code permettant construire la base d'apprentissage (80% des données) et la base de test (20%)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6a1c7907",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "84dc7a07",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Fitting"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "97c7b783",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Exercice :** proposez un bout de code permettant construire le modèle"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bd26339b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8d624704",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Exercice :** proposez un bout de code permettant d'évaluer les performances du modèle (MAE, MSE et RMSE)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "c4ca2cf9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "fb2fe98c",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Question :** que pensez-vous des performances de ce modèle ?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7ecba832",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Algorithme supervisé : Random Forest "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "efcb8987",
|
|
"metadata": {},
|
|
"source": [
|
|
"A ce stade, nous avons vu les différentes étapes pour lancer un algorithme de Machine Learning. Néanmoins, ces étapes ne sont pas suffisantes pour construire un modèle performant. \n",
|
|
"En effet, afin de construire un modèle performant le Data Scientist doit agir sur l'apprentissage du modèle. Dans ce qui suit nous :\n",
|
|
"* Changerons d'algorithme pour utiliser un algorithme plus performant (Random Forest)\n",
|
|
"* Raliserons un *grid search* sur les paramètres du modèle\n",
|
|
"* Appliquerons l'apprentissage par validation croisée\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d6723a2f",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Modèle avec Validation Croisée"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3716b09f",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Sampling"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ab1e1367",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3f5d735e",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Fitting avec Cross-Validation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "bc819f8f",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Exercice :** construisez un modèle RF (RandomForestRegressor) en implémentant la technique de validation croisée. Pensez à enregistrer au sein d'une variable/liste les performances (MAE, MSE & RMSE) du modèle au sein de chaque fold."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b515460e",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Initialisation\n",
|
|
"# Nombre de sous-échantillons pour la cross-validation\n",
|
|
"num_splits = 5\n",
|
|
"\n",
|
|
"# Random Forest regressor\n",
|
|
"rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)\n",
|
|
"\n",
|
|
"# Initialisation du KFold cross-validation splitter\n",
|
|
"kf = KFold(n_splits=num_splits)\n",
|
|
"\n",
|
|
"# Listes pour enregistrer les performances du modèle\n",
|
|
"MAE_scores = []\n",
|
|
"MSE_scores = []\n",
|
|
"RMSE_scores = []"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "eebb394f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Entrainement avec cross-validation\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b067126c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Métriques sur tous les folds\n",
|
|
"\n",
|
|
"#MAE\n",
|
|
"for fold, mae in enumerate(MAE_scores, start=1):\n",
|
|
" print(f\"Fold {fold} MAE:\", mae)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6597152c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#MSE\n",
|
|
"for fold, mse in enumerate(MSE_scores, start=1):\n",
|
|
" print(f\"Fold {fold} MSE:\", mse)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "63ff1c9d",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#RMSE\n",
|
|
"for fold, rmse in enumerate(RMSE_scores, start=1):\n",
|
|
" print(f\"Fold {fold} RMSE:\", rmse)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ec1961c2",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Question :** Commentez les résultats."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5a8163ef",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Ajout d'un Grid Search pour les hyper paramètres"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5a6adbfe",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Sampling"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d9342ad6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dce52b11",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Fitting avec Cross-Validation et *Grid Search*"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7e3a9dd0",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Exercice :** Intégrez la technique de Grid Search pour rechercher les paramètres optimaux du modèle."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6d58dbc2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Initialisation\n",
|
|
"# Nombre de sous-échantillons pour la cross-validation\n",
|
|
"num_splits = 5\n",
|
|
"\n",
|
|
"# Initialisation du KFold cross-validation splitter\n",
|
|
"kf = KFold(n_splits=num_splits)\n",
|
|
"\n",
|
|
"# Listes pour enregistrer les performances du modèle\n",
|
|
"MAE_scores = []\n",
|
|
"MSE_scores = []\n",
|
|
"RMSE_scores = []\n",
|
|
"\n",
|
|
"# Hyperparamètres à tester\n",
|
|
"n_estimators_values = [] #Complétez ici par les paramètres à tester\n",
|
|
"max_depth_values = [] #Complétez ici par les paramètres à tester\n",
|
|
"min_samples_split_values = [] #Complétez ici par les paramètres à tester\n",
|
|
"\n",
|
|
"# Liste pour sauveagrder les meilleurs résultats\n",
|
|
"best_score = np.inf\n",
|
|
"best_params = {}\n",
|
|
"\n",
|
|
"MAE_best_score = []\n",
|
|
"MSE_best_score = []\n",
|
|
"RMSE_best_score = []"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "47da5172",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Complétez ici avec votre code"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d4936c46",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Meilleurs résultats\n",
|
|
"print(\"Meilleurs paramètres:\", best_params)\n",
|
|
"print(\"Meilleure RMSE :\", best_score)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3215c463",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Métriques sur tous les folds\n",
|
|
"\n",
|
|
"#RMSE\n",
|
|
"for fold, rmse in enumerate(RMSE_best_score, start=1):\n",
|
|
" print(f\"Fold {fold} RMSE:\", rmse)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bb9a5c9b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#MAE\n",
|
|
"for fold, mse in enumerate(MSE_best_score, start=1):\n",
|
|
" print(f\"Fold {fold} MSE:\", mse)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "0f0768ad",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#MSE\n",
|
|
"for fold, mae in enumerate(MAE_best_score, start=1):\n",
|
|
" print(f\"Fold {fold} MAE:\", mae)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "802a625f",
|
|
"metadata": {},
|
|
"source": [
|
|
"**Question :** Commentez les résultats"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "studies",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.13.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|