{ "cells": [ { "cell_type": "markdown", "id": "8750d15b", "metadata": {}, "source": [ "# Cours 3 : Machine Learning - Algorithmes supervisés (1/2)" ] }, { "cell_type": "markdown", "id": "f7c08ae5", "metadata": {}, "source": [ "## Préambule" ] }, { "cell_type": "markdown", "id": "ec7ecb4b", "metadata": {}, "source": [ "Les objectifs de cette séance (3h) sont :\n", "* Préparation des bases de modélisation (sampling)\n", "* Mettre en application un modèle supervisé simple.\n", "* Construire un modèle de Machine Learning (cross-validation et hyperparamétrage) pour résoudre un problème de régression\n", "* Analyser les performances du modèle" ] }, { "cell_type": "markdown", "id": "4e99c600", "metadata": {}, "source": [ "## Préparation du workspace" ] }, { "cell_type": "markdown", "id": "c1b01045", "metadata": {}, "source": [ "### Import de librairies " ] }, { "cell_type": "code", "execution_count": null, "id": "97d58527", "metadata": {}, "outputs": [], "source": [ "# Données\n", "import numpy as np\n", "import pandas as pd\n", "\n", "#Graphiques\n", "import seaborn as sns\n", "\n", "sns.set()\n", "import plotly.express as px\n", "import plotly.graph_objects as gp\n", "import sklearn.preprocessing as preproc\n", "\n", "#Statistiques\n", "from scipy.stats import chi2_contingency\n", "from sklearn import metrics\n", "\n", "# Machine Learning\n", "from sklearn.cluster import KMeans\n", "import sklearn.metrics as metrics\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import KFold, train_test_split\n", "from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor" ] }, { "cell_type": "markdown", "id": "06153286", "metadata": {}, "source": [ "### Définition des fonctions " ] }, { "cell_type": "code", "execution_count": null, "id": "c67db932", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "985e4e97", "metadata": {}, "source": [ "### Constantes" ] }, { "cell_type": "code", "execution_count": 91, "id": "c9597b48", "metadata": {}, "outputs": [], "source": [ "input_path = \"./1_inputs\"\n", "output_path = \"./2_outputs\"" ] }, { "cell_type": "markdown", "id": "b2b035d2", "metadata": {}, "source": [ "### Import des données" ] }, { "cell_type": "code", "execution_count": 92, "id": "8051b5f4", "metadata": {}, "outputs": [], "source": [ "path =input_path + '/base_retraitee.csv'\n", "data_retraitee = pd.read_csv(path,sep=\",\",decimal=\".\")" ] }, { "cell_type": "markdown", "id": "a2578ba1", "metadata": {}, "source": [ "## Algorithme supervisé : CART " ] }, { "cell_type": "markdown", "id": "aaa0b27d", "metadata": {}, "source": [ "Dans cette partie l'objectif est de construire un modèle simple (algorithme CART) afin de voir les différentes étapes nécessaire au lancement d'un modèle\n", "Nous modéliserons directement le coût des sinistres. " ] }, { "cell_type": "markdown", "id": "a0458a05", "metadata": {}, "source": [ "### Construction du modèle" ] }, { "cell_type": "markdown", "id": "b3715c37", "metadata": {}, "source": [ "La première étape est de calculer les côut moyen de chaque sinistre (target ou variable réponse). Cette variable sera la variable à prédire en fonction des variables explicatives." ] }, { "cell_type": "code", "execution_count": 93, "id": "c427a4b8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(824, 14)" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_model = data_retraitee.copy()\n", "\n", "# Filtre pour ne garder que les lignes qui ont un sinistre (NB > 0)\n", "data_model = data_model[data_model['NB'] > 0]\n", "\n", "# Calcul du cout moyen \"théorique\" des sinistres\n", "data_model[\"CM\"] = (data_model[\"CHARGE\"] / data_model[\"NB\"])\n", "data_model = data_model.drop(['CHARGE', 'NB', \"EXPO\"], axis=1)\n", "data_model.shape" ] }, { "cell_type": "markdown", "id": "e3e85088", "metadata": {}, "source": [ "**Exercice :** construisez les statistiques descriptives de la base utilisée." ] }, { "cell_type": "code", "execution_count": 94, "id": "c8fd3ee1", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "object", "type": "string" }, { "name": "ANNEE_CTR", "rawType": "float64", "type": "float" }, { "name": "CONTRAT_ANCIENNETE", "rawType": "object", "type": "unknown" }, { "name": "FREQUENCE_PAIEMENT_COTISATION", "rawType": "object", "type": "unknown" }, { "name": "GROUPE_KM", "rawType": "object", "type": "unknown" }, { "name": "ZONE_RISQUE", "rawType": "object", "type": "unknown" }, { "name": "AGE_ASSURE_PRINCIPAL", "rawType": "float64", "type": "float" }, { "name": "GENRE", "rawType": "object", "type": "unknown" }, { "name": "DEUXIEME_CONDUCTEUR", "rawType": "object", "type": "unknown" }, { "name": "ANCIENNETE_PERMIS", "rawType": "float64", "type": "float" }, { "name": "ANNEE_CONSTRUCTION", "rawType": "float64", "type": "float" }, { "name": "ENERGIE", "rawType": "object", "type": "unknown" }, { "name": "EQUIPEMENT_SECURITE", "rawType": "object", "type": "unknown" }, { "name": "VALEUR_DU_BIEN", "rawType": "object", "type": "unknown" }, { "name": "CM", "rawType": "float64", "type": "float" } ], "ref": "8d8166c3-6828-4361-92de-ebce2dadb512", "rows": [ [ "count", "824.0", "824", "824", "824", "824", "824.0", "824", "824", "824.0", "824.0", "824", "824", "824", "824.0" ], [ "unique", null, "5", "3", "4", "14", null, "2", "2", null, null, "3", "2", "6", null ], [ "top", null, "(0,1]", "MENSUEL", "[0;20000[", "C", null, "M", "False", null, null, "ESSENCE", "FAUX", "[10000;15000[", null ], [ "freq", null, "297", "398", "391", "269", null, "483", "663", null, null, "413", "517", "213", null ], [ "mean", "2018.384708737864", null, null, null, null, "44.383495145631066", null, null, "35.68810679611651", "2015.2123786407767", null, null, null, "4246.01697815534" ], [ "std", "1.515832735580178", null, null, null, null, "13.808216667998865", null, null, "19.370620845496358", "3.1637823115731556", null, null, null, "6869.61691660173" ], [ "min", "2016.0", null, null, null, null, "19.0", null, null, "1.0", "1998.0", null, null, null, "7.5" ], [ "25%", "2017.0", null, null, null, null, "34.0", null, null, "18.0", "2014.0", null, null, null, "1159.96125" ], [ "50%", "2018.0", null, null, null, null, "43.0", null, null, "35.0", "2016.0", null, null, null, "2541.6499999999996" ], [ "75%", "2020.0", null, null, null, null, "53.0", null, null, "53.0", "2017.0", null, null, null, "4193.797500000001" ], [ "max", "2021.0", null, null, null, null, "94.0", null, null, "70.0", "2021.0", null, null, null, "83421.85" ] ], "shape": { "columns": 14, "rows": 11 } }, "text/html": [ "
| \n", " | ANNEE_CTR | \n", "CONTRAT_ANCIENNETE | \n", "FREQUENCE_PAIEMENT_COTISATION | \n", "GROUPE_KM | \n", "ZONE_RISQUE | \n", "AGE_ASSURE_PRINCIPAL | \n", "GENRE | \n", "DEUXIEME_CONDUCTEUR | \n", "ANCIENNETE_PERMIS | \n", "ANNEE_CONSTRUCTION | \n", "ENERGIE | \n", "EQUIPEMENT_SECURITE | \n", "VALEUR_DU_BIEN | \n", "CM | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", "824.000000 | \n", "824 | \n", "824 | \n", "824 | \n", "824 | \n", "824.000000 | \n", "824 | \n", "824 | \n", "824.000000 | \n", "824.000000 | \n", "824 | \n", "824 | \n", "824 | \n", "824.000000 | \n", "
| unique | \n", "NaN | \n", "5 | \n", "3 | \n", "4 | \n", "14 | \n", "NaN | \n", "2 | \n", "2 | \n", "NaN | \n", "NaN | \n", "3 | \n", "2 | \n", "6 | \n", "NaN | \n", "
| top | \n", "NaN | \n", "(0,1] | \n", "MENSUEL | \n", "[0;20000[ | \n", "C | \n", "NaN | \n", "M | \n", "False | \n", "NaN | \n", "NaN | \n", "ESSENCE | \n", "FAUX | \n", "[10000;15000[ | \n", "NaN | \n", "
| freq | \n", "NaN | \n", "297 | \n", "398 | \n", "391 | \n", "269 | \n", "NaN | \n", "483 | \n", "663 | \n", "NaN | \n", "NaN | \n", "413 | \n", "517 | \n", "213 | \n", "NaN | \n", "
| mean | \n", "2018.384709 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "44.383495 | \n", "NaN | \n", "NaN | \n", "35.688107 | \n", "2015.212379 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "4246.016978 | \n", "
| std | \n", "1.515833 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "13.808217 | \n", "NaN | \n", "NaN | \n", "19.370621 | \n", "3.163782 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "6869.616917 | \n", "
| min | \n", "2016.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "19.000000 | \n", "NaN | \n", "NaN | \n", "1.000000 | \n", "1998.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "7.500000 | \n", "
| 25% | \n", "2017.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "34.000000 | \n", "NaN | \n", "NaN | \n", "18.000000 | \n", "2014.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1159.961250 | \n", "
| 50% | \n", "2018.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "43.000000 | \n", "NaN | \n", "NaN | \n", "35.000000 | \n", "2016.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "2541.650000 | \n", "
| 75% | \n", "2020.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "53.000000 | \n", "NaN | \n", "NaN | \n", "53.000000 | \n", "2017.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "4193.797500 | \n", "
| max | \n", "2021.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "94.000000 | \n", "NaN | \n", "NaN | \n", "70.000000 | \n", "2021.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "83421.850000 | \n", "
| \n", " | ANNEE_CTR | \n", "AGE_ASSURE_PRINCIPAL | \n", "ANCIENNETE_PERMIS | \n", "ANNEE_CONSTRUCTION | \n", "CONTRAT_ANCIENNETE_(-1,0] | \n", "CONTRAT_ANCIENNETE_(0,1] | \n", "CONTRAT_ANCIENNETE_(1,2] | \n", "CONTRAT_ANCIENNETE_(2,5] | \n", "CONTRAT_ANCIENNETE_(5,10] | \n", "FREQUENCE_PAIEMENT_COTISATION_ANNUEL | \n", "... | \n", "ENERGIE_ESSENCE | \n", "EQUIPEMENT_SECURITE_FAUX | \n", "EQUIPEMENT_SECURITE_VRAI | \n", "VALEUR_DU_BIEN_[0;10000[ | \n", "VALEUR_DU_BIEN_[10000;15000[ | \n", "VALEUR_DU_BIEN_[15000;20000[ | \n", "VALEUR_DU_BIEN_[20000;25000[ | \n", "VALEUR_DU_BIEN_[25000;35000[ | \n", "VALEUR_DU_BIEN_[35000;99999[ | \n", "CM | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0.406156 | \n", "-0.317648 | \n", "0.067767 | \n", "0.565370 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "1.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1072.980 | \n", "
| 1 | \n", "1.066260 | \n", "-1.259689 | \n", "-1.171975 | \n", "0.881639 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "3750.000 | \n", "
| 2 | \n", "0.406156 | \n", "-1.839406 | \n", "-1.740190 | \n", "0.565370 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "1.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1838.490 | \n", "
| 3 | \n", "0.406156 | \n", "-0.317648 | \n", "0.481014 | \n", "0.881639 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "4892.740 | \n", "
| 4 | \n", "-0.253948 | \n", "-1.766941 | \n", "-1.275287 | \n", "-0.383438 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "166.730 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 819 | \n", "-0.914052 | \n", "0.406998 | \n", "0.894262 | \n", "-2.597324 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "1216.755 | \n", "
| 820 | \n", "-0.253948 | \n", "0.406998 | \n", "1.565789 | \n", "0.249100 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2071.560 | \n", "
| 821 | \n", "0.406156 | \n", "-1.766941 | \n", "-1.533567 | \n", "0.565370 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "5077.640 | \n", "
| 822 | \n", "-0.253948 | \n", "-1.766941 | \n", "-1.275287 | \n", "-1.648516 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "5228.550 | \n", "
| 823 | \n", "1.066260 | \n", "0.406998 | \n", "0.067767 | \n", "0.565370 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "5880.340 | \n", "
824 rows × 46 columns
\n", "DecisionTreeRegressor(max_depth=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor(max_depth=5, random_state=42)