{ "cells": [ { "cell_type": "markdown", "id": "8750d15b", "metadata": {}, "source": [ "# Cours 3 : Machine Learning - Algorithmes supervisés (1/2)" ] }, { "cell_type": "markdown", "id": "f7c08ae5", "metadata": {}, "source": [ "## Préambule" ] }, { "cell_type": "markdown", "id": "ec7ecb4b", "metadata": {}, "source": [ "Les objectifs de cette séance (3h) sont :\n", "* Préparation des bases de modélisation (sampling)\n", "* Mettre en application un modèle supervisé simple.\n", "* Construire un modèle de Machine Learning (cross-validation et hyperparamétrage) pour résoudre un problème de régression\n", "* Analyser les performances du modèle" ] }, { "cell_type": "markdown", "id": "4e99c600", "metadata": {}, "source": [ "## Préparation du workspace" ] }, { "cell_type": "markdown", "id": "c1b01045", "metadata": {}, "source": [ "### Import de librairies " ] }, { "cell_type": "code", "execution_count": 56, "id": "97d58527", "metadata": {}, "outputs": [], "source": [ "# Données\n", "import numpy as np\n", "import pandas as pd\n", "\n", "#Graphiques\n", "import seaborn as sns\n", "\n", "sns.set()\n", "import plotly.express as px\n", "import plotly.graph_objects as gp\n", "import sklearn.metrics as metrics\n", "import sklearn.preprocessing as preproc\n", "\n", "#Statistiques\n", "from scipy.stats import chi2_contingency\n", "\n", "# Machine Learning\n", "from sklearn.cluster import KMeans\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import KFold, train_test_split\n", "from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor\n" ] }, { "cell_type": "markdown", "id": "06153286", "metadata": {}, "source": [ "### Définition des fonctions " ] }, { "cell_type": "code", "execution_count": null, "id": "c67db932", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "985e4e97", "metadata": {}, "source": [ "### Constantes" ] }, { "cell_type": "code", "execution_count": 57, "id": "c9597b48", "metadata": {}, "outputs": [], "source": [ "input_path = \"./1_inputs\"\n", "output_path = \"./2_outputs\"" ] }, { "cell_type": "markdown", "id": "b2b035d2", "metadata": {}, "source": [ "### Import des données" ] }, { "cell_type": "code", "execution_count": 58, "id": "8051b5f4", "metadata": {}, "outputs": [], "source": [ "path =input_path + '/base_retraitee.csv'\n", "data_retraitee = pd.read_csv(path,sep=\",\",decimal=\".\")" ] }, { "cell_type": "markdown", "id": "a2578ba1", "metadata": {}, "source": [ "## Algorithme supervisé : CART " ] }, { "cell_type": "markdown", "id": "aaa0b27d", "metadata": {}, "source": [ "Dans cette partie l'objectif est de construire un modèle simple (algorithme CART) afin de voir les différentes étapes nécessaire au lancement d'un modèle\n", "Nous modéliserons directement le coût des sinistres. " ] }, { "cell_type": "markdown", "id": "a0458a05", "metadata": {}, "source": [ "### Construction du modèle" ] }, { "cell_type": "markdown", "id": "b3715c37", "metadata": {}, "source": [ "La première étape est de calculer les côut moyen de chaque sinistre (target ou variable réponse). Cette variable sera la variable à prédire en fonction des variables explicatives." ] }, { "cell_type": "code", "execution_count": 59, "id": "c427a4b8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(824, 14)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_model = data_retraitee.copy()\n", "\n", "# Filtre pour ne garder que les lignes qui ont un sinistre (NB > 0)\n", "data_model = data_model[data_model['NB'] > 0]\n", "\n", "# Calcul du cout moyen \"théorique\" des sinistres\n", "data_model[\"CM\"] = (data_model[\"CHARGE\"] / data_model[\"NB\"])\n", "data_model = data_model.drop(['CHARGE', 'NB', \"EXPO\"], axis=1)\n", "data_model.shape" ] }, { "cell_type": "markdown", "id": "e3e85088", "metadata": {}, "source": [ "**Exercice :** construisez les statistiques descriptives de la base utilisée." ] }, { "cell_type": "code", "execution_count": 60, "id": "c8fd3ee1", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "object", "type": "string" }, { "name": "ANNEE_CTR", "rawType": "float64", "type": "float" }, { "name": "CONTRAT_ANCIENNETE", "rawType": "object", "type": "unknown" }, { "name": "FREQUENCE_PAIEMENT_COTISATION", "rawType": "object", "type": "unknown" }, { "name": "GROUPE_KM", "rawType": "object", "type": "unknown" }, { "name": "ZONE_RISQUE", "rawType": "object", "type": "unknown" }, { "name": "AGE_ASSURE_PRINCIPAL", "rawType": "float64", "type": "float" }, { "name": "GENRE", "rawType": "object", "type": "unknown" }, { "name": "DEUXIEME_CONDUCTEUR", "rawType": "object", "type": "unknown" }, { "name": "ANCIENNETE_PERMIS", "rawType": "float64", "type": "float" }, { "name": "ANNEE_CONSTRUCTION", "rawType": "float64", "type": "float" }, { "name": "ENERGIE", "rawType": "object", "type": "unknown" }, { "name": "EQUIPEMENT_SECURITE", "rawType": "object", "type": "unknown" }, { "name": "VALEUR_DU_BIEN", "rawType": "object", "type": "unknown" }, { "name": "CM", "rawType": "float64", "type": "float" } ], "ref": "8fcd0abc-8334-4a0d-96b7-b6d7e17b3fb7", "rows": [ [ "count", "824.0", "824", "824", "824", "824", "824.0", "824", "824", "824.0", "824.0", "824", "824", "824", "824.0" ], [ "unique", null, "5", "3", "4", "14", null, "2", "2", null, null, "3", "2", "6", null ], [ "top", null, "(0,1]", "MENSUEL", "[0;20000[", "C", null, "M", "False", null, null, "ESSENCE", "FAUX", "[10000;15000[", null ], [ "freq", null, "297", "398", "391", "269", null, "483", "663", null, null, "413", "517", "213", null ], [ "mean", "2018.384708737864", null, null, null, null, "44.383495145631066", null, null, "35.68810679611651", "2015.2123786407767", null, null, null, "4246.01697815534" ], [ "std", "1.515832735580178", null, null, null, null, "13.808216667998865", null, null, "19.370620845496358", "3.1637823115731556", null, null, null, "6869.61691660173" ], [ "min", "2016.0", null, null, null, null, "19.0", null, null, "1.0", "1998.0", null, null, null, "7.5" ], [ "25%", "2017.0", null, null, null, null, "34.0", null, null, "18.0", "2014.0", null, null, null, "1159.96125" ], [ "50%", "2018.0", null, null, null, null, "43.0", null, null, "35.0", "2016.0", null, null, null, "2541.6499999999996" ], [ "75%", "2020.0", null, null, null, null, "53.0", null, null, "53.0", "2017.0", null, null, null, "4193.797500000001" ], [ "max", "2021.0", null, null, null, null, "94.0", null, null, "70.0", "2021.0", null, null, null, "83421.85" ] ], "shape": { "columns": 14, "rows": 11 } }, "text/html": [ "
| \n", " | ANNEE_CTR | \n", "CONTRAT_ANCIENNETE | \n", "FREQUENCE_PAIEMENT_COTISATION | \n", "GROUPE_KM | \n", "ZONE_RISQUE | \n", "AGE_ASSURE_PRINCIPAL | \n", "GENRE | \n", "DEUXIEME_CONDUCTEUR | \n", "ANCIENNETE_PERMIS | \n", "ANNEE_CONSTRUCTION | \n", "ENERGIE | \n", "EQUIPEMENT_SECURITE | \n", "VALEUR_DU_BIEN | \n", "CM | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", "824.000000 | \n", "824 | \n", "824 | \n", "824 | \n", "824 | \n", "824.000000 | \n", "824 | \n", "824 | \n", "824.000000 | \n", "824.000000 | \n", "824 | \n", "824 | \n", "824 | \n", "824.000000 | \n", "
| unique | \n", "NaN | \n", "5 | \n", "3 | \n", "4 | \n", "14 | \n", "NaN | \n", "2 | \n", "2 | \n", "NaN | \n", "NaN | \n", "3 | \n", "2 | \n", "6 | \n", "NaN | \n", "
| top | \n", "NaN | \n", "(0,1] | \n", "MENSUEL | \n", "[0;20000[ | \n", "C | \n", "NaN | \n", "M | \n", "False | \n", "NaN | \n", "NaN | \n", "ESSENCE | \n", "FAUX | \n", "[10000;15000[ | \n", "NaN | \n", "
| freq | \n", "NaN | \n", "297 | \n", "398 | \n", "391 | \n", "269 | \n", "NaN | \n", "483 | \n", "663 | \n", "NaN | \n", "NaN | \n", "413 | \n", "517 | \n", "213 | \n", "NaN | \n", "
| mean | \n", "2018.384709 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "44.383495 | \n", "NaN | \n", "NaN | \n", "35.688107 | \n", "2015.212379 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "4246.016978 | \n", "
| std | \n", "1.515833 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "13.808217 | \n", "NaN | \n", "NaN | \n", "19.370621 | \n", "3.163782 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "6869.616917 | \n", "
| min | \n", "2016.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "19.000000 | \n", "NaN | \n", "NaN | \n", "1.000000 | \n", "1998.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "7.500000 | \n", "
| 25% | \n", "2017.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "34.000000 | \n", "NaN | \n", "NaN | \n", "18.000000 | \n", "2014.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "1159.961250 | \n", "
| 50% | \n", "2018.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "43.000000 | \n", "NaN | \n", "NaN | \n", "35.000000 | \n", "2016.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "2541.650000 | \n", "
| 75% | \n", "2020.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "53.000000 | \n", "NaN | \n", "NaN | \n", "53.000000 | \n", "2017.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "4193.797500 | \n", "
| max | \n", "2021.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "94.000000 | \n", "NaN | \n", "NaN | \n", "70.000000 | \n", "2021.000000 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "83421.850000 | \n", "
DecisionTreeRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor()