{ "cells": [ { "cell_type": "markdown", "id": "8750d15b", "metadata": {}, "source": [ "# Cours 4 : Machine Learning - Algorithmes supervisés (2/2)" ] }, { "cell_type": "markdown", "id": "f7c08ae5", "metadata": {}, "source": [ "## Préambule" ] }, { "cell_type": "markdown", "id": "ec7ecb4b", "metadata": {}, "source": [ "Les objectifs de cette séance (3h) sont :\n", "* Préparation des bases de modélisation (sampling)\n", "* Construire un modèle de Machine Learning (cross-validation et hyperparamétrage) pour résoudre un problème de classification\n", "* Analyser les performances du modèle" ] }, { "cell_type": "markdown", "id": "4e99c600", "metadata": {}, "source": [ "## Préparation du workspace" ] }, { "cell_type": "markdown", "id": "c1b01045", "metadata": {}, "source": [ "### Import de librairies " ] }, { "cell_type": "code", "execution_count": null, "id": "97d58527", "metadata": {}, "outputs": [], "source": [ "# Données\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# Graphiques\n", "import seaborn as sns\n", "\n", "sns.set()\n", "import plotly.express as px\n", "\n", "# Machine Learning\n", "import sklearn.preprocessing as preproc\n", "from imblearn.over_sampling import RandomOverSampler\n", "\n", "# Statistiques\n", "from scipy.stats import chi2_contingency\n", "from sklearn import metrics\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.model_selection import (\n", " GridSearchCV,\n", " KFold,\n", " StratifiedKFold,\n", " cross_val_score,\n", " train_test_split,\n", ")\n" ] }, { "cell_type": "markdown", "id": "06153286", "metadata": {}, "source": [ "### Définition des fonctions " ] }, { "cell_type": "code", "execution_count": 104, "id": "c67db932", "metadata": {}, "outputs": [], "source": [ "def cramers_V(var1,var2) :\n", " crosstab = np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building\n", " stat = chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test\n", " obs = np.sum(crosstab) # Number of observations\n", " mini = min(crosstab.shape)-1 # Take the minimum value between the columns and the rows of the cross table\n", " return (stat/(obs*mini))" ] }, { "cell_type": "markdown", "id": "985e4e97", "metadata": {}, "source": [ "### Constantes" ] }, { "cell_type": "code", "execution_count": 105, "id": "c9597b48", "metadata": {}, "outputs": [], "source": [ "input_path = \"./1_inputs\"\n", "output_path = \"./2_outputs\"" ] }, { "cell_type": "markdown", "id": "b2b035d2", "metadata": {}, "source": [ "### Import des données" ] }, { "cell_type": "code", "execution_count": 106, "id": "8051b5f4", "metadata": {}, "outputs": [], "source": [ "path = input_path + '/base_retraitee.csv'\n", "data_retraitee = pd.read_csv(path, sep=\",\", decimal=\".\")" ] }, { "cell_type": "markdown", "id": "a2578ba1", "metadata": {}, "source": [ "## Préparation de la base de données" ] }, { "cell_type": "markdown", "id": "b3715c37", "metadata": {}, "source": [ "Dans cette partie nous souhaitons expliquer la survenance d'un sinistre en fonction des variables explicatives i.e. une variable binaire qui : \n", "* est égale à 1 si la personne a eu 1 ou plus de sinistres.\n", "* est égale à 0 le cas échéant." ] }, { "cell_type": "code", "execution_count": 107, "id": "b9b98d36", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "ANNEE_CTR", "rawType": "int64", "type": "integer" }, { "name": "CONTRAT_ANCIENNETE", "rawType": "object", "type": "string" }, { "name": "FREQUENCE_PAIEMENT_COTISATION", "rawType": "object", "type": "string" }, { "name": "GROUPE_KM", "rawType": "object", "type": "string" }, { "name": "ZONE_RISQUE", "rawType": "object", "type": "string" }, { "name": "AGE_ASSURE_PRINCIPAL", "rawType": "int64", "type": "integer" }, { "name": "GENRE", "rawType": "object", "type": "string" }, { "name": "DEUXIEME_CONDUCTEUR", "rawType": "bool", "type": "boolean" }, { "name": "ANCIENNETE_PERMIS", "rawType": "int64", "type": "integer" }, { "name": "ANNEE_CONSTRUCTION", "rawType": "float64", "type": "float" }, { "name": "ENERGIE", "rawType": "object", "type": "string" }, { "name": "EQUIPEMENT_SECURITE", "rawType": "object", "type": "string" }, { "name": "VALEUR_DU_BIEN", "rawType": "object", "type": "string" }, { "name": "NB", "rawType": "int64", "type": "integer" }, { "name": "CHARGE", "rawType": "float64", "type": "float" }, { "name": "EXPO", "rawType": "float64", "type": "float" }, { "name": "sinistré", "rawType": "int64", "type": "integer" } ], "ref": "b979eb39-686f-4927-8f14-5b4f00e866e5", "rows": [ [ "0", "2019", "(-1,0]", "ANNUEL", "[20000;40000[", "B", "54", "M", "False", "47", "2016.0", "ESSENCE", "FAUX", "[10000;15000[", "0", "0.0", "245.3278688524592", "0" ], [ "1", "2019", "(-1,0]", "ANNUEL", "[20000;40000[", "B", "88", "F", "True", "55", "2018.0", "DIESEL", "VRAI", "[20000;25000[", "0", "0.0", "230.36885245901655", "0" ], [ "2", "2021", "(1,2]", "ANNUEL", "[0;20000[", "D", "35", "F", "True", "16", "2017.0", "ESSENCE", "FAUX", "[15000;20000[", "0", "0.0", "300.0", "0" ], [ "3", "2021", "(2,5]", "ANNUEL", "[0;20000[", "C", "46", "M", "False", "44", "2018.0", "ESSENCE", "VRAI", "[35000;99999[", "0", "0.0", "303.99999999999994", "0" ], [ "4", "2018", "(2,5]", "MENSUEL", "[20000;40000[", "A", "46", "F", "False", "31", "2009.0", "DIESEL", "FAUX", "[10000;15000[", "0", "0.0", "365.0", "0" ], [ "5", "2019", "(2,5]", "MENSUEL", "[0;20000[", "A", "67", "M", "False", "22", "2015.0", "ESSENCE", "VRAI", "[10000;15000[", "0", "0.0", "364.5874316939892", "0" ], [ "6", "2016", "(0,1]", "MENSUEL", "[0;20000[", "C", "37", "F", "False", "15", "2016.0", "ESSENCE", "VRAI", "[10000;15000[", "0", "868.11", "365.0", "0" ], [ "7", "2017", "(1,2]", "MENSUEL", "[0;20000[", "A", "46", "F", "False", "37", "2015.0", "ESSENCE", "FAUX", "[10000;15000[", "0", "0.0", "300.0", "0" ], [ "8", "2016", "(0,1]", "MENSUEL", "[0;20000[", "A", "44", "F", "False", "63", "2014.0", "ESSENCE", "FAUX", "[0;10000[", "0", "0.0", "56.84426229508204", "0" ], [ "9", "2019", "(2,5]", "MENSUEL", "[0;20000[", "B", "59", "F", "False", "68", "2014.0", "ESSENCE", "FAUX", "[10000;15000[", "0", "2794.96", "364.00000000000006", "0" ], [ "10", "2019", "(0,1]", "MENSUEL", "[0;20000[", "C", "40", "M", "False", "37", "2017.0", "ESSENCE", "VRAI", "[15000;20000[", "1", "1072.98", "364.8415300546447", "1" ], [ "11", "2018", "(-1,0]", "MENSUEL", "[0;20000[", "C", "30", "M", "False", "12", "2017.0", "DIESEL", "FAUX", "[20000;25000[", "0", "0.0", "272.00000000000006", "0" ], [ "12", "2020", "(0,1]", "MENSUEL", "[20000;40000[", "D", "30", "M", "True", "15", "2020.0", "ESSENCE", "FAUX", "[20000;25000[", "0", "0.0", "365.0", "0" ], [ "13", "2021", "(0,1]", "MENSUEL", "[20000;40000[", "B", "58", "M", "False", "39", "2017.0", "DIESEL", "FAUX", "[10000;15000[", "0", "0.0", "303.99999999999994", "0" ], [ "14", "2019", "(-1,0]", "MENSUEL", "[20000;40000[", "C", "39", "M", "False", "36", "2014.0", "DIESEL", "FAUX", "[10000;15000[", "0", "0.0", "203.44262295081973", "0" ], [ "15", "2019", "(0,1]", "ANNUEL", "[0;20000[", "A", "26", "F", "False", "14", "2016.0", "DIESEL", "FAUX", "[15000;20000[", "0", "0.0", "364.2049180327869", "0" ], [ "16", "2017", "(-1,0]", "ANNUEL", "[0;20000[", "D", "26", "M", "False", "17", "2018.0", "ESSENCE", "FAUX", "[35000;99999[", "0", "0.0", "268.00000000000006", "0" ], [ "17", "2016", "(0,1]", "TRIMESTRIEL", "[0;20000[", "A", "57", "F", "False", "61", "2011.0", "ESSENCE", "VRAI", "[10000;15000[", "0", "287.73", "365.0", "0" ], [ "18", "2018", "(-1,0]", "TRIMESTRIEL", "[0;20000[", "B", "25", "M", "False", "17", "2017.0", "DIESEL", "VRAI", "[35000;99999[", "0", "0.0", "350.99999999999983", "0" ], [ "19", "2018", "(2,5]", "ANNUEL", "[20000;40000[", "D", "61", "M", "True", "28", "2014.0", "DIESEL", "FAUX", "[20000;25000[", "0", "0.0", "365.0", "0" ], [ "20", "2020", "(1,2]", "MENSUEL", "[20000;40000[", "F", "37", "F", "False", "20", "2018.0", "DIESEL", "FAUX", "[25000;35000[", "0", "0.0", "365.0", "0" ], [ "21", "2020", "(2,5]", "TRIMESTRIEL", "[0;20000[", "D", "25", "M", "True", "18", "2014.0", "DIESEL", "VRAI", "[15000;20000[", "0", "0.0", "102.71857923497252", "0" ], [ "22", "2021", "(2,5]", "MENSUEL", "[20000;40000[", "C", "30", "F", "True", "14", "2018.0", "ESSENCE", "FAUX", "[15000;20000[", "0", "0.0", "303.99999999999994", "0" ], [ "23", "2017", "(-1,0]", "MENSUEL", "[0;20000[", "B", "26", "F", "False", "15", "2016.0", "DIESEL", "FAUX", "[15000;20000[", "0", "0.0", "158.99999999999986", "0" ], [ "24", "2016", "(0,1]", "TRIMESTRIEL", "[0;20000[", "A", "62", "M", "False", "64", "2013.0", "DIESEL", "FAUX", "[10000;15000[", "0", "0.0", "365.0", "0" ], [ "25", "2020", "(-1,0]", "MENSUEL", "[20000;40000[", "C", "45", "F", "False", "44", "2020.0", "ESSENCE", "FAUX", "[15000;20000[", "0", "0.0", "330.42349726775944", "0" ], [ "26", "2020", "(0,1]", "MENSUEL", "[20000;40000[", "E", "60", "M", "False", "66", "2018.0", "DIESEL", "FAUX", "[35000;99999[", "0", "0.0", "365.0", "0" ], [ "27", "2020", "(0,1]", "TRIMESTRIEL", "[0;20000[", "C", "42", "F", "False", "18", "2018.0", "ESSENCE", "FAUX", "[10000;15000[", "0", "0.0", "365.0", "0" ], [ "28", "2021", "(2,5]", "MENSUEL", "[0;20000[", "C", "60", "M", "False", "52", "2016.0", "DIESEL", "VRAI", "[15000;20000[", "0", "0.0", "277.9999999999999", "0" ], [ "29", "2021", "(2,5]", "MENSUEL", "[20000;40000[", "C", "44", "M", "False", "27", "2017.0", "ESSENCE", "FAUX", "[10000;15000[", "0", "0.0", "234.99999999999991", "0" ], [ "30", "2021", "(-1,0]", "MENSUEL", "[20000;40000[", "D", "44", "F", "False", "40", "2020.0", "ESSENCE", "FAUX", "[15000;20000[", "0", "0.0", "180.99999999999997", "0" ], [ "31", "2017", "(1,2]", "ANNUEL", "[20000;40000[", "A", "37", "M", "False", "56", "2013.0", "DIESEL", "VRAI", "[35000;99999[", "0", "0.0", "93.99999999999984", "0" ], [ "32", "2017", "(0,1]", "ANNUEL", "[20000;40000[", "A", "25", "F", "True", "12", "2016.0", "DIESEL", "FAUX", "[20000;25000[", "0", "0.0", "365.0", "0" ], [ "33", "2021", "(1,2]", "ANNUEL", "[0;20000[", "B", "62", "M", "False", "50", "2014.0", "DIESEL", "FAUX", "[20000;25000[", "0", "0.0", "238.99999999999991", "0" ], [ "34", "2020", "(-1,0]", "MENSUEL", "[20000;40000[", "C", "27", "M", "True", "13", "2018.0", "AUTRE", "FAUX", "[35000;99999[", "1", "3750.0", "306.9945355191256", "1" ], [ "35", "2021", "(1,2]", "ANNUEL", "[0;20000[", "C", "60", "F", "False", "61", "2020.0", "ESSENCE", "FAUX", "[15000;20000[", "0", "0.0", "303.99999999999994", "0" ], [ "36", "2019", "(-1,0]", "MENSUEL", "[20000;40000[", "L", "19", "M", "False", "2", "2017.0", "ESSENCE", "VRAI", "[0;10000[", "1", "1838.49", "344.80327868852464", "1" ], [ "37", "2016", "(-1,0]", "ANNUEL", "[0;20000[", "C", "56", "F", "False", "65", "2010.0", "ESSENCE", "FAUX", "[0;10000[", "0", "0.0", "280.0", "0" ], [ "38", "2019", "(0,1]", "MENSUEL", "[0;20000[", "C", "57", "F", "False", "36", "2021.0", "ESSENCE", "FAUX", "[10000;15000[", "0", "0.0", "364.2677595628415", "0" ], [ "39", "2017", "(-1,0]", "MENSUEL", "[0;20000[", "A", "24", "F", "False", "12", "2017.0", "ESSENCE", "FAUX", "[15000;20000[", "0", "2637.39", "195.00000000000009", "0" ], [ "40", "2018", "(0,1]", "ANNUEL", "[20000;40000[", "C", "49", "M", "True", "20", "2017.0", "DIESEL", "FAUX", "[20000;25000[", "0", "0.0", "365.0", "0" ], [ "41", "2018", "(0,1]", "ANNUEL", "[0;20000[", "B", "51", "M", "True", "42", "2017.0", "ESSENCE", "FAUX", "[15000;20000[", "0", "0.0", "365.0", "0" ], [ "42", "2020", "(1,2]", "MENSUEL", "[20000;40000[", "C", "57", "M", "False", "63", "2018.0", "ESSENCE", "FAUX", "[15000;20000[", "0", "0.0", "365.0", "0" ], [ "43", "2019", "(1,2]", "MENSUEL", "[20000;40000[", "C", "40", "M", "False", "69", "2013.0", "ESSENCE", "FAUX", "[10000;15000[", "0", "0.0", "364.2240437158468", "0" ], [ "44", "2021", "(1,2]", "MENSUEL", "[20000;40000[", "B", "60", "M", "False", "28", "2018.0", "DIESEL", "FAUX", "[35000;99999[", "0", "0.0", "303.99999999999994", "0" ], [ "45", "2020", "(2,5]", "ANNUEL", "[0;20000[", "B", "52", "F", "False", "55", "2017.0", "DIESEL", "VRAI", "[35000;99999[", "0", "0.0", "365.0", "0" ], [ "46", "2020", "(2,5]", "ANNUEL", "[0;20000[", "C", "41", "M", "False", "47", "2018.0", "ESSENCE", "FAUX", "[15000;20000[", "0", "0.0", "365.0", "0" ], [ "47", "2020", "(0,1]", "MENSUEL", "[0;20000[", "B", "51", "F", "False", "59", "2016.0", "ESSENCE", "FAUX", "[10000;15000[", "0", "0.0", "118.67486338797818", "0" ], [ "48", "2019", "(-1,0]", "MENSUEL", "[20000;40000[", "C", "49", "M", "False", "21", "2020.0", "ESSENCE", "FAUX", "[25000;35000[", "0", "0.0", "267.26775956284155", "0" ], [ "49", "2020", "(2,5]", "ANNUEL", "[0;20000[", "B", "73", "M", "True", "24", "2018.0", "DIESEL", "FAUX", "[20000;25000[", "0", "0.0", "193.4699453551912", "0" ] ], "shape": { "columns": 17, "rows": 14236 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ANNEE_CTRCONTRAT_ANCIENNETEFREQUENCE_PAIEMENT_COTISATIONGROUPE_KMZONE_RISQUEAGE_ASSURE_PRINCIPALGENREDEUXIEME_CONDUCTEURANCIENNETE_PERMISANNEE_CONSTRUCTIONENERGIEEQUIPEMENT_SECURITEVALEUR_DU_BIENNBCHARGEEXPOsinistré
02019(-1,0]ANNUEL[20000;40000[B54MFalse472016.0ESSENCEFAUX[10000;15000[00.0245.3278690
12019(-1,0]ANNUEL[20000;40000[B88FTrue552018.0DIESELVRAI[20000;25000[00.0230.3688520
22021(1,2]ANNUEL[0;20000[D35FTrue162017.0ESSENCEFAUX[15000;20000[00.0300.0000000
32021(2,5]ANNUEL[0;20000[C46MFalse442018.0ESSENCEVRAI[35000;99999[00.0304.0000000
42018(2,5]MENSUEL[20000;40000[A46FFalse312009.0DIESELFAUX[10000;15000[00.0365.0000000
......................................................
142312021(2,5]MENSUEL[0;20000[D55MFalse492017.0ESSENCEFAUX[20000;25000[00.0181.0000000
142322019(2,5]MENSUEL[20000;40000[A33MFalse142017.0ESSENCEFAUX[10000;15000[00.0364.6693990
142332017(-1,0]ANNUEL[0;20000[A62MFalse582017.0ESSENCEVRAI[10000;15000[00.0182.0000000
142342018(-1,0]TRIMESTRIEL[20000;40000[D20MFalse72016.0DIESELFAUX[25000;35000[00.09.0000000
142352017(-1,0]ANNUEL[0;20000[C73FFalse412017.0ESSENCEFAUX[10000;15000[00.052.0000000
\n", "

14236 rows × 17 columns

\n", "
" ], "text/plain": [ " ANNEE_CTR CONTRAT_ANCIENNETE FREQUENCE_PAIEMENT_COTISATION \\\n", "0 2019 (-1,0] ANNUEL \n", "1 2019 (-1,0] ANNUEL \n", "2 2021 (1,2] ANNUEL \n", "3 2021 (2,5] ANNUEL \n", "4 2018 (2,5] MENSUEL \n", "... ... ... ... \n", "14231 2021 (2,5] MENSUEL \n", "14232 2019 (2,5] MENSUEL \n", "14233 2017 (-1,0] ANNUEL \n", "14234 2018 (-1,0] TRIMESTRIEL \n", "14235 2017 (-1,0] ANNUEL \n", "\n", " GROUPE_KM ZONE_RISQUE AGE_ASSURE_PRINCIPAL GENRE \\\n", "0 [20000;40000[ B 54 M \n", "1 [20000;40000[ B 88 F \n", "2 [0;20000[ D 35 F \n", "3 [0;20000[ C 46 M \n", "4 [20000;40000[ A 46 F \n", "... ... ... ... ... \n", "14231 [0;20000[ D 55 M \n", "14232 [20000;40000[ A 33 M \n", "14233 [0;20000[ A 62 M \n", "14234 [20000;40000[ D 20 M \n", "14235 [0;20000[ C 73 F \n", "\n", " DEUXIEME_CONDUCTEUR ANCIENNETE_PERMIS ANNEE_CONSTRUCTION ENERGIE \\\n", "0 False 47 2016.0 ESSENCE \n", "1 True 55 2018.0 DIESEL \n", "2 True 16 2017.0 ESSENCE \n", "3 False 44 2018.0 ESSENCE \n", "4 False 31 2009.0 DIESEL \n", "... ... ... ... ... \n", "14231 False 49 2017.0 ESSENCE \n", "14232 False 14 2017.0 ESSENCE \n", "14233 False 58 2017.0 ESSENCE \n", "14234 False 7 2016.0 DIESEL \n", "14235 False 41 2017.0 ESSENCE \n", "\n", " EQUIPEMENT_SECURITE VALEUR_DU_BIEN NB CHARGE EXPO sinistré \n", "0 FAUX [10000;15000[ 0 0.0 245.327869 0 \n", "1 VRAI [20000;25000[ 0 0.0 230.368852 0 \n", "2 FAUX [15000;20000[ 0 0.0 300.000000 0 \n", "3 VRAI [35000;99999[ 0 0.0 304.000000 0 \n", "4 FAUX [10000;15000[ 0 0.0 365.000000 0 \n", "... ... ... .. ... ... ... \n", "14231 FAUX [20000;25000[ 0 0.0 181.000000 0 \n", "14232 FAUX [10000;15000[ 0 0.0 364.669399 0 \n", "14233 VRAI [10000;15000[ 0 0.0 182.000000 0 \n", "14234 FAUX [25000;35000[ 0 0.0 9.000000 0 \n", "14235 FAUX [10000;15000[ 0 0.0 52.000000 0 \n", "\n", "[14236 rows x 17 columns]" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculez la variable \"sinistré\" qui est 1 si la personne a eu un ou plusieurs sinistres, 0 sinon\n", "data_retraitee[\"sinistré\"] = data_retraitee[\"NB\"] > 0\n", "data_retraitee[\"sinistré\"] = data_retraitee[\"sinistré\"].astype(int)\n", "data_retraitee" ] }, { "cell_type": "markdown", "id": "657ebd89", "metadata": {}, "source": [ "**Exercice :** construisez les statistiques descriptives de la base utilisée. Notamment la distribution de la variable réponse." ] }, { "cell_type": "code", "execution_count": 108, "id": "47cf4b69", "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "bingroup": "x", "hovertemplate": "sinistré=%{x}
count=%{y}", "legendgroup": "", "marker": { "color": "#636efa", "pattern": { "shape": "" } }, "name": "", "orientation": "v", "showlegend": false, "type": "histogram", "x": { "bdata": "", "dtype": "i1" }, "xaxis": "x", "yaxis": "y" } ], "layout": { "barmode": "relative", "legend": { "tracegroupgap": 0 }, "template": { "data": { "bar": [ { "error_x": { "color": "#2a3f5f" }, "error_y": { "color": "#2a3f5f" }, "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "#E5ECF6", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "baxis": { "endlinecolor": "#2a3f5f", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "#2a3f5f" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "heatmap" } ], "histogram": [ { "marker": { "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattergl" } ], "scattermap": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermap" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 0, "ticks": "" }, "colorscale": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "#EBF0F8" }, "line": { "color": "white" } }, "header": { "fill": { "color": "#C8D4E3" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowcolor": "#2a3f5f", "arrowhead": 0, "arrowwidth": 1 }, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 0, "ticks": "" } }, "colorscale": { "diverging": [ [ 0, "#8e0152" ], [ 0.1, "#c51b7d" ], [ 0.2, "#de77ae" ], [ 0.3, "#f1b6da" ], [ 0.4, "#fde0ef" ], [ 0.5, "#f7f7f7" ], [ 0.6, "#e6f5d0" ], [ 0.7, "#b8e186" ], [ 0.8, "#7fbc41" ], [ 0.9, "#4d9221" ], [ 1, "#276419" ] ], "sequential": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ], "sequentialminus": [ [ 0, "#0d0887" ], [ 0.1111111111111111, "#46039f" ], [ 0.2222222222222222, "#7201a8" ], [ 0.3333333333333333, "#9c179e" ], [ 0.4444444444444444, "#bd3786" ], [ 0.5555555555555556, "#d8576b" ], [ 0.6666666666666666, "#ed7953" ], [ 0.7777777777777778, "#fb9f3a" ], [ 0.8888888888888888, "#fdca26" ], [ 1, "#f0f921" ] ] }, "colorway": [ "#636efa", "#EF553B", "#00cc96", "#ab63fa", "#FFA15A", "#19d3f3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52" ], "font": { "color": "#2a3f5f" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "#E5ECF6", "showlakes": true, "showland": true, "subunitcolor": "white" }, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "paper_bgcolor": "white", "plot_bgcolor": "#E5ECF6", "polar": { "angularaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "radialaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "scene": { "xaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "yaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" }, "zaxis": { "backgroundcolor": "#E5ECF6", "gridcolor": "white", "gridwidth": 2, "linecolor": "white", "showbackground": true, "ticks": "", "zerolinecolor": "white" } }, "shapedefaults": { "line": { "color": "#2a3f5f" } }, "ternary": { "aaxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "baxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" }, "bgcolor": "#E5ECF6", "caxis": { "gridcolor": "white", "linecolor": "white", "ticks": "" } }, "title": { "x": 0.05 }, "xaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 }, "yaxis": { "automargin": true, "gridcolor": "white", "linecolor": "white", "ticks": "", "title": { "standoff": 15 }, "zerolinecolor": "white", "zerolinewidth": 2 } } }, "title": { "text": "Distribution de la variable 'sinistré'" }, "xaxis": { "anchor": "y", "domain": [ 0, 1 ], "title": { "text": "sinistré" } }, "yaxis": { "anchor": "x", "domain": [ 0, 1 ], "title": { "text": "count" } } } } }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Observation de la distribution\n", "fig = px.histogram(data_retraitee, x=\"sinistré\", title=\"Distribution de la variable 'sinistré'\")\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "92d6156a", "metadata": {}, "source": [ "#### Etude des corrélations parmi les variables explicatives" ] }, { "cell_type": "code", "execution_count": 109, "id": "a0bc6278", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(14236, 16)" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_set = data_retraitee.drop(\"sinistré\", axis=1)\n", "data_set.shape" ] }, { "cell_type": "code", "execution_count": 110, "id": "73d31ea4", "metadata": {}, "outputs": [], "source": [ "# Séparation en variables qualitatives ou catégorielles\n", "variables_na = []\n", "variables_numeriques = []\n", "variables_01 = []\n", "variables_categorielles = []\n", "for colu in data_set.columns:\n", " if True in data_set[colu].isna().unique():\n", " variables_na.append(data_set[colu])\n", " else:\n", " if str(data_set[colu].dtypes) in [\"int32\", \"int64\", \"float64\"]:\n", " if len(data_set[colu].unique()) == 2:\n", " variables_categorielles.append(data_set[colu])\n", " else:\n", " variables_numeriques.append(data_set[colu])\n", " else:\n", " if len(data_set[colu].unique()) == 2:\n", " variables_categorielles.append(data_set[colu])\n", " else:\n", " variables_categorielles.append(data_set[colu])\n" ] }, { "cell_type": "markdown", "id": "e82fcade", "metadata": {}, "source": [ "##### Corrélation des variables catégorielles :" ] }, { "cell_type": "code", "execution_count": 111, "id": "30df8bd5", "metadata": {}, "outputs": [], "source": [ "vars_categorielles = pd.DataFrame(variables_categorielles).transpose()" ] }, { "cell_type": "code", "execution_count": 112, "id": "be7a7d00", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "object", "type": "string" }, { "name": "CONTRAT_ANCIENNETE", "rawType": "float64", "type": "float" }, { "name": "FREQUENCE_PAIEMENT_COTISATION", "rawType": "float64", "type": "float" }, { "name": "GROUPE_KM", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE", "rawType": "float64", "type": "float" }, { "name": "GENRE", "rawType": "float64", "type": "float" }, { "name": "DEUXIEME_CONDUCTEUR", "rawType": "float64", "type": "float" }, { "name": "ENERGIE", "rawType": "float64", "type": "float" }, { "name": "EQUIPEMENT_SECURITE", "rawType": "float64", "type": "float" }, { "name": "VALEUR_DU_BIEN", "rawType": "float64", "type": "float" } ], "ref": "0d7eb6cc-5877-455f-9d93-0374286dc27c", "rows": [ [ "CONTRAT_ANCIENNETE", "1.0", "0.0", "0.01", "0.02", "0.0", "0.0", "0.0", "0.01", "0.0" ], [ "FREQUENCE_PAIEMENT_COTISATION", "0.0", "1.0", "0.0", "0.0", "0.01", "0.0", "0.0", "0.01", "0.02" ], [ "GROUPE_KM", "0.01", "0.0", "1.0", "0.01", "0.01", "0.0", "0.04", "0.01", "0.02" ], [ "ZONE_RISQUE", "0.02", "0.0", "0.01", "1.0", "0.0", "0.0", "0.01", "0.03", "0.0" ], [ "GENRE", "0.0", "0.01", "0.01", "0.0", "1.0", "0.0", "0.02", "0.01", "0.07" ], [ "DEUXIEME_CONDUCTEUR", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "0.0", "0.0", "0.0" ], [ "ENERGIE", "0.0", "0.0", "0.04", "0.01", "0.02", "0.0", "1.0", "0.02", "0.08" ], [ "EQUIPEMENT_SECURITE", "0.01", "0.01", "0.01", "0.03", "0.01", "0.0", "0.02", "1.0", "0.07" ], [ "VALEUR_DU_BIEN", "0.0", "0.02", "0.02", "0.0", "0.07", "0.0", "0.08", "0.07", "1.0" ] ], "shape": { "columns": 9, "rows": 9 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CONTRAT_ANCIENNETEFREQUENCE_PAIEMENT_COTISATIONGROUPE_KMZONE_RISQUEGENREDEUXIEME_CONDUCTEURENERGIEEQUIPEMENT_SECURITEVALEUR_DU_BIEN
CONTRAT_ANCIENNETE1.000.000.010.020.000.00.000.010.00
FREQUENCE_PAIEMENT_COTISATION0.001.000.000.000.010.00.000.010.02
GROUPE_KM0.010.001.000.010.010.00.040.010.02
ZONE_RISQUE0.020.000.011.000.000.00.010.030.00
GENRE0.000.010.010.001.000.00.020.010.07
DEUXIEME_CONDUCTEUR0.000.000.000.000.001.00.000.000.00
ENERGIE0.000.000.040.010.020.01.000.020.08
EQUIPEMENT_SECURITE0.010.010.010.030.010.00.021.000.07
VALEUR_DU_BIEN0.000.020.020.000.070.00.080.071.00
\n", "
" ], "text/plain": [ " CONTRAT_ANCIENNETE \\\n", "CONTRAT_ANCIENNETE 1.00 \n", "FREQUENCE_PAIEMENT_COTISATION 0.00 \n", "GROUPE_KM 0.01 \n", "ZONE_RISQUE 0.02 \n", "GENRE 0.00 \n", "DEUXIEME_CONDUCTEUR 0.00 \n", "ENERGIE 0.00 \n", "EQUIPEMENT_SECURITE 0.01 \n", "VALEUR_DU_BIEN 0.00 \n", "\n", " FREQUENCE_PAIEMENT_COTISATION GROUPE_KM \\\n", "CONTRAT_ANCIENNETE 0.00 0.01 \n", "FREQUENCE_PAIEMENT_COTISATION 1.00 0.00 \n", "GROUPE_KM 0.00 1.00 \n", "ZONE_RISQUE 0.00 0.01 \n", "GENRE 0.01 0.01 \n", "DEUXIEME_CONDUCTEUR 0.00 0.00 \n", "ENERGIE 0.00 0.04 \n", "EQUIPEMENT_SECURITE 0.01 0.01 \n", "VALEUR_DU_BIEN 0.02 0.02 \n", "\n", " ZONE_RISQUE GENRE DEUXIEME_CONDUCTEUR \\\n", "CONTRAT_ANCIENNETE 0.02 0.00 0.0 \n", "FREQUENCE_PAIEMENT_COTISATION 0.00 0.01 0.0 \n", "GROUPE_KM 0.01 0.01 0.0 \n", "ZONE_RISQUE 1.00 0.00 0.0 \n", "GENRE 0.00 1.00 0.0 \n", "DEUXIEME_CONDUCTEUR 0.00 0.00 1.0 \n", "ENERGIE 0.01 0.02 0.0 \n", "EQUIPEMENT_SECURITE 0.03 0.01 0.0 \n", "VALEUR_DU_BIEN 0.00 0.07 0.0 \n", "\n", " ENERGIE EQUIPEMENT_SECURITE VALEUR_DU_BIEN \n", "CONTRAT_ANCIENNETE 0.00 0.01 0.00 \n", "FREQUENCE_PAIEMENT_COTISATION 0.00 0.01 0.02 \n", "GROUPE_KM 0.04 0.01 0.02 \n", "ZONE_RISQUE 0.01 0.03 0.00 \n", "GENRE 0.02 0.01 0.07 \n", "DEUXIEME_CONDUCTEUR 0.00 0.00 0.00 \n", "ENERGIE 1.00 0.02 0.08 \n", "EQUIPEMENT_SECURITE 0.02 1.00 0.07 \n", "VALEUR_DU_BIEN 0.08 0.07 1.00 " ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Test du V de Cramer\n", "rows = []\n", "\n", "for var1 in vars_categorielles:\n", " col = []\n", " for var2 in vars_categorielles:\n", " cramers = cramers_V(\n", " vars_categorielles[var1], vars_categorielles[var2]\n", " ) # V de Cramer\n", " col.append(round(cramers, 2)) # arrondi du résultat\n", " rows.append(col)\n", "\n", "cramers_results = np.array(rows)\n", "v_cramer_resultats = pd.DataFrame(\n", " cramers_results,\n", " columns=vars_categorielles.columns,\n", " index=vars_categorielles.columns,\n", ")\n", "\n", "v_cramer_resultats\n" ] }, { "cell_type": "code", "execution_count": 113, "id": "b3297dca", "metadata": {}, "outputs": [], "source": [ "# On repère les variables trop corrélées\n", "for i in range(v_cramer_resultats.shape[0]):\n", " for j in range(i + 1, v_cramer_resultats.shape[0]):\n", " if v_cramer_resultats.iloc[i, j] > 0.7:\n", " print(\n", " v_cramer_resultats.index.to_numpy()[i]\n", " + \" et \"\n", " + v_cramer_resultats.columns[j]\n", " + \" sont trop dépendantes, V-CRAMER = \"\n", " + str(v_cramer_resultats.iloc[i, j])\n", " )\n" ] }, { "cell_type": "markdown", "id": "8f615121", "metadata": {}, "source": [ "##### Corrélation des variables numériques :" ] }, { "cell_type": "code", "execution_count": 114, "id": "d1fa12fc", "metadata": {}, "outputs": [], "source": [ "vars_numeriques = pd.DataFrame(variables_numeriques).transpose()" ] }, { "cell_type": "markdown", "id": "5777d20f", "metadata": {}, "source": [ "**Question :** quels sont vos commentaires ?" ] }, { "cell_type": "code", "execution_count": 115, "id": "c70946b4", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "object", "type": "string" }, { "name": "ANNEE_CTR", "rawType": "float64", "type": "float" }, { "name": "AGE_ASSURE_PRINCIPAL", "rawType": "float64", "type": "float" }, { "name": "ANCIENNETE_PERMIS", "rawType": "float64", "type": "float" }, { "name": "ANNEE_CONSTRUCTION", "rawType": "float64", "type": "float" }, { "name": "NB", "rawType": "float64", "type": "float" }, { "name": "CHARGE", "rawType": "float64", "type": "float" }, { "name": "EXPO", "rawType": "float64", "type": "float" } ], "ref": "6775fec4-a2fa-4d45-a7e7-55334dc80d4d", "rows": [ [ "ANNEE_CTR", "1.0", "0.048023234802924315", "0.043983174120495815", "0.3615499864845018", "-0.05775190894636334", "-0.028901069139582642", "-0.04770515515535773" ], [ "AGE_ASSURE_PRINCIPAL", "0.048023234802924315", "1.0", "0.4987430846753776", "-0.0591835157827114", "-0.012425345899111317", "-0.020907992524227155", "0.06096340138959582" ], [ "ANCIENNETE_PERMIS", "0.043983174120495815", "0.4987430846753776", "1.0", "-0.0298138263902136", "-0.008703999957333864", "-0.011347002839350888", "0.0324606537737922" ], [ "ANNEE_CONSTRUCTION", "0.3615499864845018", "-0.0591835157827114", "-0.0298138263902136", "1.0", "-0.01437673371578632", "-0.0012301736578250726", "-0.07395284013392618" ], [ "NB", "-0.05775190894636334", "-0.012425345899111317", "-0.008703999957333864", "-0.01437673371578632", "1.0", "0.5071071150738479", "0.0507022890091039" ], [ "CHARGE", "-0.028901069139582642", "-0.020907992524227155", "-0.011347002839350888", "-0.0012301736578250726", "0.5071071150738479", "1.0", "-0.021418687122216843" ], [ "EXPO", "-0.04770515515535773", "0.06096340138959582", "0.0324606537737922", "-0.07395284013392618", "0.0507022890091039", "-0.021418687122216843", "1.0" ] ], "shape": { "columns": 7, "rows": 7 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ANNEE_CTRAGE_ASSURE_PRINCIPALANCIENNETE_PERMISANNEE_CONSTRUCTIONNBCHARGEEXPO
ANNEE_CTR1.0000000.0480230.0439830.361550-0.057752-0.028901-0.047705
AGE_ASSURE_PRINCIPAL0.0480231.0000000.498743-0.059184-0.012425-0.0209080.060963
ANCIENNETE_PERMIS0.0439830.4987431.000000-0.029814-0.008704-0.0113470.032461
ANNEE_CONSTRUCTION0.361550-0.059184-0.0298141.000000-0.014377-0.001230-0.073953
NB-0.057752-0.012425-0.008704-0.0143771.0000000.5071070.050702
CHARGE-0.028901-0.020908-0.011347-0.0012300.5071071.000000-0.021419
EXPO-0.0477050.0609630.032461-0.0739530.050702-0.0214191.000000
\n", "
" ], "text/plain": [ " ANNEE_CTR AGE_ASSURE_PRINCIPAL ANCIENNETE_PERMIS \\\n", "ANNEE_CTR 1.000000 0.048023 0.043983 \n", "AGE_ASSURE_PRINCIPAL 0.048023 1.000000 0.498743 \n", "ANCIENNETE_PERMIS 0.043983 0.498743 1.000000 \n", "ANNEE_CONSTRUCTION 0.361550 -0.059184 -0.029814 \n", "NB -0.057752 -0.012425 -0.008704 \n", "CHARGE -0.028901 -0.020908 -0.011347 \n", "EXPO -0.047705 0.060963 0.032461 \n", "\n", " ANNEE_CONSTRUCTION NB CHARGE EXPO \n", "ANNEE_CTR 0.361550 -0.057752 -0.028901 -0.047705 \n", "AGE_ASSURE_PRINCIPAL -0.059184 -0.012425 -0.020908 0.060963 \n", "ANCIENNETE_PERMIS -0.029814 -0.008704 -0.011347 0.032461 \n", "ANNEE_CONSTRUCTION 1.000000 -0.014377 -0.001230 -0.073953 \n", "NB -0.014377 1.000000 0.507107 0.050702 \n", "CHARGE -0.001230 0.507107 1.000000 -0.021419 \n", "EXPO -0.073953 0.050702 -0.021419 1.000000 " ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Corrélation de Pearson\n", "correlations_num = vars_numeriques.corr(method=\"pearson\")\n", "correlations_num" ] }, { "cell_type": "code", "execution_count": 116, "id": "4c29f1f0", "metadata": {}, "outputs": [], "source": [ "# On repère les variables trop corrélées\n", "nb_variables = correlations_num.shape[0]\n", "for i in range(nb_variables):\n", " for j in range(i + 1, nb_variables):\n", " if abs(correlations_num.iloc[i, j]) > 0.7:\n", " print(\n", " correlations_num.index.to_numpy()[i]\n", " + \" et \"\n", " + correlations_num.columns[j]\n", " + \" sont trop dépendantes, corr = \"\n", " + str(correlations_num.iloc[i, j])\n", " )" ] }, { "cell_type": "markdown", "id": "212209ec", "metadata": {}, "source": [ "#### Preprocessing" ] }, { "cell_type": "markdown", "id": "65aca700", "metadata": {}, "source": [ "Deux étapes sont nécessaires avant de lancer l'apprentissage d'un modèle, c'est ce qu'on connait comme le *Preprocessing* :\n", "\n", "* Les modèles proposés par la librairie \"sklearn\" ne gèrent que des variables numériques. Il est donc nécessaire de transformer les variables catégorielles en variables numériques : ce processus s'appelle le *One Hot Encoding*.\n", "* Normaliser les données numériques" ] }, { "cell_type": "markdown", "id": "6c23d236", "metadata": {}, "source": [ "**Exercice :** proposez un bout de code permettant de réaliser le One Hot Encoding des variables catégorielles. Vous pourrez utiliser la fonction \"preproc.OneHotEncoder\" de la librairie sklearn" ] }, { "cell_type": "code", "execution_count": 117, "id": "b8530717", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "CONTRAT_ANCIENNETE_(0,1]", "rawType": "float64", "type": "float" }, { "name": "CONTRAT_ANCIENNETE_(1,2]", "rawType": "float64", "type": "float" }, { "name": "CONTRAT_ANCIENNETE_(2,5]", "rawType": "float64", "type": "float" }, { "name": "CONTRAT_ANCIENNETE_(5,10]", "rawType": "float64", "type": "float" }, { "name": "FREQUENCE_PAIEMENT_COTISATION_MENSUEL", "rawType": "float64", "type": "float" }, { "name": "FREQUENCE_PAIEMENT_COTISATION_TRIMESTRIEL", "rawType": "float64", "type": "float" }, { "name": "GROUPE_KM_[20000;40000[", "rawType": "float64", "type": "float" }, { "name": "GROUPE_KM_[40000;60000[", "rawType": "float64", "type": "float" }, { "name": "GROUPE_KM_[60000;99999[", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_B", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_C", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_D", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_E", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_F", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_G", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_H", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_I", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_J", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_K", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_L", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_M", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_R", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_S", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_T", "rawType": "float64", "type": "float" }, { "name": "ZONE_RISQUE_X", "rawType": "float64", "type": "float" }, { "name": "GENRE_M", "rawType": "float64", "type": "float" }, { "name": "DEUXIEME_CONDUCTEUR_True", "rawType": "float64", "type": "float" }, { "name": "ENERGIE_DIESEL", "rawType": "float64", "type": "float" }, { "name": "ENERGIE_ESSENCE", "rawType": "float64", "type": "float" }, { "name": "EQUIPEMENT_SECURITE_VRAI", "rawType": "float64", "type": "float" }, { "name": "VALEUR_DU_BIEN_[10000;15000[", "rawType": "float64", "type": "float" }, { "name": "VALEUR_DU_BIEN_[15000;20000[", "rawType": "float64", "type": "float" }, { "name": "VALEUR_DU_BIEN_[20000;25000[", "rawType": "float64", "type": "float" }, { "name": "VALEUR_DU_BIEN_[25000;35000[", "rawType": "float64", "type": "float" }, { "name": "VALEUR_DU_BIEN_[35000;99999[", "rawType": "float64", "type": "float" } ], "ref": "babc19df-3fb0-454f-b931-5edcdd6c6a55", "rows": [ [ "0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "0.0", "0.0", "1.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "0.0", "0.0", "1.0", "0.0", "1.0", "0.0", "0.0", "0.0", "0.0" ], [ "1", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "0.0", "0.0", "1.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "1.0", "0.0", "1.0", "0.0", "0.0", "1.0", "0.0", "0.0" ], [ "2", "0.0", "1.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "0.0", "1.0", "0.0", "0.0", "1.0", "0.0", "0.0", "0.0" ], [ "3", "0.0", "0.0", "1.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "0.0", "0.0", "1.0", "1.0", "0.0", "0.0", "0.0", "0.0", "1.0" ], [ "4", "0.0", "0.0", "1.0", "0.0", "1.0", "0.0", "1.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "1.0", "0.0", "0.0", "1.0", "0.0", "0.0", "0.0", "0.0" ] ], "shape": { "columns": 35, "rows": 5 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CONTRAT_ANCIENNETE_(0,1]CONTRAT_ANCIENNETE_(1,2]CONTRAT_ANCIENNETE_(2,5]CONTRAT_ANCIENNETE_(5,10]FREQUENCE_PAIEMENT_COTISATION_MENSUELFREQUENCE_PAIEMENT_COTISATION_TRIMESTRIELGROUPE_KM_[20000;40000[GROUPE_KM_[40000;60000[GROUPE_KM_[60000;99999[ZONE_RISQUE_B...GENRE_MDEUXIEME_CONDUCTEUR_TrueENERGIE_DIESELENERGIE_ESSENCEEQUIPEMENT_SECURITE_VRAIVALEUR_DU_BIEN_[10000;15000[VALEUR_DU_BIEN_[15000;20000[VALEUR_DU_BIEN_[20000;25000[VALEUR_DU_BIEN_[25000;35000[VALEUR_DU_BIEN_[35000;99999[
00.00.00.00.00.00.01.00.00.01.0...1.00.00.01.00.01.00.00.00.00.0
10.00.00.00.00.00.01.00.00.01.0...0.01.01.00.01.00.00.01.00.00.0
20.01.00.00.00.00.00.00.00.00.0...0.01.00.01.00.00.01.00.00.00.0
30.00.01.00.00.00.00.00.00.00.0...1.00.00.01.01.00.00.00.00.01.0
40.00.01.00.01.00.01.00.00.00.0...0.00.01.00.00.01.00.00.00.00.0
\n", "

5 rows × 35 columns

\n", "
" ], "text/plain": [ " CONTRAT_ANCIENNETE_(0,1] CONTRAT_ANCIENNETE_(1,2] \\\n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 1.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "\n", " CONTRAT_ANCIENNETE_(2,5] CONTRAT_ANCIENNETE_(5,10] \\\n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 1.0 0.0 \n", "4 1.0 0.0 \n", "\n", " FREQUENCE_PAIEMENT_COTISATION_MENSUEL \\\n", "0 0.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 1.0 \n", "\n", " FREQUENCE_PAIEMENT_COTISATION_TRIMESTRIEL GROUPE_KM_[20000;40000[ \\\n", "0 0.0 1.0 \n", "1 0.0 1.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 1.0 \n", "\n", " GROUPE_KM_[40000;60000[ GROUPE_KM_[60000;99999[ ZONE_RISQUE_B ... \\\n", "0 0.0 0.0 1.0 ... \n", "1 0.0 0.0 1.0 ... \n", "2 0.0 0.0 0.0 ... \n", "3 0.0 0.0 0.0 ... \n", "4 0.0 0.0 0.0 ... \n", "\n", " GENRE_M DEUXIEME_CONDUCTEUR_True ENERGIE_DIESEL ENERGIE_ESSENCE \\\n", "0 1.0 0.0 0.0 1.0 \n", "1 0.0 1.0 1.0 0.0 \n", "2 0.0 1.0 0.0 1.0 \n", "3 1.0 0.0 0.0 1.0 \n", "4 0.0 0.0 1.0 0.0 \n", "\n", " EQUIPEMENT_SECURITE_VRAI VALEUR_DU_BIEN_[10000;15000[ \\\n", "0 0.0 1.0 \n", "1 1.0 0.0 \n", "2 0.0 0.0 \n", "3 1.0 0.0 \n", "4 0.0 1.0 \n", "\n", " VALEUR_DU_BIEN_[15000;20000[ VALEUR_DU_BIEN_[20000;25000[ \\\n", "0 0.0 0.0 \n", "1 0.0 1.0 \n", "2 1.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "\n", " VALEUR_DU_BIEN_[25000;35000[ VALEUR_DU_BIEN_[35000;99999[ \n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 1.0 \n", "4 0.0 0.0 \n", "\n", "[5 rows x 35 columns]" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# One hot encoding des variables catégorielles\n", "preproc_ohe = preproc.OneHotEncoder(handle_unknown=\"ignore\")\n", "preproc_ohe = preproc.OneHotEncoder(drop=\"first\", sparse_output=False).fit(\n", " vars_categorielles\n", ")\n", "\n", "variables_categorielles_ohe = preproc_ohe.transform(vars_categorielles)\n", "variables_categorielles_ohe = pd.DataFrame(\n", " variables_categorielles_ohe,\n", " columns=preproc_ohe.get_feature_names_out(vars_categorielles.columns),\n", ")\n", "variables_categorielles_ohe.head()" ] }, { "cell_type": "markdown", "id": "2be6a3e4", "metadata": {}, "source": [ "**Exercice :** proposez un bout de code permettant noramliser les variables numériques présentes dans la base. Vous pourrez utiliser la fonction \"preproc.StandardScaler\" de la librairie sklearn" ] }, { "cell_type": "code", "execution_count": 118, "id": "4ff3847d", "metadata": {}, "outputs": [ { "data": { "application/vnd.microsoft.datawrangler.viewer.v0+json": { "columns": [ { "name": "index", "rawType": "int64", "type": "integer" }, { "name": "ANNEE_CTR", "rawType": "float64", "type": "float" }, { "name": "AGE_ASSURE_PRINCIPAL", "rawType": "float64", "type": "float" }, { "name": "ANCIENNETE_PERMIS", "rawType": "float64", "type": "float" }, { "name": "ANNEE_CONSTRUCTION", "rawType": "float64", "type": "float" }, { "name": "NB", "rawType": "float64", "type": "float" }, { "name": "CHARGE", "rawType": "float64", "type": "float" }, { "name": "EXPO", "rawType": "float64", "type": "float" } ], "ref": "46a8d9a1-3a1b-4f12-80a5-7301880114ee", "rows": [ [ "0", "0.1393559608666301", "0.6582867283271144", "0.5635879287137437", "0.1740107784615837", "-0.24202868219585674", "-0.181253980627111", "-0.289146035458737" ], [ "1", "0.1393559608666301", "3.1516280073827847", "0.9874335016275682", "0.7442069902648635", "-0.24202868219585674", "-0.181253980627111", "-0.42709265252699025" ], [ "2", "1.3471924655222902", "-0.7350510452628191", "-1.078813666327326", "0.45910888436322356", "-0.24202868219585674", "-0.181253980627111", "0.215020504730438" ], [ "3", "1.3471924655222902", "0.0716181920787214", "0.40464583887105954", "0.7442069902648635", "-0.24202868219585674", "-0.181253980627111", "0.25190705219855114" ], [ "4", "-0.4645622914611999", "0.0716181920787214", "-0.28410321711390524", "-1.8216759628498953", "-0.24202868219585674", "-0.181253980627111", "0.8144269010872852" ] ], "shape": { "columns": 7, "rows": 5 } }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ANNEE_CTRAGE_ASSURE_PRINCIPALANCIENNETE_PERMISANNEE_CONSTRUCTIONNBCHARGEEXPO
00.1393560.6582870.5635880.174011-0.242029-0.181254-0.289146
10.1393563.1516280.9874340.744207-0.242029-0.181254-0.427093
21.347192-0.735051-1.0788140.459109-0.242029-0.1812540.215021
31.3471920.0716180.4046460.744207-0.242029-0.1812540.251907
4-0.4645620.071618-0.284103-1.821676-0.242029-0.1812540.814427
\n", "
" ], "text/plain": [ " ANNEE_CTR AGE_ASSURE_PRINCIPAL ANCIENNETE_PERMIS ANNEE_CONSTRUCTION \\\n", "0 0.139356 0.658287 0.563588 0.174011 \n", "1 0.139356 3.151628 0.987434 0.744207 \n", "2 1.347192 -0.735051 -1.078814 0.459109 \n", "3 1.347192 0.071618 0.404646 0.744207 \n", "4 -0.464562 0.071618 -0.284103 -1.821676 \n", "\n", " NB CHARGE EXPO \n", "0 -0.242029 -0.181254 -0.289146 \n", "1 -0.242029 -0.181254 -0.427093 \n", "2 -0.242029 -0.181254 0.215021 \n", "3 -0.242029 -0.181254 0.251907 \n", "4 -0.242029 -0.181254 0.814427 " ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Normalisation des varibales numériques\n", "preproc_scale = preproc.StandardScaler(with_mean=True, with_std=True)\n", "preproc_scale.fit(vars_numeriques)\n", "\n", "vars_numeriques_scaled = preproc_scale.transform(vars_numeriques)\n", "vars_numeriques_scaled = pd.DataFrame(\n", " vars_numeriques_scaled, columns=vars_numeriques.columns\n", ")\n", "vars_numeriques_scaled.head()" ] }, { "cell_type": "markdown", "id": "7ecba832", "metadata": {}, "source": [ "## Algorithme supervisé : Gradient Boosting" ] }, { "cell_type": "markdown", "id": "efcb8987", "metadata": {}, "source": [ "A ce stade, nous avons vu les différentes étapes pour lancer un algorithme de Machine Learning. Néanmoins, ces étapes ne sont pas suffisantes pour construire un modèle performant. \n", "En effet, afin de construire un modèle performant le Data Scientist doit agir sur l'apprentissage du modèle. Dans ce qui suit nous :\n", "* Changerons d'algorithme pour utiliser un algorithme plus performant (Gradient Boosting)\n", "* Raliserons un *grid search* sur les paramètres du modèle\n", "* Appliquerons l'apprentissage par validation croisée\n" ] }, { "cell_type": "markdown", "id": "3feaff44", "metadata": {}, "source": [ "**Exercice :** Implémentez l'algorithme du Gradient Boosting en appliquant les techniques vues lors des derniers cours (sampling, Grid search et Cross Validation) \n", "**Remarques :**\n", "* Vous pouvez utiliser les modèles \"GradientBoostingClassifier\" et \"GridSearchCV\" de la libraire Sklearn. \n", "* Pensez à utiliser les métriques relatives aux problèmes de classification." ] }, { "cell_type": "markdown", "id": "5a6adbfe", "metadata": {}, "source": [ "#### Sampling" ] }, { "cell_type": "code", "execution_count": 119, "id": "d9342ad6", "metadata": {}, "outputs": [], "source": [ "X_global = vars_numeriques_scaled.merge(\n", " variables_categorielles_ohe, left_index=True, right_index=True\n", ")\n", "\n", "# Réorganisation des données\n", "X = X_global.to_numpy()\n", "Y = data_retraitee[\"sinistré\"]\n", "\n", "# Sampling en 80% train et 20% test\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, Y, test_size=0.2, random_state=42, stratify=Y\n", ")" ] }, { "cell_type": "markdown", "id": "76ece01f", "metadata": {}, "source": [ "#### Fitting avec Cross-Validation et *Grid Search*" ] }, { "cell_type": "code", "execution_count": 120, "id": "cb60fe19", "metadata": {}, "outputs": [], "source": [ "# Définir la grille d'hyperparamètres à rechercher\n", "param_grid = {\n", " \"n_estimators\": [100, 200, 250],\n", " \"learning_rate\": [0.5, 0.7, 0.9],\n", "}\n", "scoring = 'recall'\n", "# Nombre de folds pour la validation croisée\n", "num_folds = 5" ] }, { "cell_type": "code", "execution_count": 121, "id": "b976720e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Meilleurs hyperparamètres : {'learning_rate': 0.5, 'n_estimators': 100}\n" ] } ], "source": [ "# Initialisation du modèle GradientBoostingClassifier\n", "gbc = GradientBoostingClassifier(random_state=42)\n", "\n", "# Création de l'objet GridSearchCV pour la recherche sur grille avec validation croisée\n", "grid_search = GridSearchCV(\n", " estimator=gbc,\n", " param_grid=param_grid,\n", " cv=StratifiedKFold(\n", " n_splits=num_folds, shuffle=True, random_state=42\n", " ), # Validation croisée avec 5 folds\n", " scoring=scoring, # Métrique d'évaluation (moins c'est mieux)\n", " n_jobs=-1, # Utiliser tous les cœurs du processeur\n", ")\n", "\n", "# Exécution de la recherche sur grille\n", "grid_search.fit(X_train, y_train)\n", "\n", "# Afficher les meilleurs hyperparamètres\n", "best_params = grid_search.best_params_\n", "print(\"Meilleurs hyperparamètres : \", best_params)\n" ] }, { "cell_type": "code", "execution_count": 122, "id": "0a35a4bf", "metadata": {}, "outputs": [], "source": [ "# Initialiser le modèle final avec les meilleurs hyperparamètres\n", "best_gbc = GradientBoostingClassifier(random_state=42, **best_params)" ] }, { "cell_type": "code", "execution_count": 123, "id": "e12177a8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RMSE pour le fold 1: 1.0\n", "RMSE pour le fold 2: 1.0\n", "RMSE pour le fold 3: 1.0\n", "RMSE pour le fold 4: 1.0\n", "RMSE pour le fold 5: 1.0\n", "\n", "\n", "MSE pour le fold 1: 1.0\n", "MSE pour le fold 2: 1.0\n", "MSE pour le fold 3: 1.0\n", "MSE pour le fold 4: 1.0\n", "MSE pour le fold 5: 1.0\n", "\n", "\n", "MAE pour le fold 1: 1.0\n", "MAE pour le fold 2: 1.0\n", "MAE pour le fold 3: 1.0\n", "MAE pour le fold 4: 1.0\n", "MAE pour le fold 5: 1.0\n" ] } ], "source": [ "# Cross validation\n", "# RMSE de chaque fold\n", "rmse_scores = cross_val_score(best_gbc, X_train, y_train, cv=num_folds, scoring=scoring)\n", "\n", "# Afficher les scores pour chaque fold\n", "for i, score in enumerate(rmse_scores):\n", " print(f\"RMSE pour le fold {i + 1}: {score}\")\n", "\n", "# MSE de chaque fold\n", "mse_scores = cross_val_score(best_gbc, X_train, y_train, cv=num_folds, scoring=scoring)\n", "\n", "# Afficher les scores pour chaque fold\n", "print(\"\\n\")\n", "for i, score in enumerate(mse_scores):\n", " print(f\"MSE pour le fold {i + 1}: {score}\")\n", "\n", "# MAE de chaque fold\n", "mae_scores = cross_val_score(best_gbc, X_train, y_train, cv=num_folds, scoring=scoring)\n", "\n", "# Afficher les scores pour chaque fold\n", "print(\"\\n\")\n", "for i, score in enumerate(mae_scores):\n", " print(f\"MAE pour le fold {i + 1}: {score}\")\n" ] }, { "cell_type": "markdown", "id": "3a723cbc", "metadata": {}, "source": [ "#### Validation du modèle - métriques" ] }, { "cell_type": "markdown", "id": "60c0312d", "metadata": {}, "source": [ "**Exercice :** \n", "* Construisez la matrice de confusion (metrics.confusion_matrix).\n", "* Calculez les métriques : accuracy, recall & precision." ] }, { "cell_type": "code", "execution_count": null, "id": "5d9ef448", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "studies", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.3" } }, "nbformat": 4, "nbformat_minor": 5 }