Compare commits

...

3 Commits

5 changed files with 33082 additions and 14482 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,541 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "8750d15b",
"metadata": {},
"source": [
"# Cours 2 : Algorithmes non supervisés "
]
},
{
"cell_type": "markdown",
"id": "f7c08ae5",
"metadata": {},
"source": [
"## Préambule"
]
},
{
"cell_type": "markdown",
"id": "ec7ecb4b",
"metadata": {},
"source": [
"Les objectifs de cette séance (3h) sont :\n",
"* Mettre en application un modèle non-supervisé (K-means et C.A.H)"
]
},
{
"cell_type": "markdown",
"id": "4e99c600",
"metadata": {},
"source": [
"## Préparation du workspace"
]
},
{
"cell_type": "markdown",
"id": "c1b01045",
"metadata": {},
"source": [
"### Import de librairies "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97d58527",
"metadata": {},
"outputs": [],
"source": [
"# Données\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"#Graphiques \n",
"import seaborn as sns\n",
"sns.set()\n",
"import plotly.express as px\n",
"import plotly.graph_objects as gp\n",
"\n",
"#Statistiques\n",
"from scipy.stats import chi2_contingency\n",
"\n",
"# Machine Learning\n",
"from sklearn.cluster import KMeans\n",
"import matplotlib.pyplot as plt\n",
"from scipy.cluster.hierarchy import dendrogram, linkage\n",
"from sklearn.cluster import AgglomerativeClustering"
]
},
{
"cell_type": "markdown",
"id": "985e4e97",
"metadata": {},
"source": [
"### Constantes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9597b48",
"metadata": {},
"outputs": [],
"source": [
"input_path = \"./1_inputs\"\n",
"output_path = \"./2_outputs\""
]
},
{
"cell_type": "markdown",
"id": "b2ff398d",
"metadata": {},
"source": [
"## Exercice (implémentation des exercices du support de cours)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea2a0164",
"metadata": {},
"outputs": [],
"source": [
"#Défintion de E\n",
"x = #Complétez avec votre code\n",
"\n",
"#Représentation graphique\n",
"y=[0, 0, 0, 0, 0]\n",
"plt.scatter(x, y)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "5e4abc23",
"metadata": {},
"source": [
"### K-means : Question 1 "
]
},
{
"cell_type": "markdown",
"id": "5dea6f90",
"metadata": {},
"source": [
"**Déterminer la partition optimale par k-means en prenant pour centres initiaux les éléments 1, 2, 18**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41cc10ba",
"metadata": {},
"outputs": [],
"source": [
"#Définition des centres initiaux\n",
"init_points= #Complétez avec votre code\n",
"\n",
"#Itinitialisation algo\n",
"kmeans = KMeans(init=init_points.reshape(-1,1),\n",
" n_clusters=#Complétez avec votre code,\n",
" n_init = 1) \n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54857e7b",
"metadata": {},
"outputs": [],
"source": [
"#Transformation des données : plusieurs échantillons de 1 dimension\n",
"data_x = np.array(x)\n",
"data_x = data_x.reshape(-1,1)\n",
"\n",
"# Fitting \n",
"kmeans.fit(data_x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "72efd783",
"metadata": {},
"outputs": [],
"source": [
"#Centroides finaux\n",
"final_centroids = kmeans.cluster_centers_\n",
"labels = kmeans.labels_\n",
"\n",
"final_centroids"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3110c8ca",
"metadata": {},
"outputs": [],
"source": [
"#Représentation Graphique \n",
"plt.scatter(x, y, c=labels, cmap='viridis')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "a24927bc",
"metadata": {},
"source": [
"### K-means : Question 2"
]
},
{
"cell_type": "markdown",
"id": "c18297ba",
"metadata": {},
"source": [
"**Déterminer la partition optimale par k-means en prenant pour centres initiaux les éléments 18, 20, 31**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d0ccbcf3-a06f-4757-bdd8-2cc3bd1626c6",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "b957bbe8",
"metadata": {},
"outputs": [],
"source": [
"#Complétez avec votre code"
]
},
{
"cell_type": "markdown",
"id": "2b85bc73",
"metadata": {},
"source": [
"### K-means : Question 3"
]
},
{
"cell_type": "markdown",
"id": "0c085473",
"metadata": {},
"source": [
"**Déterminer la partition optimale par k-means en prenant comme partition initiale {{1},{2,18},{20,31}}**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0047b80a",
"metadata": {},
"outputs": [],
"source": [
"#Complétez avec votre code "
]
},
{
"cell_type": "markdown",
"id": "5eaad20e",
"metadata": {},
"source": [
"### Classification Ascendante Hiérarchique avec le lien simple"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1ebaaa05",
"metadata": {},
"outputs": [],
"source": [
"#Défintion de E\n",
"x = #Complétez avec votre code\n",
"\n",
"#Représentation graphique\n",
"y=[0, 0, 0, 0,0,0,0]\n",
"plt.scatter(x, y)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e96f7f3",
"metadata": {},
"outputs": [],
"source": [
"#Calcul de CAH avec lien simple\n",
"data = list(zip(x))\n",
"\n",
"linkage_data = linkage(data, \n",
" method=#Complétez avec votre code , \n",
" metric=#Complétez avec votre code)\n",
"\n",
"dendrogram(linkage_data, labels=x)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "874c878c",
"metadata": {},
"outputs": [],
"source": [
"#Calcul de la partition de l'espace \n",
"hierarchical_cluster = AgglomerativeClustering(n_clusters=#Complétez avec votre code, \n",
" affinity=#Complétez avec votre code, \n",
" linkage=#Complétez avec votre code)\n",
"\n",
"labels = hierarchical_cluster.fit_predict(data) \n",
"print(labels)\n",
"\n",
"#Représentation Graphique \n",
"plt.scatter(x, y, c=labels, cmap='viridis')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "75420ae4",
"metadata": {},
"source": [
"### Classification Ascendante Hiérarchique avec le lien complet"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f098bc3",
"metadata": {},
"outputs": [],
"source": [
"#Complétez avec votre code"
]
},
{
"cell_type": "markdown",
"id": "99bc3508",
"metadata": {},
"source": [
"## K-means: Cas pratique"
]
},
{
"cell_type": "markdown",
"id": "b2b035d2",
"metadata": {},
"source": [
"### Import des données"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8051b5f4",
"metadata": {},
"outputs": [],
"source": [
"path =input_path + '/base_retraitee.csv'\n",
"data_retraitee = pd.read_csv(path,sep=\",\",decimal=\".\")"
]
},
{
"cell_type": "markdown",
"id": "aeff9cff",
"metadata": {},
"source": [
"**Exercice :** Regrouper les zones géographiques en 5 zones homogènes en termes :\n",
"* Fréquence de sinistres (La fréquence est égale au Nombre de sinistres divisé par l'exposition)\n",
"* Charge \n",
"* Fréquence de sinistres x Charge \n",
" \n",
"A chaque fois :\n",
"* Afficher les coordonnées des centroïdes\n",
"* Représenter graphiquement la partition obtenue"
]
},
{
"cell_type": "markdown",
"id": "1c4333b8",
"metadata": {},
"source": [
"### Regroupement de zones selon la fréquence"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e35f286",
"metadata": {},
"outputs": [],
"source": [
"#Complétez avec votre code"
]
},
{
"cell_type": "markdown",
"id": "9c738659",
"metadata": {},
"source": [
"### Regroupement de zones selon le coût moyen"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f461bfb8",
"metadata": {},
"outputs": [],
"source": [
"#Complétez avec votre code"
]
},
{
"cell_type": "markdown",
"id": "6b154f4a",
"metadata": {},
"source": [
"### Regroupement de zones selon (fréquence; le coût moyen)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d89f70e",
"metadata": {},
"outputs": [],
"source": [
"#Complétez avec votre code"
]
},
{
"cell_type": "markdown",
"id": "f1cac03f",
"metadata": {},
"source": [
"## C.A.H : Cas pratique"
]
},
{
"cell_type": "markdown",
"id": "bffff328",
"metadata": {},
"source": [
"**Exercice :** Comparer les résultats obtenus via K-means à ceux d'une C.A.H (lien simple) pour la fréquence et (fréquence; coût moyen)\n",
" \n",
"A chaque fois :\n",
"* Tracer le dendrogramme associé\n",
"* Représenter graphiquement la partition obtenue"
]
},
{
"cell_type": "markdown",
"id": "8453bf02",
"metadata": {},
"source": [
"### Regroupement de zones selon la fréquence"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "341bf2b2",
"metadata": {},
"outputs": [],
"source": [
"#Complétez avec votre code"
]
},
{
"cell_type": "markdown",
"id": "6ace7bc5",
"metadata": {},
"source": [
"### Regroupement de zones selon (fréquence; le coût moyen)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16103b5b",
"metadata": {},
"outputs": [],
"source": [
"#Complétez avec votre code"
]
},
{
"cell_type": "markdown",
"id": "12961201",
"metadata": {},
"source": [
"## Application : création de Model Points (K-means)"
]
},
{
"cell_type": "markdown",
"id": "6567c970",
"metadata": {},
"source": [
"Dans certains cas, il se peut que la modélisation ligne par ligne ne soit pas adaptée. C'est le cas des produits collectifs en assurance ou lorsque le nombre d'individus est trop important. \n",
"Dans ce cas de figure, il est nécessaire d'agréger l'information afin d'avoir des \"individus type\". Chacun de ces individus est appelé *Model Point*. \n",
"L'algorithme des k-means peut s'avérer utile pour le regroupement d'individus sous forme de *Mode Points* lorsque les variables explicatives sont numériques. \n",
" \n",
"Afin d'illustre ce propos, nous agrègerons la base de données selon les variables ANNEE_CTR, AGE_ASSURE_PRINCIPAL, ANCIENNETE_PERMIS et ANNEE_CONSTRUCTION afin de créer 100 Model Points. "
]
},
{
"cell_type": "markdown",
"id": "a250bff9",
"metadata": {},
"source": [
"**Exercice :** Construire la nouvelle base de modélisation (les nouveaux individus deviennent les Model Points et chacune de modalités devient le centroïde de la classe)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b42c2b5",
"metadata": {},
"outputs": [],
"source": [
"#Complétez avec votre code"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -1,10 +1,11 @@
# Studies
# ArtStudies
[Studies Projects](https://github.com/ArthurDanjou/studies) is a curated collection of academic projects completed throughout my mathematics studies. The repository showcases work in both _Python_ and _R_, focusing on mathematical modeling, data analysis, and numerical methods.
[ArtStudies Projects](https://github.com/ArthurDanjou/artstudies) is a curated collection of academic projects completed throughout my mathematics studies. The repository showcases work in both _Python_ and _R_, focusing on mathematical modeling, data analysis, and numerical methods.
The projects are organized into two main sections:
- **L3** Third year of the Bachelor's degree in Mathematics
- **M1** First year of the Master's degree in Mathematics
- **M2** Second year of the Master's degree in Mathematics
## 📁 File Structure
@@ -27,6 +28,10 @@ The projects are organized into two main sections:
- `Portfolio Management`
- `Statistical Learning`
- `M2`
- `Machine Learning`
- `SQL`
## 🛠️ Technologies & Tools
- [Python](https://www.python.org): A high-level, interpreted programming language, widely used for data science, machine learning, and scientific computing.
@@ -38,6 +43,8 @@ The projects are organized into two main sections:
- [Scikit-learn](https://scikit-learn.org): A robust library offering simple and efficient tools for machine learning and statistical modeling, including classification, regression, and clustering.
- [TensorFlow](https://www.tensorflow.org): A comprehensive open-source framework for building and deploying machine learning and deep learning models.
- [Matplotlib](https://matplotlib.org): A versatile plotting library for creating high-quality static, animated, and interactive visualizations in Python.
- [Plotly](https://plotly.com): An interactive graphing library for creating dynamic visualizations in Python and R.
- [Seaborn](https://seaborn.pydata.org): A statistical data visualization library built on top of Matplotlib, providing a high-level interface for drawing attractive and informative graphics.
- [RMarkdown](https://rmarkdown.rstudio.com): A dynamic tool for combining code, results, and narrative into high-quality documents and presentations.
- [FactoMineR](https://factominer.free.fr/): An R package focused on multivariate exploratory data analysis (e.g., PCA, MCA, CA).
- [ggplot2](https://ggplot2.tidyverse.org): A grammar-based graphics package for creating complex and elegant visualizations in R.