Add commentary

This commit is contained in:
AntoninDurousseau
2025-06-05 08:15:02 +02:00
parent a5aacba14c
commit f62f2410e7

View File

@@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 1,
"id": "4e6f6cb1",
"metadata": {},
"outputs": [],
@@ -20,7 +20,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 2,
"id": "4dd5223b",
"metadata": {},
"outputs": [],
@@ -31,7 +31,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 3,
"id": "c1ab7ec9",
"metadata": {},
"outputs": [
@@ -86,7 +86,7 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 4,
"id": "754dce9b",
"metadata": {},
"outputs": [
@@ -95,13 +95,15 @@
"output_type": "stream",
"text": [
"Number of samples: 116\n",
"Number of features: 9\n"
"Number of features: 9\n",
"Number of classes: 2\n"
]
}
],
"source": [
"print(\"Number of samples:\", X.shape[0])\n",
"print(\"Number of features:\", X.shape[1])"
"print(\"Number of features:\", X.shape[1])\n",
"print(\"Number of classes:\", len(np.unique(y)))"
]
},
{
@@ -122,7 +124,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 5,
"id": "b2e03ac1",
"metadata": {},
"outputs": [
@@ -150,6 +152,16 @@
"print(\"The best k for k-NN is k =\", k_optimal)"
]
},
{
"cell_type": "markdown",
"id": "698d03a8",
"metadata": {},
"source": [
"In k-NN classification, to achieve the best prediction performance, we need to find the optimal number of neighbors that maximizes the evaluation score of our models. Here, we use the $f1\\_score$ from sklearn.metrics, as it provides a good balance between precision (e.g., correctly predicting a sick patient as sick, or a healthy patient as healthy) and recall (e.g., correctly identifying sick patients among all those predicted as sick).\n",
"\n",
"To determine this hyperparameter, we use 5-fold cross-validation. We chose 5 folds instead of 10 due to the limited amount of data, as this provides a better balance between the sizes of the training and validation sets. After cross-validation, it turns out that the optimal number of neighbors, $k$, is $k = 23$."
]
},
{
"cell_type": "markdown",
"id": "9f74eaee",
@@ -168,28 +180,19 @@
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 6,
"id": "70281897",
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # train/test split with 70% training and 30% testing\n",
"\n",
"\n",
"# Feature scaling\n",
"scaler = StandardScaler()\n",
"\n",
"X_train_scaled = pd.DataFrame(\n",
" scaler.fit_transform(X_train),\n",
" columns=X_train.columns,\n",
" index=X_train.index\n",
")\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"\n",
"X_test_scaled = pd.DataFrame(\n",
" scaler.transform(X_test),\n",
" columns=X_test.columns,\n",
" index=X_test.index\n",
")"
"X_test_scaled = scaler.transform(X_test)"
]
},
{
@@ -202,7 +205,7 @@
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 7,
"id": "064a5aa7",
"metadata": {},
"outputs": [
@@ -247,7 +250,6 @@
}
],
"source": [
"\n",
"knn = KNeighborsClassifier(n_neighbors=k_optimal) # using the best k founded earlier\n",
"knn.fit(X_train_scaled, y_train)\n",
"\n",
@@ -295,12 +297,12 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c158b385",
"cell_type": "markdown",
"id": "9bf7ed62",
"metadata": {},
"outputs": [],
"source": []
"source": [
"In this optimized k-NN classification, we aim to maximize recall while maintaining good accuracy, in order to minimize the number of misclassifications, particularly cases where a sick patient is incorrectly predicted as healthy. We achieve this goal with a recall of 89% and an accuracy of 80%."
]
}
],
"metadata": {