Add commentary

2026-03-16 03:12:01 +01:00 · 2025-06-05 08:15:02 +02:00
parent a5aacba14c
commit f62f2410e7
1 changed files with 28 additions and 26 deletions
--- a/knn.ipynb
+++ b/knn.ipynb
@@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 1,
   "id": "4e6f6cb1",
   "metadata": {},
   "outputs": [],
@@ -20,7 +20,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 2,
   "id": "4dd5223b",
   "metadata": {},
   "outputs": [],
@@ -31,7 +31,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 3,
   "id": "c1ab7ec9",
   "metadata": {},
   "outputs": [
@@ -86,7 +86,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 4,
   "id": "754dce9b",
   "metadata": {},
   "outputs": [
@@ -95,13 +95,15 @@
     "output_type": "stream",
     "text": [
      "Number of samples: 116\n",
-      "Number of features: 9\n"
+      "Number of features: 9\n",
+      "Number of classes: 2\n"
     ]
    }
   ],
   "source": [
    "print(\"Number of samples:\", X.shape[0])\n",
-    "print(\"Number of features:\", X.shape[1])"
+    "print(\"Number of features:\", X.shape[1])\n",
+    "print(\"Number of classes:\", len(np.unique(y)))"
   ]
  },
  {
@@ -122,7 +124,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 5,
   "id": "b2e03ac1",
   "metadata": {},
   "outputs": [
@@ -150,6 +152,16 @@
    "print(\"The best k for k-NN is k =\", k_optimal)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "698d03a8",
+   "metadata": {},
+   "source": [
+    "In k-NN classification, to achieve the best prediction performance, we need to find the optimal number of neighbors that maximizes the evaluation score of our models. Here, we use the $f1\\_score$ from sklearn.metrics, as it provides a good balance between precision (e.g., correctly predicting a sick patient as sick, or a healthy patient as healthy) and recall (e.g., correctly identifying sick patients among all those predicted as sick).\n",
+    "\n",
+    "To determine this hyperparameter, we use 5-fold cross-validation. We chose 5 folds instead of 10 due to the limited amount of data, as this provides a better balance between the sizes of the training and validation sets. After cross-validation, it turns out that the optimal number of neighbors, $k$, is $k = 23$."
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "9f74eaee",
@@ -168,28 +180,19 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 6,
   "id": "70281897",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # train/test split with 70% training and 30% testing\n",
    "\n",
-    "\n",
    "# Feature scaling\n",
    "scaler = StandardScaler()\n",
    "\n",
-    "X_train_scaled = pd.DataFrame(\n",
-    "    scaler.fit_transform(X_train),\n",
-    "    columns=X_train.columns,\n",
-    "    index=X_train.index\n",
-    ")\n",
+    "X_train_scaled = scaler.fit_transform(X_train)\n",
    "\n",
-    "X_test_scaled = pd.DataFrame(\n",
-    "    scaler.transform(X_test),\n",
-    "    columns=X_test.columns,\n",
-    "    index=X_test.index\n",
-    ")"
+    "X_test_scaled = scaler.transform(X_test)"
   ]
  },
  {
@@ -202,7 +205,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 7,
   "id": "064a5aa7",
   "metadata": {},
   "outputs": [
@@ -247,7 +250,6 @@
    }
   ],
   "source": [
-    "\n",
    "knn = KNeighborsClassifier(n_neighbors=k_optimal) # using the best k founded earlier\n",
    "knn.fit(X_train_scaled, y_train)\n",
    "\n",
@@ -295,12 +297,12 @@
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "c158b385",
+   "cell_type": "markdown",
+   "id": "9bf7ed62",
   "metadata": {},
-   "outputs": [],
-   "source": []
+   "source": [
+    "In this optimized k-NN classification, we aim to maximize recall while maintaining good accuracy, in order to minimize the number of misclassifications, particularly cases where a sick patient is incorrectly predicted as healthy. We achieve this goal with a recall of 89% and an accuracy of 80%."
+   ]
  }
 ],
 "metadata": {