diff --git a/knn.ipynb b/knn.ipynb index 6ef404f..d4ee2e1 100644 --- a/knn.ipynb +++ b/knn.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "code", - "execution_count": 10, + "execution_count": 1, "id": "4e6f6cb1", "metadata": {}, "outputs": [], @@ -20,7 +20,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 2, "id": "4dd5223b", "metadata": {}, "outputs": [], @@ -31,7 +31,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 3, "id": "c1ab7ec9", "metadata": {}, "outputs": [ @@ -86,7 +86,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 4, "id": "754dce9b", "metadata": {}, "outputs": [ @@ -95,13 +95,15 @@ "output_type": "stream", "text": [ "Number of samples: 116\n", - "Number of features: 9\n" + "Number of features: 9\n", + "Number of classes: 2\n" ] } ], "source": [ "print(\"Number of samples:\", X.shape[0])\n", - "print(\"Number of features:\", X.shape[1])" + "print(\"Number of features:\", X.shape[1])\n", + "print(\"Number of classes:\", len(np.unique(y)))" ] }, { @@ -122,7 +124,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 5, "id": "b2e03ac1", "metadata": {}, "outputs": [ @@ -150,6 +152,16 @@ "print(\"The best k for k-NN is k =\", k_optimal)" ] }, + { + "cell_type": "markdown", + "id": "698d03a8", + "metadata": {}, + "source": [ + "In k-NN classification, to achieve the best prediction performance, we need to find the optimal number of neighbors that maximizes the evaluation score of our models. Here, we use the $f1\\_score$ from sklearn.metrics, as it provides a good balance between precision (e.g., correctly predicting a sick patient as sick, or a healthy patient as healthy) and recall (e.g., correctly identifying sick patients among all those predicted as sick).\n", + "\n", + "To determine this hyperparameter, we use 5-fold cross-validation. We chose 5 folds instead of 10 due to the limited amount of data, as this provides a better balance between the sizes of the training and validation sets. After cross-validation, it turns out that the optimal number of neighbors, $k$, is $k = 23$." + ] + }, { "cell_type": "markdown", "id": "9f74eaee", @@ -168,28 +180,19 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 6, "id": "70281897", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # train/test split with 70% training and 30% testing\n", "\n", - "\n", "# Feature scaling\n", "scaler = StandardScaler()\n", "\n", - "X_train_scaled = pd.DataFrame(\n", - " scaler.fit_transform(X_train),\n", - " columns=X_train.columns,\n", - " index=X_train.index\n", - ")\n", + "X_train_scaled = scaler.fit_transform(X_train)\n", "\n", - "X_test_scaled = pd.DataFrame(\n", - " scaler.transform(X_test),\n", - " columns=X_test.columns,\n", - " index=X_test.index\n", - ")" + "X_test_scaled = scaler.transform(X_test)" ] }, { @@ -202,7 +205,7 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": 7, "id": "064a5aa7", "metadata": {}, "outputs": [ @@ -247,7 +250,6 @@ } ], "source": [ - "\n", "knn = KNeighborsClassifier(n_neighbors=k_optimal) # using the best k founded earlier\n", "knn.fit(X_train_scaled, y_train)\n", "\n", @@ -295,12 +297,12 @@ ] }, { - "cell_type": "code", - "execution_count": null, - "id": "c158b385", + "cell_type": "markdown", + "id": "9bf7ed62", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "In this optimized k-NN classification, we aim to maximize recall while maintaining good accuracy, in order to minimize the number of misclassifications, particularly cases where a sick patient is incorrectly predicted as healthy. We achieve this goal with a recall of 89% and an accuracy of 80%." + ] } ], "metadata": {