mirror of
https://github.com/ArthurDanjou/breast-cancer-detection.git
synced 2026-01-14 13:54:06 +01:00
Add commentary
This commit is contained in:
54
knn.ipynb
54
knn.ipynb
@@ -2,7 +2,7 @@
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": 1,
|
||||
"id": "4e6f6cb1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -20,7 +20,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"execution_count": 2,
|
||||
"id": "4dd5223b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
@@ -31,7 +31,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"execution_count": 3,
|
||||
"id": "c1ab7ec9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@@ -86,7 +86,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"execution_count": 4,
|
||||
"id": "754dce9b",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@@ -95,13 +95,15 @@
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of samples: 116\n",
|
||||
"Number of features: 9\n"
|
||||
"Number of features: 9\n",
|
||||
"Number of classes: 2\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(\"Number of samples:\", X.shape[0])\n",
|
||||
"print(\"Number of features:\", X.shape[1])"
|
||||
"print(\"Number of features:\", X.shape[1])\n",
|
||||
"print(\"Number of classes:\", len(np.unique(y)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -122,7 +124,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"execution_count": 5,
|
||||
"id": "b2e03ac1",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@@ -150,6 +152,16 @@
|
||||
"print(\"The best k for k-NN is k =\", k_optimal)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "698d03a8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In k-NN classification, to achieve the best prediction performance, we need to find the optimal number of neighbors that maximizes the evaluation score of our models. Here, we use the $f1\\_score$ from sklearn.metrics, as it provides a good balance between precision (e.g., correctly predicting a sick patient as sick, or a healthy patient as healthy) and recall (e.g., correctly identifying sick patients among all those predicted as sick).\n",
|
||||
"\n",
|
||||
"To determine this hyperparameter, we use 5-fold cross-validation. We chose 5 folds instead of 10 due to the limited amount of data, as this provides a better balance between the sizes of the training and validation sets. After cross-validation, it turns out that the optimal number of neighbors, $k$, is $k = 23$."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9f74eaee",
|
||||
@@ -168,28 +180,19 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"execution_count": 6,
|
||||
"id": "70281897",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # train/test split with 70% training and 30% testing\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Feature scaling\n",
|
||||
"scaler = StandardScaler()\n",
|
||||
"\n",
|
||||
"X_train_scaled = pd.DataFrame(\n",
|
||||
" scaler.fit_transform(X_train),\n",
|
||||
" columns=X_train.columns,\n",
|
||||
" index=X_train.index\n",
|
||||
")\n",
|
||||
"X_train_scaled = scaler.fit_transform(X_train)\n",
|
||||
"\n",
|
||||
"X_test_scaled = pd.DataFrame(\n",
|
||||
" scaler.transform(X_test),\n",
|
||||
" columns=X_test.columns,\n",
|
||||
" index=X_test.index\n",
|
||||
")"
|
||||
"X_test_scaled = scaler.transform(X_test)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -202,7 +205,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"execution_count": 7,
|
||||
"id": "064a5aa7",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
@@ -247,7 +250,6 @@
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"\n",
|
||||
"knn = KNeighborsClassifier(n_neighbors=k_optimal) # using the best k founded earlier\n",
|
||||
"knn.fit(X_train_scaled, y_train)\n",
|
||||
"\n",
|
||||
@@ -295,12 +297,12 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "c158b385",
|
||||
"cell_type": "markdown",
|
||||
"id": "9bf7ed62",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
"source": [
|
||||
"In this optimized k-NN classification, we aim to maximize recall while maintaining good accuracy, in order to minimize the number of misclassifications, particularly cases where a sick patient is incorrectly predicted as healthy. We achieve this goal with a recall of 89% and an accuracy of 80%."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
||||
Reference in New Issue
Block a user