diff --git a/README.md b/README.md index 8c883cf..f3283a2 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,62 @@ -# breast-cancer-detection -Binary classification of breast cancer based on biomedical markers using logistic regression and statistical learning techniques. +# 🩺 Early Breast Cancer Detection using Blood Biomarkers + +**Université Paris Dauphine - PSL** +Statistical Learning Project — Academic Year 2024–2025 +**Supervisors:** Prof. Gabriel Turinici, Dr. Laetitia Comminges + +## Project Objective + +The goal of this project is to predict the presence of breast cancer using blood-based biomarkers. Several supervised classification models are compared, with a strong focus on **recall** to reflect the clinical need to minimize false negatives (i.e., undiagnosed patients). + +## Dataset + +- **Source:** Breast Cancer Coimbra Dataset (116 patients, 9 biomarkers) +- **Target variable:** `Classification` (0 = healthy, 1 = cancer) +- **Preprocessing steps:** + - Logarithmic transformation of skewed variables + - Standardization (Z-score) + - Stratified train/test split (92/24) + +## Models Compared + +| Model | Recall | F1-score | AUC | +|-----------------------------|--------|----------|-------| +| k-Nearest Neighbors (k=23) | 0.92 | 0.788 | 0.88 | +| Neural Network (MLP) | 0.92 | 0.69 | 0.83 | +| Logistic Regression (L2) | 0.69 | 0.75 | 0.79 | +| Gaussian Naïve Bayes | 0.58 | 0.68 | 0.72 | + +**Best model for clinical usage (recall priority):** k-NN (k = 23) + +## Repository Structure + +- `eda_analysis.ipynb` – Data exploration, visualization, and preprocessing +- `logistic_regression.ipynb` – Logistic regression (basic and optimized via GridSearchCV) +- `knn.ipynb` – k-Nearest Neighbors with cross-validation and performance tuning +- `neural_network.ipynb` – Feedforward neural network (MLPClassifier) +- `naive_bayes.ipynb` – Gaussian Naïve Bayes with log-transformed inputs +- `svm.ipynb` – Preliminary experiments with SVM (bonus, not included in the final report) +- `Subject_3_Ouabdesselam_Forest_Durousseau_Danjou_vonSiemens.pdf` – Final report (comprehensive analysis and conclusions) +- `README.md` – This file + +## Evaluation Metrics + +- **Recall:** prioritized (to avoid false negatives) +- **F1-score:** balances precision and recall +- **ROC & AUC:** overall discriminative ability + +## Clinical Recommendation + +Assuming false positives are acceptable to avoid missing cancer cases, the k-NN (k = 23) model is preferred. It offers the best compromise between recall and F1-score, and reliably identifies patients at risk. + +## Authors + +Erwan Ouabdesselam +Antonin Durousseau +Moritz von Siemens +Arthur Danjou +Thaïs Forest + +--- + +Project completed as part of the Statistical Learning course in the Master’s program in Applied Mathematics at Université Paris Dauphine - PSL.