Files
breast-cancer-detection/README.md
2025-06-06 21:58:45 +02:00

63 lines
2.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 🩺 Early Breast Cancer Detection using Blood Biomarkers
**Université Paris Dauphine - PSL**
Statistical Learning Project — Academic Year 20242025
**Supervisors:** Prof. Gabriel Turinici, Dr. Laetitia Comminges
## Project Objective
The goal of this project is to predict the presence of breast cancer using blood-based biomarkers. Several supervised classification models are compared, with a strong focus on **recall** to reflect the clinical need to minimize false negatives (i.e., undiagnosed patients).
## Dataset
- **Source:** Breast Cancer Coimbra Dataset (116 patients, 9 biomarkers)
- **Target variable:** `Classification` (0 = healthy, 1 = cancer)
- **Preprocessing steps:**
- Logarithmic transformation of skewed variables
- Standardization (Z-score)
- Stratified train/test split (92/24)
## Models Compared
| Model | Recall | F1-score | AUC |
|-----------------------------|--------|----------|-------|
| k-Nearest Neighbors (k=23) | 0.92 | 0.788 | 0.88 |
| Neural Network (MLP) | 0.92 | 0.69 | 0.83 |
| Logistic Regression (L2) | 0.69 | 0.75 | 0.79 |
| Gaussian Naïve Bayes | 0.58 | 0.68 | 0.72 |
**Best model for clinical usage (recall priority):** k-NN (k = 23)
## Repository Structure
- `eda_analysis.ipynb` Data exploration, visualization, and preprocessing
- `logistic_regression.ipynb` Logistic regression (basic and optimized via GridSearchCV)
- `knn.ipynb` k-Nearest Neighbors with cross-validation and performance tuning
- `neural_network.ipynb` Feedforward neural network (MLPClassifier)
- `naive_bayes.ipynb` Gaussian Naïve Bayes with log-transformed inputs
- `svm.ipynb` Preliminary experiments with SVM (bonus, not included in the final report)
- `Subject_3_Ouabdesselam_Forest_Durousseau_Danjou_vonSiemens.pdf` Final report (comprehensive analysis and conclusions)
- `README.md` This file
## Evaluation Metrics
- **Recall:** prioritized (to avoid false negatives)
- **F1-score:** balances precision and recall
- **ROC & AUC:** overall discriminative ability
## Clinical Recommendation
Assuming false positives are acceptable to avoid missing cancer cases, the k-NN (k = 23) model is preferred. It offers the best compromise between recall and F1-score, and reliably identifies patients at risk.
## Authors
Erwan Ouabdesselam
Antonin Durousseau
Moritz von Siemens
Arthur Danjou
Thaïs Forest
---
Project completed as part of the Statistical Learning course in the Masters program in Applied Mathematics at Université Paris Dauphine - PSL.