2025-06-06 21:58:45 +02:00
2025-06-06 15:39:06 +02:00
2025-06-06 12:01:26 +02:00
2025-06-06 19:30:24 +02:00
2025-06-06 15:39:45 +02:00
2025-06-06 21:58:45 +02:00
2025-06-03 23:50:18 +02:00

🩺 Early Breast Cancer Detection using Blood Biomarkers

Université Paris Dauphine - PSL
Statistical Learning Project — Academic Year 20242025
Supervisors: Prof. Gabriel Turinici, Dr. Laetitia Comminges

Project Objective

The goal of this project is to predict the presence of breast cancer using blood-based biomarkers. Several supervised classification models are compared, with a strong focus on recall to reflect the clinical need to minimize false negatives (i.e., undiagnosed patients).

Dataset

  • Source: Breast Cancer Coimbra Dataset (116 patients, 9 biomarkers)
  • Target variable: Classification (0 = healthy, 1 = cancer)
  • Preprocessing steps:
    • Logarithmic transformation of skewed variables
    • Standardization (Z-score)
    • Stratified train/test split (92/24)

Models Compared

Model Recall F1-score AUC
k-Nearest Neighbors (k=23) 0.92 0.788 0.88
Neural Network (MLP) 0.92 0.69 0.83
Logistic Regression (L2) 0.69 0.75 0.79
Gaussian Naïve Bayes 0.58 0.68 0.72

Best model for clinical usage (recall priority): k-NN (k = 23)

Repository Structure

  • eda_analysis.ipynb Data exploration, visualization, and preprocessing
  • logistic_regression.ipynb Logistic regression (basic and optimized via GridSearchCV)
  • knn.ipynb k-Nearest Neighbors with cross-validation and performance tuning
  • neural_network.ipynb Feedforward neural network (MLPClassifier)
  • naive_bayes.ipynb Gaussian Naïve Bayes with log-transformed inputs
  • svm.ipynb Preliminary experiments with SVM (bonus, not included in the final report)
  • Subject_3_Ouabdesselam_Forest_Durousseau_Danjou_vonSiemens.pdf Final report (comprehensive analysis and conclusions)
  • README.md This file

Evaluation Metrics

  • Recall: prioritized (to avoid false negatives)
  • F1-score: balances precision and recall
  • ROC & AUC: overall discriminative ability

Clinical Recommendation

Assuming false positives are acceptable to avoid missing cancer cases, the k-NN (k = 23) model is preferred. It offers the best compromise between recall and F1-score, and reliably identifies patients at risk.

Authors

Erwan Ouabdesselam
Antonin Durousseau
Moritz von Siemens
Arthur Danjou
Thaïs Forest


Project completed as part of the Statistical Learning course in the Masters program in Applied Mathematics at Université Paris Dauphine - PSL.

Description
No description provided
Readme 12 MiB
Languages
Jupyter Notebook 100%