Add R script for package management in Classification and Regression module

- Created a new R script 'packages.R' to manage necessary packages for the Classification and Regression module. - Included a list of required packages and a function to install any missing packages. - Implemented loading of all packages and added a success message upon completion.
2026-03-16 03:11:46 +01:00 · 2026-03-02 09:34:25 +01:00
parent d3d56fd6ab
commit 8b72b281f9
5 changed files with 1258385 additions and 0 deletions
--- a/Regression/Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html
+++ b/Regression/Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html
--- a/Regression/Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd
+++ b/Regression/Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd
--- a/Regression/README.md
+++ b/Regression/README.md
@@ -0,0 +1,310 @@
+# Implied Volatility Prediction from Options Data
+
+[![R](https://img.shields.io/badge/R-4.0+-276DC3.svg)](https://www.r-project.org/)
+[![Course](https://img.shields.io/badge/Course-Classification%20%26%20Regression-orange.svg)]()
+[![License](https://img.shields.io/badge/License-Academic-blue.svg)]()
+
+> **M2 Master's Project** – Predicting implied volatility using advanced regression techniques and machine learning models on financial options data.
+
+This project explores the prediction of **implied volatility** from options market data, combining classical statistical methods with modern machine learning approaches. The analysis covers data preprocessing, feature engineering, model benchmarking, and interpretability analysis using real-world financial panel data.
+
+---
+
+## 📋 Project Overview
+
+### Problem Statement
+
+Implied volatility represents the market's forward-looking expectation of an asset's future volatility. Accurate prediction is crucial for:
+- **Option pricing** and valuation
+- **Risk management** and hedging strategies
+- **Trading strategies** based on volatility arbitrage
+
+### Dataset
+
+The project uses a comprehensive panel dataset tracking **3,887 assets** across **544 observation dates** (2019-2022):
+
+| File | Description | Shape |
+|------|-------------|-------|
+| `Train_ISF.csv` | Training data with target variable | 1,909,465 rows × 21 columns |
+| `Test_ISF.csv` | Test data for prediction | 1,251,308 rows × 18 columns |
+| `hat_y.csv` | Final predictions from both models | 1,251,308 rows × 2 columns |
+
+### Key Variables
+
+**Target Variable:**
+- `implied_vol_ref` – The implied volatility to predict
+
+**Feature Categories:**
+- **Identifiers:** `asset_id`, `obs_date`
+- **Market Activity:** `call_volume`, `put_volume`, `call_oi`, `put_oi`, `total_contracts`
+- **Volatility Metrics:** `realized_vol_short`, `realized_vol_mid1-3`, `realized_vol_long1-4`, `market_vol_index`
+- **Option Structure:** `strike_dispersion`, `maturity_count`
+
+---
+
+## 🏗️ Methodology
+
+### Data Pipeline
+
+```
+Raw Data
+    ↓
+┌─────────────────────────────────────────────────────────┐
+│  Data Splitting (Chronological 80/20)                  │
+│  - Training: 2019-10 to 2021-07                         │
+│  - Validation: 2021-07 to 2022-03                     │
+└─────────────────────────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────────────────────────┐
+│  Feature Engineering                                   │
+│  - Aggregation of volatility horizons                 │
+│  - Creation of financial indicators                   │
+└─────────────────────────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────────────────────────┐
+│  Data Preprocessing (tidymodels)                       │
+│  - Winsorization (99.5th percentile)                  │
+│  - Log/Yeo-Johnson transformations                    │
+│  - Z-score normalization                              │
+│  - PCA (95% variance retention)                       │
+└─────────────────────────────────────────────────────────┘
+    ↓
+Three Datasets Generated:
+├── Tree-based (raw, scale-invariant)
+├── Linear (normalized, winsorized)
+└── PCA (dimensionality-reduced)
+```
+
+### Feature Engineering
+
+New financial indicators created to capture market dynamics:
+
+| Feature | Description | Formula |
+|---------|-------------|---------|
+| `pulse_ratio` | Volatility trend direction | RV_short / RV_long |
+| `stress_spread` | Asset vs market stress | RV_short - Market_VIX |
+| `put_call_ratio_volume` | Immediate market stress | Put_Volume / Call_Volume |
+| `put_call_ratio_oi` | Long-term risk structure | Put_OI / Call_OI |
+| `liquidity_ratio` | Market depth | Total_Volume / Total_OI |
+| `option_dispersion` | Market uncertainty | Strike_Dispersion / Total_Contracts |
+| `put_low_strike` | Downside protection density | Strike_Dispersion / Put_OI |
+| `put_proportion` | Hedging vs speculation | Put_Volume / Total_Volume |
+
+---
+
+## 🤖 Models Implemented
+
+### Linear Models
+
+| Model | Description | Best RMSE |
+|-------|-------------|-----------|
+| **OLS** | Ordinary Least Squares | 11.26 |
+| **Ridge** | L2 regularization | 12.48 |
+| **Lasso** | L1 regularization (variable selection) | 12.03 |
+| **Elastic Net** | L1 + L2 combined | ~12.03 |
+| **PLS** | Partial Least Squares (on PCA) | 12.79 |
+
+### Linear Mixed-Effects Models (LMM)
+
+Advanced panel data models accounting for asset-specific effects:
+
+| Model | Features | RMSE |
+|-------|----------|------|
+| LMM Baseline | All variables + Random Intercept | 8.77 |
+| LMM Reduced | Collinearity removal | ~8.77 |
+| LMM Interactions | Financial interaction terms | ~8.77 |
+| LMM + Quadratic | Convexity terms (vol of vol) | 8.41 |
+| **LMM + Random Slopes (mod_lmm_5)** | Asset-specific betas | **8.10** ⭐ |
+
+### Tree-Based Models
+
+| Model | Strategy | Validation RMSE | Training RMSE |
+|-------|----------|-----------------|---------------|
+| **XGBoost** | Level-wise, Bayesian tuning | 10.70 | 0.57 |
+| **LightGBM** | Leaf-wise, feature regularization | **10.61** ⭐ | 10.90 |
+| Random Forest | Bagging | DNF* | - |
+
+*DNF: Did Not Finish (computational constraints)
+
+### Neural Networks
+
+| Model | Architecture | Status |
+|-------|--------------|--------|
+| MLP | 128-64 units, tanh activation | Failed to converge |
+
+---
+
+## 📊 Results Summary
+
+### Model Comparison
+
+```
+RMSE Performance (Lower is Better)
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+Linear Mixed-Effects (LMM5)     8.38 ████████████████████ Best Linear
+Linear Mixed-Effects (LMM4)     8.41 ███████████████████
+Linear Mixed-Effects (Baseline) 8.77 ██████████████████
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+LightGBM                       10.61 ███████████████ Best Non-Linear
+XGBoost                        10.70 ██████████████
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+OLS (with interactions)        11.26 █████████████
+Lasso                          12.03 ███████████
+OLS (baseline)                 12.01 ███████████
+Ridge                          12.48 ██████████
+PLS                            12.79 █████████
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+```
+
+### Key Findings
+
+1. **Best Linear Model:** LMM with Random Slopes (RMSE = 8.38)
+   - Captures asset-specific volatility sensitivities
+   - Includes quadratic terms for convexity effects
+
+2. **Best Non-Linear Model:** LightGBM (RMSE = 10.61)
+   - Superior generalization vs XGBoost
+   - Feature regularization prevents overfitting
+
+3. **Interpretability Insights (SHAP Analysis):**
+   - `realized_vol_mid` dominates (57% of gain)
+   - Volatility clustering confirmed as primary driver
+   - Non-linear regime switching in stress_spread
+
+---
+
+## 📁 Repository Structure
+
+```
+PROJECT/
+├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd    # Main analysis (Quarto)
+├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html    # Rendered report
+├── packages.R                                         # R dependencies installer
+├── Train_ISF.csv                                      # Training data (~1.9M rows)
+├── Test_ISF.csv                                       # Test data (~1.25M rows)
+├── hat_y.csv                                          # Final predictions
+├── README.md                                          # This file
+└── results/
+    ├── lightgbm/                                      # LightGBM model outputs
+    └── xgboost/                                       # XGBoost model outputs
+```
+
+---
+
+## 🚀 Getting Started
+
+### Prerequisites
+
+- **R** ≥ 4.0
+- Required packages (auto-installed via `packages.R`)
+
+### Installation
+
+```r
+# Install all dependencies
+source("packages.R")
+```
+
+Or manually install key packages:
+
+```r
+install.packages(c(
+  "tidyverse", "tidymodels", "caret", "glmnet",
+  "lme4", "lmerTest", "xgboost", "lightgbm",
+  "ranger", "pls", "shapviz", "rBayesianOptimization"
+))
+```
+
+### Running the Analysis
+
+1. **Open the Quarto document:**
+   ```r
+   # In RStudio
+   rstudioapi::navigateToFile("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
+   ```
+
+2. **Render the document:**
+   ```r
+   quarto::quarto_render("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
+   ```
+
+3. **Or run specific sections interactively** using the code chunks in the `.qmd` file
+
+---
+
+## 🛠️ Technical Details
+
+### Data Split Strategy
+
+- **Chronological split** at 80th percentile of dates
+- Prevents look-ahead bias and data leakage
+- Training: ~1.53M observations
+- Validation: ~376K observations
+
+### Hyperparameter Tuning
+
+- **Method:** Bayesian Optimization (Gaussian Processes)
+- **Acquisition:** Expected Improvement (UCB)
+- **Goal:** Maximize negative RMSE
+
+### Evaluation Metric
+
+**Exponential RMSE** on original scale:
+
+$$
+RMSE_{real} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \exp(\hat{y}_{\log, i}) - y_i \right)^2}
+$$
+
+Models trained on log-transformed target for variance stabilization.
+
+---
+
+## 📖 Key Concepts
+
+### Financial Theories Applied
+
+1. **Volatility Clustering** – Past volatility predicts future volatility
+2. **Variance Risk Premium** – Spread between implied and realized volatility
+3. **Fear Gauge** – Put-call ratio as sentiment indicator
+4. **Mean Reversion** – Volatility tends to return to long-term average
+5. **Liquidity Premium** – Illiquid assets command higher volatility
+
+### Statistical Methods
+
+- Panel data modeling with fixed and random effects
+- Principal Component Analysis (PCA)
+- Bayesian hyperparameter optimization
+- SHAP values for model interpretability
+
+---
+
+## 👥 Authors
+
+**Team:**
+- Arthur DANJOU
+- Camille LEGRAND  
+- Axelle MERIC
+- Moritz VON SIEMENS
+
+**Course:** Classification and Regression (M2)
+**Academic Year:** 2025-2026
+
+---
+
+## 📝 Notes
+
+- **Computational Constraints:** Some models (Random Forest, MLP) failed due to hardware limitations (16GB RAM, CPU-only)
+- **Reproducibility:** Set `seed = 2025` for consistent results
+- **Language:** Analysis documented in English, course materials in French
+
+---
+
+## 📚 References
+
+Key R packages used:
+- `tidymodels` – Modern modeling framework
+- `glmnet` – Regularized regression
+- `lme4` / `lmerTest` – Mixed-effects models
+- `xgboost` / `lightgbm` – Gradient boosting
+- `shapviz` – Model interpretability
+- `rBayesianOptimization` – Hyperparameter tuning
--- a/Regression/hat_y.csv
+++ b/Regression/hat_y.csv
--- a/Regression/packages.R
+++ b/Regression/packages.R
@@ -0,0 +1,44 @@
+# Liste des packages nécessaires
+packages <- c(
+  "tidyverse",
+  "rsample",
+  "scales",
+  "dplyr",
+  "tidyr",
+  "glue",
+  "corrplot",
+  "ggfortify",
+  "carData",
+  "car",
+  "MASS",
+  "ggplot2",
+  "DataExplorer",
+  "skimr",
+  "plotly",
+  "gridExtra",
+  "grid",
+  "rlang",
+  "caret",
+  "reshape2",
+  "class",
+  "ROCR",
+  "randomForest",
+  "fitdistrplus",
+  "hexbin",
+  "paletteer"
+)
+
+# Fonction pour installer les packages manquants
+install_if_missing <- function(p) {
+  if (!require(p, character.only = TRUE)) {
+    install.packages(p, dependencies = TRUE)
+  }
+}
+
+# Application de la fonction sur toute la liste
+invisible(sapply(packages, install_if_missing))
+
+# Chargement de toutes les librairies
+invisible(lapply(packages, library, character.only = TRUE))
+
+message("Tous les packages ont été installés et chargés avec succès !")