Add R script for package management in Classification and Regression module

- Created a new R script 'packages.R' to manage necessary packages for the Classification and Regression module.
- Included a list of required packages and a function to install any missing packages.
- Implemented loading of all packages and added a success message upon completion.
This commit is contained in:
2026-03-02 09:34:25 +01:00
parent d3d56fd6ab
commit 8b72b281f9
5 changed files with 1258385 additions and 0 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,310 @@
# Implied Volatility Prediction from Options Data
[![R](https://img.shields.io/badge/R-4.0+-276DC3.svg)](https://www.r-project.org/)
[![Course](https://img.shields.io/badge/Course-Classification%20%26%20Regression-orange.svg)]()
[![License](https://img.shields.io/badge/License-Academic-blue.svg)]()
> **M2 Master's Project** Predicting implied volatility using advanced regression techniques and machine learning models on financial options data.
This project explores the prediction of **implied volatility** from options market data, combining classical statistical methods with modern machine learning approaches. The analysis covers data preprocessing, feature engineering, model benchmarking, and interpretability analysis using real-world financial panel data.
---
## 📋 Project Overview
### Problem Statement
Implied volatility represents the market's forward-looking expectation of an asset's future volatility. Accurate prediction is crucial for:
- **Option pricing** and valuation
- **Risk management** and hedging strategies
- **Trading strategies** based on volatility arbitrage
### Dataset
The project uses a comprehensive panel dataset tracking **3,887 assets** across **544 observation dates** (2019-2022):
| File | Description | Shape |
|------|-------------|-------|
| `Train_ISF.csv` | Training data with target variable | 1,909,465 rows × 21 columns |
| `Test_ISF.csv` | Test data for prediction | 1,251,308 rows × 18 columns |
| `hat_y.csv` | Final predictions from both models | 1,251,308 rows × 2 columns |
### Key Variables
**Target Variable:**
- `implied_vol_ref` The implied volatility to predict
**Feature Categories:**
- **Identifiers:** `asset_id`, `obs_date`
- **Market Activity:** `call_volume`, `put_volume`, `call_oi`, `put_oi`, `total_contracts`
- **Volatility Metrics:** `realized_vol_short`, `realized_vol_mid1-3`, `realized_vol_long1-4`, `market_vol_index`
- **Option Structure:** `strike_dispersion`, `maturity_count`
---
## 🏗️ Methodology
### Data Pipeline
```
Raw Data
┌─────────────────────────────────────────────────────────┐
│ Data Splitting (Chronological 80/20) │
│ - Training: 2019-10 to 2021-07 │
│ - Validation: 2021-07 to 2022-03 │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Feature Engineering │
│ - Aggregation of volatility horizons │
│ - Creation of financial indicators │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Data Preprocessing (tidymodels) │
│ - Winsorization (99.5th percentile) │
│ - Log/Yeo-Johnson transformations │
│ - Z-score normalization │
│ - PCA (95% variance retention) │
└─────────────────────────────────────────────────────────┘
Three Datasets Generated:
├── Tree-based (raw, scale-invariant)
├── Linear (normalized, winsorized)
└── PCA (dimensionality-reduced)
```
### Feature Engineering
New financial indicators created to capture market dynamics:
| Feature | Description | Formula |
|---------|-------------|---------|
| `pulse_ratio` | Volatility trend direction | RV_short / RV_long |
| `stress_spread` | Asset vs market stress | RV_short - Market_VIX |
| `put_call_ratio_volume` | Immediate market stress | Put_Volume / Call_Volume |
| `put_call_ratio_oi` | Long-term risk structure | Put_OI / Call_OI |
| `liquidity_ratio` | Market depth | Total_Volume / Total_OI |
| `option_dispersion` | Market uncertainty | Strike_Dispersion / Total_Contracts |
| `put_low_strike` | Downside protection density | Strike_Dispersion / Put_OI |
| `put_proportion` | Hedging vs speculation | Put_Volume / Total_Volume |
---
## 🤖 Models Implemented
### Linear Models
| Model | Description | Best RMSE |
|-------|-------------|-----------|
| **OLS** | Ordinary Least Squares | 11.26 |
| **Ridge** | L2 regularization | 12.48 |
| **Lasso** | L1 regularization (variable selection) | 12.03 |
| **Elastic Net** | L1 + L2 combined | ~12.03 |
| **PLS** | Partial Least Squares (on PCA) | 12.79 |
### Linear Mixed-Effects Models (LMM)
Advanced panel data models accounting for asset-specific effects:
| Model | Features | RMSE |
|-------|----------|------|
| LMM Baseline | All variables + Random Intercept | 8.77 |
| LMM Reduced | Collinearity removal | ~8.77 |
| LMM Interactions | Financial interaction terms | ~8.77 |
| LMM + Quadratic | Convexity terms (vol of vol) | 8.41 |
| **LMM + Random Slopes (mod_lmm_5)** | Asset-specific betas | **8.10** ⭐ |
### Tree-Based Models
| Model | Strategy | Validation RMSE | Training RMSE |
|-------|----------|-----------------|---------------|
| **XGBoost** | Level-wise, Bayesian tuning | 10.70 | 0.57 |
| **LightGBM** | Leaf-wise, feature regularization | **10.61** ⭐ | 10.90 |
| Random Forest | Bagging | DNF* | - |
*DNF: Did Not Finish (computational constraints)
### Neural Networks
| Model | Architecture | Status |
|-------|--------------|--------|
| MLP | 128-64 units, tanh activation | Failed to converge |
---
## 📊 Results Summary
### Model Comparison
```
RMSE Performance (Lower is Better)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Linear Mixed-Effects (LMM5) 8.38 ████████████████████ Best Linear
Linear Mixed-Effects (LMM4) 8.41 ███████████████████
Linear Mixed-Effects (Baseline) 8.77 ██████████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LightGBM 10.61 ███████████████ Best Non-Linear
XGBoost 10.70 ██████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OLS (with interactions) 11.26 █████████████
Lasso 12.03 ███████████
OLS (baseline) 12.01 ███████████
Ridge 12.48 ██████████
PLS 12.79 █████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```
### Key Findings
1. **Best Linear Model:** LMM with Random Slopes (RMSE = 8.38)
- Captures asset-specific volatility sensitivities
- Includes quadratic terms for convexity effects
2. **Best Non-Linear Model:** LightGBM (RMSE = 10.61)
- Superior generalization vs XGBoost
- Feature regularization prevents overfitting
3. **Interpretability Insights (SHAP Analysis):**
- `realized_vol_mid` dominates (57% of gain)
- Volatility clustering confirmed as primary driver
- Non-linear regime switching in stress_spread
---
## 📁 Repository Structure
```
PROJECT/
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd # Main analysis (Quarto)
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html # Rendered report
├── packages.R # R dependencies installer
├── Train_ISF.csv # Training data (~1.9M rows)
├── Test_ISF.csv # Test data (~1.25M rows)
├── hat_y.csv # Final predictions
├── README.md # This file
└── results/
├── lightgbm/ # LightGBM model outputs
└── xgboost/ # XGBoost model outputs
```
---
## 🚀 Getting Started
### Prerequisites
- **R** ≥ 4.0
- Required packages (auto-installed via `packages.R`)
### Installation
```r
# Install all dependencies
source("packages.R")
```
Or manually install key packages:
```r
install.packages(c(
"tidyverse", "tidymodels", "caret", "glmnet",
"lme4", "lmerTest", "xgboost", "lightgbm",
"ranger", "pls", "shapviz", "rBayesianOptimization"
))
```
### Running the Analysis
1. **Open the Quarto document:**
```r
# In RStudio
rstudioapi::navigateToFile("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
```
2. **Render the document:**
```r
quarto::quarto_render("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
```
3. **Or run specific sections interactively** using the code chunks in the `.qmd` file
---
## 🛠️ Technical Details
### Data Split Strategy
- **Chronological split** at 80th percentile of dates
- Prevents look-ahead bias and data leakage
- Training: ~1.53M observations
- Validation: ~376K observations
### Hyperparameter Tuning
- **Method:** Bayesian Optimization (Gaussian Processes)
- **Acquisition:** Expected Improvement (UCB)
- **Goal:** Maximize negative RMSE
### Evaluation Metric
**Exponential RMSE** on original scale:
$$
RMSE_{real} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \exp(\hat{y}_{\log, i}) - y_i \right)^2}
$$
Models trained on log-transformed target for variance stabilization.
---
## 📖 Key Concepts
### Financial Theories Applied
1. **Volatility Clustering** Past volatility predicts future volatility
2. **Variance Risk Premium** Spread between implied and realized volatility
3. **Fear Gauge** Put-call ratio as sentiment indicator
4. **Mean Reversion** Volatility tends to return to long-term average
5. **Liquidity Premium** Illiquid assets command higher volatility
### Statistical Methods
- Panel data modeling with fixed and random effects
- Principal Component Analysis (PCA)
- Bayesian hyperparameter optimization
- SHAP values for model interpretability
---
## 👥 Authors
**Team:**
- Arthur DANJOU
- Camille LEGRAND
- Axelle MERIC
- Moritz VON SIEMENS
**Course:** Classification and Regression (M2)
**Academic Year:** 2025-2026
---
## 📝 Notes
- **Computational Constraints:** Some models (Random Forest, MLP) failed due to hardware limitations (16GB RAM, CPU-only)
- **Reproducibility:** Set `seed = 2025` for consistent results
- **Language:** Analysis documented in English, course materials in French
---
## 📚 References
Key R packages used:
- `tidymodels` Modern modeling framework
- `glmnet` Regularized regression
- `lme4` / `lmerTest` Mixed-effects models
- `xgboost` / `lightgbm` Gradient boosting
- `shapviz` Model interpretability
- `rBayesianOptimization` Hyperparameter tuning

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,44 @@
# Liste des packages nécessaires
packages <- c(
"tidyverse",
"rsample",
"scales",
"dplyr",
"tidyr",
"glue",
"corrplot",
"ggfortify",
"carData",
"car",
"MASS",
"ggplot2",
"DataExplorer",
"skimr",
"plotly",
"gridExtra",
"grid",
"rlang",
"caret",
"reshape2",
"class",
"ROCR",
"randomForest",
"fitdistrplus",
"hexbin",
"paletteer"
)
# Fonction pour installer les packages manquants
install_if_missing <- function(p) {
if (!require(p, character.only = TRUE)) {
install.packages(p, dependencies = TRUE)
}
}
# Application de la fonction sur toute la liste
invisible(sapply(packages, install_if_missing))
# Chargement de toutes les librairies
invisible(lapply(packages, library, character.only = TRUE))
message("Tous les packages ont été installés et chargés avec succès !")