mirror of https://github.com/ArthurDanjou/ArtStudies.git synced 2026-03-16 07:10:13 +01:00

Files

Arthur DANJOU 8b72b281f9 Add R script for package management in Classification and Regression module

- Created a new R script 'packages.R' to manage necessary packages for the Classification and Regression module.
- Included a list of required packages and a function to install any missing packages.
- Implemented loading of all packages and added a success message upon completion.

2026-03-02 09:34:25 +01:00

hat_y.csv

Add R script for package management in Classification and Regression module

2026-03-02 09:34:25 +01:00

packages.R

Add R script for package management in Classification and Regression module

2026-03-02 09:34:25 +01:00

Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html

Add R script for package management in Classification and Regression module

2026-03-02 09:34:25 +01:00

Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd

Add R script for package management in Classification and Regression module

2026-03-02 09:34:25 +01:00

README.md

Add R script for package management in Classification and Regression module

2026-03-02 09:34:25 +01:00

README.md

Implied Volatility Prediction from Options Data

M2 Master's Project – Predicting implied volatility using advanced regression techniques and machine learning models on financial options data.

This project explores the prediction of implied volatility from options market data, combining classical statistical methods with modern machine learning approaches. The analysis covers data preprocessing, feature engineering, model benchmarking, and interpretability analysis using real-world financial panel data.

📋 Project Overview

Problem Statement

Implied volatility represents the market's forward-looking expectation of an asset's future volatility. Accurate prediction is crucial for:

Option pricing and valuation
Risk management and hedging strategies
Trading strategies based on volatility arbitrage

Dataset

The project uses a comprehensive panel dataset tracking 3,887 assets across 544 observation dates (2019-2022):

File	Description	Shape
`Train_ISF.csv`	Training data with target variable	1,909,465 rows × 21 columns
`Test_ISF.csv`	Test data for prediction	1,251,308 rows × 18 columns
`hat_y.csv`	Final predictions from both models	1,251,308 rows × 2 columns

Key Variables

Target Variable:

implied_vol_ref – The implied volatility to predict

Feature Categories:

Identifiers: asset_id, obs_date
Market Activity: call_volume, put_volume, call_oi, put_oi, total_contracts
Volatility Metrics: realized_vol_short, realized_vol_mid1-3, realized_vol_long1-4, market_vol_index
Option Structure: strike_dispersion, maturity_count

🏗️ Methodology

Data Pipeline

Raw Data
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Splitting (Chronological 80/20)                  │
│  - Training: 2019-10 to 2021-07                         │
│  - Validation: 2021-07 to 2022-03                     │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Feature Engineering                                   │
│  - Aggregation of volatility horizons                 │
│  - Creation of financial indicators                   │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Preprocessing (tidymodels)                       │
│  - Winsorization (99.5th percentile)                  │
│  - Log/Yeo-Johnson transformations                    │
│  - Z-score normalization                              │
│  - PCA (95% variance retention)                       │
└─────────────────────────────────────────────────────────┘
    ↓
Three Datasets Generated:
├── Tree-based (raw, scale-invariant)
├── Linear (normalized, winsorized)
└── PCA (dimensionality-reduced)

Feature Engineering

New financial indicators created to capture market dynamics:

Feature	Description	Formula
`pulse_ratio`	Volatility trend direction	RV_short / RV_long
`stress_spread`	Asset vs market stress	RV_short - Market_VIX
`put_call_ratio_volume`	Immediate market stress	Put_Volume / Call_Volume
`put_call_ratio_oi`	Long-term risk structure	Put_OI / Call_OI
`liquidity_ratio`	Market depth	Total_Volume / Total_OI
`option_dispersion`	Market uncertainty	Strike_Dispersion / Total_Contracts
`put_low_strike`	Downside protection density	Strike_Dispersion / Put_OI
`put_proportion`	Hedging vs speculation	Put_Volume / Total_Volume

🤖 Models Implemented

Linear Models

Model	Description	Best RMSE
OLS	Ordinary Least Squares	11.26
Ridge	L2 regularization	12.48
Lasso	L1 regularization (variable selection)	12.03
Elastic Net	L1 + L2 combined	~12.03
PLS	Partial Least Squares (on PCA)	12.79

Linear Mixed-Effects Models (LMM)

Advanced panel data models accounting for asset-specific effects:

Model	Features	RMSE
LMM Baseline	All variables + Random Intercept	8.77
LMM Reduced	Collinearity removal	~8.77
LMM Interactions	Financial interaction terms	~8.77
LMM + Quadratic	Convexity terms (vol of vol)	8.41
LMM + Random Slopes (mod_lmm_5)	Asset-specific betas	8.10 ⭐

Tree-Based Models

Model	Strategy	Validation RMSE	Training RMSE
XGBoost	Level-wise, Bayesian tuning	10.70	0.57
LightGBM	Leaf-wise, feature regularization	10.61 ⭐	10.90
Random Forest	Bagging	DNF*	-

*DNF: Did Not Finish (computational constraints)

Neural Networks

Model	Architecture	Status
MLP	128-64 units, tanh activation	Failed to converge

📊 Results Summary

Model Comparison

RMSE Performance (Lower is Better)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Linear Mixed-Effects (LMM5)     8.38 ████████████████████ Best Linear
Linear Mixed-Effects (LMM4)     8.41 ███████████████████
Linear Mixed-Effects (Baseline) 8.77 ██████████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LightGBM                       10.61 ███████████████ Best Non-Linear
XGBoost                        10.70 ██████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OLS (with interactions)        11.26 █████████████
Lasso                          12.03 ███████████
OLS (baseline)                 12.01 ███████████
Ridge                          12.48 ██████████
PLS                            12.79 █████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Key Findings

Best Linear Model: LMM with Random Slopes (RMSE = 8.38)
- Captures asset-specific volatility sensitivities
- Includes quadratic terms for convexity effects
Best Non-Linear Model: LightGBM (RMSE = 10.61)
- Superior generalization vs XGBoost
- Feature regularization prevents overfitting
Interpretability Insights (SHAP Analysis):
- realized_vol_mid dominates (57% of gain)
- Volatility clustering confirmed as primary driver
- Non-linear regime switching in stress_spread

📁 Repository Structure

PROJECT/
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd    # Main analysis (Quarto)
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html    # Rendered report
├── packages.R                                         # R dependencies installer
├── Train_ISF.csv                                      # Training data (~1.9M rows)
├── Test_ISF.csv                                       # Test data (~1.25M rows)
├── hat_y.csv                                          # Final predictions
├── README.md                                          # This file
└── results/
    ├── lightgbm/                                      # LightGBM model outputs
    └── xgboost/                                       # XGBoost model outputs

🚀 Getting Started

Prerequisites

R ≥ 4.0
Required packages (auto-installed via packages.R)

Installation

# Install all dependencies
source("packages.R")

Or manually install key packages:

install.packages(c(
  "tidyverse", "tidymodels", "caret", "glmnet",
  "lme4", "lmerTest", "xgboost", "lightgbm",
  "ranger", "pls", "shapviz", "rBayesianOptimization"
))

Running the Analysis

Open the Quarto document:

# In RStudio
rstudioapi::navigateToFile("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")

Render the document:

quarto::quarto_render("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")

Or run specific sections interactively using the code chunks in the .qmd file

🛠️ Technical Details

Data Split Strategy

Chronological split at 80th percentile of dates
Prevents look-ahead bias and data leakage
Training: ~1.53M observations
Validation: ~376K observations

Hyperparameter Tuning

Method: Bayesian Optimization (Gaussian Processes)
Acquisition: Expected Improvement (UCB)
Goal: Maximize negative RMSE

Evaluation Metric

Exponential RMSE on original scale:


RMSE_{real} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \exp(\hat{y}_{\log, i}) - y_i \right)^2}

Models trained on log-transformed target for variance stabilization.

📖 Key Concepts

Financial Theories Applied

Volatility Clustering – Past volatility predicts future volatility
Variance Risk Premium – Spread between implied and realized volatility
Fear Gauge – Put-call ratio as sentiment indicator
Mean Reversion – Volatility tends to return to long-term average
Liquidity Premium – Illiquid assets command higher volatility

Statistical Methods

Panel data modeling with fixed and random effects
Principal Component Analysis (PCA)
Bayesian hyperparameter optimization
SHAP values for model interpretability

👥 Authors

Team:

Arthur DANJOU
Camille LEGRAND
Axelle MERIC
Moritz VON SIEMENS

Course: Classification and Regression (M2) Academic Year: 2025-2026

📝 Notes

Computational Constraints: Some models (Random Forest, MLP) failed due to hardware limitations (16GB RAM, CPU-only)
Reproducibility: Set seed = 2025 for consistent results
Language: Analysis documented in English, course materials in French

📚 References

Key R packages used:

tidymodels – Modern modeling framework
glmnet – Regularized regression
lme4 / lmerTest – Mixed-effects models
xgboost / lightgbm – Gradient boosting
shapviz – Model interpretability
rBayesianOptimization – Hyperparameter tuning

README.md Unescape Escape

Implied Volatility Prediction from Options Data

📋 Project Overview

Problem Statement

Dataset

Key Variables

🏗️ Methodology

Data Pipeline

Feature Engineering

🤖 Models Implemented

Linear Models

Linear Mixed-Effects Models (LMM)

Tree-Based Models

Neural Networks

📊 Results Summary

Model Comparison

Key Findings

📁 Repository Structure

🚀 Getting Started

Prerequisites

Installation

Running the Analysis

🛠️ Technical Details

Data Split Strategy

Hyperparameter Tuning

Evaluation Metric

📖 Key Concepts

Financial Theories Applied

Statistical Methods

👥 Authors

📝 Notes

📚 References

README.md