Files
ArtStudies/M2/Classification and Regression
Arthur DANJOU 8b72b281f9 Add R script for package management in Classification and Regression module
- Created a new R script 'packages.R' to manage necessary packages for the Classification and Regression module.
- Included a list of required packages and a function to install any missing packages.
- Implemented loading of all packages and added a success message upon completion.
2026-03-02 09:34:25 +01:00
..

Implied Volatility Prediction from Options Data

R Course License

M2 Master's Project Predicting implied volatility using advanced regression techniques and machine learning models on financial options data.

This project explores the prediction of implied volatility from options market data, combining classical statistical methods with modern machine learning approaches. The analysis covers data preprocessing, feature engineering, model benchmarking, and interpretability analysis using real-world financial panel data.


📋 Project Overview

Problem Statement

Implied volatility represents the market's forward-looking expectation of an asset's future volatility. Accurate prediction is crucial for:

  • Option pricing and valuation
  • Risk management and hedging strategies
  • Trading strategies based on volatility arbitrage

Dataset

The project uses a comprehensive panel dataset tracking 3,887 assets across 544 observation dates (2019-2022):

File Description Shape
Train_ISF.csv Training data with target variable 1,909,465 rows × 21 columns
Test_ISF.csv Test data for prediction 1,251,308 rows × 18 columns
hat_y.csv Final predictions from both models 1,251,308 rows × 2 columns

Key Variables

Target Variable:

  • implied_vol_ref The implied volatility to predict

Feature Categories:

  • Identifiers: asset_id, obs_date
  • Market Activity: call_volume, put_volume, call_oi, put_oi, total_contracts
  • Volatility Metrics: realized_vol_short, realized_vol_mid1-3, realized_vol_long1-4, market_vol_index
  • Option Structure: strike_dispersion, maturity_count

🏗️ Methodology

Data Pipeline

Raw Data
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Splitting (Chronological 80/20)                  │
│  - Training: 2019-10 to 2021-07                         │
│  - Validation: 2021-07 to 2022-03                     │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Feature Engineering                                   │
│  - Aggregation of volatility horizons                 │
│  - Creation of financial indicators                   │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Preprocessing (tidymodels)                       │
│  - Winsorization (99.5th percentile)                  │
│  - Log/Yeo-Johnson transformations                    │
│  - Z-score normalization                              │
│  - PCA (95% variance retention)                       │
└─────────────────────────────────────────────────────────┘
    ↓
Three Datasets Generated:
├── Tree-based (raw, scale-invariant)
├── Linear (normalized, winsorized)
└── PCA (dimensionality-reduced)

Feature Engineering

New financial indicators created to capture market dynamics:

Feature Description Formula
pulse_ratio Volatility trend direction RV_short / RV_long
stress_spread Asset vs market stress RV_short - Market_VIX
put_call_ratio_volume Immediate market stress Put_Volume / Call_Volume
put_call_ratio_oi Long-term risk structure Put_OI / Call_OI
liquidity_ratio Market depth Total_Volume / Total_OI
option_dispersion Market uncertainty Strike_Dispersion / Total_Contracts
put_low_strike Downside protection density Strike_Dispersion / Put_OI
put_proportion Hedging vs speculation Put_Volume / Total_Volume

🤖 Models Implemented

Linear Models

Model Description Best RMSE
OLS Ordinary Least Squares 11.26
Ridge L2 regularization 12.48
Lasso L1 regularization (variable selection) 12.03
Elastic Net L1 + L2 combined ~12.03
PLS Partial Least Squares (on PCA) 12.79

Linear Mixed-Effects Models (LMM)

Advanced panel data models accounting for asset-specific effects:

Model Features RMSE
LMM Baseline All variables + Random Intercept 8.77
LMM Reduced Collinearity removal ~8.77
LMM Interactions Financial interaction terms ~8.77
LMM + Quadratic Convexity terms (vol of vol) 8.41
LMM + Random Slopes (mod_lmm_5) Asset-specific betas 8.10

Tree-Based Models

Model Strategy Validation RMSE Training RMSE
XGBoost Level-wise, Bayesian tuning 10.70 0.57
LightGBM Leaf-wise, feature regularization 10.61 10.90
Random Forest Bagging DNF* -

*DNF: Did Not Finish (computational constraints)

Neural Networks

Model Architecture Status
MLP 128-64 units, tanh activation Failed to converge

📊 Results Summary

Model Comparison

RMSE Performance (Lower is Better)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Linear Mixed-Effects (LMM5)     8.38 ████████████████████ Best Linear
Linear Mixed-Effects (LMM4)     8.41 ███████████████████
Linear Mixed-Effects (Baseline) 8.77 ██████████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LightGBM                       10.61 ███████████████ Best Non-Linear
XGBoost                        10.70 ██████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OLS (with interactions)        11.26 █████████████
Lasso                          12.03 ███████████
OLS (baseline)                 12.01 ███████████
Ridge                          12.48 ██████████
PLS                            12.79 █████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Key Findings

  1. Best Linear Model: LMM with Random Slopes (RMSE = 8.38)

    • Captures asset-specific volatility sensitivities
    • Includes quadratic terms for convexity effects
  2. Best Non-Linear Model: LightGBM (RMSE = 10.61)

    • Superior generalization vs XGBoost
    • Feature regularization prevents overfitting
  3. Interpretability Insights (SHAP Analysis):

    • realized_vol_mid dominates (57% of gain)
    • Volatility clustering confirmed as primary driver
    • Non-linear regime switching in stress_spread

📁 Repository Structure

PROJECT/
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd    # Main analysis (Quarto)
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html    # Rendered report
├── packages.R                                         # R dependencies installer
├── Train_ISF.csv                                      # Training data (~1.9M rows)
├── Test_ISF.csv                                       # Test data (~1.25M rows)
├── hat_y.csv                                          # Final predictions
├── README.md                                          # This file
└── results/
    ├── lightgbm/                                      # LightGBM model outputs
    └── xgboost/                                       # XGBoost model outputs

🚀 Getting Started

Prerequisites

  • R ≥ 4.0
  • Required packages (auto-installed via packages.R)

Installation

# Install all dependencies
source("packages.R")

Or manually install key packages:

install.packages(c(
  "tidyverse", "tidymodels", "caret", "glmnet",
  "lme4", "lmerTest", "xgboost", "lightgbm",
  "ranger", "pls", "shapviz", "rBayesianOptimization"
))

Running the Analysis

  1. Open the Quarto document:

    # In RStudio
    rstudioapi::navigateToFile("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
    
  2. Render the document:

    quarto::quarto_render("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
    
  3. Or run specific sections interactively using the code chunks in the .qmd file


🛠️ Technical Details

Data Split Strategy

  • Chronological split at 80th percentile of dates
  • Prevents look-ahead bias and data leakage
  • Training: ~1.53M observations
  • Validation: ~376K observations

Hyperparameter Tuning

  • Method: Bayesian Optimization (Gaussian Processes)
  • Acquisition: Expected Improvement (UCB)
  • Goal: Maximize negative RMSE

Evaluation Metric

Exponential RMSE on original scale:


RMSE_{real} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \exp(\hat{y}_{\log, i}) - y_i \right)^2}

Models trained on log-transformed target for variance stabilization.


📖 Key Concepts

Financial Theories Applied

  1. Volatility Clustering Past volatility predicts future volatility
  2. Variance Risk Premium Spread between implied and realized volatility
  3. Fear Gauge Put-call ratio as sentiment indicator
  4. Mean Reversion Volatility tends to return to long-term average
  5. Liquidity Premium Illiquid assets command higher volatility

Statistical Methods

  • Panel data modeling with fixed and random effects
  • Principal Component Analysis (PCA)
  • Bayesian hyperparameter optimization
  • SHAP values for model interpretability

👥 Authors

Team:

  • Arthur DANJOU
  • Camille LEGRAND
  • Axelle MERIC
  • Moritz VON SIEMENS

Course: Classification and Regression (M2) Academic Year: 2025-2026


📝 Notes

  • Computational Constraints: Some models (Random Forest, MLP) failed due to hardware limitations (16GB RAM, CPU-only)
  • Reproducibility: Set seed = 2025 for consistent results
  • Language: Analysis documented in English, course materials in French

📚 References

Key R packages used:

  • tidymodels Modern modeling framework
  • glmnet Regularized regression
  • lme4 / lmerTest Mixed-effects models
  • xgboost / lightgbm Gradient boosting
  • shapviz Model interpretability
  • rBayesianOptimization Hyperparameter tuning