Files
artsite/content/projects/glm-implied-volatility.md
Arthur DANJOU ac5ccb3555 Refactor project documentation and structure
- Updated data visualization project documentation to remove incomplete warning.
- Deleted the glm-financial-assets project file and replaced it with glm-implied-volatility project file, detailing a comprehensive study on implied volatility prediction using GLMs and machine learning.
- Marked n8n automations project as completed.
- Added new project on reinforcement learning applied to Atari Tennis, detailing agent comparisons and results.
- Removed outdated rl-tennis project file.
- Updated package dependencies in package.json for improved stability and performance.
2026-03-10 12:07:09 +01:00

337 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
slug: implied-volatility-prediction-from-options-data
title: Implied Volatility Prediction from Options Data
type: Academic Project
description: A large-scale statistical study comparing Generalized Linear Models (GLMs) and black-box machine learning architectures to predict the implied volatility of S&P 500 options.
shortDescription: Predicting implied volatility using advanced regression techniques and machine learning models on financial options data.
publishedAt: 2026-02-28
readingTime: 3
status: Completed
tags:
- R
- GLM
- Finance
- Machine Learning
- Statistical Modeling
icon: i-ph-graph-duotone
---
> **M2 Master's Project** Predicting implied volatility using advanced regression techniques and machine learning models on financial options data.
This project explores the prediction of **implied volatility** from options market data, combining classical statistical methods with modern machine learning approaches. The analysis covers data preprocessing, feature engineering, model benchmarking, and interpretability analysis using real-world financial panel data.
- **GitHub Repository:** [Implied-Volatility-from-Options-Data](https://github.com/ArthurDanjou/Implied-Volatility-from-Options-Data)
---
::BackgroundTitle{title="Project Overview"}
::
### Problem Statement
Implied volatility represents the market's forward-looking expectation of an asset's future volatility. Accurate prediction is crucial for:
- **Option pricing** and valuation
- **Risk management** and hedging strategies
- **Trading strategies** based on volatility arbitrage
### Dataset
The project uses a comprehensive panel dataset tracking **3,887 assets** across **544 observation dates** (2019-2022):
| File | Description | Shape |
|------|-------------|-------|
| `Train_ISF.csv` | Training data with target variable | 1,909,465 rows × 21 columns |
| `Test_ISF.csv` | Test data for prediction | 1,251,308 rows × 18 columns |
| `hat_y.csv` | Final predictions from both models | 1,251,308 rows × 2 columns |
### Key Variables
**Target Variable:**
- `implied_vol_ref` The implied volatility to predict
**Feature Categories:**
- **Identifiers:** `asset_id`, `obs_date`
- **Market Activity:** `call_volume`, `put_volume`, `call_oi`, `put_oi`, `total_contracts`
- **Volatility Metrics:** `realized_vol_short`, `realized_vol_mid1-3`, `realized_vol_long1-4`, `market_vol_index`
- **Option Structure:** `strike_dispersion`, `maturity_count`
---
::BackgroundTitle{title="Methodology"}
::
### Data Pipeline
```
Raw Data
┌─────────────────────────────────────────────────────────┐
│ Data Splitting (Chronological 80/20) │
│ - Training: 2019-10 to 2021-07 │
│ - Validation: 2021-07 to 2022-03 │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Feature Engineering │
│ - Aggregation of volatility horizons │
│ - Creation of financial indicators │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Data Preprocessing (tidymodels) │
│ - Winsorization (99.5th percentile) │
│ - Log/Yeo-Johnson transformations │
│ - Z-score normalization │
│ - PCA (95% variance retention) │
└─────────────────────────────────────────────────────────┘
Three Datasets Generated:
├── Tree-based (raw, scale-invariant)
├── Linear (normalized, winsorized)
└── PCA (dimensionality-reduced)
```
### Feature Engineering
New financial indicators created to capture market dynamics:
| Feature | Description | Formula |
|---------|-------------|---------|
| `pulse_ratio` | Volatility trend direction | RV_short / RV_long |
| `stress_spread` | Asset vs market stress | RV_short - Market_VIX |
| `put_call_ratio_volume` | Immediate market stress | Put_Volume / Call_Volume |
| `put_call_ratio_oi` | Long-term risk structure | Put_OI / Call_OI |
| `liquidity_ratio` | Market depth | Total_Volume / Total_OI |
| `option_dispersion` | Market uncertainty | Strike_Dispersion / Total_Contracts |
| `put_low_strike` | Downside protection density | Strike_Dispersion / Put_OI |
| `put_proportion` | Hedging vs speculation | Put_Volume / Total_Volume |
---
::BackgroundTitle{title="Models Implemented"}
::
### Linear Models
| Model | Description | Best RMSE |
|-------|-------------|-----------|
| **OLS** | Ordinary Least Squares | 11.26 |
| **Ridge** | L2 regularization | 12.48 |
| **Lasso** | L1 regularization (variable selection) | 12.03 |
| **Elastic Net** | L1 + L2 combined | ~12.03 |
| **PLS** | Partial Least Squares (on PCA) | 12.79 |
### Linear Mixed-Effects Models (LMM)
Advanced panel data models accounting for asset-specific effects:
| Model | Features | RMSE |
|-------|----------|------|
| LMM Baseline | All variables + Random Intercept | 8.77 |
| LMM Reduced | Collinearity removal | ~8.77 |
| LMM Interactions | Financial interaction terms | ~8.77 |
| LMM + Quadratic | Convexity terms (vol of vol) | 8.41 |
| **LMM + Random Slopes (mod_lmm_5)** | Asset-specific betas | **8.10** ⭐ |
### Tree-Based Models
| Model | Strategy | Validation RMSE | Training RMSE |
|-------|----------|-----------------|---------------|
| **XGBoost** | Level-wise, Bayesian tuning | 10.70 | 0.57 |
| **LightGBM** | Leaf-wise, feature regularization | **10.61** ⭐ | 10.90 |
| Random Forest | Bagging | DNF* | - |
*DNF: Did Not Finish (computational constraints)
### Neural Networks
| Model | Architecture | Status |
|-------|--------------|--------|
| MLP | 128-64 units, tanh activation | Failed to converge |
---
::BackgroundTitle{title="Results Summary"}
::
### Model Comparison
```
RMSE Performance (Lower is Better)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Linear Mixed-Effects (LMM5) 8.38 ████████████████████ Best Linear
Linear Mixed-Effects (LMM4) 8.41 ███████████████████
Linear Mixed-Effects (Baseline) 8.77 ██████████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LightGBM 10.61 ███████████████ Best Non-Linear
XGBoost 10.70 ██████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OLS (with interactions) 11.26 █████████████
Lasso 12.03 ███████████
OLS (baseline) 12.01 ███████████
Ridge 12.48 ██████████
PLS 12.79 █████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```
### Key Findings
1. **Best Linear Model:** LMM with Random Slopes (RMSE = 8.38)
- Captures asset-specific volatility sensitivities
- Includes quadratic terms for convexity effects
2. **Best Non-Linear Model:** LightGBM (RMSE = 10.61)
- Superior generalization vs XGBoost
- Feature regularization prevents overfitting
3. **Interpretability Insights (SHAP Analysis):**
- `realized_vol_mid` dominates (57% of gain)
- Volatility clustering confirmed as primary driver
- Non-linear regime switching in stress_spread
---
::BackgroundTitle{title="Repository Structure"}
::
```
PROJECT/
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd # Main analysis (Quarto)
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html # Rendered report
├── packages.R # R dependencies installer
├── Train_ISF.csv # Training data (~1.9M rows)
├── Test_ISF.csv # Test data (~1.25M rows)
├── hat_y.csv # Final predictions
├── README.md # This file
└── results/
├── lightgbm/ # LightGBM model outputs
└── xgboost/ # XGBoost model outputs
```
---
::BackgroundTitle{title="Getting Started"}
::
### Prerequisites
- **R** ≥ 4.0
- Required packages (auto-installed via `packages.R`)
### Installation
```r
# Install all dependencies
source("packages.R")
```
Or manually install key packages:
```r
install.packages(c(
"tidyverse", "tidymodels", "caret", "glmnet",
"lme4", "lmerTest", "xgboost", "lightgbm",
"ranger", "pls", "shapviz", "rBayesianOptimization"
))
```
### Running the Analysis
1. **Open the Quarto document:**
```r
# In RStudio
rstudioapi::navigateToFile("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
```
2. **Render the document:**
```r
quarto::quarto_render("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
```
3. **Or run specific sections interactively** using the code chunks in the `.qmd` file
---
::BackgroundTitle{title="Technical Details"}
::
### Data Split Strategy
- **Chronological split** at 80th percentile of dates
- Prevents look-ahead bias and data leakage
- Training: ~1.53M observations
- Validation: ~376K observations
### Hyperparameter Tuning
- **Method:** Bayesian Optimization (Gaussian Processes)
- **Acquisition:** Expected Improvement (UCB)
- **Goal:** Maximize negative RMSE
### Evaluation Metric
**Exponential RMSE** on original scale:
$$
RMSE_{real} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \exp(\hat{y}_{\log, i}) - y_i \right)^2}
$$
Models trained on log-transformed target for variance stabilization.
---
::BackgroundTitle{title="Key Concepts"}
::
### Financial Theories Applied
1. **Volatility Clustering** Past volatility predicts future volatility
2. **Variance Risk Premium** Spread between implied and realized volatility
3. **Fear Gauge** Put-call ratio as sentiment indicator
4. **Mean Reversion** Volatility tends to return to long-term average
5. **Liquidity Premium** Illiquid assets command higher volatility
### Statistical Methods
- Panel data modeling with fixed and random effects
- Principal Component Analysis (PCA)
- Bayesian hyperparameter optimization
- SHAP values for model interpretability
---
::BackgroundTitle{title="Authors"}
::
**Team:**
- Arthur DANJOU
- Camille LEGRAND
- Axelle MERIC
- Moritz VON SIEMENS
**Course:** Classification and Regression (M2)
**Academic Year:** 2025-2026
---
::BackgroundTitle{title="Notes"}
::
- **Computational Constraints:** Some models (Random Forest, MLP) failed due to hardware limitations (16GB RAM, CPU-only)
- **Reproducibility:** Set `seed = 2025` for consistent results
- **Language:** Analysis documented in English, course materials in French
---
::BackgroundTitle{title="References"}
::
Key R packages used:
- `tidymodels` Modern modeling framework
- `glmnet` Regularized regression
- `lme4` / `lmerTest` Mixed-effects models
- `xgboost` / `lightgbm` Gradient boosting
- `shapviz` Model interpretability
- `rBayesianOptimization` Hyperparameter tuning