artsite/content/projects/glm-implied-volatility.md

---
slug: implied-volatility-prediction-from-options-data
title: Implied Volatility Prediction from Options Data
type: Academic Project
description: A large-scale statistical study comparing Generalized Linear Models (GLMs) and black-box machine learning architectures to predict the implied volatility of S&P 500 options.
shortDescription: Predicting implied volatility using advanced regression techniques and machine learning models on financial options data.
publishedAt: 2026-02-28
readingTime: 3
status: Completed
tags:
  - R
  - GLM
  - Finance
  - Machine Learning
  - Statistical Modeling
icon: i-ph-graph-duotone
---

> **M2 Master's Project** – Predicting implied volatility using advanced regression techniques and machine learning models on financial options data.

This project explores the prediction of **implied volatility** from options market data, combining classical statistical methods with modern machine learning approaches. The analysis covers data preprocessing, feature engineering, model benchmarking, and interpretability analysis using real-world financial panel data.

- **GitHub Repository:** [Implied-Volatility-from-Options-Data](https://github.com/ArthurDanjou/Implied-Volatility-from-Options-Data)

---

::BackgroundTitle{title="Project Overview"}
::

### Problem Statement

Implied volatility represents the market's forward-looking expectation of an asset's future volatility. Accurate prediction is crucial for:
- **Option pricing** and valuation
- **Risk management** and hedging strategies
- **Trading strategies** based on volatility arbitrage

### Dataset

The project uses a comprehensive panel dataset tracking **3,887 assets** across **544 observation dates** (2019-2022):

| File | Description | Shape |
|------|-------------|-------|
| `Train_ISF.csv` | Training data with target variable | 1,909,465 rows × 21 columns |
| `Test_ISF.csv` | Test data for prediction | 1,251,308 rows × 18 columns |
| `hat_y.csv` | Final predictions from both models | 1,251,308 rows × 2 columns |

### Key Variables

**Target Variable:**
- `implied_vol_ref` – The implied volatility to predict

**Feature Categories:**
- **Identifiers:** `asset_id`, `obs_date`
- **Market Activity:** `call_volume`, `put_volume`, `call_oi`, `put_oi`, `total_contracts`
- **Volatility Metrics:** `realized_vol_short`, `realized_vol_mid1-3`, `realized_vol_long1-4`, `market_vol_index`
- **Option Structure:** `strike_dispersion`, `maturity_count`

---

::BackgroundTitle{title="Methodology"}
::

### Data Pipeline

```
Raw Data
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Splitting (Chronological 80/20)                   │
│  - Training: 2019-10 to 2021-07                         │
│  - Validation: 2021-07 to 2022-03                       │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Feature Engineering                                    │
│  - Aggregation of volatility horizons                   │
│  - Creation of financial indicators                     │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Preprocessing (tidymodels)                        │
│  - Winsorization (99.5th percentile)                    │
│  - Log/Yeo-Johnson transformations                      │
│  - Z-score normalization                                │
│  - PCA (95% variance retention)                         │
└─────────────────────────────────────────────────────────┘
    ↓
Three Datasets Generated:
├── Tree-based (raw, scale-invariant)
├── Linear (normalized, winsorized)
└── PCA (dimensionality-reduced)
```

### Feature Engineering

New financial indicators created to capture market dynamics:

| Feature | Description | Formula |
|---------|-------------|---------|
| `pulse_ratio` | Volatility trend direction | RV_short / RV_long |
| `stress_spread` | Asset vs market stress | RV_short - Market_VIX |
| `put_call_ratio_volume` | Immediate market stress | Put_Volume / Call_Volume |
| `put_call_ratio_oi` | Long-term risk structure | Put_OI / Call_OI |
| `liquidity_ratio` | Market depth | Total_Volume / Total_OI |
| `option_dispersion` | Market uncertainty | Strike_Dispersion / Total_Contracts |
| `put_low_strike` | Downside protection density | Strike_Dispersion / Put_OI |
| `put_proportion` | Hedging vs speculation | Put_Volume / Total_Volume |

---

::BackgroundTitle{title="Models Implemented"}
::

### Linear Models

| Model | Description | Best RMSE |
|-------|-------------|-----------|
| **OLS** | Ordinary Least Squares | 11.26 |
| **Ridge** | L2 regularization | 12.48 |
| **Lasso** | L1 regularization (variable selection) | 12.03 |
| **Elastic Net** | L1 + L2 combined | ~12.03 |
| **PLS** | Partial Least Squares (on PCA) | 12.79 |

### Linear Mixed-Effects Models (LMM)

Advanced panel data models accounting for asset-specific effects:

| Model | Features | RMSE |
|-------|----------|------|
| LMM Baseline | All variables + Random Intercept | 8.77 |
| LMM Reduced | Collinearity removal | ~8.77 |
| LMM Interactions | Financial interaction terms | ~8.77 |
| LMM + Quadratic | Convexity terms (vol of vol) | 8.41 |
| **LMM + Random Slopes (mod_lmm_5)** | Asset-specific betas | **8.10** ⭐ |

### Tree-Based Models

| Model | Strategy | Validation RMSE | Training RMSE |
|-------|----------|-----------------|---------------|
| **XGBoost** | Level-wise, Bayesian tuning | 10.70 | 0.57 |
| **LightGBM** | Leaf-wise, feature regularization | **10.61** ⭐ | 10.90 |
| Random Forest | Bagging | DNF* | - |

*DNF: Did Not Finish (computational constraints)

### Neural Networks

| Model | Architecture | Status |
|-------|--------------|--------|
| MLP | 128-64 units, tanh activation | Failed to converge |

---

::BackgroundTitle{title="Results Summary"}
::

### Model Comparison

```
RMSE Performance (Lower is Better)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Linear Mixed-Effects (LMM5)     8.38 ████████████████████ Best Linear
Linear Mixed-Effects (LMM4)     8.41 ███████████████████
Linear Mixed-Effects (Baseline) 8.77 ██████████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LightGBM                       10.61 ███████████████ Best Non-Linear
XGBoost                        10.70 ██████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OLS (with interactions)        11.26 █████████████
Lasso                          12.03 ███████████
OLS (baseline)                 12.01 ███████████
Ridge                          12.48 ██████████
PLS                            12.79 █████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```

### Key Findings

1. **Best Linear Model:** LMM with Random Slopes (RMSE = 8.38)
   - Captures asset-specific volatility sensitivities
   - Includes quadratic terms for convexity effects

2. **Best Non-Linear Model:** LightGBM (RMSE = 10.61)
   - Superior generalization vs XGBoost
   - Feature regularization prevents overfitting

3. **Interpretability Insights (SHAP Analysis):**
   - `realized_vol_mid` dominates (57% of gain)
   - Volatility clustering confirmed as primary driver
   - Non-linear regime switching in stress_spread

---

::BackgroundTitle{title="Repository Structure"}
::

```
PROJECT/
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd     # Main analysis (Quarto)
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html    # Rendered report
├── packages.R                                         # R dependencies installer
├── Train_ISF.csv                                      # Training data (~1.9M rows)
├── Test_ISF.csv                                       # Test data (~1.25M rows)
├── hat_y.csv                                          # Final predictions
├── README.md                                          # This file
└── results/
    ├── lightgbm/                                      # LightGBM model outputs
    └── xgboost/                                       # XGBoost model outputs
```

---

::BackgroundTitle{title="Getting Started"}
::


### Prerequisites

- **R** ≥ 4.0
- Required packages (auto-installed via `packages.R`)

### Installation

```r
# Install all dependencies
source("packages.R")
```

Or manually install key packages:

```r
install.packages(c(
  "tidyverse", "tidymodels", "caret", "glmnet",
  "lme4", "lmerTest", "xgboost", "lightgbm",
  "ranger", "pls", "shapviz", "rBayesianOptimization"
))
```

### Running the Analysis

1. **Open the Quarto document:**
   ```r
   # In RStudio
   rstudioapi::navigateToFile("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
   ```

2. **Render the document:**
   ```r
   quarto::quarto_render("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")
   ```

3. **Or run specific sections interactively** using the code chunks in the `.qmd` file

---

::BackgroundTitle{title="Technical Details"}
::

### Data Split Strategy

- **Chronological split** at 80th percentile of dates
- Prevents look-ahead bias and data leakage
- Training: ~1.53M observations
- Validation: ~376K observations

### Hyperparameter Tuning

- **Method:** Bayesian Optimization (Gaussian Processes)
- **Acquisition:** Expected Improvement (UCB)
- **Goal:** Maximize negative RMSE

### Evaluation Metric

**Exponential RMSE** on original scale:

$$
RMSE_{real} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \exp(\hat{y}_{\log, i}) - y_i \right)^2}
$$

Models trained on log-transformed target for variance stabilization.

---

::BackgroundTitle{title="Key Concepts"}
::

### Financial Theories Applied

1. **Volatility Clustering** – Past volatility predicts future volatility
2. **Variance Risk Premium** – Spread between implied and realized volatility
3. **Fear Gauge** – Put-call ratio as sentiment indicator
4. **Mean Reversion** – Volatility tends to return to long-term average
5. **Liquidity Premium** – Illiquid assets command higher volatility

### Statistical Methods

- Panel data modeling with fixed and random effects
- Principal Component Analysis (PCA)
- Bayesian hyperparameter optimization
- SHAP values for model interpretability

---

::BackgroundTitle{title="Authors"}
::

**Team:**
- Arthur DANJOU
- Camille LEGRAND
- Axelle MERIC
- Moritz VON SIEMENS

**Course:** Classification and Regression (M2)
**Academic Year:** 2025-2026

---

::BackgroundTitle{title="Notes"}
::

- **Computational Constraints:** Some models (Random Forest, MLP) failed due to hardware limitations (16GB RAM, CPU-only)
- **Reproducibility:** Set `seed = 2025` for consistent results
- **Language:** Analysis documented in English, course materials in French

---

::BackgroundTitle{title="References"}
::

Key R packages used:
- `tidymodels` – Modern modeling framework
- `glmnet` – Regularized regression
- `lme4` / `lmerTest` – Mixed-effects models
- `xgboost` / `lightgbm` – Gradient boosting
- `shapviz` – Model interpretability
- `rBayesianOptimization` – Hyperparameter tuning