artsite/content/projects/glm-implied-volatility.md at ac5ccb3555eb4d523fdbdc59ab7cab9902436ae2

mirror of https://github.com/ArthurDanjou/artsite.git synced 2026-03-16 07:09:20 +01:00

Files

Arthur DANJOU ac5ccb3555 Refactor project documentation and structure

- Updated data visualization project documentation to remove incomplete warning.
- Deleted the glm-financial-assets project file and replaced it with glm-implied-volatility project file, detailing a comprehensive study on implied volatility prediction using GLMs and machine learning.
- Marked n8n automations project as completed.
- Added new project on reinforcement learning applied to Atari Tennis, detailing agent comparisons and results.
- Removed outdated rl-tennis project file.
- Updated package dependencies in package.json for improved stability and performance.

2026-03-10 12:07:09 +01:00

12 KiB

Raw Blame History

slug, title, type, description, shortDescription, publishedAt, readingTime, status, tags, icon

slug

title

type

description

shortDescription

publishedAt

readingTime

status

Problem Statement

Implied volatility represents the market's forward-looking expectation of an asset's future volatility. Accurate prediction is crucial for:

Option pricing and valuation
Risk management and hedging strategies
Trading strategies based on volatility arbitrage

Dataset

The project uses a comprehensive panel dataset tracking 3,887 assets across 544 observation dates (2019-2022):

File	Description	Shape
`Train_ISF.csv`	Training data with target variable	1,909,465 rows × 21 columns
`Test_ISF.csv`	Test data for prediction	1,251,308 rows × 18 columns
`hat_y.csv`	Final predictions from both models	1,251,308 rows × 2 columns

Key Variables

Target Variable:

implied_vol_ref – The implied volatility to predict

Feature Categories:

Identifiers: asset_id, obs_date
Market Activity: call_volume, put_volume, call_oi, put_oi, total_contracts
Volatility Metrics: realized_vol_short, realized_vol_mid1-3, realized_vol_long1-4, market_vol_index
Option Structure: strike_dispersion, maturity_count

::BackgroundTitle{title="Methodology"} ::

Data Pipeline

Raw Data
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Splitting (Chronological 80/20)                   │
│  - Training: 2019-10 to 2021-07                         │
│  - Validation: 2021-07 to 2022-03                       │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Feature Engineering                                    │
│  - Aggregation of volatility horizons                   │
│  - Creation of financial indicators                     │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Preprocessing (tidymodels)                        │
│  - Winsorization (99.5th percentile)                    │
│  - Log/Yeo-Johnson transformations                      │
│  - Z-score normalization                                │
│  - PCA (95% variance retention)                         │
└─────────────────────────────────────────────────────────┘
    ↓
Three Datasets Generated:
├── Tree-based (raw, scale-invariant)
├── Linear (normalized, winsorized)
└── PCA (dimensionality-reduced)

Feature Engineering

New financial indicators created to capture market dynamics:

Feature	Description	Formula
`pulse_ratio`	Volatility trend direction	RV_short / RV_long
`stress_spread`	Asset vs market stress	RV_short - Market_VIX
`put_call_ratio_volume`	Immediate market stress	Put_Volume / Call_Volume
`put_call_ratio_oi`	Long-term risk structure	Put_OI / Call_OI
`liquidity_ratio`	Market depth	Total_Volume / Total_OI
`option_dispersion`	Market uncertainty	Strike_Dispersion / Total_Contracts
`put_low_strike`	Downside protection density	Strike_Dispersion / Put_OI
`put_proportion`	Hedging vs speculation	Put_Volume / Total_Volume

::BackgroundTitle{title="Models Implemented"} ::

Linear Models

Model	Description	Best RMSE
OLS	Ordinary Least Squares	11.26
Ridge	L2 regularization	12.48
Lasso	L1 regularization (variable selection)	12.03
Elastic Net	L1 + L2 combined	~12.03
PLS	Partial Least Squares (on PCA)	12.79

Linear Mixed-Effects Models (LMM)

Advanced panel data models accounting for asset-specific effects:

Model	Features	RMSE
LMM Baseline	All variables + Random Intercept	8.77
LMM Reduced	Collinearity removal	~8.77
LMM Interactions	Financial interaction terms	~8.77
LMM + Quadratic	Convexity terms (vol of vol)	8.41
LMM + Random Slopes (mod_lmm_5)	Asset-specific betas	8.10 ⭐

Tree-Based Models

Model	Strategy	Validation RMSE	Training RMSE
XGBoost	Level-wise, Bayesian tuning	10.70	0.57
LightGBM	Leaf-wise, feature regularization	10.61 ⭐	10.90
Random Forest	Bagging	DNF*	-

*DNF: Did Not Finish (computational constraints)

Neural Networks

Model	Architecture	Status
MLP	128-64 units, tanh activation	Failed to converge

::BackgroundTitle{title="Results Summary"} ::

Model Comparison

RMSE Performance (Lower is Better)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Linear Mixed-Effects (LMM5)     8.38 ████████████████████ Best Linear
Linear Mixed-Effects (LMM4)     8.41 ███████████████████
Linear Mixed-Effects (Baseline) 8.77 ██████████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LightGBM                       10.61 ███████████████ Best Non-Linear
XGBoost                        10.70 ██████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OLS (with interactions)        11.26 █████████████
Lasso                          12.03 ███████████
OLS (baseline)                 12.01 ███████████
Ridge                          12.48 ██████████
PLS                            12.79 █████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Key Findings

Best Linear Model: LMM with Random Slopes (RMSE = 8.38)
- Captures asset-specific volatility sensitivities
- Includes quadratic terms for convexity effects
Best Non-Linear Model: LightGBM (RMSE = 10.61)
- Superior generalization vs XGBoost
- Feature regularization prevents overfitting
Interpretability Insights (SHAP Analysis):
- realized_vol_mid dominates (57% of gain)
- Volatility clustering confirmed as primary driver
- Non-linear regime switching in stress_spread

::BackgroundTitle{title="Repository Structure"} ::

PROJECT/
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd     # Main analysis (Quarto)
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html    # Rendered report
├── packages.R                                         # R dependencies installer
├── Train_ISF.csv                                      # Training data (~1.9M rows)
├── Test_ISF.csv                                       # Test data (~1.25M rows)
├── hat_y.csv                                          # Final predictions
├── README.md                                          # This file
└── results/
    ├── lightgbm/                                      # LightGBM model outputs
    └── xgboost/                                       # XGBoost model outputs

::BackgroundTitle{title="Getting Started"} ::

Prerequisites

R ≥ 4.0
Required packages (auto-installed via packages.R)

Installation

# Install all dependencies
source("packages.R")

Or manually install key packages:

install.packages(c(
  "tidyverse", "tidymodels", "caret", "glmnet",
  "lme4", "lmerTest", "xgboost", "lightgbm",
  "ranger", "pls", "shapviz", "rBayesianOptimization"
))

Running the Analysis

Open the Quarto document:

# In RStudio
rstudioapi::navigateToFile("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")

Render the document:

quarto::quarto_render("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")

Or run specific sections interactively using the code chunks in the .qmd file

::BackgroundTitle{title="Technical Details"} ::

Data Split Strategy

Chronological split at 80th percentile of dates
Prevents look-ahead bias and data leakage
Training: ~1.53M observations
Validation: ~376K observations

Hyperparameter Tuning

Method: Bayesian Optimization (Gaussian Processes)
Acquisition: Expected Improvement (UCB)
Goal: Maximize negative RMSE

Evaluation Metric

Exponential RMSE on original scale:


RMSE_{real} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \exp(\hat{y}_{\log, i}) - y_i \right)^2}

Models trained on log-transformed target for variance stabilization.

::BackgroundTitle{title="Key Concepts"} ::

Financial Theories Applied

Volatility Clustering – Past volatility predicts future volatility
Variance Risk Premium – Spread between implied and realized volatility
Fear Gauge – Put-call ratio as sentiment indicator
Mean Reversion – Volatility tends to return to long-term average
Liquidity Premium – Illiquid assets command higher volatility

Statistical Methods

Panel data modeling with fixed and random effects
Principal Component Analysis (PCA)
Bayesian hyperparameter optimization
SHAP values for model interpretability

::BackgroundTitle{title="Authors"} ::

Team:

Arthur DANJOU
Camille LEGRAND
Axelle MERIC
Moritz VON SIEMENS

Course: Classification and Regression (M2) Academic Year: 2025-2026

::BackgroundTitle{title="Notes"} ::

Computational Constraints: Some models (Random Forest, MLP) failed due to hardware limitations (16GB RAM, CPU-only)
Reproducibility: Set seed = 2025 for consistent results
Language: Analysis documented in English, course materials in French

::BackgroundTitle{title="References"} ::

Key R packages used:

tidymodels – Modern modeling framework
glmnet – Regularized regression
lme4 / lmerTest – Mixed-effects models
xgboost / lightgbm – Gradient boosting
shapviz – Model interpretability
rBayesianOptimization – Hyperparameter tuning

12 KiB Raw Blame History Unescape Escape