---
slug: implied-volatility-modeling
title: Implied Volatility Surface Modeling
type: Academic Project
description: A large-scale statistical study comparing Generalized Linear Models (GLMs) and black-box machine learning architectures to predict the implied volatility of S&P 500 options.
shortDescription: Predicting the SPX volatility surface using GLMs and black-box models on 1.2 million observations.
publishedAt: 2026-02-28
readingTime: 3
status: In progress
tags:
  - R
  - GLM
  - Finance
  - Machine Learning
icon: i-ph-graph-duotone
---

This project targets high-precision calibration of the **Implied Volatility Surface** using a large-scale dataset of S&P 500 (SPX) European options.

The core objective is to stress-test classic statistical models against modern predictive algorithms. **Generalized Linear Models (GLMs)** provide a transparent baseline, while more complex "black-box" architectures are evaluated on whether their accuracy gains justify reduced interpretability in a risk management context.

::BackgroundTitle{title="Dataset & Scale"}
::

The modeling is performed on a high-dimensional dataset with over **1.2 million observations**.

- **Target Variable**: `implied_vol_ref` (implied volatility).
- **Features**: Option strike price ($K$), underlying asset price ($S$), and time to maturity ($\tau$).
- **Volume**: A training set of $1,251,307$ rows and a test set of identical size.

::BackgroundTitle{title="Modeling Methodology"}
::

The project follows a rigorous statistical pipeline to compare two modeling philosophies:

### 1. The Statistical Baseline (GLM)
Using R's GLM framework, I implement models with targeted link functions and error distributions (such as **Gamma** or **Inverse Gaussian**) to capture the global structure of the volatility surface. These models serve as the benchmark for transparency and stability.

### 2. The Black-Box Challenge
To capture local non-linearities such as the volatility smile and skew, I explore more complex architectures. Performance is evaluated by **Root Mean Squared Error (RMSE)** relative to the GLM baselines.

### 3. Feature Engineering
Key financial indicators are derived from the raw data:
- **Moneyness**: Calculated as the ratio $K/S$.
- **Temporal Dynamics**: Transformations of time to maturity to linearize the term structure.

::BackgroundTitle{title="Evaluation & Reproducibility"}
::

Performance is measured strictly via RMSE on the original scale of the target variable. To ensure reproducibility and precise comparisons across model iterations, a fixed random seed is maintained throughout the workflow.

```r
set.seed(2025)

TrainData <- read.csv("train_ISF.csv", stringsAsFactors = FALSE)
TestX <- read.csv("test_ISF.csv", stringsAsFactors = FALSE)

rmse_eval <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

```

::BackgroundTitle{title="Critical Analysis"}
::

Beyond pure prediction, the project addresses:

- Model Limits: Identifying market regimes where models fail (e.g., deep out-of-the-money options).
- Interpretability: Quantifying the trade-off between complexity and practical utility in a risk management context.
- Future Extensions: Considering richer dynamics, such as historical volatility or skew-specific targets.