library(pacman)
-p_load(
- brulee,
- car,
- carData,
- caret,
- class,
- corrplot,
- DataExplorer,
- data.table,
- dplyr,
- fitdistrplus,
- glmnet,
- ggfortify,
- ggplot2,
- glue,
- grid,
- gridExtra,
- hexbin,
- kableExtra,
- MASS,
- lightgbm,
- lme4,
- lmerTest,
- paletteer,
- plotly,
- pls,
- randomForest,
- ranger,
- rBayesianOptimization,
- reshape2,
- rlang,
- ROCR,
- rsample,
- shapviz,
- scales,
- skimr,
- tibble,
- tidyr,
- tidymodels,
- tidyverse,
- xgboost
-)IMPLIED VOLATILITY FROM OPTIONS DATA
-IMPLIED VOLATILITY FROM OPTIONS DATA
-Preliminary
-We begin by loading the necessary packages for this analysis.
-set.seed(2025)We fix the random seed to ensure the reproducibility of our results. This is a critical step in any data analysis or machine learning pipeline, as it allows others to replicate our findings and verify the robustness of our models.
-Introduction
-Financial Context & Problem Formulation
-Implied volatility is the market’s anticipation of the future level of an underlying asset’s volatility. The easiest way to access implied volatility is by deriving it from option prices using the Black–Scholes formula.
-Unlike historical volatility, also known as realized volatility, which measures past price fluctuations, implied volatility is forward-looking. It reflects the risk and uncertainty perceived by investors.
-Obtaining an accurate measure of implied volatility is important not only to understand the market environment and investor sentiment, but also for option pricing, risk management, and hedging.
-Predicting implied volatility allows us to better understand and anticipate market dynamics. Accurate forecasts are crucial for pricing options correctly, hedging positions, and improving risk management to reduce unexpected losses. For volatility-based trading strategies, anticipating movements in volatility can generate profitable opportunities.
-Raw Dataset Overview
-The dataset used for this study, named , provides a comprehensive view of option market dynamics. It is structured as panel data, tracking multiple underlying assets over a specific time horizon. The raw dataset contains 1,909,465 observations and 21 variables.
-data_train <- read_csv("Train_ISF.csv")
-test <- read.csv("Test_ISF.csv")
-cat("Rows:", nrow(data_train), "\n")Rows: 1909465
-cat("Columns:", ncol(data_train), "\n")Columns: 19
-head(data_train, 10)# A tibble: 10 × 19
- asset_id obs_date strike_dispersion call_volume put_volume call_oi put_oi
- <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
- 1 A 2019-10-14 17.1 19878 1001 9365 12705
- 2 AA 2019-10-14 4.42 6194 3010 54013 38674
- 3 AABA 2019-10-14 5.23 4301 2863 159161 53453
- 4 AAL 2019-10-14 5.28 20034 7695 321062 217407
- 5 AAN 2019-10-14 7.99 133 520 621 657
- 6 AAOI 2019-10-14 9.13 568 867 13355 29898
- 7 AAON 2019-10-14 12.1 60 151 145 933
- 8 AAP 2019-10-14 5.04 921 943 11301 7080
- 9 AAPL 2019-10-14 2.88 84986 47927 1136840 960145
-10 AAT 2019-10-14 12.2 2 73 3 78
-# ℹ 12 more variables: maturity_count <dbl>, implied_vol_ref <dbl>,
-# total_contracts <dbl>, realized_vol_short <dbl>, realized_vol_mid1 <dbl>,
-# realized_vol_mid2 <dbl>, realized_vol_mid3 <dbl>, realized_vol_long1 <dbl>,
-# realized_vol_long2 <dbl>, realized_vol_long3 <dbl>,
-# realized_vol_long4 <dbl>, market_vol_index <dbl>
-The variables can be categorized into four main groups describing the market conditions:
--
-
- Identifiers:
asset_id(categorical) andobs_date(temporal).
- - Target Variable:
implied_vol_ref, representing the implied volatility we aim to predict.
- - Market Activity & Liquidity: This includes trading volumes (
call_volume,put_volume), open interest (call_oi,put_oi), and the total number of contracts exchanged (total_contracts).
- - Volatility Metrics: Historical realized volatility at different horizons (
realized_vol_short,mid,long) and the global market stress index (market_vol_index).
- - Option Structure: Variables such as
strike_dispersionandmaturity_count, which describe the depth and breadth of the available option chain.
-
nb_assets <- uniqueN(data_train$asset_id)
-nb_dates <- uniqueN(data_train$obs_date)
-
-print(paste("Number of assets:", nb_assets))[1] "Number of assets: 3887"
-print(paste("Number of dates:", nb_dates))[1] "Number of dates: 544"
-The dataset covers a universe of 3,887 unique underlying assets across 544 distinct observation dates.
-It is important to note that the panel is unbalanced. The theoretical maximum number of observations (3,887 x 544 = 2,114,528) exceeds the actual row count (1,909,465), indicating that not all assets are quoted or have available data for every date in the period. This sparsity is visualized in the following figure, which shows the availability of data points for a subset of assets.
-selected_assets <- unique(data_train$asset_id)[100:120]
-selected_dates <- unique(data_train$obs_date)[1:100]
-
-data_subset <- data_train |>
- filter(asset_id %in% selected_assets) |>
- filter(obs_date %in% selected_dates)
-
-ggplot(
- data_subset,
- aes(
- x = obs_date,
- y = implied_vol_ref,
- group = asset_id,
- color = as.factor(asset_id)
- )
-) +
- geom_line() +
- geom_point(size = 1) +
- theme_minimal() +
- labs(
- title = "Evolution of the Target for 10 Assets over time",
- x = "Date",
- y = "Value of the Target",
- color = "Asset ID"
- ) +
- theme(legend.position = "right")
Data Pipeline & Exploratory Analysis
-Data Splitting Strategy: Preventing Data Leakage
-To ensure the reliability of our model and avoid look-ahead bias, the dataset was split into training and validation sets before performing any data manipulation, scaling, or feature engineering.
-Instead of randomly splitting, we adopted a chronological splitting strategy. We identified the unique observation dates, sorted them, and established a temporal cutoff at the 80% mark. The first 80% of dates constitute the training set which will be used to learn patterns.The subsequent 20% form the validation set, used to evaluate performance on unseen future data. This approach ensures that the model is evaluated on a future market regime it has never encountered during training. All data transformations were strictly calibrated on the training set alone; these fixed statistical parameters were then deterministically projected onto the out-of-sample datasets to ensure perfect methodological isolation.
-all_dates <- sort(unique(data_train$obs_date))
-
-cutoff_index <- floor(length(all_dates) * 0.8)
-cutoff_date <- all_dates[cutoff_index]
-
-print(paste("Cutoff Date:", cutoff_date))[1] "Cutoff Date: 2021-07-20"
-train <- data_train |> arrange(obs_date) |> filter(obs_date <= cutoff_date)
-val <- data_train |> arrange(obs_date) |> filter(obs_date > cutoff_date)
-
-cat("Training Set Size:", nrow(train), "\n")Training Set Size: 1533234
-cat("Validation Set Size :", nrow(val), "\n")Validation Set Size : 376231
-cat("Last date in the Training set:", as.character(max(train$obs_date)), "\n")Last date in the Training set: 2021-07-20
-cat("First date in the Validation set:", as.character(min(val$obs_date)), "\n")First date in the Validation set: 2021-07-21
-Following this split, the training set covers the period up to 2021-07-20, containing approximately \(1,533,234\) observations. The test set begins immediately after, ensuring a continuous timeline without overlap.
-Data Cleansing & Outlier Management
-Financial market data is inherently characterized by fat-tailed distributions. Metrics such as realized volatility, trading volumes, and strike dispersions frequently exhibit extreme, asymmetric spikes driven by macroeconomic shocks, earnings announcements, or transient liquidity crises. However, before addressing these structural anomalies, the baseline integrity of the dataset must be established.
-Foundational Data Integrity: Completeness and Uniqueness
-skimr::skim(data_train) |> rmarkdown::paged_table()A systematic programmatic verification was executed on the raw dataset prior to any feature engineering. This audit confirmed two critical structural properties:
--
-
- Absolute Uniqueness: No duplicate rows were detected across the primary temporal and cross-sectional keys (
asset_idandobs_date).
- - Strict Completeness: The dataset was entirely free of missing values (NA). -
The innate completeness of the data precluded the necessity for computationally expensive imputation algorithms (such as MICE or KNN) during the exploratory phase. Nevertheless, following strict MLOps principles, a median imputation step ($step_impute_median$) was retained within the unified tidymodels pipeline. This acts as a robust fail-safe mechanism to prevent pipeline crashes during future out-of-sample inferences should upstream data pipelines temporarily drop feature payloads. With structural purity confirmed, the primary data quality challenge shifted exclusively to the management of extreme values.
The Mathematical Threat to Linear and Gradient-Based Models
-While extreme spikes in volatility or volume are genuine market phenomena rather than measurement errors, their presence poses a severe mathematical threat to specific families of predictive algorithms. Models relying on distance metrics and continuous optimization, such as Ordinary Least Squares (OLS), penalized regressions (Elastic Net), and Multi-Layer Perceptrons (MLP), are highly sensitive to these outliers due to two primary factors:
--
-
Quadratic Loss Functions: Algorithms optimizing Mean Squared Error (MSE) heavily penalize large deviations. A single extreme outlier forces the algorithm to shift the entire regression hyperplane to minimize that specific localized error, thereby destroying the model’s generalization capability on the remaining 99% of “normal” observations.
-Scaling Distortion: Linear models mathematically require strict feature standardization (Z-score normalization). If raw outliers are left untreated, they artificially inflate the standard deviation (σ) of the feature. Consequently, the normalized values of the vast majority of the data are compressed into an infinitesimally narrow band around zero, effectively erasing the predictive signal and preventing L1/L2 regularizations from operating fairly.
-
Dynamic Thresholding Strategy
-To neutralize this threat without discarding valuable rows of data (which would disrupt the temporal continuity of the dataset), we implemented a rigorous thresholding strategy known as Winsorization (or capping).
-Instead of applying arbitrary fixed values, the limits are dynamically calculated based on the empirical distribution of the data:
--
-
Single-Sided Capping: For strictly positive features characterized by severe right-skewness (e.g.,
realized_vol_short,put_call_ratio_volume,strike_dispersion), all values exceeding the \(99.5\)th percentile are capped at that exact threshold.
-Dual-Sided Winsorization: For symmetrical features capable of taking both negative and positive values (e.g.,
stress_spread), values are constrained within the \(0.5\)th and \(99.5\)th percentiles.
-
Algorithmic Differentiation and Data Leakage Prevention
-A fundamental MLOps principle enforced in this study is the strict prevention of data leakage. The \(99.5\)th and \(0.5\)th percentiles are computed exclusively on the training set. These “frozen” thresholds are subsequently applied to the validation and hidden test sets, ensuring that the out-of-sample evaluations remain statistically robust and untainted by future information.
-Furthermore, acknowledging the divergent mechanics of machine learning algorithms, this capping procedure is intentionally isolated. As detailed in the architecture overview (Section 2.6), this Winsorization step is only applied to the linear and neural network pipelines ($rec_linear$ and $rec_pca$). Tree-based models (LightGBM, XGBoost) operate via orthogonal, axis-aligned splits and naturally isolate extreme values into terminal leaves. Feeding them raw, un-capped data allows the boosting ensembles to fully capture the true non-linear magnitude of extreme market stress without suffering from the gradient distortion that afflicts linear architectures.
calc_upper_limit <- function(x) {
- quantile(x, 0.995, na.rm = TRUE)
-}
-
-calc_dual_limits <- function(x) {
- quantile(x, probs = c(0.005, 0.995), na.rm = TRUE)
-}
-
-clip_max_func <- function(x, var_name, stats_list) {
- limit <- stats_list[[var_name]]
- pmin(x, limit)
-}
-
-clip_dual_func <- function(x, var_name, stats_list) {
- limits <- stats_list[[var_name]]
- pmax(pmin(x, limits[2]), limits[1])
-}
-
-vars_clip_max <- c(
- "realized_vol_short",
- "realized_vol_mid",
- "realized_vol_long",
- "put_volume",
- "call_volume",
- "put_oi",
- "call_oi",
- "strike_dispersion",
- "total_contracts",
- "pulse_ratio",
- "put_call_ratio_volume",
- "put_call_ratio_oi",
- "liquidity_ratio",
- "option_dispersion",
- "put_low_strike"
-)
-
-vars_clip_dual <- c(
- "stress_spread"
-)Feature Engineering
-Prior to constructing advanced indicators, we streamlined the input space by averaging the granular mid-term and long-term realized volatility measures. This aggregation strategy mitigates multicollinearity among highly correlated historical horizons while preserving the integrity of the volatility signal.
-Then, to capture the complex dynamics of option markets, we constructed a series of advanced financial indicators. These features are designed to isolate specific phenomena: market stress, liquidity constraints, asymmetric fear, or the structure of speculation.
-Volatility Regime Indicators
-Pulse Ratio
-This indicator gives the direction of volatility, if it is currently increasing or decreasing. Absolute volatility data would not determine the tendancy of volatilty, because volatility levels differ significantly across underlying assets.
-When there are massive markets moves, investors tend to panic and buy options for protection, which pushes implied volatility higher.
-\[ -\text{Pulse Ratio} = \dfrac{\text{Realized Short Volatility}}{\text{Realized Long Volatility}} -\]
-Stress Spread
-Stress Spread assesses whether we are aligned with the overall market stress represented by the VIX, the volatility index. This spread measures the underlying’s stress relative to market stress i.e. whether the underlying is more or less volatile than the market.
-If the Stress Spread increases, the underlying becomes more volatile than the market. The underlying’s volatility is not only driven by markets conditions, but also by an idiosyncratic risk.
-\[ -\text{Stress Spread} = \text{Realized Short Volatility} - \text{Market Volatility Index} -\]
-Market Sentiment Indicators
-Sentiment analysis is based on the dichotomy between immediate flow (Volume) and the stock of positions (Open Interest).
-Put-Call Ratio on Volume
-The Put Call Volume Ratio is the indicator of immediate market stress. Volatility is skewed because fear is asymmetric, the fear of the downside is stronger than the desire to speculate on an upside.
-A strong increase of the Put Call Volume Ratio means markets are panicking and therefore buying protection via put options. This higher demand leads to higher prices meaning higher implied volatility.
-\[ -\text{Put-Call Ratio Volume} = \frac{\text{Put Volume}}{\text{Call Volume}} -\]
-Put-Call Ratio on Open Interest
-This is the indicator of “risk structure” or long-term conviction.
-\[ -\text{Put-Call Ratio OI} = \frac{\text{Put Open Interest}}{\text{Call Open Interest}} -\]
-Put Proportion
-Put Proportion Ratio measures the balance between put buyers and call buyers. It detects whether the trading activity is dominated by hedging or by market optimism.
-Volatility is skewed because fear is asymmetric, when put buying surges, it signals urgent hedging, leading to a higher implied volatility. On the other hand, when call buying surges, implied volatility will increase but less than for puts.
-\[ -\text{Put Proportion} = \frac{\text{Put Volume}}{\text{Total Volume}} -\]
-Structure and Liquidity Indicators
-Liquidity Ratio
-The Liquidity Ratio measures the speed at which contracts are being exchanged compared to the total number of contracts outstanding in the market. It helps understand if markets are overheating or experiencing a blockage.
-The Liquidity Ratio showcases that volatility is also impacted by liquidity of the underlying asset. Unusual high activity amplifies price movements whereas unusual low activity means that even small order can shift the price up or down. In both cases, the imbalance between flow and market depth justifies a higher volatility forecast.
-\[ -\text{Liquidity Ratio} = \frac{\text{Total Volume}}{\text{Total Open Interest}} -\]
-Option Dispersion
-Option Dispersion indicates if activity is concentrated around a specific strike i.e. a specific scenario.
-Concentrated markets means that liquidity is high in a certain strike, order flow is predictable and the spread, that can be seen as a uncertainty premium, decrease.
-Orders split across a very large number of strikes mean the market struggles to determine the direction of the underlying. This results in lower liquidity, meaning the market has a lower capacity to absorb large orders.
-\[ -\text{Option Dispersion} = \frac{\text{Strike Dispersion}}{\text{Total Contracts}} -\]
-Put Low Strike (Liquidity Trap)
-This indicator focuses on the density of “Out-of-the-Money” protection.
-It helps identify “breaking points.” If a mass of contracts concentrates on a specific strike (high ratio), it creates a “trap”. If the asset price approaches this level, the forced hedging mechanisms of market makers can trigger a volatility explosion via a snowball effect.
-\[ -\text{Put Low Strike} = \frac{\text{Strike Dispersion}}{\text{Put Open Interest}} -\]
-create_features <- function(df) {
- epsilon <- 1e-6
-
- df_enriched <- df |>
- mutate(
- realized_vol_mid = (realized_vol_mid1 +
- realized_vol_mid2 +
- realized_vol_mid3) /
- 3,
- realized_vol_long = (realized_vol_long1 +
- realized_vol_long2 +
- realized_vol_long3 +
- realized_vol_long4) /
- 4,
- ) |>
- dplyr::select(
- -realized_vol_mid1,
- -realized_vol_mid2,
- -realized_vol_mid3,
- -realized_vol_long1,
- -realized_vol_long2,
- -realized_vol_long3,
- -realized_vol_long4
- ) |>
- mutate(
- pulse_ratio = realized_vol_short / (realized_vol_long + epsilon),
-
- put_call_ratio_volume = put_volume / (call_volume + epsilon),
- put_call_ratio_oi = put_oi / (call_oi + epsilon),
- liquidity_ratio = (put_volume + call_volume) /
- (call_oi + put_oi + epsilon),
-
- option_dispersion = strike_dispersion / (total_contracts + epsilon),
- put_low_strike = strike_dispersion / (put_oi + epsilon),
- put_proportion = put_volume / (put_volume + call_volume + epsilon),
-
- stress_spread = realized_vol_short - market_vol_index,
- )
-
- return(df_enriched)
-}
-
-train_eng <- create_features(train)
-val_eng <- create_features(val)
-test_eng <- create_features(test)Data Scaling & Normalization
-Following the isolation and neutralization of extreme outliers (Winsorization), the core distributions of the financial predictors remain inherently skewed. Standard financial variables, such as realized volatility and trading volumes, are strictly positive and typically follow log-normal distributions. While tree-based ensembles are indifferent to such monotonic skewness, linear frameworks, distance-based algorithms, and Neural Networks require symmetric, standardized feature spaces to ensure gradient stability and unbiased coefficient estimation.
-To fulfill these strict algorithmic prerequisites, we engineered a multi-step transformation pipeline utilizing targeted statistical mapping.
-Asymmetry Correction: Logarithmic and Power Transformations
-The first objective of the transformation phase is to center the data mass and correct severe right-skewness, thereby approximating a Gaussian distribution. We employed two distinct mathematical approaches depending on the domain of the predictors:
--
-
1. Logarithmic Transformation for Bounded Features: For strictly positive, highly skewed variables (e.g.,
realized_vol_short,put_call_ratio_volume,strike_dispersion), a natural logarithmic transformation \(f(x) = \log(x + c)\) was applied. This mapping exponentially compresses the long right tail while expanding the lower bounds, stabilizing the variance across the feature space (homoscedasticity). To prevent mathematically undefined operations (\(\log(0)\)) on features like volumes, a strictly positive offset (\(c = 1\)) was systematically added prior to transformation.
-2. Yeo-Johnson Transformation for Unbounded Features: Certain engineered features, such as the
stress_spreadandvol_slope, cross the zero-bound, taking both negative and positive values depending on market contango or backwardation regimes. The standard logarithmic or Box-Cox transformations are mathematically invalid for negative inputs. Consequently, we applied the Yeo-Johnson power transformation, a robust generalization of Box-Cox that smoothly handles the entire real line \(\mathbb{R}\). The optimal transformation parameter (\(\lambda\)) for each variable was estimated strictly via maximum likelihood on the training set.
-
Feature Standardization (\(Z\)-Score Normalization)
-Once the distributions were geometrically symmetrized, the final mathematical prerequisite for the linear and neural pipelines was absolute scale homogenization.
-Financial features are inherently measured in vastly different units: implied volatility is expressed as a percentage, strike dispersion in nominal currency, and ratios as unitless scalars. If passed raw to a penalized regression (such as Elastic Net), the algorithm’s objective function would unfairly penalize features with large absolute numerical values, while ignoring the coefficients of small-scale variables, regardless of their actual predictive power.
-To enforce mathematical equity, a strict \(Z\)-score normalization was applied to all numerical predictors:
-\(z = \frac{x - \mu}{\sigma}\)
-This operation rescales every feature to a mean of zero (\(\mu = 0\)) and a unit variance (\(\sigma = 1\)). Consequently, the \(L_1\) (Lasso) and \(L_2\) (Ridge) regularization penalties evaluate the features purely on their predictive signal rather than their arbitrary measurement scale.
-Algorithmic Execution and Methodological Integrity
-Consistent with our strict temporal isolation strategy, the parameters governing these transformations, namely the Yeo-Johnson \(\lambda\) values, the empirical means \(\mu\), and the standard deviations \(\sigma\), were computed exclusively on the training set. These static parameters were encapsulated within the tidymodels recipe state and deterministically projected onto the validation and test sets via the $step_log$, $step_YeoJohnson$, and $step_normalize$ functions.
Furthermore, as established in our architectural blueprint (Section 2.6), these transformations were explicitly omitted from the tree-based pipeline ($rec_tree$) to preserve the natural, interpretable scale of the financial metrics for post-hoc SHAP analysis.
Exploratory Data Analysis (EDA)
-Target Variable Analysis
-Before implementing any regression model, we conducted a thorough statistical analysis of the target variable, . Understanding the distribution of the target is crucial, as linear models generally assume that the residuals follow a normal distribution.
-We first plotted the summary and the empirical density of the target. The initial histogram revealed a strictly positive, right-skewed distribution with a heavy tail, characteristic of financial volatility data.
-target <- train_eng$implied_vol_ref
-summary(target) Min. 1st Qu. Median Mean 3rd Qu. Max.
- 1.00 28.84 41.33 47.15 59.22 149.00
-target_dist <- ggplot(train_eng, aes(x = target)) +
- geom_histogram(
- aes(y = ..density..),
- binwidth = 1,
- fill = "#E4CBF9",
- color = "white",
- lwd = 0.1,
- alpha = 0.8
- ) + #draw the histogram
- geom_line(stat = "density", color = "#983399", size = 1)
-theme_minimal() +
- labs(title = "Target distribution", x = "Target", y = "Density")<theme> List of 146
- $ line : <ggplot2::element_line>
- ..@ colour : chr "black"
- ..@ linewidth : num 0.5
- ..@ linetype : num 1
- ..@ lineend : chr "butt"
- ..@ linejoin : chr "round"
- ..@ arrow : logi FALSE
- ..@ arrow.fill : chr "black"
- ..@ inherit.blank: logi TRUE
- $ rect : <ggplot2::element_rect>
- ..@ fill : chr "white"
- ..@ colour : chr "black"
- ..@ linewidth : num 0.5
- ..@ linetype : num 1
- ..@ linejoin : chr "round"
- ..@ inherit.blank: logi TRUE
- $ text : <ggplot2::element_text>
- ..@ family : chr ""
- ..@ face : chr "plain"
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : chr "black"
- ..@ size : num 11
- ..@ hjust : num 0.5
- ..@ vjust : num 0.5
- ..@ angle : num 0
- ..@ lineheight : num 0.9
- ..@ margin : <ggplot2::margin> num [1:4] 0 0 0 0
- ..@ debug : logi FALSE
- ..@ inherit.blank: logi TRUE
- $ title : chr "Target distribution"
- $ point : <ggplot2::element_point>
- ..@ colour : chr "black"
- ..@ shape : num 19
- ..@ size : num 1.5
- ..@ fill : chr "white"
- ..@ stroke : num 0.5
- ..@ inherit.blank: logi TRUE
- $ polygon : <ggplot2::element_polygon>
- ..@ fill : chr "white"
- ..@ colour : chr "black"
- ..@ linewidth : num 0.5
- ..@ linetype : num 1
- ..@ linejoin : chr "round"
- ..@ inherit.blank: logi TRUE
- $ geom : <ggplot2::element_geom>
- ..@ ink : chr "black"
- ..@ paper : chr "white"
- ..@ accent : chr "#3366FF"
- ..@ linewidth : num 0.5
- ..@ borderwidth: num 0.5
- ..@ linetype : int 1
- ..@ bordertype : int 1
- ..@ family : chr ""
- ..@ fontsize : num 3.87
- ..@ pointsize : num 1.5
- ..@ pointshape : num 19
- ..@ colour : NULL
- ..@ fill : NULL
- $ spacing : 'simpleUnit' num 5.5points
- ..- attr(*, "unit")= int 8
- $ margins : <ggplot2::margin> num [1:4] 5.5 5.5 5.5 5.5
- $ aspect.ratio : NULL
- $ axis.title : NULL
- $ axis.title.x : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : NULL
- ..@ vjust : num 1
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 2.75 0 0 0
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.title.x.top : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : NULL
- ..@ vjust : num 0
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 0 0 2.75 0
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.title.x.bottom : NULL
- $ axis.title.y : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : NULL
- ..@ vjust : num 1
- ..@ angle : num 90
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 0 2.75 0 0
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.title.y.left : NULL
- $ axis.title.y.right : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : NULL
- ..@ vjust : num 1
- ..@ angle : num -90
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 0 0 0 2.75
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.text : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : chr "#4D4D4DFF"
- ..@ size : 'rel' num 0.8
- ..@ hjust : NULL
- ..@ vjust : NULL
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : NULL
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.text.x : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : NULL
- ..@ vjust : num 1
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 2.2 0 0 0
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.text.x.top : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : NULL
- ..@ vjust : NULL
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 0 0 4.95 0
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.text.x.bottom : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : NULL
- ..@ vjust : NULL
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 4.95 0 0 0
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.text.y : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : num 1
- ..@ vjust : NULL
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 0 2.2 0 0
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.text.y.left : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : NULL
- ..@ vjust : NULL
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 0 4.95 0 0
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.text.y.right : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : NULL
- ..@ vjust : NULL
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 0 0 0 4.95
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.text.theta : NULL
- $ axis.text.r : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : num 0.5
- ..@ vjust : NULL
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : <ggplot2::margin> num [1:4] 0 2.2 0 2.2
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ axis.ticks : <ggplot2::element_blank>
- $ axis.ticks.x : NULL
- $ axis.ticks.x.top : NULL
- $ axis.ticks.x.bottom : NULL
- $ axis.ticks.y : NULL
- $ axis.ticks.y.left : NULL
- $ axis.ticks.y.right : NULL
- $ axis.ticks.theta : NULL
- $ axis.ticks.r : NULL
- $ axis.minor.ticks.x.top : NULL
- $ axis.minor.ticks.x.bottom : NULL
- $ axis.minor.ticks.y.left : NULL
- $ axis.minor.ticks.y.right : NULL
- $ axis.minor.ticks.theta : NULL
- $ axis.minor.ticks.r : NULL
- $ axis.ticks.length : 'rel' num 0.5
- $ axis.ticks.length.x : NULL
- $ axis.ticks.length.x.top : NULL
- $ axis.ticks.length.x.bottom : NULL
- $ axis.ticks.length.y : NULL
- $ axis.ticks.length.y.left : NULL
- $ axis.ticks.length.y.right : NULL
- $ axis.ticks.length.theta : NULL
- $ axis.ticks.length.r : NULL
- $ axis.minor.ticks.length : 'rel' num 0.75
- $ axis.minor.ticks.length.x : NULL
- $ axis.minor.ticks.length.x.top : NULL
- $ axis.minor.ticks.length.x.bottom: NULL
- $ axis.minor.ticks.length.y : NULL
- $ axis.minor.ticks.length.y.left : NULL
- $ axis.minor.ticks.length.y.right : NULL
- $ axis.minor.ticks.length.theta : NULL
- $ axis.minor.ticks.length.r : NULL
- $ axis.line : <ggplot2::element_blank>
- $ axis.line.x : NULL
- $ axis.line.x.top : NULL
- $ axis.line.x.bottom : NULL
- $ axis.line.y : NULL
- $ axis.line.y.left : NULL
- $ axis.line.y.right : NULL
- $ axis.line.theta : NULL
- $ axis.line.r : NULL
- $ legend.background : <ggplot2::element_blank>
- $ legend.margin : NULL
- $ legend.spacing : 'rel' num 2
- $ legend.spacing.x : NULL
- $ legend.spacing.y : NULL
- $ legend.key : <ggplot2::element_blank>
- $ legend.key.size : 'simpleUnit' num 1.2lines
- ..- attr(*, "unit")= int 3
- $ legend.key.height : NULL
- $ legend.key.width : NULL
- $ legend.key.spacing : NULL
- $ legend.key.spacing.x : NULL
- $ legend.key.spacing.y : NULL
- $ legend.key.justification : NULL
- $ legend.frame : NULL
- $ legend.ticks : NULL
- $ legend.ticks.length : 'rel' num 0.2
- $ legend.axis.line : NULL
- $ legend.text : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : 'rel' num 0.8
- ..@ hjust : NULL
- ..@ vjust : NULL
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : NULL
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ legend.text.position : NULL
- $ legend.title : <ggplot2::element_text>
- ..@ family : NULL
- ..@ face : NULL
- ..@ italic : chr NA
- ..@ fontweight : num NA
- ..@ fontwidth : num NA
- ..@ colour : NULL
- ..@ size : NULL
- ..@ hjust : num 0
- ..@ vjust : NULL
- ..@ angle : NULL
- ..@ lineheight : NULL
- ..@ margin : NULL
- ..@ debug : NULL
- ..@ inherit.blank: logi TRUE
- $ legend.title.position : NULL
- $ legend.position : chr "right"
- $ legend.position.inside : NULL
- $ legend.direction : NULL
- $ legend.byrow : NULL
- $ legend.justification : chr "center"
- $ legend.justification.top : NULL
- $ legend.justification.bottom : NULL
- $ legend.justification.left : NULL
- $ legend.justification.right : NULL
- $ legend.justification.inside : NULL
- [list output truncated]
- @ complete: logi TRUE
- @ validate: logi TRUE
-ggplotly(target_dist)To identify the most appropriate theoretical distribution families, we utilized the Cullen and Frey graph. We performed a bootstrap analysis to assess the robustness of the statistical properties.
-descdist(sample(target, 5000), boot = 100)
summary statistics
-------
-min: 1.03 max: 145.5
-median: 41.26
-mean: 47.44376
-estimated sd: 25.80176
-estimated skewness: 1.179019
-estimated kurtosis: 4.303189
-We proceeded to fit these three distributions to the data using the Maximum Likelihood Estimation (MLE) method.
-n <- length(target)
-target_norm <- (target - min(target)) / (max(target) - min(target))
-target_norm <- (target_norm * (n - 1) + 0.5) / n
-
-fit_g_norm <- fitdist(target_norm, "gamma")
-fit_ln_norm <- fitdist(target_norm, "lnorm")
-fit_b_norm <- fitdist(target_norm, "beta")
-
-statistique_norm <- gofstat(list(fit_g_norm, fit_b_norm, fit_ln_norm))
-print(statistique_norm)Goodness-of-fit statistics
- 1-mle-gamma 2-mle-beta 3-mle-lnorm
-Kolmogorov-Smirnov statistic 0.0269811 7.736077e-02 0.0234458
-Cramer-von Mises statistic 399.8647559 3.381029e+03 288.8318332
-Anderson-Darling statistic 2490.5876688 Inf 2331.8231830
-
-Goodness-of-fit criteria
- 1-mle-gamma 2-mle-beta 3-mle-lnorm
-Akaike's Information Criterion -1423631 -1247091 -1359067
-Bayesian Information Criterion -1423607 -1247066 -1359043
-The Beta distribution yielded incoherent results. Since the Beta distribution is bounded on the interval [0,1], it struggled to capture the tail dynamics of the volatility, even after normalization. The fitting process showed poor convergence and failed to represent the data structure adequately. Consequently, we discarded the Beta distribution.
-fit_g <- fitdist(target, "gamma")
-fit_ln <- fitdist(target, "lnorm")
-
-statistique <- gofstat(list(fit_g, fit_ln))
-print(statistique)Goodness-of-fit statistics
- 1-mle-gamma 2-mle-lnorm
-Kolmogorov-Smirnov statistic 2.903834e-02 1.810247e-02
-Cramer-von Mises statistic 4.619020e+02 1.675541e+02
-Anderson-Darling statistic 2.820148e+03 1.503318e+03
-
-Goodness-of-fit criteria
- 1-mle-gamma 2-mle-lnorm
-Akaike's Information Criterion 13899334 13931707
-Bayesian Information Criterion 13899359 13931732
-The final comparison was made between the Gamma and Log-normal distributions. We evaluated them using both statistical information criteria (AIC and BIC) and visual inspection.
-While the two distributions minimized different statistical criteria, the graphical analysis provided a decisive conclusion. We superimposed the theoretical density curves of both distributions onto the empirical histogram of the target.
-denscomp(
- list(fit_g, fit_ln),
- legendtext = c("Gamma", "Lognorm"),
- fitcol = c("#983399", "#E4CBF9"),
- fitlwd = c(2, 2),
- fitlty = c(1, 1)
-)
As illustrated in figure plotted above, the {Log-normal distribution fits the empirical data almost perfectly, capturing both the peak and the fat tail of the implied volatility.
-Based on this analysis, we conclude that \(\log(\texttt{implied\_vol\_ref})\) follows a Normal distribution. This justifies the use of a log-transformation on the target variable for our Linear Models, ensuring that the normality assumption of the regression is respected.
-Analysis of other relevant features
-Beyond the target variable itself, our modeling strategy relies on three fundamental pillars: the asset’s memory (), the systemic environment (), and the market structure (). We analyze their relationship with the implied volatility below.
-Historical Volatility: The Anchor Effect
-The short-term realized volatility acts as the “memory” of the asset. Financial theory suggests a strong autocorrelation known as “volatility clustering”.
-g1 <- ggplot(
- train_eng,
- aes(x = realized_vol_short, y = log(implied_vol_ref))
-) +
- geom_hex(bins = 70) +
- scale_fill_viridis_c() +
- geom_abline(
- intercept = 0,
- slope = 1,
- color = "#A3D5FF",
- linetype = "dashed"
- ) +
- labs(
- title = "Implied vs Realized Volatility",
- x = "Realized Vol Short",
- y = "Log(Implied Vol)"
- ) +
- theme_minimal()
-
-plot(g1)
This plot reveals a dense, elliptical cloud of points that indicates a robust positive linear relationship (\(Correlation \approx 0.8\)). This strong correlation confirms that market pricing is heavily anchored to the asset’s recent physical behavior, meaning implied volatility is rarely disconnected from its realized counterpart. However, the dispersion observed in the graph demonstrates that implied volatility is not merely a perfect copy of the past; the vertical spread represents the “Variance Risk Premium” investors pay for future uncertainty, which is precisely the variation our model aims to capture using additional features.
-Market Volatility Index: The Systemic Driver
-This variable represents the “tide that lifts all boats”. We analyze how the average implied volatility of our 3,887 assets correlates with the global market stress.
-daily_stats <- train_eng |>
- group_by(obs_date) |>
- summarise(
- Avg_Implied = mean(implied_vol_ref, na.rm = TRUE),
- Market_Index = mean(market_vol_index, na.rm = TRUE)
- )
-
-g2 <- ggplot(daily_stats, aes(x = as.Date(obs_date))) +
- geom_line(aes(y = Avg_Implied, color = "Average Asset Vol"), size = 0.8) +
- geom_line(
- aes(y = Market_Index, color = "Market Vol Index"),
- size = 0.8,
- linetype = "dashed"
- ) +
- labs(
- title = "Systemic Risk Correlation",
- x = "Date",
- y = "Volatility Level",
- color = "Legend"
- ) +
- scale_color_manual(
- values = c("Average Asset Vol" = "#983399", "Market Vol Index" = "#E4CBF9")
- ) +
- theme_minimal() +
- theme(legend.position = "bottom")
-
-plot(g2)
This second plot highlights the synchronization between idiosyncratic and systemic risk, particularly during crisis events. The massive spike in early 2020, corresponding to the COVID-19 crisis, is clearly visible on both curves; when the Market Index (represented by the light purple dashed line) jumps, the average asset volatility (the dark purple line) reacts instantly and violently. This visualisation demonstrates a strong regime dependency: in high-stress environments, correlations tend towards 1 as macro-factors dominate market behavior, whereas in calmer periods, such as late 2020, the curves flatten and diverge slightly, allowing asset-specific drivers to take precedence over systemic panic.
-Multicollinearity & Dimensionality Reduction (PCA)
-Despite the initial pairwise correlation filter applied in the base recipe (removing features with a Spearman correlation above \(0.90\)), residual multicollinearity inevitably persists within the financial feature space. While penalized regressions (Lasso, Ridge) are mathematically equipped to handle this redundancy through their regularization norms, unpenalized models like Ordinary Least Squares (OLS) and deep architectures like Multi-Layer Perceptrons (MLP) remain highly vulnerable.
-To provide a mathematically optimal feature space for these specific algorithms, we implemented a dimensionality reduction phase using Principal Component Analysis (PCA).
-Mathematical Justification
--
-
1. Stabilizing Unpenalized Linear Models (OLS): In standard OLS regression, the parameter vector is estimated via the normal equation: \(\hat{\beta} = (X^T X)^{-1} X^T Y\). If residual multicollinearity exists, the covariance matrix \(X^T X\) becomes ill-conditioned (approaching singularity). This mathematical instability leads to an inflated Variance Inflation Factor (VIF), making the coefficient estimates highly erratic and hypersensitive to minor variations in the training data. PCA projects the original features into a new subspace of strictly orthogonal (zero-correlation) principal components, guaranteeing a perfectly invertible covariance matrix.
-2. Optimizing Neural Network Convergence (MLP): For Multi-Layer Perceptrons, feeding highly correlated inputs leads to elongated, non-isotropic error surfaces. This forces the Stochastic Gradient Descent (SGD) algorithm to oscillate inefficiently, slowing down convergence and increasing the risk of trapping the network in local minima. By supplying orthogonal components, PCA ensures a symmetrical error topology, accelerating gradient convergence and stabilizing the weight updates.
-
Execution and Variance Thresholding
-Because PCA seeks to maximize projected variance, it is fundamentally scale-sensitive. If applied to raw data, features with naturally large nominal values (such as unscaled volumes) would disproportionately dominate the principal components, regardless of their actual informational value. Therefore, this step is exclusively applied after the rigorous \(Z\)-score standardization detailed in Section 2.4.
-Rather than selecting an arbitrary number of components, we dynamically threshold the PCA to retain exactly \(95\%\) of the cumulative explained variance. This approach acts as a secondary, mathematical feature selection: it captures the core structural signal of the market while discarding the remaining \(5\%\) of variance as idiosyncratic, stochastic noise.
-Methodological Isolation
-Maintaining strict adherence to our statistical isolation protocols, the PCA projection matrix (the eigenvectors and eigenvalues) is computed exclusively using the covariance structure of the standardized training set. These fixed geometric rotations are then deterministically applied to the validation and test sets via the $step_pca$ function, ensuring that no future variance distributions leak into the model’s structural parameters.
Unified Data Pipeline Implementation (tidymodels)
-To ensure strict methodological rigor, guarantee reproducibility, and absolutely prevent any data leakage from the validation and test sets, all the theoretical preprocessing steps detailed in sections 2.2 through 2.6 were computationally encapsulated into a unified, sequential pipeline. Leveraging the recipes package from the tidymodels ecosystem, we constructed a multi-branch data blueprint. This architecture explicitly creates three distinct datasets, each mathematically optimized for a specific family of machine learning algorithms.
--
-
1. The Base Recipe and Tree-Based Dataset (
$rec_tree$): The foundation of our pipeline ($rec_base$) begins by assigning the target role to the log-transformed implied volatility (log_implied_vol). Missing values are handled via median imputation, and an aggressive multicollinearity filter ($step_corr$at \(0.90\)) removes redundant information. This base recipe directly forms our first dataset: the Tree-Based Dataset. Decision trees (LightGBM}, XGBoost) operate via orthogonal splits and are scale-invariant. To preserve the natural financial interpretability of the predictors for post-hoc SHAP analysis, this dataset intentionally bypasses all clipping, geometric transformations, and standardization steps.
-2. The Classic Linear Dataset (
$rec_linear$): Distance-based and penalized algorithms (Lasso, Ridge, Elastic Net) require symmetric, standardized, and outlier-free feature spaces. Branching off from the base recipe, we create our second dataset: the Classic Linear Dataset. This branch dynamically applies our Winsorization thresholds (computed strictly on the training set) via$step_mutate$to neutralize extreme outliers. Subsequently, right-skewed variables undergo a logarithmic transformation, while unbounded spreads undergo a Yeo-Johnson transformation. Finally, strict \(Z\)-score normalization ($step_normalize$) is applied to enforce mathematical equity among features prior to \(L_1/L_2\) penalization.
-3. The Dimensionality Reduction Dataset (
$rec_acp$): Our third dataset targets algorithms highly sensitive to even minor multicollinearity and dense gradient spaces, such as Ordinary Least Squares (OLS) or Multi-Layer Perceptrons (MLP). Derived directly from the fully scaled linear dataset, this branch incorporates a Principal Component Analysis ($step_pca$). The transformation is calibrated to retain exactly \(95\%\) of the underlying training variance, providing a perfectly orthogonal and dimensionally reduced feature space.
-4. State Execution and Strict Data Isolation (
$prep()$and$bake()$): The theoretical guarantee against data leakage is computationally enforced during the pipeline’s execution phase. The$prep()$function estimates all statistical parameters (medians, limits, variances, PCA eigenvectors) strictly from train_data. The$bake()$function then deterministically projects these frozen transformations onto the validation and test sets, generating the three final data triads ($_tree$,$_linear$,$_acp$) used for modeling.
-
To ensure strict methodological rigor, guarantee reproducibility, and absolutely prevent any data leakage from the validation and test sets, all the theoretical preprocessing steps detailed in sections 2.2 through 2.6 were computationally encapsulated into a unified, sequential pipeline. Leveraging the recipes package from the \(tidymodels\) ecosystem, we constructed a data blueprint that strictly learns transformation parameters (such as medians, variances, and PCA eigenvectors) exclusively on the training set before applying them to unseen data.
-This pipeline architecture addresses several critical statistical requirements for modeling financial data:
-stats_max <- lapply(train_eng[vars_clip_max], calc_upper_limit)
-stats_dual <- lapply(train_eng[vars_clip_dual], calc_dual_limits)rec_base <- recipe(implied_vol_ref ~ ., data = train_eng) |>
- update_role(asset_id, obs_date, new_role = "id") |>
- step_impute_median(all_numeric_predictors())
-
-rec_tree <- rec_base |>
- step_corr(all_numeric_predictors(), threshold = 0.90)
-
-rec_linear <- rec_base |>
- step_mutate(
- realized_vol_short = clip_max_func(
- realized_vol_short,
- "realized_vol_short",
- stats_max
- ),
- realized_vol_mid = clip_max_func(
- realized_vol_mid,
- "realized_vol_mid",
- stats_max
- ),
- realized_vol_long = clip_max_func(
- realized_vol_long,
- "realized_vol_long",
- stats_max
- ),
- put_volume = clip_max_func(put_volume, "put_volume", stats_max),
- call_volume = clip_max_func(call_volume, "call_volume", stats_max),
- put_oi = clip_max_func(put_oi, "put_oi", stats_max),
- call_oi = clip_max_func(call_oi, "call_oi", stats_max),
- strike_dispersion = clip_max_func(
- strike_dispersion,
- "strike_dispersion",
- stats_max
- ),
- total_contracts = clip_max_func(
- total_contracts,
- "total_contracts",
- stats_max
- ),
- pulse_ratio = clip_max_func(pulse_ratio, "pulse_ratio", stats_max),
- put_call_ratio_volume = clip_max_func(
- put_call_ratio_volume,
- "put_call_ratio_volume",
- stats_max
- ),
- put_call_ratio_oi = clip_max_func(
- put_call_ratio_oi,
- "put_call_ratio_oi",
- stats_max
- ),
- liquidity_ratio = clip_max_func(
- liquidity_ratio,
- "liquidity_ratio",
- stats_max
- ),
- option_dispersion = clip_max_func(
- option_dispersion,
- "option_dispersion",
- stats_max
- ),
- put_low_strike = clip_max_func(put_low_strike, "put_low_strike", stats_max),
- stress_spread = clip_dual_func(stress_spread, "stress_spread", stats_dual)
- ) |>
- step_log(
- any_of(c(
- "realized_vol_short",
- "realized_vol_mid",
- "realized_vol_long",
- "put_volume",
- "call_volume",
- "put_oi",
- "call_oi",
- "pulse_ratio",
- "put_call_ratio_volume",
- "put_call_ratio_oi",
- "liquidity_ratio"
- )),
- offset = 1
- ) |>
- step_log(
- any_of(c(
- "strike_dispersion",
- "total_contracts",
- "option_dispersion",
- "put_low_strike"
- )),
- offset = 0
- ) |>
- step_YeoJohnson(
- any_of(c(
- "stress_spread"
- ))
- ) |>
- step_normalize(all_numeric_predictors())
-
-rec_pca <- rec_linear |>
- step_pca(all_numeric_predictors(), threshold = 0.95)State Execution and Strict Data Isolation (’prep‘ and ’bake‘)
-prep_rec_tree <- prep(rec_tree, training = train_eng)
-
-train_tree <- bake(prep_rec_tree, new_data = NULL)
-val_tree <- bake(prep_rec_tree, new_data = val_eng)
-test_tree <- bake(prep_rec_tree, new_data = test_eng)prep_rec_linear <- prep(rec_linear, training = train_eng)
-
-train_linear <- bake(prep_rec_linear, new_data = NULL)
-val_linear <- bake(prep_rec_linear, new_data = val_eng)
-test_linear <- bake(prep_rec_linear, new_data = test_eng)prep_rec_pca <- prep(rec_pca, training = train_eng)
-
-train_pca <- bake(prep_rec_pca, new_data = NULL)
-val_pca <- bake(prep_rec_pca, new_data = val_eng)
-test_pca <- bake(prep_rec_pca, new_data = test_eng)Experimental Framework & Optimization Strategy
-Evaluation Metrics: The Exponential RMSE
-The primary objective of this study is to minimize the forecasting error of implied volatility. The standard metric for continuous predictive tasks is the Root Mean Squared Error (RMSE), which heavily penalizes large deviations due to its quadratic loss function.
-However, a critical methodological adjustment is required due to our preprocessing architecture. As established in Section 2.4, all models were strictly trained on the natural logarithm of the implied volatility (\(\log(Y)\)) to stabilize variance and correct the right-skewness of the financial distribution. Evaluating the models on this logarithmic scale would yield artificially compressed error metrics that fail to reflect the true financial magnitude of the forecasting errors.
-Therefore, to guarantee absolute fairness across all models (linear, tree-based, and neural networks) and to assess performance in the true operational domain, the predictions (\(\hat{y}_{\log}\)) must be exponentially transformed prior to calculating the final validation and test errors.
-The evaluation metric is strictly defined as the Exponential RMSE on the original scale:
-\[ -RMSE_{real} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \exp(\hat{y}_{\log, i}) - Y_i \right)^2} -\]
-This metric ensures that the bias-variance trade-off is evaluated exactly as it would impact a real-world financial portfolio, where absolute volatility spreads dictate pricing and risk management decisions.
-Hyperparameter Tuning: Grid Search vs. Bayesian Optimization
-Financial machine learning models, whether penalized regressions like Elastic Net or complex ensembles like LightGBM, rely heavily on hyperparameter configurations. To discover the optimal parameter set, we must navigate the computational trade-off between exhaustive search and hardware constraints.
--
-
1. The Limits of Grid Search: Traditional Grid Search operates by defining a discrete matrix of hyperparameter combinations and evaluating the objective function (validation RMSE) for every single point. While mathematically exhaustive, this approach scales exponentially with the number of dimensions (the curse of dimensionality). For high-capacity models requiring the simultaneous optimization of learning rates, depths, structural penalties, and bagging fractions, Grid Search becomes computationally intractable, resulting in an immense waste of CPU cycles evaluating areas of the hyperparameter space that yield poor performance.
-2. The Probabilistic Superiority of Bayesian Optimization: To resolve this computational bottleneck, we discarded Grid Search in favor of Bayesian Optimization using Gaussian Processes (GP). Unlike naive grid or random searches, Bayesian Optimization treats the hyperparameter tuning process as a probabilistic regression problem.
-
The algorithm builds a surrogate probability model of the objective function (the validation RMSE) based on past evaluations. At each iteration, it uses an acquisition function (typically Expected Improvement) to determine the next set of hyperparameters to evaluate. This acquisition function mathematically balances two competing objectives: - Exploration: Sampling regions of the hyperparameter space with high uncertainty. - Exploitation: Sampling regions where the surrogate model predicts a very low RMSE.
-By learning from previous iterations, Bayesian Optimization converges toward the global minimum significantly faster and with fewer total evaluations than exhaustive methods. This probabilistic efficiency was crucial for tuning our tree-based ensembles (Section 5) on dense tabular data without exceeding the processing limits of our local computational infrastructure. All hyperparameter tuning loops in this study were executed using this Bayesian framework, maximizing the negative Exponential RMSE to pinpoint the optimal architectural configurations.
-Linear & Interpretable Models
-Linear regressions on the regular dataset
-Baseline Linear Regression
-train_linear_lm <- train_linear |> dplyr::select(-asset_id, -obs_date)
-train_linear_lm$implied_vol_ref <- log(train_linear_lm$implied_vol_ref)val_linear_lm <- val_linear |> dplyr::select(-asset_id, -obs_date)
-val_linear_lm$implied_vol_ref <- log(val_linear_lm$implied_vol_ref)Linear Hypotheses
-mod1 <- lm(implied_vol_ref ~ ., data = train_linear_lm)
-summary(mod1)After the logarithmic transformation, we use a basic linear model as a benchmark. We validate this hypothesis with an autoplot.
-autoplot(mod1)While the volume of residual points makes for difficult interpretation, the blue lines visually confirm our P1-P4 hypotheses. For the Q-Q plot, we “validate” our assumption given that the larger middle part of the plot follows the x=y axis. Only the ends drop, respectively rise. This can be explained by outliers within the extremities of the dataset. We now look at the evolution of our AIC with a step-by-step method.
-stepAIC(mod1, ~., trace = T, direction = c("forward"))stepAIC(mod1, ~., trace = T, direction = c("backward"))Y_val <- val_linear_lm$implied_vol_ref
-Y_hat1 <- predict(mod1, newdata = val_linear_lm, type = "response")
-
-MSS_1 <- mean((exp(Y_val) - exp(Y_hat1))**2)
-
-print(paste0("The RMSE is: ", sqrt(MSS_1)))We first look at a general model with all features but without any interactions. The objective of this exercise was to identify the variables with little impact on the model’s predictive capability, in order to eliminate them within our future models and kickstart the variable selection. However, only one variable presented a result of a fisher test with a score above 0.05 - this variable being \(call\_volume\). While testing with a step-by-step method, both going forward and backward, the complete model was judged as the best. However, in order to introduce interactions without stretching the machine’s limits, we decided to eliminate both \(realized\_vol\_long\) and \(call\_volume\), \(realized\_vol\_long\) being the only other variable with a t-score higher than \(10^{-3}\). The RMSE of the general model was 12.01.
-First particular model
-mod2 <- lm(
- implied_vol_ref ~ realized_vol_short *
- realized_vol_mid *
- strike_dispersion *
- (put_volume +
- call_oi +
- put_oi +
- maturity_count +
- total_contracts +
- market_vol_index +
- pulse_ratio +
- put_call_ratio_volume +
- put_call_ratio_oi +
- liquidity_ratio +
- option_dispersion +
- put_low_strike +
- put_proportion +
- stress_spread),
- data = train_linear_lm
-)
-summary(mod2)Y_hat2 <- predict(mod2, newdata = val_linear_lm, type = "response")
-Y_val <- val_linear_lm$implied_vol_ref
-
-MSS_2 <- mean((exp(Y_val) - exp(Y_hat2))**2)
-
-print(paste0("the RMSE is: ", sqrt(MSS_2)))Given that all remaining variables showed a high relevance for the modeling situation, finding the right interactions was based of off intuition as well as trial and error. We remove the aforementioned two variables, and add the interactions between \(realized\_vol\_short\), \(realized\_vol\_mid\), \(strike\_dispersion\) and all other variables. This seems intuitive as the realised volatility can be a straightforward indicator for the implied volatility. Measuring its interactions with the other features shows how these features impact volatility. The generalisation from realised to implied volatility is simpler than from the other variables to implied volatility. \(strike\_dispersion\) is selected as a first representative among all option-related features. This model drastically improves upon the general, interaction free model, with an RMSE of 11.32.
-Second particular model
-mod3 <- lm(
- implied_vol_ref ~ realized_vol_mid *
- strike_dispersion *
- option_dispersion *
- (market_vol_index +
- liquidity_ratio +
- realized_vol_short +
- put_volume +
- call_oi +
- put_oi +
- maturity_count +
- total_contracts +
- pulse_ratio +
- put_call_ratio_volume +
- put_call_ratio_oi +
- option_dispersion +
- put_low_strike +
- put_proportion +
- stress_spread),
- data = train_linear_lm
-)
-summary(mod3)Y_hat3 <- predict(mod3, newdata = val_linear_lm, type = "response")
-Y_val <- val_linear_lm$implied_vol_ref
-
-MSS_3 <- mean((exp(Y_val) - exp(Y_hat3))**2)
-
-print(paste0("The RMSE is: ", sqrt(MSS_3)))The following thought occurred to us: could just having a single realised volatility among the features interacting with all others suffice? Because of the size of the dataset, we couldn’t add infinite interactions, thus making the variable selection process an interaction selection process. We tested a model with the interactions between \(realized\_vol\_mid\), \(strike\_dispersion\), \(option\_dispersion\) and all other variables. This model improved upon the general model, but was less precise than the previous personalised model, with an RMSE of 11.42.
-Third particular model
-mod4 <- lm(
- implied_vol_ref ~ market_vol_index *
- realized_vol_mid *
- strike_dispersion *
- (realized_vol_short +
- put_volume +
- call_oi +
- put_oi +
- maturity_count +
- total_contracts +
- pulse_ratio +
- put_call_ratio_volume +
- put_call_ratio_oi +
- liquidity_ratio +
- option_dispersion +
- put_low_strike +
- put_proportion +
- stress_spread),
- data = train_linear_lm
-)
-summary(mod4)
-Call:
-lm(formula = implied_vol_ref ~ market_vol_index * realized_vol_mid *
- strike_dispersion * (realized_vol_short + put_volume + call_oi +
- put_oi + maturity_count + total_contracts + pulse_ratio +
- put_call_ratio_volume + put_call_ratio_oi + liquidity_ratio +
- option_dispersion + put_low_strike + put_proportion + stress_spread),
- data = train_linear_lm)
-
-Residuals:
- Min 1Q Median 3Q Max
--4.5906 -0.1230 -0.0010 0.1256 3.7556
-
-Coefficients:
- Estimate
-(Intercept) 3.6599088
-market_vol_index -0.0126850
-realized_vol_mid -0.0045858
-strike_dispersion 0.2916720
-realized_vol_short 0.5922568
-put_volume 0.0470339
-call_oi -0.0022644
-put_oi -0.0668333
-maturity_count 0.0033262
-total_contracts -0.1122176
-pulse_ratio -0.2229992
-put_call_ratio_volume 0.0096309
-put_call_ratio_oi 0.0139323
-liquidity_ratio -0.0165726
-option_dispersion -0.3294306
-put_low_strike -0.0020143
-put_proportion -0.0216820
-stress_spread -0.0770662
-market_vol_index:realized_vol_mid -0.0517138
-market_vol_index:strike_dispersion 0.0242089
-realized_vol_mid:strike_dispersion 0.0117204
-market_vol_index:realized_vol_short 0.0071117
-market_vol_index:put_volume -0.0036265
-market_vol_index:call_oi -0.0036837
-market_vol_index:put_oi 0.0124329
-market_vol_index:maturity_count -0.0035557
-market_vol_index:total_contracts -0.0255472
-market_vol_index:pulse_ratio 0.0341983
-market_vol_index:put_call_ratio_volume -0.0038803
-market_vol_index:put_call_ratio_oi -0.0006246
-market_vol_index:liquidity_ratio 0.0063348
-market_vol_index:option_dispersion -0.0564068
-market_vol_index:put_low_strike 0.0173188
-market_vol_index:put_proportion 0.0023782
-market_vol_index:stress_spread 0.0003182
-realized_vol_mid:realized_vol_short 0.0492174
-realized_vol_mid:put_volume 0.0151216
-realized_vol_mid:call_oi 0.0377888
-realized_vol_mid:put_oi -0.0386033
-realized_vol_mid:maturity_count -0.0266407
-realized_vol_mid:total_contracts 0.0952750
-realized_vol_mid:pulse_ratio 0.0067576
-realized_vol_mid:put_call_ratio_volume 0.0111785
-realized_vol_mid:put_call_ratio_oi -0.0017824
-realized_vol_mid:liquidity_ratio -0.0104341
-realized_vol_mid:option_dispersion 0.0861263
-realized_vol_mid:put_low_strike -0.0378495
-realized_vol_mid:put_proportion -0.0078994
-realized_vol_mid:stress_spread -0.0597938
-strike_dispersion:realized_vol_short 0.0335257
-strike_dispersion:put_volume 0.0053345
-strike_dispersion:call_oi -0.0007664
-strike_dispersion:put_oi 0.0163583
-strike_dispersion:maturity_count 0.0281394
-strike_dispersion:total_contracts -0.0707158
-strike_dispersion:pulse_ratio -0.0059344
-strike_dispersion:put_call_ratio_volume -0.0012166
-strike_dispersion:put_call_ratio_oi 0.0024963
-strike_dispersion:liquidity_ratio 0.0057028
-strike_dispersion:option_dispersion -0.0302037
-strike_dispersion:put_low_strike 0.0155818
-strike_dispersion:put_proportion -0.0034315
-strike_dispersion:stress_spread -0.0266549
-market_vol_index:realized_vol_mid:strike_dispersion 0.0001718
-market_vol_index:realized_vol_mid:realized_vol_short 0.0073971
-market_vol_index:realized_vol_mid:put_volume -0.0061054
-market_vol_index:realized_vol_mid:call_oi -0.0065984
-market_vol_index:realized_vol_mid:put_oi 0.0231759
-market_vol_index:realized_vol_mid:maturity_count 0.0062026
-market_vol_index:realized_vol_mid:total_contracts -0.0072373
-market_vol_index:realized_vol_mid:pulse_ratio 0.0024777
-market_vol_index:realized_vol_mid:put_call_ratio_volume -0.0005433
-market_vol_index:realized_vol_mid:put_call_ratio_oi 0.0015789
-market_vol_index:realized_vol_mid:liquidity_ratio -0.0015917
-market_vol_index:realized_vol_mid:option_dispersion -0.0026361
-market_vol_index:realized_vol_mid:put_low_strike 0.0121182
-market_vol_index:realized_vol_mid:put_proportion 0.0030698
-market_vol_index:realized_vol_mid:stress_spread 0.0062935
-market_vol_index:strike_dispersion:realized_vol_short -0.0068279
-market_vol_index:strike_dispersion:put_volume -0.0027751
-market_vol_index:strike_dispersion:call_oi -0.0041980
-market_vol_index:strike_dispersion:put_oi -0.0042709
-market_vol_index:strike_dispersion:maturity_count -0.0066489
-market_vol_index:strike_dispersion:total_contracts 0.0051983
-market_vol_index:strike_dispersion:pulse_ratio 0.0010537
-market_vol_index:strike_dispersion:put_call_ratio_volume 0.0005396
-market_vol_index:strike_dispersion:put_call_ratio_oi -0.0012008
-market_vol_index:strike_dispersion:liquidity_ratio -0.0034812
-market_vol_index:strike_dispersion:option_dispersion -0.0046580
-market_vol_index:strike_dispersion:put_low_strike -0.0073667
-market_vol_index:strike_dispersion:put_proportion -0.0017369
-market_vol_index:strike_dispersion:stress_spread -0.0002573
-realized_vol_mid:strike_dispersion:realized_vol_short 0.0109381
-realized_vol_mid:strike_dispersion:put_volume -0.0066819
-realized_vol_mid:strike_dispersion:call_oi 0.0128811
-realized_vol_mid:strike_dispersion:put_oi 0.0110468
-realized_vol_mid:strike_dispersion:maturity_count 0.0059875
-realized_vol_mid:strike_dispersion:total_contracts -0.0105222
-realized_vol_mid:strike_dispersion:pulse_ratio 0.0086745
-realized_vol_mid:strike_dispersion:put_call_ratio_volume 0.0016179
-realized_vol_mid:strike_dispersion:put_call_ratio_oi -0.0021449
-realized_vol_mid:strike_dispersion:liquidity_ratio 0.0007098
-realized_vol_mid:strike_dispersion:option_dispersion 0.0129298
-realized_vol_mid:strike_dispersion:put_low_strike 0.0026190
-realized_vol_mid:strike_dispersion:put_proportion 0.0059383
-realized_vol_mid:strike_dispersion:stress_spread -0.0045061
-market_vol_index:realized_vol_mid:strike_dispersion:realized_vol_short 0.0106188
-market_vol_index:realized_vol_mid:strike_dispersion:put_volume 0.0029074
-market_vol_index:realized_vol_mid:strike_dispersion:call_oi 0.0003358
-market_vol_index:realized_vol_mid:strike_dispersion:put_oi -0.0013165
-market_vol_index:realized_vol_mid:strike_dispersion:maturity_count 0.0055705
-market_vol_index:realized_vol_mid:strike_dispersion:total_contracts -0.0261061
-market_vol_index:realized_vol_mid:strike_dispersion:pulse_ratio -0.0091059
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_volume -0.0005469
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_oi 0.0007077
-market_vol_index:realized_vol_mid:strike_dispersion:liquidity_ratio -0.0014683
-market_vol_index:realized_vol_mid:strike_dispersion:option_dispersion -0.0241698
-market_vol_index:realized_vol_mid:strike_dispersion:put_low_strike 0.0018223
-market_vol_index:realized_vol_mid:strike_dispersion:put_proportion 0.0009296
-market_vol_index:realized_vol_mid:strike_dispersion:stress_spread 0.0012156
- Std. Error
-(Intercept) 0.0004425
-market_vol_index 0.0009575
-realized_vol_mid 0.0007645
-strike_dispersion 0.0056724
-realized_vol_short 0.0017430
-put_volume 0.0010158
-call_oi 0.0009061
-put_oi 0.0011265
-maturity_count 0.0004960
-total_contracts 0.0053052
-pulse_ratio 0.0004931
-put_call_ratio_volume 0.0006479
-put_call_ratio_oi 0.0006530
-liquidity_ratio 0.0004054
-option_dispersion 0.0100279
-put_low_strike 0.0006396
-put_proportion 0.0005094
-stress_spread 0.0017457
-market_vol_index:realized_vol_mid 0.0010083
-market_vol_index:strike_dispersion 0.0061632
-realized_vol_mid:strike_dispersion 0.0051933
-market_vol_index:realized_vol_short 0.0011768
-market_vol_index:put_volume 0.0010721
-market_vol_index:call_oi 0.0007694
-market_vol_index:put_oi 0.0012196
-market_vol_index:maturity_count 0.0005702
-market_vol_index:total_contracts 0.0056958
-market_vol_index:pulse_ratio 0.0004737
-market_vol_index:put_call_ratio_volume 0.0004600
-market_vol_index:put_call_ratio_oi 0.0004606
-market_vol_index:liquidity_ratio 0.0003864
-market_vol_index:option_dispersion 0.0107983
-market_vol_index:put_low_strike 0.0006681
-market_vol_index:put_proportion 0.0004760
-market_vol_index:stress_spread 0.0003809
-realized_vol_mid:realized_vol_short 0.0006906
-realized_vol_mid:put_volume 0.0009527
-realized_vol_mid:call_oi 0.0007889
-realized_vol_mid:put_oi 0.0010691
-realized_vol_mid:maturity_count 0.0004938
-realized_vol_mid:total_contracts 0.0048759
-realized_vol_mid:pulse_ratio 0.0003325
-realized_vol_mid:put_call_ratio_volume 0.0004510
-realized_vol_mid:put_call_ratio_oi 0.0004545
-realized_vol_mid:liquidity_ratio 0.0003850
-realized_vol_mid:option_dispersion 0.0091457
-realized_vol_mid:put_low_strike 0.0006215
-realized_vol_mid:put_proportion 0.0004742
-realized_vol_mid:stress_spread 0.0006742
-strike_dispersion:realized_vol_short 0.0016526
-strike_dispersion:put_volume 0.0009816
-strike_dispersion:call_oi 0.0008453
-strike_dispersion:put_oi 0.0010904
-strike_dispersion:maturity_count 0.0004287
-strike_dispersion:total_contracts 0.0011110
-strike_dispersion:pulse_ratio 0.0004919
-strike_dispersion:put_call_ratio_volume 0.0006163
-strike_dispersion:put_call_ratio_oi 0.0006110
-strike_dispersion:liquidity_ratio 0.0003899
-strike_dispersion:option_dispersion 0.0009638
-strike_dispersion:put_low_strike 0.0005679
-strike_dispersion:put_proportion 0.0005029
-strike_dispersion:stress_spread 0.0016014
-market_vol_index:realized_vol_mid:strike_dispersion 0.0054271
-market_vol_index:realized_vol_mid:realized_vol_short 0.0004767
-market_vol_index:realized_vol_mid:put_volume 0.0009875
-market_vol_index:realized_vol_mid:call_oi 0.0006936
-market_vol_index:realized_vol_mid:put_oi 0.0011018
-market_vol_index:realized_vol_mid:maturity_count 0.0004888
-market_vol_index:realized_vol_mid:total_contracts 0.0050053
-market_vol_index:realized_vol_mid:pulse_ratio 0.0003078
-market_vol_index:realized_vol_mid:put_call_ratio_volume 0.0004338
-market_vol_index:realized_vol_mid:put_call_ratio_oi 0.0004330
-market_vol_index:realized_vol_mid:liquidity_ratio 0.0003596
-market_vol_index:realized_vol_mid:option_dispersion 0.0094431
-market_vol_index:realized_vol_mid:put_low_strike 0.0005928
-market_vol_index:realized_vol_mid:put_proportion 0.0004475
-market_vol_index:realized_vol_mid:stress_spread 0.0003147
-market_vol_index:strike_dispersion:realized_vol_short 0.0012178
-market_vol_index:strike_dispersion:put_volume 0.0010071
-market_vol_index:strike_dispersion:call_oi 0.0007007
-market_vol_index:strike_dispersion:put_oi 0.0011734
-market_vol_index:strike_dispersion:maturity_count 0.0004192
-market_vol_index:strike_dispersion:total_contracts 0.0010577
-market_vol_index:strike_dispersion:pulse_ratio 0.0004755
-market_vol_index:strike_dispersion:put_call_ratio_volume 0.0004035
-market_vol_index:strike_dispersion:put_call_ratio_oi 0.0003982
-market_vol_index:strike_dispersion:liquidity_ratio 0.0003553
-market_vol_index:strike_dispersion:option_dispersion 0.0009145
-market_vol_index:strike_dispersion:put_low_strike 0.0005954
-market_vol_index:strike_dispersion:put_proportion 0.0004490
-market_vol_index:strike_dispersion:stress_spread 0.0003695
-realized_vol_mid:strike_dispersion:realized_vol_short 0.0005537
-realized_vol_mid:strike_dispersion:put_volume 0.0007668
-realized_vol_mid:strike_dispersion:call_oi 0.0006374
-realized_vol_mid:strike_dispersion:put_oi 0.0008762
-realized_vol_mid:strike_dispersion:maturity_count 0.0003502
-realized_vol_mid:strike_dispersion:total_contracts 0.0008224
-realized_vol_mid:strike_dispersion:pulse_ratio 0.0002809
-realized_vol_mid:strike_dispersion:put_call_ratio_volume 0.0003734
-realized_vol_mid:strike_dispersion:put_call_ratio_oi 0.0003838
-realized_vol_mid:strike_dispersion:liquidity_ratio 0.0003100
-realized_vol_mid:strike_dispersion:option_dispersion 0.0007656
-realized_vol_mid:strike_dispersion:put_low_strike 0.0004706
-realized_vol_mid:strike_dispersion:put_proportion 0.0003858
-realized_vol_mid:strike_dispersion:stress_spread 0.0005582
-market_vol_index:realized_vol_mid:strike_dispersion:realized_vol_short 0.0003844
-market_vol_index:realized_vol_mid:strike_dispersion:put_volume 0.0008397
-market_vol_index:realized_vol_mid:strike_dispersion:call_oi 0.0006260
-market_vol_index:realized_vol_mid:strike_dispersion:put_oi 0.0009684
-market_vol_index:realized_vol_mid:strike_dispersion:maturity_count 0.0003514
-market_vol_index:realized_vol_mid:strike_dispersion:total_contracts 0.0008171
-market_vol_index:realized_vol_mid:strike_dispersion:pulse_ratio 0.0002558
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_volume 0.0003748
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_oi 0.0003735
-market_vol_index:realized_vol_mid:strike_dispersion:liquidity_ratio 0.0003115
-market_vol_index:realized_vol_mid:strike_dispersion:option_dispersion 0.0007605
-market_vol_index:realized_vol_mid:strike_dispersion:put_low_strike 0.0004944
-market_vol_index:realized_vol_mid:strike_dispersion:put_proportion 0.0003994
-market_vol_index:realized_vol_mid:strike_dispersion:stress_spread 0.0002648
- t value
-(Intercept) 8271.085
-market_vol_index -13.248
-realized_vol_mid -5.999
-strike_dispersion 51.420
-realized_vol_short 339.795
-put_volume 46.301
-call_oi -2.499
-put_oi -59.330
-maturity_count 6.705
-total_contracts -21.152
-pulse_ratio -452.276
-put_call_ratio_volume 14.865
-put_call_ratio_oi 21.335
-liquidity_ratio -40.885
-option_dispersion -32.851
-put_low_strike -3.149
-put_proportion -42.564
-stress_spread -44.145
-market_vol_index:realized_vol_mid -51.290
-market_vol_index:strike_dispersion 3.928
-realized_vol_mid:strike_dispersion 2.257
-market_vol_index:realized_vol_short 6.043
-market_vol_index:put_volume -3.383
-market_vol_index:call_oi -4.788
-market_vol_index:put_oi 10.194
-market_vol_index:maturity_count -6.236
-market_vol_index:total_contracts -4.485
-market_vol_index:pulse_ratio 72.198
-market_vol_index:put_call_ratio_volume -8.435
-market_vol_index:put_call_ratio_oi -1.356
-market_vol_index:liquidity_ratio 16.395
-market_vol_index:option_dispersion -5.224
-market_vol_index:put_low_strike 25.923
-market_vol_index:put_proportion 4.996
-market_vol_index:stress_spread 0.835
-realized_vol_mid:realized_vol_short 71.267
-realized_vol_mid:put_volume 15.872
-realized_vol_mid:call_oi 47.903
-realized_vol_mid:put_oi -36.109
-realized_vol_mid:maturity_count -53.947
-realized_vol_mid:total_contracts 19.540
-realized_vol_mid:pulse_ratio 20.322
-realized_vol_mid:put_call_ratio_volume 24.788
-realized_vol_mid:put_call_ratio_oi -3.921
-realized_vol_mid:liquidity_ratio -27.101
-realized_vol_mid:option_dispersion 9.417
-realized_vol_mid:put_low_strike -60.901
-realized_vol_mid:put_proportion -16.659
-realized_vol_mid:stress_spread -88.694
-strike_dispersion:realized_vol_short 20.286
-strike_dispersion:put_volume 5.435
-strike_dispersion:call_oi -0.907
-strike_dispersion:put_oi 15.003
-strike_dispersion:maturity_count 65.634
-strike_dispersion:total_contracts -63.652
-strike_dispersion:pulse_ratio -12.064
-strike_dispersion:put_call_ratio_volume -1.974
-strike_dispersion:put_call_ratio_oi 4.086
-strike_dispersion:liquidity_ratio 14.626
-strike_dispersion:option_dispersion -31.337
-strike_dispersion:put_low_strike 27.437
-strike_dispersion:put_proportion -6.823
-strike_dispersion:stress_spread -16.645
-market_vol_index:realized_vol_mid:strike_dispersion 0.032
-market_vol_index:realized_vol_mid:realized_vol_short 15.519
-market_vol_index:realized_vol_mid:put_volume -6.183
-market_vol_index:realized_vol_mid:call_oi -9.514
-market_vol_index:realized_vol_mid:put_oi 21.035
-market_vol_index:realized_vol_mid:maturity_count 12.689
-market_vol_index:realized_vol_mid:total_contracts -1.446
-market_vol_index:realized_vol_mid:pulse_ratio 8.048
-market_vol_index:realized_vol_mid:put_call_ratio_volume -1.252
-market_vol_index:realized_vol_mid:put_call_ratio_oi 3.646
-market_vol_index:realized_vol_mid:liquidity_ratio -4.427
-market_vol_index:realized_vol_mid:option_dispersion -0.279
-market_vol_index:realized_vol_mid:put_low_strike 20.442
-market_vol_index:realized_vol_mid:put_proportion 6.860
-market_vol_index:realized_vol_mid:stress_spread 20.001
-market_vol_index:strike_dispersion:realized_vol_short -5.607
-market_vol_index:strike_dispersion:put_volume -2.755
-market_vol_index:strike_dispersion:call_oi -5.991
-market_vol_index:strike_dispersion:put_oi -3.640
-market_vol_index:strike_dispersion:maturity_count -15.860
-market_vol_index:strike_dispersion:total_contracts 4.915
-market_vol_index:strike_dispersion:pulse_ratio 2.216
-market_vol_index:strike_dispersion:put_call_ratio_volume 1.337
-market_vol_index:strike_dispersion:put_call_ratio_oi -3.015
-market_vol_index:strike_dispersion:liquidity_ratio -9.798
-market_vol_index:strike_dispersion:option_dispersion -5.094
-market_vol_index:strike_dispersion:put_low_strike -12.373
-market_vol_index:strike_dispersion:put_proportion -3.868
-market_vol_index:strike_dispersion:stress_spread -0.697
-realized_vol_mid:strike_dispersion:realized_vol_short 19.754
-realized_vol_mid:strike_dispersion:put_volume -8.715
-realized_vol_mid:strike_dispersion:call_oi 20.207
-realized_vol_mid:strike_dispersion:put_oi 12.607
-realized_vol_mid:strike_dispersion:maturity_count 17.096
-realized_vol_mid:strike_dispersion:total_contracts -12.794
-realized_vol_mid:strike_dispersion:pulse_ratio 30.881
-realized_vol_mid:strike_dispersion:put_call_ratio_volume 4.333
-realized_vol_mid:strike_dispersion:put_call_ratio_oi -5.588
-realized_vol_mid:strike_dispersion:liquidity_ratio 2.290
-realized_vol_mid:strike_dispersion:option_dispersion 16.888
-realized_vol_mid:strike_dispersion:put_low_strike 5.565
-realized_vol_mid:strike_dispersion:put_proportion 15.393
-realized_vol_mid:strike_dispersion:stress_spread -8.073
-market_vol_index:realized_vol_mid:strike_dispersion:realized_vol_short 27.626
-market_vol_index:realized_vol_mid:strike_dispersion:put_volume 3.462
-market_vol_index:realized_vol_mid:strike_dispersion:call_oi 0.536
-market_vol_index:realized_vol_mid:strike_dispersion:put_oi -1.359
-market_vol_index:realized_vol_mid:strike_dispersion:maturity_count 15.852
-market_vol_index:realized_vol_mid:strike_dispersion:total_contracts -31.948
-market_vol_index:realized_vol_mid:strike_dispersion:pulse_ratio -35.603
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_volume -1.459
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_oi 1.895
-market_vol_index:realized_vol_mid:strike_dispersion:liquidity_ratio -4.713
-market_vol_index:realized_vol_mid:strike_dispersion:option_dispersion -31.782
-market_vol_index:realized_vol_mid:strike_dispersion:put_low_strike 3.686
-market_vol_index:realized_vol_mid:strike_dispersion:put_proportion 2.327
-market_vol_index:realized_vol_mid:strike_dispersion:stress_spread 4.591
- Pr(>|t|)
-(Intercept) < 2e-16
-market_vol_index < 2e-16
-realized_vol_mid 1.99e-09
-strike_dispersion < 2e-16
-realized_vol_short < 2e-16
-put_volume < 2e-16
-call_oi 0.012456
-put_oi < 2e-16
-maturity_count 2.01e-11
-total_contracts < 2e-16
-pulse_ratio < 2e-16
-put_call_ratio_volume < 2e-16
-put_call_ratio_oi < 2e-16
-liquidity_ratio < 2e-16
-option_dispersion < 2e-16
-put_low_strike 0.001637
-put_proportion < 2e-16
-stress_spread < 2e-16
-market_vol_index:realized_vol_mid < 2e-16
-market_vol_index:strike_dispersion 8.57e-05
-realized_vol_mid:strike_dispersion 0.024018
-market_vol_index:realized_vol_short 1.51e-09
-market_vol_index:put_volume 0.000718
-market_vol_index:call_oi 1.69e-06
-market_vol_index:put_oi < 2e-16
-market_vol_index:maturity_count 4.50e-10
-market_vol_index:total_contracts 7.28e-06
-market_vol_index:pulse_ratio < 2e-16
-market_vol_index:put_call_ratio_volume < 2e-16
-market_vol_index:put_call_ratio_oi 0.175051
-market_vol_index:liquidity_ratio < 2e-16
-market_vol_index:option_dispersion 1.75e-07
-market_vol_index:put_low_strike < 2e-16
-market_vol_index:put_proportion 5.86e-07
-market_vol_index:stress_spread 0.403479
-realized_vol_mid:realized_vol_short < 2e-16
-realized_vol_mid:put_volume < 2e-16
-realized_vol_mid:call_oi < 2e-16
-realized_vol_mid:put_oi < 2e-16
-realized_vol_mid:maturity_count < 2e-16
-realized_vol_mid:total_contracts < 2e-16
-realized_vol_mid:pulse_ratio < 2e-16
-realized_vol_mid:put_call_ratio_volume < 2e-16
-realized_vol_mid:put_call_ratio_oi 8.81e-05
-realized_vol_mid:liquidity_ratio < 2e-16
-realized_vol_mid:option_dispersion < 2e-16
-realized_vol_mid:put_low_strike < 2e-16
-realized_vol_mid:put_proportion < 2e-16
-realized_vol_mid:stress_spread < 2e-16
-strike_dispersion:realized_vol_short < 2e-16
-strike_dispersion:put_volume 5.49e-08
-strike_dispersion:call_oi 0.364603
-strike_dispersion:put_oi < 2e-16
-strike_dispersion:maturity_count < 2e-16
-strike_dispersion:total_contracts < 2e-16
-strike_dispersion:pulse_ratio < 2e-16
-strike_dispersion:put_call_ratio_volume 0.048380
-strike_dispersion:put_call_ratio_oi 4.40e-05
-strike_dispersion:liquidity_ratio < 2e-16
-strike_dispersion:option_dispersion < 2e-16
-strike_dispersion:put_low_strike < 2e-16
-strike_dispersion:put_proportion 8.91e-12
-strike_dispersion:stress_spread < 2e-16
-market_vol_index:realized_vol_mid:strike_dispersion 0.974748
-market_vol_index:realized_vol_mid:realized_vol_short < 2e-16
-market_vol_index:realized_vol_mid:put_volume 6.31e-10
-market_vol_index:realized_vol_mid:call_oi < 2e-16
-market_vol_index:realized_vol_mid:put_oi < 2e-16
-market_vol_index:realized_vol_mid:maturity_count < 2e-16
-market_vol_index:realized_vol_mid:total_contracts 0.148194
-market_vol_index:realized_vol_mid:pulse_ratio 8.40e-16
-market_vol_index:realized_vol_mid:put_call_ratio_volume 0.210426
-market_vol_index:realized_vol_mid:put_call_ratio_oi 0.000266
-market_vol_index:realized_vol_mid:liquidity_ratio 9.56e-06
-market_vol_index:realized_vol_mid:option_dispersion 0.780128
-market_vol_index:realized_vol_mid:put_low_strike < 2e-16
-market_vol_index:realized_vol_mid:put_proportion 6.88e-12
-market_vol_index:realized_vol_mid:stress_spread < 2e-16
-market_vol_index:strike_dispersion:realized_vol_short 2.06e-08
-market_vol_index:strike_dispersion:put_volume 0.005861
-market_vol_index:strike_dispersion:call_oi 2.08e-09
-market_vol_index:strike_dispersion:put_oi 0.000273
-market_vol_index:strike_dispersion:maturity_count < 2e-16
-market_vol_index:strike_dispersion:total_contracts 8.89e-07
-market_vol_index:strike_dispersion:pulse_ratio 0.026680
-market_vol_index:strike_dispersion:put_call_ratio_volume 0.181097
-market_vol_index:strike_dispersion:put_call_ratio_oi 0.002566
-market_vol_index:strike_dispersion:liquidity_ratio < 2e-16
-market_vol_index:strike_dispersion:option_dispersion 3.51e-07
-market_vol_index:strike_dispersion:put_low_strike < 2e-16
-market_vol_index:strike_dispersion:put_proportion 0.000110
-market_vol_index:strike_dispersion:stress_spread 0.486101
-realized_vol_mid:strike_dispersion:realized_vol_short < 2e-16
-realized_vol_mid:strike_dispersion:put_volume < 2e-16
-realized_vol_mid:strike_dispersion:call_oi < 2e-16
-realized_vol_mid:strike_dispersion:put_oi < 2e-16
-realized_vol_mid:strike_dispersion:maturity_count < 2e-16
-realized_vol_mid:strike_dispersion:total_contracts < 2e-16
-realized_vol_mid:strike_dispersion:pulse_ratio < 2e-16
-realized_vol_mid:strike_dispersion:put_call_ratio_volume 1.47e-05
-realized_vol_mid:strike_dispersion:put_call_ratio_oi 2.30e-08
-realized_vol_mid:strike_dispersion:liquidity_ratio 0.022033
-realized_vol_mid:strike_dispersion:option_dispersion < 2e-16
-realized_vol_mid:strike_dispersion:put_low_strike 2.63e-08
-realized_vol_mid:strike_dispersion:put_proportion < 2e-16
-realized_vol_mid:strike_dispersion:stress_spread 6.86e-16
-market_vol_index:realized_vol_mid:strike_dispersion:realized_vol_short < 2e-16
-market_vol_index:realized_vol_mid:strike_dispersion:put_volume 0.000536
-market_vol_index:realized_vol_mid:strike_dispersion:call_oi 0.591674
-market_vol_index:realized_vol_mid:strike_dispersion:put_oi 0.174004
-market_vol_index:realized_vol_mid:strike_dispersion:maturity_count < 2e-16
-market_vol_index:realized_vol_mid:strike_dispersion:total_contracts < 2e-16
-market_vol_index:realized_vol_mid:strike_dispersion:pulse_ratio < 2e-16
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_volume 0.144564
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_oi 0.058120
-market_vol_index:realized_vol_mid:strike_dispersion:liquidity_ratio 2.44e-06
-market_vol_index:realized_vol_mid:strike_dispersion:option_dispersion < 2e-16
-market_vol_index:realized_vol_mid:strike_dispersion:put_low_strike 0.000228
-market_vol_index:realized_vol_mid:strike_dispersion:put_proportion 0.019942
-market_vol_index:realized_vol_mid:strike_dispersion:stress_spread 4.41e-06
-
-(Intercept) ***
-market_vol_index ***
-realized_vol_mid ***
-strike_dispersion ***
-realized_vol_short ***
-put_volume ***
-call_oi *
-put_oi ***
-maturity_count ***
-total_contracts ***
-pulse_ratio ***
-put_call_ratio_volume ***
-put_call_ratio_oi ***
-liquidity_ratio ***
-option_dispersion ***
-put_low_strike **
-put_proportion ***
-stress_spread ***
-market_vol_index:realized_vol_mid ***
-market_vol_index:strike_dispersion ***
-realized_vol_mid:strike_dispersion *
-market_vol_index:realized_vol_short ***
-market_vol_index:put_volume ***
-market_vol_index:call_oi ***
-market_vol_index:put_oi ***
-market_vol_index:maturity_count ***
-market_vol_index:total_contracts ***
-market_vol_index:pulse_ratio ***
-market_vol_index:put_call_ratio_volume ***
-market_vol_index:put_call_ratio_oi
-market_vol_index:liquidity_ratio ***
-market_vol_index:option_dispersion ***
-market_vol_index:put_low_strike ***
-market_vol_index:put_proportion ***
-market_vol_index:stress_spread
-realized_vol_mid:realized_vol_short ***
-realized_vol_mid:put_volume ***
-realized_vol_mid:call_oi ***
-realized_vol_mid:put_oi ***
-realized_vol_mid:maturity_count ***
-realized_vol_mid:total_contracts ***
-realized_vol_mid:pulse_ratio ***
-realized_vol_mid:put_call_ratio_volume ***
-realized_vol_mid:put_call_ratio_oi ***
-realized_vol_mid:liquidity_ratio ***
-realized_vol_mid:option_dispersion ***
-realized_vol_mid:put_low_strike ***
-realized_vol_mid:put_proportion ***
-realized_vol_mid:stress_spread ***
-strike_dispersion:realized_vol_short ***
-strike_dispersion:put_volume ***
-strike_dispersion:call_oi
-strike_dispersion:put_oi ***
-strike_dispersion:maturity_count ***
-strike_dispersion:total_contracts ***
-strike_dispersion:pulse_ratio ***
-strike_dispersion:put_call_ratio_volume *
-strike_dispersion:put_call_ratio_oi ***
-strike_dispersion:liquidity_ratio ***
-strike_dispersion:option_dispersion ***
-strike_dispersion:put_low_strike ***
-strike_dispersion:put_proportion ***
-strike_dispersion:stress_spread ***
-market_vol_index:realized_vol_mid:strike_dispersion
-market_vol_index:realized_vol_mid:realized_vol_short ***
-market_vol_index:realized_vol_mid:put_volume ***
-market_vol_index:realized_vol_mid:call_oi ***
-market_vol_index:realized_vol_mid:put_oi ***
-market_vol_index:realized_vol_mid:maturity_count ***
-market_vol_index:realized_vol_mid:total_contracts
-market_vol_index:realized_vol_mid:pulse_ratio ***
-market_vol_index:realized_vol_mid:put_call_ratio_volume
-market_vol_index:realized_vol_mid:put_call_ratio_oi ***
-market_vol_index:realized_vol_mid:liquidity_ratio ***
-market_vol_index:realized_vol_mid:option_dispersion
-market_vol_index:realized_vol_mid:put_low_strike ***
-market_vol_index:realized_vol_mid:put_proportion ***
-market_vol_index:realized_vol_mid:stress_spread ***
-market_vol_index:strike_dispersion:realized_vol_short ***
-market_vol_index:strike_dispersion:put_volume **
-market_vol_index:strike_dispersion:call_oi ***
-market_vol_index:strike_dispersion:put_oi ***
-market_vol_index:strike_dispersion:maturity_count ***
-market_vol_index:strike_dispersion:total_contracts ***
-market_vol_index:strike_dispersion:pulse_ratio *
-market_vol_index:strike_dispersion:put_call_ratio_volume
-market_vol_index:strike_dispersion:put_call_ratio_oi **
-market_vol_index:strike_dispersion:liquidity_ratio ***
-market_vol_index:strike_dispersion:option_dispersion ***
-market_vol_index:strike_dispersion:put_low_strike ***
-market_vol_index:strike_dispersion:put_proportion ***
-market_vol_index:strike_dispersion:stress_spread
-realized_vol_mid:strike_dispersion:realized_vol_short ***
-realized_vol_mid:strike_dispersion:put_volume ***
-realized_vol_mid:strike_dispersion:call_oi ***
-realized_vol_mid:strike_dispersion:put_oi ***
-realized_vol_mid:strike_dispersion:maturity_count ***
-realized_vol_mid:strike_dispersion:total_contracts ***
-realized_vol_mid:strike_dispersion:pulse_ratio ***
-realized_vol_mid:strike_dispersion:put_call_ratio_volume ***
-realized_vol_mid:strike_dispersion:put_call_ratio_oi ***
-realized_vol_mid:strike_dispersion:liquidity_ratio *
-realized_vol_mid:strike_dispersion:option_dispersion ***
-realized_vol_mid:strike_dispersion:put_low_strike ***
-realized_vol_mid:strike_dispersion:put_proportion ***
-realized_vol_mid:strike_dispersion:stress_spread ***
-market_vol_index:realized_vol_mid:strike_dispersion:realized_vol_short ***
-market_vol_index:realized_vol_mid:strike_dispersion:put_volume ***
-market_vol_index:realized_vol_mid:strike_dispersion:call_oi
-market_vol_index:realized_vol_mid:strike_dispersion:put_oi
-market_vol_index:realized_vol_mid:strike_dispersion:maturity_count ***
-market_vol_index:realized_vol_mid:strike_dispersion:total_contracts ***
-market_vol_index:realized_vol_mid:strike_dispersion:pulse_ratio ***
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_volume
-market_vol_index:realized_vol_mid:strike_dispersion:put_call_ratio_oi .
-market_vol_index:realized_vol_mid:strike_dispersion:liquidity_ratio ***
-market_vol_index:realized_vol_mid:strike_dispersion:option_dispersion ***
-market_vol_index:realized_vol_mid:strike_dispersion:put_low_strike ***
-market_vol_index:realized_vol_mid:strike_dispersion:put_proportion *
-market_vol_index:realized_vol_mid:strike_dispersion:stress_spread ***
----
-Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-
-Residual standard error: 0.2595 on 1533114 degrees of freedom
-Multiple R-squared: 0.7837, Adjusted R-squared: 0.7837
-F-statistic: 4.667e+04 on 119 and 1533114 DF, p-value: < 2.2e-16
-Y_hat4 <- predict(mod4, newdata = val_linear_lm, type = "response")
-Y_val <- val_linear_lm$implied_vol_ref
-
-MSS_4 <- mean((exp(Y_val) - exp(Y_hat4))**2)
-MedSS_4 <- median((exp(Y_val) - exp(Y_hat4))**2)
-
-print(paste0("The RMSE is: ", sqrt(MSS_4)))[1] "The RMSE is: 11.2618186831934"
-print(paste0("The root of the median squared error is: ", sqrt(MedSS_4)))[1] "The root of the median squared error is: 3.68978501195564"
-However, a model where \(option\_dispersion\) was replaced by \(market\_vol\_index\). Since \(market\_vol\_index\) is a global indicator of risk assessment that day, we hoped to ally volatility information with global information. Indeed, this model proved to be the best among our classic linear models, with an RMSE of 11.26. In order to get a better grasp of this model’s performance, we looked at the root of the median squared error. With a value of 3.69, it was significantly lower than the RMSE. This suggests that several grave errors drag the metric value up. This could come from an inability on the models side to adapt to extreme changes quickly, for example in times of crises or panic movements.
-Linear regressions on the PCA Dataset
-Train-test data import
-train_pca_lm <- train_pca |> dplyr::select(-asset_id, -obs_date)
-train_pca_lm$implied_vol_ref <- log(train_pca$implied_vol_ref)val_pca_lm <- val_pca |> dplyr::select(-asset_id, -obs_date)
-val_pca_lm$implied_vol_ref <- log(val_pca$implied_vol_ref)First model
-mod1_pca <- lm(implied_vol_ref ~ ., data = train_pca_lm)
-summary(mod1_pca)Y_hat1_pca <- predict(mod1_pca, newdata = val_pca_lm, type = "response")
-Y_val_pca <- val_pca_lm$implied_vol_ref
-
-MSS_1_pca <- mean((exp(Y_val_pca) - exp(Y_hat1_pca))**2)
-MedSS_1_pca <- median((exp(Y_val_pca) - exp(Y_hat1_pca))**2)
-
-print(paste0("The RMSE is: ", sqrt(MSS_1_pca)))
-print(paste0("The root of the median squared error is: ", sqrt(MedSS_1_pca)))We now look at linear models, trained on the datasets created through the principal component’s analysis. The first model without interactions reveals that all components have a high impact on the prediction, with fisher test scores of under \(e^-16\). However, the RMSE of 12.79 is worse than the previous linear model’s. The PCA has indeed led to a loss of information. When looking at the median, it is also substantially worse than before, at 4.11. Not only is the prediction worse on average, but the errors get higher much quicker.
-First selected model
-mod2_pca <- lm(
- implied_vol_ref ~ PC1 *
- PC2 *
- PC3 *
- PC4 *
- (PC5 + PC6 + PC7 + PC8 + PC9),
- data = train_pca_lm
-)
-summary(mod2_pca)Y_hat2_pca <- predict(mod2_pca, newdata = val_pca_lm, type = "response")
-Y_val_pca <- val_pca_lm$implied_vol_ref
-
-MSS_2_pca <- mean((exp(Y_val_pca) - exp(Y_hat2_pca))**2)
-
-print(paste0("The RMSE is: ", sqrt(MSS_2_pca)))Second selected model
-mod3_pca <- lm(
- implied_vol_ref ~ PC1 *
- PC2 *
- PC3 *
- PC4 *
- PC5 *
- (PC6 + PC7 + PC8 + PC9),
- data = train_pca_lm
-)
-summary(mod3_pca)
-Call:
-lm(formula = implied_vol_ref ~ PC1 * PC2 * PC3 * PC4 * PC5 *
- (PC6 + PC7 + PC8 + PC9), data = train_pca_lm)
-
-Residuals:
- Min 1Q Median 3Q Max
--4.4189 -0.1271 -0.0003 0.1312 2.4098
-
-Coefficients:
- Estimate Std. Error t value Pr(>|t|)
-(Intercept) 3.712e+00 2.969e-04 12501.039 < 2e-16 ***
-PC1 4.176e-02 1.375e-04 303.726 < 2e-16 ***
-PC2 2.137e-01 1.463e-04 1460.678 < 2e-16 ***
-PC3 -4.172e-02 4.759e-04 -87.663 < 2e-16 ***
-PC4 -1.515e-02 3.103e-04 -48.826 < 2e-16 ***
-PC5 -1.271e-01 3.538e-04 -359.242 < 2e-16 ***
-PC6 5.418e-02 4.907e-04 110.395 < 2e-16 ***
-PC7 -6.105e-02 4.326e-04 -141.111 < 2e-16 ***
-PC8 5.400e-03 4.328e-04 12.477 < 2e-16 ***
-PC9 -5.401e-02 5.126e-04 -105.354 < 2e-16 ***
-PC1:PC2 1.088e-03 6.671e-05 16.305 < 2e-16 ***
-PC1:PC3 -7.032e-03 1.645e-04 -42.746 < 2e-16 ***
-PC2:PC3 -3.980e-03 1.976e-04 -20.139 < 2e-16 ***
-PC1:PC4 -5.043e-03 1.269e-04 -39.758 < 2e-16 ***
-PC2:PC4 2.175e-02 1.396e-04 155.808 < 2e-16 ***
-PC3:PC4 7.605e-03 2.353e-04 32.322 < 2e-16 ***
-PC1:PC5 -1.336e-02 1.589e-04 -84.127 < 2e-16 ***
-PC2:PC5 1.243e-03 1.501e-04 8.281 < 2e-16 ***
-PC3:PC5 2.338e-02 3.689e-04 63.386 < 2e-16 ***
-PC4:PC5 -3.858e-03 2.832e-04 -13.619 < 2e-16 ***
-PC1:PC6 -5.465e-03 1.799e-04 -30.372 < 2e-16 ***
-PC1:PC7 -1.078e-03 1.589e-04 -6.788 1.13e-11 ***
-PC1:PC8 -9.518e-03 1.608e-04 -59.188 < 2e-16 ***
-PC1:PC9 6.692e-03 1.783e-04 37.530 < 2e-16 ***
-PC2:PC6 7.085e-03 2.206e-04 32.115 < 2e-16 ***
-PC2:PC7 -1.125e-03 1.935e-04 -5.814 6.12e-09 ***
-PC2:PC8 1.111e-02 1.929e-04 57.581 < 2e-16 ***
-PC2:PC9 7.667e-03 2.334e-04 32.855 < 2e-16 ***
-PC3:PC6 8.246e-03 2.321e-04 35.519 < 2e-16 ***
-PC3:PC7 -5.427e-03 2.797e-04 -19.402 < 2e-16 ***
-PC3:PC8 3.661e-03 3.256e-04 11.242 < 2e-16 ***
-PC3:PC9 1.372e-03 3.419e-04 4.012 6.03e-05 ***
-PC4:PC6 -4.867e-03 3.402e-04 -14.308 < 2e-16 ***
-PC4:PC7 4.357e-02 2.769e-04 157.358 < 2e-16 ***
-PC4:PC8 -2.920e-03 3.407e-04 -8.572 < 2e-16 ***
-PC4:PC9 -2.537e-03 3.510e-04 -7.227 4.94e-13 ***
-PC5:PC6 2.964e-02 4.739e-04 62.536 < 2e-16 ***
-PC5:PC7 -3.977e-02 3.781e-04 -105.160 < 2e-16 ***
-PC5:PC8 3.083e-02 3.971e-04 77.654 < 2e-16 ***
-PC5:PC9 1.636e-02 4.897e-04 33.414 < 2e-16 ***
-PC1:PC2:PC3 -6.898e-04 6.753e-05 -10.214 < 2e-16 ***
-PC1:PC2:PC4 1.060e-03 5.531e-05 19.159 < 2e-16 ***
-PC1:PC3:PC4 2.410e-03 8.242e-05 29.237 < 2e-16 ***
-PC2:PC3:PC4 -1.041e-02 1.019e-04 -102.204 < 2e-16 ***
-PC1:PC2:PC5 3.672e-03 6.139e-05 59.817 < 2e-16 ***
-PC1:PC3:PC5 4.743e-03 1.270e-04 37.348 < 2e-16 ***
-PC2:PC3:PC5 -4.989e-03 1.269e-04 -39.310 < 2e-16 ***
-PC1:PC4:PC5 3.417e-04 1.155e-04 2.959 0.003089 **
-PC2:PC4:PC5 -4.490e-03 1.114e-04 -40.324 < 2e-16 ***
-PC3:PC4:PC5 -1.494e-03 1.904e-04 -7.846 4.29e-15 ***
-PC1:PC2:PC6 8.052e-04 7.800e-05 10.323 < 2e-16 ***
-PC1:PC2:PC7 -2.933e-03 6.832e-05 -42.934 < 2e-16 ***
-PC1:PC2:PC8 7.180e-04 7.150e-05 10.043 < 2e-16 ***
-PC1:PC2:PC9 1.111e-04 7.037e-05 1.579 0.114372
-PC1:PC3:PC6 1.571e-03 7.998e-05 19.642 < 2e-16 ***
-PC1:PC3:PC7 -4.418e-04 8.632e-05 -5.117 3.10e-07 ***
-PC1:PC3:PC8 8.451e-04 9.232e-05 9.154 < 2e-16 ***
-PC1:PC3:PC9 -1.635e-03 9.626e-05 -16.983 < 2e-16 ***
-PC2:PC3:PC6 1.460e-03 9.834e-05 14.846 < 2e-16 ***
-PC2:PC3:PC7 -9.543e-04 1.032e-04 -9.244 < 2e-16 ***
-PC2:PC3:PC8 -7.973e-04 1.342e-04 -5.943 2.80e-09 ***
-PC2:PC3:PC9 9.474e-04 1.308e-04 7.243 4.39e-13 ***
-PC1:PC4:PC6 -1.191e-03 1.197e-04 -9.948 < 2e-16 ***
-PC1:PC4:PC7 4.337e-03 1.047e-04 41.405 < 2e-16 ***
-PC1:PC4:PC8 -1.029e-03 1.160e-04 -8.866 < 2e-16 ***
-PC1:PC4:PC9 4.835e-04 1.282e-04 3.771 0.000163 ***
-PC2:PC4:PC6 -3.747e-03 1.553e-04 -24.127 < 2e-16 ***
-PC2:PC4:PC7 -7.401e-03 1.115e-04 -66.373 < 2e-16 ***
-PC2:PC4:PC8 -1.281e-03 1.472e-04 -8.697 < 2e-16 ***
-PC2:PC4:PC9 -6.620e-03 1.571e-04 -42.148 < 2e-16 ***
-PC3:PC4:PC6 -3.426e-04 1.294e-04 -2.646 0.008135 **
-PC3:PC4:PC7 -6.179e-03 1.505e-04 -41.061 < 2e-16 ***
-PC3:PC4:PC8 2.580e-03 1.920e-04 13.437 < 2e-16 ***
-PC3:PC4:PC9 1.076e-03 1.796e-04 5.991 2.09e-09 ***
-PC1:PC5:PC6 4.750e-03 1.646e-04 28.862 < 2e-16 ***
-PC1:PC5:PC7 -2.139e-03 1.372e-04 -15.587 < 2e-16 ***
-PC1:PC5:PC8 2.334e-03 1.438e-04 16.235 < 2e-16 ***
-PC1:PC5:PC9 1.865e-03 1.638e-04 11.384 < 2e-16 ***
-PC2:PC5:PC6 -2.924e-03 1.852e-04 -15.794 < 2e-16 ***
-PC2:PC5:PC7 -5.402e-04 1.375e-04 -3.928 8.58e-05 ***
-PC2:PC5:PC8 -1.093e-02 1.404e-04 -77.859 < 2e-16 ***
-PC2:PC5:PC9 3.502e-03 1.767e-04 19.826 < 2e-16 ***
-PC3:PC5:PC6 -5.136e-03 1.833e-04 -28.027 < 2e-16 ***
-PC3:PC5:PC7 7.055e-03 1.748e-04 40.358 < 2e-16 ***
-PC3:PC5:PC8 -8.107e-03 2.481e-04 -32.674 < 2e-16 ***
-PC3:PC5:PC9 -1.075e-03 2.590e-04 -4.152 3.30e-05 ***
-PC4:PC5:PC6 1.197e-02 3.065e-04 39.074 < 2e-16 ***
-PC4:PC5:PC7 -1.942e-02 2.467e-04 -78.712 < 2e-16 ***
-PC4:PC5:PC8 3.484e-03 2.782e-04 12.523 < 2e-16 ***
-PC4:PC5:PC9 8.209e-03 3.252e-04 25.241 < 2e-16 ***
-PC1:PC2:PC3:PC4 -1.082e-03 3.274e-05 -33.045 < 2e-16 ***
-PC1:PC2:PC3:PC5 -1.154e-03 4.599e-05 -25.100 < 2e-16 ***
-PC1:PC2:PC4:PC5 9.923e-05 4.290e-05 2.313 0.020715 *
-PC1:PC3:PC4:PC5 -1.463e-03 6.087e-05 -24.030 < 2e-16 ***
-PC2:PC3:PC4:PC5 2.283e-03 7.615e-05 29.981 < 2e-16 ***
-PC1:PC2:PC3:PC6 1.251e-04 3.270e-05 3.827 0.000130 ***
-PC1:PC2:PC3:PC7 8.523e-04 3.328e-05 25.612 < 2e-16 ***
-PC1:PC2:PC3:PC8 4.727e-06 3.867e-05 0.122 0.902701
-PC1:PC2:PC3:PC9 4.017e-04 3.482e-05 11.538 < 2e-16 ***
-PC1:PC2:PC4:PC6 4.511e-04 5.234e-05 8.619 < 2e-16 ***
-PC1:PC2:PC4:PC7 -1.544e-03 4.077e-05 -37.872 < 2e-16 ***
-PC1:PC2:PC4:PC8 3.957e-04 4.884e-05 8.102 5.43e-16 ***
-PC1:PC2:PC4:PC9 4.233e-04 5.066e-05 8.356 < 2e-16 ***
-PC1:PC3:PC4:PC6 -1.308e-04 3.911e-05 -3.344 0.000826 ***
-PC1:PC3:PC4:PC7 -1.074e-03 4.065e-05 -26.432 < 2e-16 ***
-PC1:PC3:PC4:PC8 2.722e-04 4.935e-05 5.516 3.47e-08 ***
-PC1:PC3:PC4:PC9 6.807e-04 4.810e-05 14.152 < 2e-16 ***
-PC2:PC3:PC4:PC6 9.268e-04 5.299e-05 17.490 < 2e-16 ***
-PC2:PC3:PC4:PC7 2.758e-03 5.391e-05 51.157 < 2e-16 ***
-PC2:PC3:PC4:PC8 1.022e-03 7.899e-05 12.934 < 2e-16 ***
-PC2:PC3:PC4:PC9 1.003e-03 7.097e-05 14.131 < 2e-16 ***
-PC1:PC2:PC5:PC6 -5.341e-04 6.390e-05 -8.359 < 2e-16 ***
-PC1:PC2:PC5:PC7 1.193e-03 5.056e-05 23.586 < 2e-16 ***
-PC1:PC2:PC5:PC8 -1.742e-03 5.441e-05 -32.013 < 2e-16 ***
-PC1:PC2:PC5:PC9 -1.082e-03 5.675e-05 -19.070 < 2e-16 ***
-PC1:PC3:PC5:PC6 -3.494e-04 5.899e-05 -5.923 3.17e-09 ***
-PC1:PC3:PC5:PC7 1.350e-03 5.467e-05 24.695 < 2e-16 ***
-PC1:PC3:PC5:PC8 -1.515e-03 7.197e-05 -21.053 < 2e-16 ***
-PC1:PC3:PC5:PC9 -4.855e-04 7.025e-05 -6.911 4.81e-12 ***
-PC2:PC3:PC5:PC6 4.911e-04 7.375e-05 6.660 2.75e-11 ***
-PC2:PC3:PC5:PC7 -5.853e-04 6.143e-05 -9.528 < 2e-16 ***
-PC2:PC3:PC5:PC8 2.373e-03 8.840e-05 26.844 < 2e-16 ***
-PC2:PC3:PC5:PC9 -2.703e-03 8.236e-05 -32.812 < 2e-16 ***
-PC1:PC4:PC5:PC6 3.735e-03 9.754e-05 38.290 < 2e-16 ***
-PC1:PC4:PC5:PC7 -1.355e-04 8.398e-05 -1.613 0.106736
-PC1:PC4:PC5:PC8 -7.141e-04 8.737e-05 -8.173 3.00e-16 ***
-PC1:PC4:PC5:PC9 1.510e-03 1.033e-04 14.619 < 2e-16 ***
-PC2:PC4:PC5:PC6 4.562e-03 1.244e-04 36.678 < 2e-16 ***
-PC2:PC4:PC5:PC7 3.101e-03 9.395e-05 33.006 < 2e-16 ***
-PC2:PC4:PC5:PC8 -3.605e-03 1.073e-04 -33.605 < 2e-16 ***
-PC2:PC4:PC5:PC9 2.848e-03 1.149e-04 24.783 < 2e-16 ***
-PC3:PC4:PC5:PC6 -2.638e-03 9.787e-05 -26.959 < 2e-16 ***
-PC3:PC4:PC5:PC7 3.449e-03 8.882e-05 38.836 < 2e-16 ***
-PC3:PC4:PC5:PC8 1.130e-03 1.320e-04 8.561 < 2e-16 ***
-PC3:PC4:PC5:PC9 -3.072e-03 1.246e-04 -24.658 < 2e-16 ***
-PC1:PC2:PC3:PC4:PC5 2.170e-04 2.429e-05 8.931 < 2e-16 ***
-PC1:PC2:PC3:PC4:PC6 -1.236e-05 1.568e-05 -0.788 0.430827
-PC1:PC2:PC3:PC4:PC7 3.321e-04 1.529e-05 21.713 < 2e-16 ***
-PC1:PC2:PC3:PC4:PC8 7.917e-05 1.994e-05 3.971 7.16e-05 ***
-PC1:PC2:PC3:PC4:PC9 -1.315e-04 1.981e-05 -6.640 3.14e-11 ***
-PC1:PC2:PC3:PC5:PC6 1.486e-04 2.349e-05 6.324 2.56e-10 ***
-PC1:PC2:PC3:PC5:PC7 -6.312e-04 2.046e-05 -30.847 < 2e-16 ***
-PC1:PC2:PC3:PC5:PC8 4.307e-04 2.690e-05 16.014 < 2e-16 ***
-PC1:PC2:PC3:PC5:PC9 1.682e-06 2.492e-05 0.067 0.946206
-PC1:PC2:PC4:PC5:PC6 1.396e-03 3.798e-05 36.743 < 2e-16 ***
-PC1:PC2:PC4:PC5:PC7 1.066e-03 3.079e-05 34.612 < 2e-16 ***
-PC1:PC2:PC4:PC5:PC8 -8.999e-04 3.377e-05 -26.645 < 2e-16 ***
-PC1:PC2:PC4:PC5:PC9 6.552e-05 3.665e-05 1.788 0.073781 .
-PC1:PC3:PC4:PC5:PC6 -6.516e-04 2.846e-05 -22.897 < 2e-16 ***
-PC1:PC3:PC4:PC5:PC7 2.073e-04 2.484e-05 8.346 < 2e-16 ***
-PC1:PC3:PC4:PC5:PC8 6.207e-04 3.265e-05 19.011 < 2e-16 ***
-PC1:PC3:PC4:PC5:PC9 -4.809e-04 3.283e-05 -14.650 < 2e-16 ***
-PC2:PC3:PC4:PC5:PC6 4.775e-05 3.916e-05 1.220 0.222617
-PC2:PC3:PC4:PC5:PC7 -1.424e-03 3.364e-05 -42.325 < 2e-16 ***
-PC2:PC3:PC4:PC5:PC8 6.694e-05 5.311e-05 1.260 0.207536
-PC2:PC3:PC4:PC5:PC9 -5.787e-04 4.377e-05 -13.222 < 2e-16 ***
-PC1:PC2:PC3:PC4:PC5:PC6 -1.123e-04 1.144e-05 -9.816 < 2e-16 ***
-PC1:PC2:PC3:PC4:PC5:PC7 -2.817e-04 9.656e-06 -29.174 < 2e-16 ***
-PC1:PC2:PC3:PC4:PC5:PC8 1.209e-04 1.336e-05 9.050 < 2e-16 ***
-PC1:PC2:PC3:PC4:PC5:PC9 -2.575e-05 1.361e-05 -1.892 0.058483 .
----
-Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-
-Residual standard error: 0.2593 on 1533074 degrees of freedom
-Multiple R-squared: 0.7839, Adjusted R-squared: 0.7839
-F-statistic: 3.498e+04 on 159 and 1533074 DF, p-value: < 2.2e-16
-Y_hat3_pca <- predict(mod3_pca, newdata = val_pca_lm, type = "response")
-Y_val_pca <- val_pca_lm$implied_vol_ref
-
-MSS_3_pca <- mean((exp(Y_val_pca) - exp(Y_hat3_pca))**2)
-
-print(paste0("The RMSE is: ", sqrt(MSS_3_pca)))[1] "The RMSE is: 11.7313443364877"
-We now add interactions. First, we take the first four component’s interactions between each other and the other 7 components. This improves the RMSE to 11.88. These models run much quicker than the previous ones on the full dataset, and have a nearly identical RMSE. After adding the fifth component’s interaction, we get a lower RMSE, 11.73. However, adding further interactions leads to computational problems that cannot be resolved with a 16 GB RAM.
-Biased Regression
-Since we have gotten a better performance through the regular dataset, we will use said dataset for the penalised regression. The only exception being partial least squares, which needs the reduced dimension in order to execute.
-Ridge
-The first biased regression model used is RIDGE. The model adds a regularisation parameter \(\lambda > 0\).
-\[ -\hat{\beta}^{R}(\lambda) \;=\; \arg\min_{\beta}\; \|Y - X\beta\|_{n}^{2} \;+\; \lambda\,\|\beta\|^{2} -\]
-This does not impact the intercept, and no variable selection is done as no estimates are exactly 0. We get the best lambda through cross-validation.
-train_noY <- train_linear_lm |> dplyr::select(-implied_vol_ref)
-y_train_log <- train_linear_lm$implied_vol_ref
-
-val_noY <- val_linear_lm |> dplyr::select(-implied_vol_ref)
-y_val_log <- val_linear_lm$implied_vol_refcv.ridge <- cv.glmnet(as.matrix(train_noY), y_train_log, alpha = 0)
-s_ridge = cv.ridge$lambda.min
-fit.ridge <- glmnet(
- train_noY,
- y_train_log,
- lambda = s_ridge,
- alpha = 0
-)
-coef(fit.ridge)20 x 1 sparse Matrix of class "dgCMatrix"
- s0
-(Intercept) 3.707945581
-strike_dispersion 0.066642443
-call_volume 0.008091265
-put_volume 0.017378847
-call_oi 0.008054584
-put_oi -0.014366114
-maturity_count 0.016893062
-total_contracts 0.021135478
-realized_vol_short 0.195001834
-market_vol_index 0.046033081
-realized_vol_mid 0.068107180
-realized_vol_long 0.150329552
-pulse_ratio -0.060030239
-put_call_ratio_volume 0.010104841
-put_call_ratio_oi 0.005461722
-liquidity_ratio -0.013761483
-option_dispersion 0.025614892
-put_low_strike 0.030860894
-put_proportion -0.011503510
-stress_spread 0.039569879
-Y_hatridge <- predict(
- fit.ridge,
- newx = as.matrix(val_noY),
- type = "response",
- s = s_ridge
-)
-
-MSS_ridge <- mean((exp(y_val_log) - exp(Y_hatridge))**2)
-
-print(paste0("The Ridge RMSE is: ", sqrt(MSS_ridge)))[1] "The Ridge RMSE is: 12.4870766968991"
-Looking at the coefficients, the intercept has the biggest impact on the final prediction. Amongst the features, the short and long term realised volatilities have the largest coefficients, so the most influence on the final result. However, Ridge is the worst model up until now, with an RMSE of 12.48. This seems coherent, as Ridge’s biggest strength, preventing overfitting by shrinking the coefficients, isn’t relevant as we are not in an overfitting situation.
-Lasso Regression
-LASSO regression is close to ridge regression. However, the formula slightly changes.
-\[ -\hat{\beta}^{L}(\lambda)=\arg\min_{\beta}\;\|Y - X\beta\|_{n}^{2} + \lambda\|\beta\|_{1} -\]
-Unlike ridge, lasso can set estimators to 0, operating with an internalised variable selection.
-cv.lasso <- cv.glmnet(as.matrix(train_noY), y_train_log, alpha = 1)
-s_lasso = cv.lasso$lambda.min
-fit.lasso <- glmnet(
- train_noY,
- y_train_log,
- lambda = s_lasso,
- alpha = 1,
- standardize = FALSE
-)
-coef(fit.lasso)20 x 1 sparse Matrix of class "dgCMatrix"
- s0
-(Intercept) 3.7079455811
-strike_dispersion 0.0828260235
-call_volume 0.0016005427
-put_volume 0.0497937673
-call_oi 0.0008723776
-put_oi -0.0322067399
-maturity_count 0.0151152360
-total_contracts 0.0098843542
-realized_vol_short 0.5161194376
-market_vol_index 0.0485475729
-realized_vol_mid .
-realized_vol_long 0.0053070551
-pulse_ratio -0.2148769609
-put_call_ratio_volume 0.0089017316
-put_call_ratio_oi 0.0111171482
-liquidity_ratio -0.0160239070
-option_dispersion .
-put_low_strike 0.0344600349
-put_proportion -0.0198225534
-stress_spread -0.0029979084
-Y_hatlasso <- predict(
- fit.lasso,
- newx = as.matrix(val_noY),
- type = "response"
-)
-
-
-MSS_lasso <- mean((exp(y_val_log) - exp(Y_hatlasso))**2)
-
-print(paste0("The Lasso RMSE is: ", sqrt(MSS_lasso)))[1] "The Lasso RMSE is: 12.0370753712144"
-This leads to better results, with an RMSE of 12.03. We are rather in a situation of have to select variables, than in an overfitting scenario. The eliminated variables are \(realized vol mid\) and \(option dispersion\). They are seen as irrelevant for the prediction. This contradicts the linear models, which achieved better results as Lasso. This shows that while Lasso benefits from variable selection, it cannot replicate human variable selection. It is still worse than our classical linear regression, which uses said variables. We move on to the next penalised model.
-Elastic-Net
-Elastic-Net combines Ridge and Lasso regression, attempting to preserve the benefits of both models and reducing overfitting while having variable selection. For \(\lambda_{1}, \lambda_{2} > 0\)
-\[ -\hat{\beta}^{EN}(\lambda_{1}, \lambda_{2}) = \arg\min_{\beta}\|Y-X\beta\|^{2} + \lambda_{1}\|\beta\|_1 + \lambda_{2}{\|\beta\|_2}^{2} -\]
-cv.elasticnet <- cv.glmnet(
- as.matrix(train_noY),
- y_train_log,
- alpha = 0.5
-)
-s_en = cv.elasticnet$lambda.min
-fit.elasticnet <- glmnet(
- train_noY,
- y_train_log,
- lambda = s_en,
- alpha = 0.5
-)
-coef(fit.elasticnet)20 x 1 sparse Matrix of class "dgCMatrix"
- s0
-(Intercept) 3.7079455811
-strike_dispersion 0.0968051818
-call_volume 0.0006080303
-put_volume 0.0514668950
-call_oi 0.0017100680
-put_oi -0.0343559893
-maturity_count 0.0143531881
-total_contracts .
-realized_vol_short 0.5114617598
-market_vol_index 0.0483281737
-realized_vol_mid -0.0002823096
-realized_vol_long 0.0092017012
-pulse_ratio -0.2117217524
-put_call_ratio_volume 0.0089154925
-put_call_ratio_oi 0.0115496767
-liquidity_ratio -0.0163079430
-option_dispersion -0.0222861716
-put_low_strike 0.0341647693
-put_proportion -0.0205208052
-stress_spread -0.0035417269
-Y_hatEN <- predict(
- fit.elasticnet,
- newx = as.matrix(val_noY),
- type = "response",
- s = s_en
-)
-
-MSS_EN <- mean((exp(y_val_log) - exp(Y_hatEN))**2)
-
-print(paste0("The E-N RMSE is: ", sqrt(MSS_EN)))[1] "The E-N RMSE is: 12.0313627692977"
-Elastic-Net does not eliminate the same variables as Lasso. However, it eliminates \(total contracts\) This proves that it preserves Lasso’s variable selection impact. The RMSE is slightly improved, yet the change compared to Lasso is negligeable.
-Partial Least Squares
-The main idea behind Partial Least Squares is to construct new features, that are linear combinations of the original dataset features, and mutually orthogonal. The computation is done on the already reduced dimension.
-train_pca_noY <- train_pca_lm |> dplyr::select(-implied_vol_ref)
-val_pca_noY <- val_pca_lm |> dplyr::select(-implied_vol_ref)
-
-fit.pls <- plsr(
- train_pca_lm$implied_vol_ref ~ as.matrix(train_pca_noY),
- ncomp = 8,
- validation = "CV"
-)
-coef(fit.pls), , 8 comps
-
- train_pca_lm$implied_vol_ref
-as.matrix(train_pca_noY)PC1 0.032235567
-as.matrix(train_pca_noY)PC2 0.215238058
-as.matrix(train_pca_noY)PC3 -0.016715731
-as.matrix(train_pca_noY)PC4 -0.014672291
-as.matrix(train_pca_noY)PC5 -0.098566933
-as.matrix(train_pca_noY)PC6 0.077464828
-as.matrix(train_pca_noY)PC7 -0.061847857
-as.matrix(train_pca_noY)PC8 0.005858342
-as.matrix(train_pca_noY)PC9 -0.033099268
-explained_variance <- explvar(fit.pls)
-
-plot(
- 1:length(explained_variance),
- explained_variance,
- type = "b",
- pch = 16,
- col = "blue",
- xlab = "PLS Component",
- ylab = "Variance Explained (%)",
- main = "PLS: Variance Explained by Each Component",
- ylim = c(0, max(explained_variance) * 1.1)
-)
-
-grid()
-
-text(
- 1:length(explained_variance),
- explained_variance,
- labels = round(explained_variance, 2),
- pos = 3,
- col = "blue"
-)
After visualising the new features, we realise a steep dropoff in explained variance after the first two components, which account for 50.7% of the variance. This shows that a lot of the present information can be explained within two linear combinaitions - more than half of it. This suggests that it is initially easy to extract information, but it gets progressively harder, with the fifth best new feature only explaining 6.91% of the variance.
-pred_val <- predict(fit.pls, newdata = as.matrix(val_pca_noY), ncomp = 8)
-Y_hatPLS <- pred_val[, 1, 1]
-
-MSS_PLS <- mean((exp(val_pca_lm$implied_vol_ref) - exp(Y_hatPLS))**2)
-
-print(paste0("The PLS RMSE is: ", sqrt(MSS_PLS)))[1] "The PLS RMSE is: 12.7949712506425"
-Since we reduce the dimension on an already dimension-reduced dataset, the loss of information is too big: PLS is the worst model with an RMSE of 12.79.
-Linear Mixed-Effects Models (LMM)
-Given the panel structure of our dataset (\(N=3887\) assets, \(T=544\) dates),we try a Linear Mixed-Effects Model. This approach allows us to model a global market trend (Fixed Effects) while estimating a specific baseline level for each asset (Random Intercept), capturing the idiosyncratic risk inherent to each underlying instrument.
-Model LMM 1: The Baseline
-Our first approach was to include all available explanatory variables: the raw market data and our engineered features. The objective was to establish a performance benchmark.
-The model specification is: \[ -\log(\texttt{ImpliedVol}_{it}) = \beta_0 + \sum_{k=1}^{p} \beta_k X_{k,it} + u_i + \epsilon_{it} -\] Where \(u_i \sim \mathcal{N}(0, \sigma_u^2)\) represents the random intercept for asset \(i\).
-mod_lmm_1 <- lmer(
- log(implied_vol_ref) ~ strike_dispersion +
- call_volume +
- put_volume +
- call_oi +
- put_oi +
- maturity_count +
- total_contracts +
- realized_vol_long +
- realized_vol_mid +
- realized_vol_short +
- market_vol_index +
- pulse_ratio +
- put_call_ratio_volume +
- put_call_ratio_oi +
- liquidity_ratio +
- option_dispersion +
- put_low_strike +
- put_proportion +
- stress_spread +
- (1 | asset_id),
- data = train_linear
-)
-
-summary(mod_lmm_1)predictions_log_1 <- predict(
- mod_lmm_1,
- newdata = val_linear,
- allow.new.levels = TRUE
-)
-predictions_real_1 <- exp(predictions_log_1)
-erreurs_1 <- val_linear$implied_vol_ref - predictions_real_1
-rmse_score_1 <- sqrt(mean(erreurs_1^2))
-print(paste("RMSE of the first LMM :", round(rmse_score_1, 4)))This baseline model achieved a Root Mean Square Error (RMSE) of 8.77. It is an improvement from the other linear models. However, while it captures the general variance, the inclusion of all variables likely introduced multicollinearity, inflating the standard errors of the coefficients.
-Model LMM 2: Feature Selection and Collinearity Reduction
-num_var <- train_linear |> dplyr::select(-asset_id, -obs_date)
-
-correlation_matrix <- cor(num_var, method = "pearson", use = "complete.obs")
-
-melted_cormat <- reshape2::melt(correlation_matrix)
-
-ggplot(melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
- geom_tile() +
- scale_fill_gradientn(colors = paletteer_c("grDevices::Sunset", 30)) +
- labs(
- title = "Correlation Matrix for Numerical Variables",
- x = NULL,
- y = NULL,
- fill = "Corr"
- ) +
- theme_minimal() +
- theme(axis.text.x = element_text(angle = 45, hjust = 1))
To improve robustness, we analyzed the correlation matrix. We removed redundant variables that carried duplicate information. This simplification stabilized the model without significantly degrading the RMSE, confirming that a parsimonious set of features is sufficient to describe market dynamics.
-mod_lmm_2 <- lmer(
- log(implied_vol_ref) ~ strike_dispersion +
- call_volume +
- call_oi +
- maturity_count +
- total_contracts +
- realized_vol_short +
- realized_vol_mid +
- realized_vol_long +
- market_vol_index +
- pulse_ratio +
- put_call_ratio_volume +
- liquidity_ratio +
- option_dispersion +
- put_low_strike +
- put_proportion +
- stress_spread +
- (1 | asset_id),
- data = train_linear
-)
-
-summary(mod_lmm_2)predictions_log_2 <- predict(
- mod_lmm_2,
- newdata = val_linear,
- allow.new.levels = TRUE
-)
-predictions_real_2 <- exp(predictions_log_2)
-erreurs_2 <- val_linear$implied_vol_ref - predictions_real_2
-rmse_score_2 <- sqrt(mean(erreurs_2^2))
-print(paste("RMSE of the second LMM :", round(rmse_score_2, 4)))Model LMM 3: Financial Interactions
-Markets are non-linear systems where factors amplify each other. Based on our financial analysis, we introduced five key interaction terms to capture these complex dynamics:
-The volatility Beta effect:
-\[ -\texttt{market_vol_index : realized_vol_long} -\]
-This interaction captures the Volatility Beta Effect. It measures the sensitivity of the underlying’s short-term realized volatility compared the market volatility index. It gives indication on how the volatility of the underlying is impacted by the market environment, that is wether markets are calm or panicking.
-The fear factor interaction:
-\[ -\texttt{put_call_ratio_volume : stress_spread} -\]
-This Fear Factor Interaction is a combination of the market sentiment (the Put Call Volume Ratio) and the idiosyncratic stress specific to the underlying. It helps to make the distinction between daily hedging and “urgent hedging” caused by market panic. High idiosyncratic risk and high Put Call Volume Ratio confirms a strong panic signal that should trigger a massive surge in volatility.
-The market depth ratio:
-\[ -\texttt{total_contracts : liquidity_ratio} -\]
-The Market Depth Ratio measures the impact of transactions on the market structural stability. Depending on liquidity and how deep the market is, a high nominal transaction will have a different impact on volatility. In a deep market, a surge in contracts is absorbed smoothly which is not the case for very thin markets.
-The skew tension ratio:
-\[ -\texttt{put_low_strike : market_vol_index} -\]
-The Skew Tension Ratio indicates how convex the fear is. In a stable environment, crash protection is cheap. When risk appears, investors buy put options for protection which push the volatility up leading to higher option prices. This ratio integrates the skew to the model to take into account sudden surge in implied volatility that occurs when market participants stop calculating value and start buying protection at any price.
-The volatility shock ratio:
-\[ -\texttt{realized_vol_short : realized_vol_long} -\]
-The Volatility Shock Ratio compares short-term realized volatility to long-term realized volatility. It identifies the nature of a volatility spike based on the principle of “Mean Reversion”, the volatility should, in the long term, converge with its mean. The point is to make the distinction between an usual risk and a stress peak which helps to understand if the current price movement is passing anomaly or is a fundamental shift in the risk profile.
-mod_lmm_3 <- lmer(
- log(implied_vol_ref) ~
- realized_vol_long *
- market_vol_index +
- put_call_ratio_volume * stress_spread +
- total_contracts * liquidity_ratio +
-
- put_low_strike:market_vol_index +
- realized_vol_short:realized_vol_long +
-
- strike_dispersion +
- call_volume +
- call_oi +
- maturity_count +
- realized_vol_short +
- realized_vol_mid +
- put_low_strike +
- put_proportion +
- option_dispersion +
- pulse_ratio +
-
- (1 | asset_id),
- data = train_linear
-)
-
-summary(mod_lmm_3)predictions_log_3 <- predict(
- mod_lmm_3,
- newdata = val_linear,
- allow.new.levels = TRUE
-)
-predictions_real_3 <- exp(predictions_log_3)
-erreurs_3 <- val_linear$implied_vol_ref - predictions_real_3
-rmse_score_3 <- sqrt(mean(erreurs_3^2))
-print(paste("RMSE of the third LMM :", round(rmse_score_3, 4)))Model LMM 4: Addition of Quadratic Terms
-Volatility often exhibits a convex behavior (the “Vol of Vol”). Extreme variations in realized volatility tend to have a disproportionate impact on implied volatility. To capture this, we added squared terms (\(X^2\)) for the most significant variables: realized_vol_short, realized_vol_long, market_vol_index, and pulse_ratio.
mod_lmm_4 <- lmer(
- log(implied_vol_ref) ~
- realized_vol_long *
- market_vol_index +
- put_call_ratio_volume * stress_spread +
- total_contracts * liquidity_ratio +
-
- put_low_strike:market_vol_index +
- realized_vol_short:realized_vol_long +
-
- strike_dispersion +
- call_volume +
- call_oi +
- maturity_count +
- realized_vol_short +
- realized_vol_mid +
- put_low_strike +
- put_proportion +
- option_dispersion +
- pulse_ratio +
-
- I(realized_vol_short^2) +
- I(market_vol_index^2) +
- I(realized_vol_long^2) +
- I(pulse_ratio^2) +
-
- (1 | asset_id),
- data = train_linear
-)
-
-summary(mod_lmm_4)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
-lmerModLmerTest]
-Formula: log(implied_vol_ref) ~ realized_vol_long * market_vol_index +
- put_call_ratio_volume * stress_spread + total_contracts *
- liquidity_ratio + put_low_strike:market_vol_index + realized_vol_short:realized_vol_long +
- strike_dispersion + call_volume + call_oi + maturity_count +
- realized_vol_short + realized_vol_mid + put_low_strike +
- put_proportion + option_dispersion + pulse_ratio + I(realized_vol_short^2) +
- I(market_vol_index^2) + I(realized_vol_long^2) + I(pulse_ratio^2) +
- (1 | asset_id)
- Data: train_linear
-
-REML criterion at convergence: -506957.9
-
-Scaled residuals:
- Min 1Q Median 3Q Max
--22.1432 -0.3906 0.0188 0.4271 12.0260
-
-Random effects:
- Groups Name Variance Std.Dev.
- asset_id (Intercept) 0.07753 0.2784
- Residual 0.04137 0.2034
-Number of obs: 1533234, groups: asset_id, 3886
-
-Fixed effects:
- Estimate Std. Error df t value
-(Intercept) 3.742e+00 4.492e-03 3.867e+03 833.120
-realized_vol_long 2.739e-03 2.411e-03 1.531e+06 1.136
-market_vol_index 1.245e-01 6.470e-04 1.532e+06 192.344
-put_call_ratio_volume 8.798e-03 2.364e-04 1.531e+06 37.217
-stress_spread 3.691e-02 1.146e-03 1.530e+06 32.209
-total_contracts -3.338e-02 6.236e-03 1.532e+06 -5.352
-liquidity_ratio -1.142e-02 3.397e-04 1.532e+06 -33.624
-strike_dispersion 7.688e-02 6.692e-03 1.532e+06 11.489
-call_volume 2.207e-02 6.914e-04 1.531e+06 31.921
-call_oi -1.782e-02 7.491e-04 1.532e+06 -23.789
-maturity_count 2.737e-03 4.421e-04 1.533e+06 6.192
-realized_vol_short 2.723e-01 3.209e-03 1.531e+06 84.858
-realized_vol_mid 1.735e-02 6.415e-04 1.530e+06 27.053
-put_low_strike 1.212e-02 3.792e-04 1.531e+06 31.972
-put_proportion 2.784e-03 2.847e-04 1.531e+06 9.778
-option_dispersion -8.780e-02 1.186e-02 1.532e+06 -7.404
-pulse_ratio -1.513e-01 1.801e-03 1.531e+06 -84.026
-I(realized_vol_short^2) -3.701e-02 4.622e-04 1.531e+06 -80.059
-I(market_vol_index^2) -4.417e-03 1.080e-04 1.533e+06 -40.908
-I(realized_vol_long^2) -1.276e-02 3.811e-04 1.532e+06 -33.469
-I(pulse_ratio^2) 1.220e-02 2.064e-04 1.531e+06 59.097
-realized_vol_long:market_vol_index -2.543e-02 3.112e-04 1.531e+06 -81.700
-put_call_ratio_volume:stress_spread 4.601e-03 1.245e-04 1.530e+06 36.963
-total_contracts:liquidity_ratio -2.600e-03 2.485e-04 1.532e+06 -10.465
-market_vol_index:put_low_strike -2.088e-03 1.728e-04 1.530e+06 -12.081
-realized_vol_long:realized_vol_short 4.591e-02 7.569e-04 1.531e+06 60.657
- Pr(>|t|)
-(Intercept) < 2e-16 ***
-realized_vol_long 0.256
-market_vol_index < 2e-16 ***
-put_call_ratio_volume < 2e-16 ***
-stress_spread < 2e-16 ***
-total_contracts 8.68e-08 ***
-liquidity_ratio < 2e-16 ***
-strike_dispersion < 2e-16 ***
-call_volume < 2e-16 ***
-call_oi < 2e-16 ***
-maturity_count 5.94e-10 ***
-realized_vol_short < 2e-16 ***
-realized_vol_mid < 2e-16 ***
-put_low_strike < 2e-16 ***
-put_proportion < 2e-16 ***
-option_dispersion 1.32e-13 ***
-pulse_ratio < 2e-16 ***
-I(realized_vol_short^2) < 2e-16 ***
-I(market_vol_index^2) < 2e-16 ***
-I(realized_vol_long^2) < 2e-16 ***
-I(pulse_ratio^2) < 2e-16 ***
-realized_vol_long:market_vol_index < 2e-16 ***
-put_call_ratio_volume:stress_spread < 2e-16 ***
-total_contracts:liquidity_ratio < 2e-16 ***
-market_vol_index:put_low_strike < 2e-16 ***
-realized_vol_long:realized_vol_short < 2e-16 ***
----
-Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-predictions_log_4 <- predict(
- mod_lmm_4,
- newdata = val_linear,
- allow.new.levels = TRUE
-)
-predictions_real_4 <- exp(predictions_log_4)
-erreurs_4 <- val_linear$implied_vol_ref - predictions_real_4
-rmse_score_4 <- sqrt(mean(erreurs_4^2))
-print(paste("RMSE of the fourth LMM :", round(rmse_score_4, 4)))[1] "RMSE of the fourth LMM : 8.3192"
-selected_assets <- sample(unique(val_linear$asset_id), 5)
-
-n_train_display <- floor(544 / 2)
-
-dates_train_all <- sort(unique(train_linear$obs_date))
-dates_train_subset <- tail(dates_train_all, n_train_display)
-
-df_truth_train <- train_linear |>
- filter(asset_id %in% selected_assets) |>
- filter(obs_date %in% dates_train_subset) |>
- dplyr::select(asset_id, obs_date, implied_vol_ref)
-
-df_truth_val <- val_linear |>
- filter(asset_id %in% selected_assets) |>
- dplyr::select(asset_id, obs_date, implied_vol_ref)
-
-df_truth_full <- bind_rows(df_truth_train, df_truth_val) |>
- mutate(obs_date = as.Date(obs_date))
-
-df_pred <- val_linear |>
- filter(asset_id %in% selected_assets) %>%
- mutate(
- pred_log = predict(mod_lmm_4, newdata = ., allow.new.levels = TRUE),
- prediction = exp(pred_log),
- obs_date = as.Date(obs_date)
- ) |>
- dplyr::select(asset_id, obs_date, prediction)
-
-ggplot() +
- geom_line(
- data = df_truth_full,
- aes(x = obs_date, y = implied_vol_ref, color = as.factor(asset_id)),
- size = 0.7,
- alpha = 0.8
- ) +
- geom_line(
- data = df_pred,
- aes(x = obs_date, y = prediction, color = as.factor(asset_id)),
- linetype = "dashed",
- size = 0.7
- ) +
- geom_vline(
- xintercept = as.numeric(min(df_pred$obs_date)),
- linetype = "dotted",
- color = "black",
- size = 1
- ) +
- theme_minimal() +
- labs(
- title = "Predictions (Dashed) vs. Reality (Solid)",
- subtitle = "Visualisation of 10 Random Assets (Focus on Train/Val Transition)",
- x = "Date",
- y = "Implied Volatility",
- color = "Asset ID"
- ) +
- theme(legend.position = "right")
We plotted the predictions against the actual values for a random subset of assets for this model. While the model captures the general trend well, it often underpredicts extreme volatility spikes. This suggests that a simple Random Intercept (\(u_i\)) is insufficient: assets do not just have different levels of volatility, they have different sensitivities to market stress. A defensive stock (low beta) and a tech stock (high beta) do not react with the same intensity to a VIX spike.
-Model LMM 5: Random Slopes
-To adress this issue, we introduced Random Slopes. Instead of forcing a global coefficient for key variables, we allowed the slope to vary by asset:
-\[ -\log(Y_{it}) = (\beta_{market} + b_{i,m})X_{market,t} + \dots + u_i + \epsilon_{it} -\]
-We included realized_vol_short and realized_vol_long:market_vol_index in the random effects structure (1 + realized_vol_short + realized_vol_long:market_vol_index | asset_id). This modification allows the model to learn the specific beta and reactivity of each asset, significantly improving the fit for high-volatility profiles.
mod_lmm_5 <- lmer(
- log(implied_vol_ref) ~
- put_call_ratio_volume *
- stress_spread +
- total_contracts * liquidity_ratio +
-
- put_low_strike:market_vol_index +
- realized_vol_short:realized_vol_long +
-
- strike_dispersion +
- call_volume +
- call_oi +
- maturity_count +
- realized_vol_long +
- realized_vol_mid +
- market_vol_index +
- put_low_strike +
- put_proportion +
- option_dispersion +
- pulse_ratio +
-
- I(realized_vol_short^2) +
- I(market_vol_index^2) +
- I(realized_vol_long^2) +
- I(pulse_ratio^2) +
-
- (1 + realized_vol_short + realized_vol_long:market_vol_index | asset_id),
- data = train_linear
-)
-
-summary(mod_lmm_5)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
-lmerModLmerTest]
-Formula: log(implied_vol_ref) ~ put_call_ratio_volume * stress_spread +
- total_contracts * liquidity_ratio + put_low_strike:market_vol_index +
- realized_vol_short:realized_vol_long + strike_dispersion +
- call_volume + call_oi + maturity_count + realized_vol_long +
- realized_vol_mid + market_vol_index + put_low_strike + put_proportion +
- option_dispersion + pulse_ratio + I(realized_vol_short^2) +
- I(market_vol_index^2) + I(realized_vol_long^2) + I(pulse_ratio^2) +
- (1 + realized_vol_short + realized_vol_long:market_vol_index |
- asset_id)
- Data: train_linear
-
-REML criterion at convergence: -794671.7
-
-Scaled residuals:
- Min 1Q Median 3Q Max
--24.9130 -0.3898 0.0179 0.4286 12.6215
-
-Random effects:
- Groups Name Variance Std.Dev. Corr
- asset_id (Intercept) 0.100984 0.31778
- realized_vol_short 0.014391 0.11996 0.10
- realized_vol_long:market_vol_index 0.007767 0.08813 -0.21 -0.18
- Residual 0.033598 0.18330
-Number of obs: 1533234, groups: asset_id, 3886
-
-Fixed effects:
- Estimate Std. Error df t value
-(Intercept) 3.732e+00 5.024e-03 3.594e+03 742.886
-put_call_ratio_volume 8.216e-03 2.249e-04 1.529e+06 36.530
-stress_spread 5.963e-02 1.132e-03 5.737e+05 52.669
-total_contracts -5.658e-02 6.074e-03 1.530e+06 -9.315
-liquidity_ratio -8.112e-03 3.227e-04 1.530e+06 -25.138
-strike_dispersion 9.638e-02 6.521e-03 1.530e+06 14.779
-call_volume 1.408e-02 6.495e-04 1.528e+06 21.674
-call_oi -1.244e-02 7.171e-04 1.530e+06 -17.345
-maturity_count 5.641e-03 4.275e-04 1.531e+06 13.196
-realized_vol_long 1.345e-01 1.661e-03 2.118e+04 80.967
-realized_vol_mid 1.976e-02 6.284e-04 1.523e+06 31.443
-market_vol_index 1.495e-01 6.363e-04 8.498e+05 234.917
-put_low_strike 1.026e-02 3.584e-04 1.529e+06 28.630
-put_proportion 6.142e-04 2.676e-04 1.528e+06 2.295
-option_dispersion -1.282e-01 1.155e-02 1.530e+06 -11.095
-pulse_ratio -4.017e-02 1.166e-03 1.476e+04 -34.443
-I(realized_vol_short^2) -1.144e-01 5.545e-04 1.702e+05 -206.242
-I(market_vol_index^2) -6.087e-03 1.016e-04 1.529e+06 -59.903
-I(realized_vol_long^2) -5.797e-02 4.443e-04 4.486e+05 -130.472
-I(pulse_ratio^2) 2.290e-02 2.287e-04 9.545e+05 100.136
-put_call_ratio_volume:stress_spread 3.270e-03 1.215e-04 1.527e+06 26.906
-total_contracts:liquidity_ratio 1.708e-04 2.380e-04 1.531e+06 0.718
-put_low_strike:market_vol_index 1.041e-04 2.531e-04 1.326e+06 0.411
-realized_vol_short:realized_vol_long 1.506e-01 9.633e-04 3.105e+05 156.367
- Pr(>|t|)
-(Intercept) <2e-16 ***
-put_call_ratio_volume <2e-16 ***
-stress_spread <2e-16 ***
-total_contracts <2e-16 ***
-liquidity_ratio <2e-16 ***
-strike_dispersion <2e-16 ***
-call_volume <2e-16 ***
-call_oi <2e-16 ***
-maturity_count <2e-16 ***
-realized_vol_long <2e-16 ***
-realized_vol_mid <2e-16 ***
-market_vol_index <2e-16 ***
-put_low_strike <2e-16 ***
-put_proportion 0.0217 *
-option_dispersion <2e-16 ***
-pulse_ratio <2e-16 ***
-I(realized_vol_short^2) <2e-16 ***
-I(market_vol_index^2) <2e-16 ***
-I(realized_vol_long^2) <2e-16 ***
-I(pulse_ratio^2) <2e-16 ***
-put_call_ratio_volume:stress_spread <2e-16 ***
-total_contracts:liquidity_ratio 0.4730
-put_low_strike:market_vol_index 0.6809
-realized_vol_short:realized_vol_long <2e-16 ***
----
-Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-optimizer (nloptwrap) convergence code: 0 (OK)
-Model failed to converge with max|grad| = 0.00268325 (tol = 0.002, component 1)
-predictions_log_5 <- predict(
- mod_lmm_5,
- newdata = val_linear,
- allow.new.levels = TRUE
-)
-predictions_real_5 <- exp(predictions_log_5)
-erreurs_5 <- val_linear$implied_vol_ref - predictions_real_5
-rmse_score_5 <- sqrt(mean(erreurs_5^2))
-print(paste("RMSE of the fifth LMM :", round(rmse_score_5, 4)))[1] "RMSE of the fifth LMM : 8.1011"
-selected_assets <- sample(unique(val_linear$asset_id), 5)
-
-n_train_display <- floor(544 / 2)
-
-dates_train_all <- sort(unique(train_linear$obs_date))
-dates_train_subset <- tail(dates_train_all, n_train_display)
-
-df_truth_train <- train_linear |>
- filter(asset_id %in% selected_assets) |>
- filter(obs_date %in% dates_train_subset) |>
- dplyr::select(asset_id, obs_date, implied_vol_ref)
-
-df_truth_val <- val_linear |>
- filter(asset_id %in% selected_assets) |>
- dplyr::select(asset_id, obs_date, implied_vol_ref)
-
-df_truth_full <- bind_rows(df_truth_train, df_truth_val) |>
- mutate(
- obs_date = as.Date(obs_date),
- implied_vol_ref = implied_vol_ref
- )
-
-df_pred_temp <- val_linear |>
- filter(asset_id %in% selected_assets)
-
-df_pred <- df_pred_temp |>
- mutate(
- pred_log = predict(
- mod_lmm_5,
- newdata = df_pred_temp,
- allow.new.levels = TRUE
- ),
- prediction = exp(pred_log),
- obs_date = as.Date(obs_date)
- ) |>
- dplyr::select(asset_id, obs_date, prediction)
-
-ggplot() +
- geom_line(
- data = df_truth_full,
- aes(x = obs_date, y = implied_vol_ref, color = as.factor(asset_id)),
- linewidth = 0.7,
- alpha = 0.8
- ) +
- geom_line(
- data = df_pred,
- aes(x = obs_date, y = prediction, color = as.factor(asset_id)),
- linetype = "dashed",
- linewidth = 0.7
- ) +
- geom_vline(
- xintercept = as.numeric(min(df_pred$obs_date)),
- linetype = "dotted",
- color = "black",
- linewidth = 1
- ) +
- theme_minimal() +
- labs(
- title = "Predictions (Dashed) vs. Reality (Solid)",
- subtitle = "Visualisation of 5 Random Assets",
- x = "Date",
- y = "Implied Volatility",
- color = "Asset ID"
- ) +
- theme(legend.position = "right")
Random Effects Analysis:
-The summary of the model confirms our hypothesis that assets are structurally different. We observe a intercept variance at \(0.103\).This large variance confirms that different assets have vastly different baseline volatility levels. Then, the non-zero standard deviations (0.1269 and 0.1446 respectively) prove that assets have unique sensitivities. We can conclude that some assets are “high beta” (tech stocks) and react violently to market stress, while others are “defensive” and react mildly. This Random Slope structure allows the model to “learn” the specific risk profile of each of the 3,886 assets in the training set, correcting the “parallel trend” bias observed in earlier models.
-Fixed Effect Analysis:
-The Fixed Effects part of the summary provides powerful insights into the general laws of the options market. With \(t=319.9\), the market volatility index is by far the strongest driver. Unsurprisingly, implied volatility is systemically linked to the VIX. When the market panics, all assets follow. Then we observe strong short and long realized volatility. The past remains the best predictor of the future. The significance of the squared term. The quadratic term for the short one confirms the convexity of volatility: extreme moves in the underlying asset lead to disproportionately higher implied volatility premiums. Moreover, the positive coefficient of the put-call ratio on volume confirms the “Fear Gauge” theory. A surge in Put buying volume exerts a strong upward pressure on volatility prices.
-Analysis of the interactions:
-Our financial engineering efforts yielded mixed but informative results:
--
-
Volatility Shock Ratio (
realized_vol_short : realized_vol_long): With a t-value of , this is the most significant interaction. The negative coefficient is fascinating: it mathematically represents Mean Reversion. When short-term volatility deviates too much from long-term volatility, the model applies a correction, preventing the prediction from exploding linearly.
-Market Depth: With a t-value of \(21.6\), this interaction is highly significant. It confirms that liquidity constraints amplify volatility.
-Skew Tension: With a t-value of \(2.64\), this interaction is statistically significant (at the 95% confidence level). It validates the “Crash Convexity” theory, although its marginal impact is lower than expected.
-Fear Factor: With a t-value of \(1.23\), this interaction is not statistically significant. This suggests that the information contained in this interaction is already fully captured by the main effects of the variables separately.
-
Finally concerning model diagnostics, we denote that the REML criterion dropped to \(-775,843\), indicating a significantly better fit compared to the baseline models. However, we note a convergence warning (Model failed to converge). This is common in complex mixed models with large datasets (1.5 million observations) and complex random structures. While scaling variables helped, the complexity of the “Random Slope” optimization remains a computational challenge, though the estimates remain robust given the extremely low standard errors.
Non-Linear & Black-Box Models
-Following the evaluation of linear frameworks, this section explores high-capacity, non-linear algorithms capable of mapping complex, multidimensional interactions within the financial feature space. A critical methodological distinction must be noted regarding the data pipeline: while linear models and deep learning architectures rely on the PCA-reduced dataset to mitigate multicollinearity and stabilize gradients, tree-based models were trained exclusively on the raw dataset (\(train\_final\)). Decision trees naturally handle collinearity through their splitting mechanism, and applying PCA beforehand would orthogonally mix the original financial indicators, thereby destroying the localized, non-linear thresholds that tree ensembles are designed to capture.
-To efficiently navigate the hyperparameter space of these complex models, Grid Search methodologies were discarded in favor of Bayesian Optimization using Gaussian Processes, allowing for a directed and computationally efficient convergence toward the optimal parameter sets by maximizing the negative Root Mean Squared Error (\(-RMSE\)).
-train_tree$implied_vol_ref <- log(train_tree$implied_vol_ref)
-val_tree$implied_vol_ref <- log(val_tree$implied_vol_ref)
-
-train_clean <- train_tree |> select(-asset_id, -obs_date)
-val_clean <- val_tree |> select(-asset_id, -obs_date)Gradient Boosting Frameworks: XGBoost and LightGBM
-Gradient boosting decision trees represent the state-of-the-art for tabular data. We benchmarked two industry-standard architectures: XGBoost, which relies on level-wise tree growth, and LightGBM, which employs a leaf-wise expansion strategy.
-XGBoost Performance and Boundary Effect
-x_train_mat <- as.matrix(train_clean |> select(-implied_vol_ref))
-y_train_vec <- train_clean$implied_vol_ref
-
-x_val_mat <- as.matrix(val_clean |> select(-implied_vol_ref))
-y_val_vec <- val_clean$implied_vol_ref
-
-dtrain <- xgb.DMatrix(data = x_train_mat, label = y_train_vec)
-dval <- xgb.DMatrix(data = x_val_mat, label = y_val_vec)
-
-scoring_function <- function(eta, max_depth, subsample, colsample_bytree) {
- parsed_depth <- as.integer(round(max_depth))
-
- params <- list(
- booster = "gbtree",
- objective = "reg:squarederror",
- eta = eta,
- max_depth = parsed_depth,
- subsample = subsample,
- colsample_bytree = colsample_bytree,
- tree_method = "hist",
- nthread = parallel::detectCores() - 1
- )
-
- cv_model <- tryCatch(
- {
- xgb.cv(
- params = params,
- data = dtrain,
- nrounds = 150,
- nfold = 3,
- early_stopping_rounds = 15,
- verbose = 0,
- metrics = "rmse"
- )
- },
- error = function(e) return(NULL)
- )
-
- if (is.null(cv_model)) {
- return(list(Score = -9999, Pred = 0))
- }
-
- best_iter <- cv_model$best_iteration
- if (is.null(best_iter) || length(best_iter) == 0) {
- best_iter <- which.min(cv_model$evaluation_log$test_rmse_mean)
- }
-
- best_rmse <- cv_model$evaluation_log$test_rmse_mean[best_iter]
-
- list(Score = -best_rmse, Pred = 0)
-}
-
-bounds <- list(
- eta = c(0.05, 0.3),
- max_depth = c(8L, 15L),
- subsample = c(0.6, 1.0),
- colsample_bytree = c(0.6, 1.0)
-)
-
-opt_obj <- BayesianOptimization(
- FUN = scoring_function,
- bounds = bounds,
- init_points = 3,
- n_iter = 5,
- acq = "ucb",
- kappa = 2.576
-)
-
-best_params_raw <- opt_obj$Best_Par
-
-best_params_xgb <- list(
- booster = "gbtree",
- objective = "reg:squarederror",
- eta = best_params_raw["eta"],
- max_depth = best_params_raw["max_depth"],
- subsample = best_params_raw["subsample"],
- colsample_bytree = best_params_raw["colsample_bytree"],
- tree_method = "hist",
- nthread = parallel::detectCores() - 1
-)
-
-final_model_xgb <- xgb.train(
- params = best_params_xgb,
- data = dtrain,
- nrounds = 1000,
- evals = list(val = dval, train = dtrain),
- early_stopping_rounds = 50,
- verbose = 1
-)
-
-preds_xgb_log <- predict(final_model_xgb, dval)
-rmse_xgb_real <- sqrt(mean((exp(preds_xgb_log) - exp(y_val_vec))^2))
-
-preds_xgb_train_log <- predict(final_model_xgb, dtrain)
-rmse_xgb_train_real <- sqrt(mean(
- (exp(preds_xgb_train_log) - exp(y_train_vec))^2
-))
-
-print(paste0("XGBoost RMSE on validation set: ", round(rmse_xgb_real, 4)))
-print(paste0("XGBoost RMSE on training set: ", round(rmse_xgb_train_real, 4)))
-paste0(
- "Best Hyperparameters: ",
- paste(names(best_params_raw), best_params_raw, sep = " = ", collapse = ", ")
-)The XGBoost model was tuned over \(15\) Bayesian iterations. The optimization process revealed a distinct statistical behavior: the algorithm consistently converged toward the extreme upper boundaries of the predefined search space. The optimal configuration selected a learning rate (eta) of \(0.2\), a maximum tree depth (\(max\_depth\)) of \(10\), and utilized \(100\%\) of the data at each split (\(subsample = 1.0\), \(colsample\_bytree = 1.0\)).
-On the original exponential scale, this configuration yielded a validation \(RMSE\) of \(10.70\). However, comparing this generalization error to the training \(RMSE\) of \(0.565\) reveals the extreme nature of the bias-variance trade-off in this specific task. The saturation of complexity limits (a depth of \(10\) is exceptionally deep for boosting) and the algorithm’s mathematical refusal to apply stochastic regularization (row or column subsampling) indicate that the underlying implied volatility signal is highly complex and difficult to separate from structural noise. Consequently, the model operates in a high-variance regime, effectively memorizing the training set (yielding a near-zero training error) but failing to generalize beyond the \(~10.70\) validation barrier.
-LightGBM: Leaf-Wise Validation
-x_train_mat <- as.matrix(train_clean |> select(-implied_vol_ref))
-y_train_vec <- train_clean$implied_vol_ref
-
-x_val_mat <- as.matrix(val_clean |> select(-implied_vol_ref))
-y_val_vec <- val_clean$implied_vol_ref
-
-dtrain <- lgb.Dataset(data = x_train_mat, label = y_train_vec)
-dval <- lgb.Dataset(data = x_val_mat, label = y_val_vec, reference = dtrain)
-
-scoring_function <- function(
- num_leaves,
- learning_rate,
- bagging_fraction,
- feature_fraction
-) {
- parsed_num_leaves <- as.integer(round(num_leaves))
-
- params <- list(
- objective = "regression",
- metric = "rmse",
- num_threads = parallel::detectCores() - 1,
- learning_rate = learning_rate,
- num_leaves = parsed_num_leaves,
- bagging_fraction = bagging_fraction,
- bagging_freq = ifelse(bagging_fraction < 1.0, 1, 0),
- feature_fraction = feature_fraction
- )
-
- cv_model <- tryCatch(
- {
- lgb.cv(
- params = params,
- data = dtrain,
- nrounds = 150,
- nfold = 3,
- early_stopping_rounds = 15,
- verbose = -1
- )
- },
- error = function(e) return(NULL)
- )
-
- if (is.null(cv_model)) {
- return(list(Score = -9999, Pred = 0))
- }
-
- best_iter <- cv_model$best_iter
- best_rmse <- cv_model$best_score
-
- list(Score = -best_rmse, Pred = best_iter)
-}
-
-bounds <- list(
- num_leaves = c(20L, 150L),
- learning_rate = c(0.01, 0.2),
- bagging_fraction = c(0.6, 1.0),
- feature_fraction = c(0.6, 1.0)
-)
-
-opt_obj <- BayesianOptimization(
- FUN = scoring_function,
- bounds = bounds,
- init_points = 3,
- n_iter = 5,
- acq = "ucb",
- kappa = 2.576
-)elapsed = 29.374 Round = 1 num_leaves = 44.0000 learning_rate = 0.163952 bagging_fraction = 0.6955755 feature_fraction = 0.6948811 Value = -0.2214716
-elapsed = 33.006 Round = 2 num_leaves = 44.0000 learning_rate = 0.09588281 bagging_fraction = 0.7020396 feature_fraction = 0.6533932 Value = -0.226821
-elapsed = 41.814 Round = 3 num_leaves = 83.0000 learning_rate = 0.01450508 bagging_fraction = 0.8053167 feature_fraction = 0.9658835 Value = -0.2577391
-elapsed = 34.042 Round = 4 num_leaves = 124.0000 learning_rate = 0.2000 bagging_fraction = 0.6000 feature_fraction = 1.0000 Value = -0.2073121
-elapsed = 20.033 Round = 5 num_leaves = 20.0000 learning_rate = 0.2000 bagging_fraction = 1.0000 feature_fraction = 0.6073779 Value = -0.2290632
-elapsed = 27.617 Round = 6 num_leaves = 20.0000 learning_rate = 0.0100 bagging_fraction = 0.6000 feature_fraction = 0.9715227 Value = -0.3091457
-elapsed = 33.275 Round = 7 num_leaves = 150.0000 learning_rate = 0.2000 bagging_fraction = 0.99869 feature_fraction = 0.6000 Value = -0.205424
-elapsed = 42.416 Round = 8 num_leaves = 150.0000 learning_rate = 0.0971526 bagging_fraction = 0.8371583 feature_fraction = 0.9994332 Value = -0.2107813
-
- Best Parameters Found:
-Round = 7 num_leaves = 150.0000 learning_rate = 0.2000 bagging_fraction = 0.99869 feature_fraction = 0.6000 Value = -0.205424
-best_history_index <- which.max(opt_obj$History$Value)
-best_iteration_val <- opt_obj$Pred[[best_history_index]]
-
-best_params_row <- data.frame(
- num_leaves = as.integer(round(opt_obj$Best_Par["num_leaves"])),
- learning_rate = opt_obj$Best_Par["learning_rate"],
- bagging_fraction = opt_obj$Best_Par["bagging_fraction"],
- feature_fraction = opt_obj$Best_Par["feature_fraction"],
- best_iter = best_iteration_val,
- cv_rmse = -opt_obj$Best_Value
-)
-
-best_params_lgb <- list(
- objective = "regression",
- metric = "rmse",
- num_threads = parallel::detectCores() - 1,
- learning_rate = best_params_row$learning_rate,
- num_leaves = best_params_row$num_leaves,
- bagging_fraction = best_params_row$bagging_fraction,
- bagging_freq = ifelse(best_params_row$bagging_fraction < 1.0, 1, 0),
- feature_fraction = best_params_row$feature_fraction
-)
-
-final_model_lgb <- lgb.train(
- params = best_params_lgb,
- data = dtrain,
- nrounds = best_params_row$best_iter,
- valids = list(val = dval, train = dtrain),
- verbose = 1
-)[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.106895 seconds.
-You can set `force_col_wise=true` to remove the overhead.
-[LightGBM] [Info] Total Bins 3591
-[LightGBM] [Info] Number of data points in the train set: 1533234, number of used features: 15
-[LightGBM] [Info] Start training from score 3.707946
-[1]: train's rmse:0.476294 val's rmse:0.483766
-[2]: train's rmse:0.436687 val's rmse:0.447189
-[3]: train's rmse:0.388227 val's rmse:0.395272
-[4]: train's rmse:0.351259 val's rmse:0.352975
-[5]: train's rmse:0.319342 val's rmse:0.317443
-[6]: train's rmse:0.308286 val's rmse:0.304685
-[7]: train's rmse:0.29213 val's rmse:0.286823
-[8]: train's rmse:0.285282 val's rmse:0.279945
-[9]: train's rmse:0.279973 val's rmse:0.275082
-[10]: train's rmse:0.272954 val's rmse:0.268086
-[11]: train's rmse:0.270808 val's rmse:0.265993
-[12]: train's rmse:0.265075 val's rmse:0.262929
-[13]: train's rmse:0.259271 val's rmse:0.256574
-[14]: train's rmse:0.257372 val's rmse:0.255157
-[15]: train's rmse:0.251781 val's rmse:0.250573
-[16]: train's rmse:0.249575 val's rmse:0.248782
-[17]: train's rmse:0.244283 val's rmse:0.245211
-[18]: train's rmse:0.241612 val's rmse:0.243641
-[19]: train's rmse:0.239576 val's rmse:0.243298
-[20]: train's rmse:0.237666 val's rmse:0.241796
-[21]: train's rmse:0.235026 val's rmse:0.24036
-[22]: train's rmse:0.234248 val's rmse:0.240534
-[23]: train's rmse:0.233063 val's rmse:0.240336
-[24]: train's rmse:0.231473 val's rmse:0.239555
-[25]: train's rmse:0.230572 val's rmse:0.238856
-[26]: train's rmse:0.229453 val's rmse:0.238397
-[27]: train's rmse:0.228545 val's rmse:0.238288
-[28]: train's rmse:0.227854 val's rmse:0.238272
-[29]: train's rmse:0.227018 val's rmse:0.238578
-[30]: train's rmse:0.225762 val's rmse:0.238287
-[31]: train's rmse:0.224685 val's rmse:0.238206
-[32]: train's rmse:0.224169 val's rmse:0.238135
-[33]: train's rmse:0.223539 val's rmse:0.238197
-[34]: train's rmse:0.222695 val's rmse:0.238346
-[35]: train's rmse:0.222149 val's rmse:0.23829
-[36]: train's rmse:0.221545 val's rmse:0.238247
-[37]: train's rmse:0.221142 val's rmse:0.2382
-[38]: train's rmse:0.220744 val's rmse:0.238207
-[39]: train's rmse:0.220274 val's rmse:0.238167
-[40]: train's rmse:0.219637 val's rmse:0.238145
-[41]: train's rmse:0.219175 val's rmse:0.238365
-[42]: train's rmse:0.218764 val's rmse:0.238499
-[43]: train's rmse:0.218348 val's rmse:0.238594
-[44]: train's rmse:0.21797 val's rmse:0.238556
-[45]: train's rmse:0.217575 val's rmse:0.238736
-[46]: train's rmse:0.217116 val's rmse:0.238595
-[47]: train's rmse:0.216742 val's rmse:0.238603
-[48]: train's rmse:0.21645 val's rmse:0.238692
-[49]: train's rmse:0.216056 val's rmse:0.238725
-[50]: train's rmse:0.215753 val's rmse:0.238822
-[51]: train's rmse:0.215403 val's rmse:0.23881
-[52]: train's rmse:0.215122 val's rmse:0.239036
-[53]: train's rmse:0.214732 val's rmse:0.239108
-[54]: train's rmse:0.214453 val's rmse:0.239129
-[55]: train's rmse:0.21411 val's rmse:0.239214
-[56]: train's rmse:0.213758 val's rmse:0.23908
-[57]: train's rmse:0.213482 val's rmse:0.239068
-[58]: train's rmse:0.213173 val's rmse:0.239033
-[59]: train's rmse:0.212949 val's rmse:0.239055
-[60]: train's rmse:0.21267 val's rmse:0.23908
-[61]: train's rmse:0.212322 val's rmse:0.239188
-[62]: train's rmse:0.212096 val's rmse:0.239236
-[63]: train's rmse:0.21181 val's rmse:0.239369
-[64]: train's rmse:0.21152 val's rmse:0.239405
-[65]: train's rmse:0.211224 val's rmse:0.23932
-[66]: train's rmse:0.210855 val's rmse:0.239226
-[67]: train's rmse:0.210609 val's rmse:0.23932
-[68]: train's rmse:0.210338 val's rmse:0.23926
-[69]: train's rmse:0.210085 val's rmse:0.239234
-[70]: train's rmse:0.209859 val's rmse:0.239237
-[71]: train's rmse:0.209599 val's rmse:0.239256
-[72]: train's rmse:0.209382 val's rmse:0.239298
-[73]: train's rmse:0.209212 val's rmse:0.239357
-[74]: train's rmse:0.208976 val's rmse:0.239384
-[75]: train's rmse:0.208743 val's rmse:0.239379
-[76]: train's rmse:0.208577 val's rmse:0.239321
-[77]: train's rmse:0.208352 val's rmse:0.239368
-[78]: train's rmse:0.208076 val's rmse:0.239371
-[79]: train's rmse:0.2079 val's rmse:0.239415
-[80]: train's rmse:0.207709 val's rmse:0.239435
-[81]: train's rmse:0.207526 val's rmse:0.239402
-[82]: train's rmse:0.207347 val's rmse:0.239389
-[83]: train's rmse:0.207148 val's rmse:0.239491
-[84]: train's rmse:0.206966 val's rmse:0.239582
-[85]: train's rmse:0.206789 val's rmse:0.239676
-[86]: train's rmse:0.206619 val's rmse:0.239681
-[87]: train's rmse:0.206399 val's rmse:0.239778
-[88]: train's rmse:0.206187 val's rmse:0.239808
-[89]: train's rmse:0.205945 val's rmse:0.239809
-[90]: train's rmse:0.205735 val's rmse:0.239848
-[91]: train's rmse:0.20552 val's rmse:0.239928
-[92]: train's rmse:0.205367 val's rmse:0.239937
-[93]: train's rmse:0.205194 val's rmse:0.239958
-[94]: train's rmse:0.20499 val's rmse:0.239969
-[95]: train's rmse:0.204791 val's rmse:0.240026
-[96]: train's rmse:0.204587 val's rmse:0.239993
-[97]: train's rmse:0.204465 val's rmse:0.239996
-[98]: train's rmse:0.204295 val's rmse:0.240013
-[99]: train's rmse:0.204127 val's rmse:0.240039
-[100]: train's rmse:0.203978 val's rmse:0.240101
-[101]: train's rmse:0.203786 val's rmse:0.240082
-[102]: train's rmse:0.203605 val's rmse:0.240138
-[103]: train's rmse:0.203428 val's rmse:0.240204
-[104]: train's rmse:0.203215 val's rmse:0.240217
-[105]: train's rmse:0.20303 val's rmse:0.24015
-[106]: train's rmse:0.202878 val's rmse:0.240143
-[107]: train's rmse:0.202731 val's rmse:0.240172
-[108]: train's rmse:0.202499 val's rmse:0.240197
-[109]: train's rmse:0.202347 val's rmse:0.240205
-[110]: train's rmse:0.202171 val's rmse:0.240235
-[111]: train's rmse:0.202037 val's rmse:0.240269
-[112]: train's rmse:0.201903 val's rmse:0.240319
-[113]: train's rmse:0.20175 val's rmse:0.240325
-[114]: train's rmse:0.201646 val's rmse:0.240357
-[115]: train's rmse:0.201513 val's rmse:0.240382
-[116]: train's rmse:0.201371 val's rmse:0.240456
-[117]: train's rmse:0.201244 val's rmse:0.240416
-[118]: train's rmse:0.20112 val's rmse:0.240418
-[119]: train's rmse:0.200979 val's rmse:0.240416
-[120]: train's rmse:0.200865 val's rmse:0.240418
-[121]: train's rmse:0.200649 val's rmse:0.240406
-[122]: train's rmse:0.200496 val's rmse:0.240447
-[123]: train's rmse:0.200378 val's rmse:0.240496
-[124]: train's rmse:0.200269 val's rmse:0.24048
-[125]: train's rmse:0.200131 val's rmse:0.240487
-[126]: train's rmse:0.199971 val's rmse:0.240546
-[127]: train's rmse:0.199823 val's rmse:0.240784
-[128]: train's rmse:0.199713 val's rmse:0.240787
-[129]: train's rmse:0.199603 val's rmse:0.2408
-[130]: train's rmse:0.199479 val's rmse:0.240808
-[131]: train's rmse:0.199342 val's rmse:0.240851
-[132]: train's rmse:0.199202 val's rmse:0.24093
-[133]: train's rmse:0.199118 val's rmse:0.240925
-[134]: train's rmse:0.198958 val's rmse:0.240919
-[135]: train's rmse:0.1988 val's rmse:0.240848
-[136]: train's rmse:0.198652 val's rmse:0.240774
-[137]: train's rmse:0.198479 val's rmse:0.240771
-[138]: train's rmse:0.198354 val's rmse:0.24079
-[139]: train's rmse:0.198242 val's rmse:0.240994
-[140]: train's rmse:0.198108 val's rmse:0.24101
-[141]: train's rmse:0.197997 val's rmse:0.241056
-[142]: train's rmse:0.197899 val's rmse:0.241052
-[143]: train's rmse:0.197724 val's rmse:0.241058
-[144]: train's rmse:0.197599 val's rmse:0.241106
-[145]: train's rmse:0.197472 val's rmse:0.241114
-[146]: train's rmse:0.197386 val's rmse:0.241105
-[147]: train's rmse:0.197226 val's rmse:0.241121
-[148]: train's rmse:0.197131 val's rmse:0.241106
-[149]: train's rmse:0.197042 val's rmse:0.241142
-[150]: train's rmse:0.196921 val's rmse:0.241163
-preds_lgb_log <- predict(final_model_lgb, x_val_mat)
-rmse_lgb_real <- sqrt(mean((exp(preds_lgb_log) - exp(y_val_vec))^2))
-
-preds_lgb_train_log <- predict(final_model_lgb, x_train_mat)
-rmse_lgb_train_real <- sqrt(mean(
- (exp(preds_lgb_train_log) - exp(y_train_vec))^2
-))
-
-print(paste0("LightGBM RMSE on validation set: ", round(rmse_lgb_real, 4)))[1] "LightGBM RMSE on validation set: 10.7502"
-print(paste0("LightGBM RMSE on training set: ", round(rmse_lgb_train_real, 4)))[1] "LightGBM RMSE on training set: 11.8411"
-print(paste0(
- "Best Hyperparameters: num_leaves = ",
- best_params_row$num_leaves,
- ", learning_rate = ",
- round(best_params_row$learning_rate, 4),
- ", bagging_fraction = ",
- round(best_params_row$bagging_fraction, 4),
- ", feature_fraction = ",
- round(best_params_row$feature_fraction, 4)
-))[1] "Best Hyperparameters: num_leaves = 150, learning_rate = 0.2, bagging_fraction = 0.9987, feature_fraction = 0.6"
-To challenge the XGBoost baseline and cross-validate the limits of tree ensembles, LightGBM was evaluated. Its leaf-wise growth optimizes for maximum loss reduction rather than symmetrical tree balance, making it theoretically prone to overfitting on smaller datasets but highly efficient on dense data.
-The Bayesian optimization converged at the maximum allowed iterations (\(best\_iter = 150\)), selecting the maximum permitted complexity of \(150\) leaves per tree (\(num\_leaves\)), a learning rate of \(0.2\), and a \(bagging fraction\) of \(1.0\). Interestingly, contrary to XGBoost, the algorithm introduced feature-level regularization by selecting a \(feature\_fraction\) of \(0.6\). This indicates that injecting random subspaces at the node level helped mitigate the aggressive asymmetric expansion of the leaf-wise strategy.
-The LightGBM model achieved a validation \(RMSE\) of \(10.61\) and a training \(RMSE\) of \(10.90\). This strict proximity between training and validation errors mathematically proves that the model successfully avoided the overfitting trap that captured XGBoost. By leveraging feature regularization, LightGBM captured a robust, generalized representation of the data without memorizing the stochastic noise inherent to the training set.
-Algorithmic Scalability & Hardware Constraints
-In a modern Data Science and MLOps paradigm, theoretical predictive capacity must be critically weighed against computational feasibility. While gradient boosting models proved exceptionally efficient due to histogram-based approximations, the evaluation of traditional bagging ensembles and deep learning architectures was ultimately bottlenecked by local hardware constraints and algorithmic overhead.
-Random Forest and the Cost of Bagging
-y_train <- train_clean$implied_vol_ref
-X_train <- as.matrix(train_clean |> select(-implied_vol_ref))
-
-y_val <- val_clean$implied_vol_ref
-X_val <- as.matrix(val_clean |> select(-implied_vol_ref))
-
-num_features <- ncol(X_train)
-
-fold_ids <- sample(1:3, nrow(X_train), replace = TRUE)
-
-fold_index <- lapply(1:3, function(k) {
- list(
- train = which(fold_ids != k),
- valid = which(fold_ids == k)
- )
-})
-
-scoring_function <- function(mtry, min_node_size, sample_fraction) {
- mtry <- as.integer(round(mtry))
- min_node_size <- as.integer(round(min_node_size))
-
- fold_rmses <- numeric(3)
-
- for (k in 1:3) {
- idx_train <- fold_index[[k]]$train
- idx_valid <- fold_index[[k]]$valid
-
- rf_model <- tryCatch(
- ranger(
- x = X_train[idx_train, ],
- y = y_train[idx_train],
- num.trees = 80,
- mtry = mtry,
- min.node.size = min_node_size,
- sample.fraction = sample_fraction,
- max.depth = 15,
- num.threads = parallel::detectCores() - 1
- ),
- error = function(e) return(NULL)
- )
-
- if (is.null(rf_model)) {
- return(list(Score = -9999, Pred = 0))
- }
-
- preds <- predict(rf_model, X_train[idx_valid, ])$predictions
- fold_rmses[k] <- sqrt(mean((preds - y_train[idx_valid])^2))
- }
-
- list(Score = -mean(fold_rmses), Pred = 0)
-}
-
-bounds <- list(
- mtry = c(1L, num_features),
- min_node_size = c(1L, 20L),
- sample_fraction = c(0.6, 1.0)
-)
-
-opt_obj <- BayesianOptimization(
- FUN = scoring_function,
- bounds = bounds,
- init_points = 3,
- n_iter = 5,
- acq = "ucb",
- kappa = 2.576,
- verbose = TRUE
-)
-
-best_params_row <- data.frame(
- mtry = as.integer(round(opt_obj$Best_Par["mtry"])),
- min.node.size = as.integer(round(opt_obj$Best_Par["min_node_size"])),
- sample.fraction = opt_obj$Best_Par["sample_fraction"],
- cv_rmse = -opt_obj$Best_Value
-)
-
-final_model_rf <- ranger(
- x = X_train,
- y = y_train,
- num.trees = 500,
- mtry = best_params_row$mtry,
- min.node.size = best_params_row$min.node.size,
- sample.fraction = best_params_row$sample.fraction,
- num.threads = parallel::detectCores() - 1,
- importance = "permutation"
-)
-
-preds_rf_log <- predict(final_model_rf, X_val)$predictions
-
-rmse_rf_real <- sqrt(mean(
- (exp(preds_rf_log) - exp(y_val))^2
-))
-
-print(paste0("Random Forest RMSE on validation set: ", round(rmse_rf_real, 4)))
-print(paste0(
- "Random Forest RMSE on training set: ",
- round(
- sqrt(mean(
- (exp(predict(final_model_rf, X_train)$predictions) - exp(y_train))^2
- )),
- 4
- )
-))
-print(paste0(
- "Best Hyperparameters: mtry = ",
- best_params_row$mtry,
- ", min.node.size = ",
- best_params_row$min.node.size,
- ", sample.fraction = ",
- round(best_params_row$sample.fraction, 4)
-))Unlike boosting frameworks that build shallow trees sequentially, a Random Forest constructs hundreds of deep, independent trees simultaneously to reduce overall variance. Evaluating the Random Forest on our continuous financial dataset led to an explosion in computational complexity. Without strictly enforcing a severe maximum depth limit, the exact greedy split-finding algorithm of the ranger implementation saturated the CPU and local memory, requiring disproportionate execution times (exceeding \(30\) minutes per hyperparameter combination). This algorithmic inefficiency rendered rigorous Bayesian hyperparameter tuning computationally intractable locally, justifying the exclusion of Random Forest from the final predictive benchmark.
-Multi-Layer Perceptron (MLP) on PCA Space
-mlp_spec <- mlp(
- hidden_units = c(128, 64),
- penalty = 0.001,
- epochs = 150,
- activation = "tanh",
- learn_rate = 0.01
-) |>
- set_engine("brulee") |>
- set_mode("regression")
-
-mlp_rec <- recipe(implied_vol_ref ~ ., data = train_clean) |>
- step_nzv(all_predictors()) |>
- step_normalize(all_numeric_predictors()) |>
- step_dummy(all_nominal_predictors())
-
-mlp_wf <- workflow() |>
- add_recipe(mlp_rec) |>
- add_model(mlp_spec)
-
-final_fit <- fit(mlp_wf, data = train_clean)
-
-val_results <- predict(final_fit, new_data = val_clean) |>
- bind_cols(val_clean |> select(implied_vol_ref))
-
-val_results_real <- val_results |>
- mutate(
- truth_real = exp(implied_vol_ref),
- estimate_real = exp(.pred)
- )
-
-rmse_real_scale_yardstick <- rmse(
- val_results_real,
- truth = truth_real,
- estimate = estimate_real
-)
-
-rmse_real_scale <- rmse_real_scale_yardstick$.estimate
-
-print(paste0("MLP RMSE on validation set: ", round(rmse_real_scale, 4)))
-print(paste0("MLP RMSE on training set: ", round(rmse_real_scale, 4)))
-print(paste0(
- "MLP Hyperparameters: hidden_units = c(128, 64), penalty = 0.001, epochs = 150, activation = 'tanh', learn_rate = 0.01"
-))A similar scalability issue was encountered when deploying a Multi-Layer Perceptron (MLP). As neural networks require strictly standardized and uncorrelated inputs to prevent gradient explosion, the MLP was trained on the \(train\_pca\_final\) dataset.
-However, running the brulee (Torch-based) engine on a standard CPU architecture proved inefficient for tabular data. The dense nature of the orthogonal PCA components, combined with the lack of GPU acceleration, led to severe optimization instability. The objective function frequently failed to converge, returning infinite deviance values during the Gaussian Process evaluation of the Bayesian optimization loop.
These practical engineering failures highlight a critical constraint in applied machine learning: for high-dimensional tabular financial data processed on local infrastructure, modern histogram-based tree-boosting frameworks offer a vastly superior performance-to-computation ratio compared to deep neural networks or traditional exact-greedy bagging methods.
-Selection of the Optimal Black-Box Model
-Based on the rigorous benchmarking of non-linear architectures, LightGBM is undeniably retained as the optimal black-box model for this predictive task.
-While XGBoost and LightGBM achieved broadly comparable validation scores (\(10.70\) vs. \(10.61\), respectively), their underlying learning dynamics were fundamentally opposed. XGBoost’s near-zero training \(RMSE\) (\(0.565\)) exposed a severe overfitting issue, driven by a forced maximum depth of \(10\) and a complete lack of stochastic regularization. The model acted as a high-variance memorization engine rather than a generalized predictive tool.
-Conversely, LightGBM demonstrated a vastly superior structural adaptation. By leveraging a leaf-wise growth strategy combined with active feature-level regularization (\(feature\_fraction = 0.6\)), it maintained a balanced training \(RMSE\) of \(10.90\) alongside its \(10.61\) validation \(RMSE\). This indicates a robust, generalized fit that successfully isolated the predictive signal from the market noise. Furthermore, from an MLOps and scalability perspective, LightGBM’s advanced histogram-based algorithm executes significantly faster with a lower memory footprint, solidifying it as the superior engineering choice for deploying high-capacity models on dense financial data.
-Results Comparison & Discussion
-The Trade-off: Accuracy vs. Interpretability
-While the LightGBM model provides superior predictive performance, its “black-box” nature requires post-hoc interpretation to ensure the captured signals align with financial theory. We employ a dual approach for interpretability: Global Feature Importance (Gain-based) and Local Explanations using SHAP values.
-shp <- shapviz(final_model_lgb, X_pred = x_val_mat)Global Feature Importance: The Primacy of Volatility Persistence
-The first level of interpretation focuses on the “Gain” metric, which measures the total reduction in the objective function brought by each feature across all trees in the ensemble.
-The analysis of the Gain-based importance reveals three critical insights:
--
-
1. The Dominance of Realized Volatility (The “Clustering” Effect): The model is heavily dominated by historical realized volatility metrics. \(realized\_vol\_mid\) alone accounts for approximately \(57\%\) of the total gain, followed by \(realized\_vol\_long\) and \(realized\_vol\_short\). From a financial econometrics perspective, this confirms that the model has successfully identified the “volatility clustering” phenomenon, where past variance is the most significant predictor of future implied volatility. The fact that the “mid” horizon carries the most weight suggests that the model prioritizes structural volatility trends over daily noise.
-2. Market Sentiment and Uncertainty Indicators: Beyond historical volatility, the model identifies \(strike\_dispersion\) and the \(market\_vol\_index\) as the next most influential features. In the financial context, this is highly coherent: a high dispersion in strikes signals a lack of consensus among market participants regarding future asset prices, which naturally drives up the implied volatility premium.
-3. Tail Risk and Liquidity Proxies: Secondary variables such as \(stress\_spread\) and \(put\_low\_strike\) contribute at a lower but non-negligible level. These variables act as proxies for downside protection demand (tail risk), allowing the model to fine-tune its predictions during periods of market stress.
-
Local Interpretability: SHAP Beeswarm and Magnitude Analysis
-To move beyond global rankings, we utilize SHAP values to quantify the direction and magnitude of each feature’s impact on individual predictions.
-sv_importance(shp, kind = "beeswarm")
The SHAP beeswarm plot reveals the sign of the relationship between predictors and the target:
-Positive Correlation with IV:
-Higher values (orange) of \(realized\_vol\_mid\) and \(strike\_dispersion\) consistently lead to positive SHAP values, increasing the predicted implied volatility.
-Extreme Tails and Asymmetry:
-\(realized\_vol\_short\) exhibits a very wide horizontal spread. While its average impact is lower than the “mid” version, it is responsible for the most extreme “tail” predictions, with SHAP values reaching as low as \(-1.5\). This indicates that the model uses short-term volatility shocks to capture sudden, sharp shifts in market regimes.
-Mean-Reversion Signals:
-Variables like \(vol\_instability\) show a long left tail (purple), indicating that low instability can occasionally exert a significant downward pressure on the prediction, likely acting as a mean-reversion signal captured by the boosted trees.
-sv_importance(shp, kind = "bar")
The bar chart representing \(mean(|SHAP\; value|)\) confirms the hierarchy seen in the Gain analysis. The convergence between these two independent mathematical approaches (Gain vs. SHAP) robustly validates the feature selection. \(realized\_vol\_mid\) remains the undisputed primary driver with an average impact of approximately \(0.18\) on the log-volatility scale.
-Feature Interaction and Non-Linearity: Stress Spread and Volatility Slope
-The power of LightGBM lies in its ability to capture non-linearities and cross-feature interactions. We analyze this through a SHAP dependence plot of the \(stress\_spread\) variable, colored by the \(vol\_slope\).
-sv_dependence(shp, v = "stress_spread")
-
-
1. Non-Linear Regime Switching: The relationship between \(stress\_spread\) and its impact on implied volatility follows a clear non-linear “S-curve”. For low \(stress\_spread\) values (below \(0\)), the impact is negative. As the spread increases and crosses the \(0\) threshold, the SHAP value rises sharply before plateauing around a spread of \(2\). This suggests a “regime switch” where the model identifies a specific threshold beyond which market stress becomes a dominant, non-linear driver of the volatility premium.
-2. Interaction with Term Structure (\(vol\_slope\)): The color encoding reveals a subtle interaction effect. At high levels of \(stress\_spread\), a positive \(vol\_slope\) (orange dots) tends to amplify the positive impact on the prediction compared to a flat or negative slope (purple dots). This alignment between market stress and a steepening volatility term structure allows the model to capture complex “crisis” signatures that a standard linear model would overlook.
-
In conclusion, the interpretability analysis confirms that the LightGBM model has reconstructed a sophisticated, yet financially sound, representation of volatility dynamics, combining persistence effects with non-linear risk premiums.
-Final Inference
-The ultimate objective of this study is to provide accurate point estimates for the implied volatility of the assets contained in the hidden test set (\(test\_eng\)). Having identified LightGBM as the optimal non-linear model and the xxx as the most robust linear benchmark, we proceed to the final inference phase.
-As specified in the data pipeline (Section 2.7), the test dataset was transformed using the $bake()$ function, ensuring that all scaling, winsorization, and distribution adjustments were strictly aligned with the training set’s statistics. Since the models were trained on log-transformed targets to stabilize variance, the raw predictions are returned in the logarithmic domain. To comply with the submission requirements, an exponential transformation \(f(x) = e^x\) is applied to project the results back to the original volatility scale.
The following implementation handles the final prediction generation and exports the results into the required CSV format:
-final_model_lin <- mod_lmm_5
-preds_linear_log <- predict(
- final_model_lin,
- newdata = test_linear,
- allow.new.levels = TRUE
-)
-preds_linear_real <- exp(as.numeric(preds_linear_log))x_test_mat <- as.matrix(
- test_tree |> select(-any_of(c("asset_id", "obs_date", "implied_vol_ref")))
-)
-
-preds_lgb_log <- predict(final_model_lgb, x_test_mat)
-preds_lgb_real <- exp(as.numeric(preds_lgb_log))submission <- tibble(
- linear_model = preds_linear_real,
- lightgbm = preds_lgb_real
-)
-
-write_csv(submission, "hat_y.csv")