Task:
Given 17 columns of training data (time, A-N, Y1, Y2) and 15 columns of test data (time, A-N):
- Goal: Train a model on the training data and predict values (
Y1,Y2) for the test data. - Metric: Achieve the highest possible ( R^2 ), defined as:
R^2 = 1 - (SS_res / SS_tot) SS_res = Σ(y_i - ŷ_i)^2 SS_tot = Σ(y_i - ȳ)^2
- Tree-based methods: XGBoost, LightGBM, CatBoost.
-
Correlation
- Y1 strongly correlated with predictors (( r \approx 0.8 )) → possible linear dependence.
- Y2 weaker correlation → less linear predictability.
-
ADF (Augmented Dickey–Fuller Test)
- Tests for stationarity (constant mean, variance, autocorrelation).
- Both Y1 and Y2 found to be stationary.
-
ACF (Autocorrelation Function)
- Y1: no significant autocorrelation (noise-like).
- Y2: spikes above confidence bounds → some MA structure.
-
PACF (Partial Autocorrelation Function)
- Y1: no significant partial autocorrelation.
- Y2: significant spikes → some AR structure.
Interpretation:
- Y1: mostly noise, no strong AR/MA structure.
- Y2: some autocorrelation, but not enough for a low-order AR/MA model.
- Linear models likely to underfit → nonlinear models more suitable.
- Large dataset with many columns → XGBoost can exploit interactions.
- Predictive performance prioritized over interpretability.
- Handles nonlinear structure: can capture feature interactions and lag effects.
- Comparison of boosting frameworks:
- XGBoost: extreme gradient boosting, level-wise growth.
- LightGBM: leaf-wise growth, faster on large datasets.
- CatBoost: categorical feature handling, symmetric level-wise trees.
-
Rolling Window Testing
- Train/test splits that respect time order.
-
Hyperparameter Optimization
- Cross-validation with time-series splits.
-
Evaluation Metric
- Compute ( R^2 ) on validation/test data.
-
Final Training
- Retrain best model on the entire dataset.
-
Prediction
- Generate predictions for Y1 and Y2 on the test set.
- Final model was a blend of XGBoost and CatBoost,
- Ridge regression was used as a meta-learner to determine the optimal weights for combining their predictions.
- This approach leveraged:
- XGBoost: strong baseline nonlinear learner.
- CatBoost: robust handling of categorical features.
- Ridge regression: provided a stable, regularized way to combine both, preventing overfitting and balancing contributions.
- Implemented Avellaneda-Stoikov model, parameters need adjusting in testing environment
- Developed a late-game inefficiency capture strategy, targeting predictable pricing discrepancies near market close