A statistical analysis extension for DuckDB, providing regression analysis, diagnostics, and inference capabilities directly within your database.
Warning
BREAKING CHANGE: Function names have changed from anofox_statistics_* to anofox_stats_* to align with the unified naming convention across Anofox extensions. Aliases without the prefix (e.g., ols_fit) are also available for convenience. Update your code accordingly.
Important
This extension is in early development, so bugs and breaking changes are expected. Please use the issues page to report bugs or request features.
| Method | Function | Description |
|---|---|---|
| OLS | ols_fit, ols_fit_agg |
Ordinary Least Squares |
| Ridge | ridge_fit, ridge_fit_agg |
L2 regularization |
| Elastic Net | elasticnet_fit, elasticnet_fit_agg |
Combined L1+L2 regularization |
| WLS | wls_fit, wls_fit_agg |
Weighted Least Squares |
| RLS | rls_fit, rls_fit_agg |
Recursive Least Squares (online) |
| Poisson | poisson_fit_agg |
GLM for count data |
| ALM | alm_fit_agg |
24 error distributions |
| BLS/NNLS | bls_fit_agg, nnls_fit_agg |
Bounded/Non-negative LS |
| PLS | pls_fit, pls_fit_agg |
Partial Least Squares |
| Isotonic | isotonic_fit, isotonic_fit_agg |
Monotonic regression |
| Quantile | quantile_fit, quantile_fit_agg |
Quantile/median regression |
| Category | Function | Description |
|---|---|---|
| Normality | shapiro_wilk_agg, jarque_bera_agg, dagostino_k2_agg |
Normality tests |
| Parametric | t_test_agg, one_way_anova_agg, yuen_agg, brown_forsythe_agg |
Parametric tests |
| Nonparametric | mann_whitney_u_agg, kruskal_wallis_agg, wilcoxon_signed_rank_agg, brunner_munzel_agg, permutation_t_test_agg |
Nonparametric tests |
| Correlation | pearson_agg, spearman_agg, kendall_agg, distance_cor_agg, icc_agg |
Correlation tests |
| Categorical | chisq_test_agg, chisq_gof_agg, g_test_agg, fisher_exact_agg, mcnemar_agg |
Contingency table tests |
| Effect Size | cramers_v_agg, phi_coefficient_agg, contingency_coef_agg, cohen_kappa_agg |
Association measures |
| Proportion | prop_test_one_agg, prop_test_two_agg, binom_test_agg |
Proportion tests |
| Equivalence | tost_t_test_agg, tost_paired_agg, tost_correlation_agg |
TOST equivalence tests |
| Distribution | energy_distance_agg, mmd_agg |
Distribution comparison |
| Forecast | diebold_mariano_agg, clark_west_agg |
Forecast evaluation |
| Function | Description |
|---|---|
vif, vif_agg |
Variance Inflation Factor |
aic, bic |
Model selection criteria |
residuals_diagnostics_agg |
Residual analysis |
aid_agg, aid_anomaly_agg |
Demand pattern classification |
Table macros for easy per-group model fitting and prediction with a single function call:
| Macro | Description |
|---|---|
ols_fit_predict_by |
OLS per-group fit + predict |
ridge_fit_predict_by |
Ridge per-group fit + predict |
elasticnet_fit_predict_by |
ElasticNet per-group fit + predict |
wls_fit_predict_by |
WLS per-group fit + predict |
rls_fit_predict_by |
RLS per-group fit + predict |
bls_fit_predict_by |
Bounded LS per-group fit + predict |
alm_fit_predict_by |
ALM per-group fit + predict |
poisson_fit_predict_by |
Poisson GLM per-group fit + predict |
pls_fit_predict_by |
PLS per-group fit + predict |
isotonic_fit_predict_by |
Isotonic per-group fit + predict |
quantile_fit_predict_by |
Quantile per-group fit + predict |
- Aggregate Functions: All methods support
GROUP BYfor per-group analysis - Window Functions: Regression methods support
OVERclause for rolling/expanding windows - Prediction Intervals: Confidence and prediction intervals for forecasts
- Full Inference: t-statistics, p-values, confidence intervals for coefficients
- Table Macros:
*_fit_predict_bymacros simplify per-group workflows
Important: All functions in this extension use positional parameters, not named parameters (:= syntax). Parameters must be provided in the order shown in the examples below.
All functions follow the anofox_stats_* naming convention, with convenient aliases without the prefix:
- Primary:
anofox_stats_ols_fit(...) - Alias:
ols_fit(...)- Shorter and more convenient!
Both work identically - use whichever you prefer!
-- Load the extension
LOAD 'anofox_statistics';
-- Simple OLS regression (using alias - recommended)
SELECT * FROM ols_fit(
[1.0, 2.0, 3.0, 4.0, 5.0]::DOUBLE[], -- y: response variable
[[1.1], [2.1], [2.9], [4.2], [4.8]]::DOUBLE[][], -- x: feature matrix
MAP{'intercept': true} -- options
);
-- Or use the full name if you prefer
SELECT * FROM anofox_stats_ols_fit(
[1.0, 2.0, 3.0, 4.0, 5.0]::DOUBLE[], -- y: response variable
[[1.1], [2.1], [2.9], [4.2], [4.8]]::DOUBLE[][], -- x: feature matrix
MAP{'intercept': true} -- options
);
-- Multiple regression with 3 predictors
SELECT * FROM ols_fit(
[1.0, 2.0, 3.0, 4.0, 5.0]::DOUBLE[], -- y: response variable
[[1.1, 2.0, 3.0], [2.1, 3.0, 4.0], [2.9, 4.0, 5.0],
[4.2, 5.0, 6.0], [4.8, 6.0, 7.0]]::DOUBLE[][], -- x: feature matrix
MAP{'intercept': true} -- options
);
-- Get coefficient inference using fit with full_output
SELECT
coefficients,
intercept,
coefficient_p_values,
intercept_p_value,
f_statistic,
f_statistic_pvalue
FROM ols_fit(
[1.0, 2.0, 3.0, 4.0, 5.0]::DOUBLE[], -- y
[[1.0], [2.0], [3.0], [4.0], [5.0]]::DOUBLE[][], -- x (matrix format)
{'intercept': true, 'full_output': true, 'confidence_level': 0.95} -- options
);
-- Per-group regression with aggregate functions
SELECT
category,
result.coefficients[1] as price_effect,
result.intercept,
result.r2
FROM (
SELECT
category,
ols_fit_agg(
sales,
[price],
MAP{'intercept': true}
) as result
FROM sales_data
GROUP BY category
) sub;
-- EASIER: Per-group fit + predict using table macros
-- Returns one row per observation with predictions
SELECT * FROM ols_fit_predict_by(
'sales_data', -- table name (as string)
category, -- group column
sales, -- y column
[price, quantity] -- x columns as list
);
-- Returns: group_id, y, x, yhat, yhat_lower, yhat_upper, is_training
-- With options (e.g., confidence level)
SELECT * FROM ridge_fit_predict_by(
'sales_data',
category,
sales,
[price, quantity],
{'alpha': 0.5, 'confidence_level': 0.99} -- options
);INSTALL anofox_statistics FROM community;
LOAD anofox_statistics;- API Reference: Complete function reference and specifications
Guides are available in the guides/ directory:
- Quick Start Guide: Getting started
- Technical Guide: Architecture and implementation details
- Business Guide: Real-world business use cases
- Advanced Use Cases: Complex analytical workflows
- DuckDB: v1.4.4 or higher
- Rust: Stable toolchain (for building from source)
- anofox-regression: Rust regression library (via Cargo)
- faer: Linear algebra library (via Cargo, no default features)
- ICU Extension (Optional): Required for date/time operations in documentation examples
INSTALL icu; LOAD icu;
This extension collects anonymous usage telemetry to help improve the product. Telemetry is enabled by default and includes:
- Extension load events (extension name, version, platform)
- Function execution events (which functions are used)
- No personal data or query contents are collected
Environment Variable:
export DATAZOO_DISABLE_TELEMETRY=1DuckDB SQL Setting:
SET anofox_telemetry_enabled = false;For more information, see the posthog-telemetry repository.
This project is licensed under the Business Source License 1.1 (BSL 1.1).
- Usage Grant: Free to use, modify, and distribute for non-production purposes
- Production Use: Permitted after 4 years from release date, or under a commercial license
- Change Date: [Release Date + 4 years]
- Change License: Apache License 2.0
See LICENSE for full terms.
The BSL allows:
- Free use for development, testing, and research
- Open source collaboration and contributions
- Academic and educational use
- Small-scale production use
While ensuring:
- Sustainable development funding
- Protection for the project's long-term viability
- Future conversion to fully open source (Apache 2.0)
For commercial production use before the Change Date, please contact: [contact@datazoo.de]
Contributions are welcome. To get started:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
For questions, open an issue on GitHub.
- Additional statistical tests (Durbin-Watson, Breusch-Pagan, etc.)
- Visualization helpers for diagnostics
- Documentation and examples
- Bug reports and fixes
- Performance optimizations
- Internationalization
- Documentation: guides/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: contact@datazoo.de
If you use this extension in research, please cite:
@software{anofox_statistics,
title = {Anofox Statistics: Statistical Analysis Extension for DuckDB},
author = {DataZoo DE},
year = {2025},
url = {https://github.com/DataZooDE/anofox-statistics},
version = {1.0.0}
}The regression algorithms in this extension are powered by the anofox-regression Rust crate, which is validated against R's statistical functions. The test suite compares results with R's lm(), glmnet, and other standard statistical packages to ensure numerical accuracy.
- DuckDB Team: For the database and extension framework
- anofox-regression: Rust regression library with R-validated implementations
- Open Source Community: For contributions and feedback