Time Series benchmark

The [M5 Forecasting Competition](https://www.sciencedirect.com/science/article/pii/S0169207021001874) was held on Kaggle in 2020, and top solutions generally featured a lot of heavy feature engineering

Doing that feature engineering in pandas was quite slow, so I'm benchmarking how much better Polars would have been at that task

I think this is good to benchmark, as:
- the competition was run on real-world Walmart data
- the operations we're benchmarking are from the winning solution, so evidently they were doing something right

I think this reflects the kinds of gains that people doing applied data science can expect from using Polars

Here's a notebook with the queries + data: https://www.kaggle.com/code/marcogorelli/m5-forecasting-feature-engineering-benchmark/notebook

Run with `SMALL=True` for testing, then `SMALL=False` to run with the original dataset (full size)

---

Anyone fancy translating to SQL so we could check DuckDB too? My intuition is that this wouldn't be DuckDB's forte - which is fine, DuckDB is incredibly good at many other things - I think that making a friendly comparison involving this kind of benchmark would give a more complete picture than "DuckDB scales better than Polars because TPC-H!"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time Series benchmark #135

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time Series benchmark #135

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions