Skip to content

Time Series benchmark #135

@MarcoGorelli

Description

@MarcoGorelli

The M5 Forecasting Competition was held on Kaggle in 2020, and top solutions generally featured a lot of heavy feature engineering

Doing that feature engineering in pandas was quite slow, so I'm benchmarking how much better Polars would have been at that task

I think this is good to benchmark, as:

  • the competition was run on real-world Walmart data
  • the operations we're benchmarking are from the winning solution, so evidently they were doing something right

I think this reflects the kinds of gains that people doing applied data science can expect from using Polars

Here's a notebook with the queries + data: https://www.kaggle.com/code/marcogorelli/m5-forecasting-feature-engineering-benchmark/notebook

Run with SMALL=True for testing, then SMALL=False to run with the original dataset (full size)


Anyone fancy translating to SQL so we could check DuckDB too? My intuition is that this wouldn't be DuckDB's forte - which is fine, DuckDB is incredibly good at many other things - I think that making a friendly comparison involving this kind of benchmark would give a more complete picture than "DuckDB scales better than Polars because TPC-H!"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions