diff --git a/content/blog/2025-09-29-datafusion-50.0.0.md b/content/blog/2025-09-29-datafusion-50.0.0.md new file mode 100644 index 00000000..e8548b75 --- /dev/null +++ b/content/blog/2025-09-29-datafusion-50.0.0.md @@ -0,0 +1,464 @@ +--- +layout: post +title: Apache DataFusion 50.0.0 Released +date: 2025-09-29 +author: pmc +categories: [release] +--- + + + +[TOC] + + + +## Introduction + +We are proud to announce the release of [DataFusion 50.0.0]. This blog post +highlights some of the major improvements since the release of [DataFusion +49.0.0]. The complete list of changes is available in the [changelog]. +Thanks to [numerous contributors] for making this release possible! + +[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0 +[DataFusion 49.0.0]: https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md +[numerous contributors]: https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md#credits + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in ClickBench +and other benchmark results. + +ClickBench performance results over time for DataFusion + +**Figure 1**: Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. See the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/) +for more details. + +Here are some noteworthy optimizations added since DataFusion 49: + +**Dynamic Filter Pushdown Improvements** + +The dynamic filter pushdown optimization, which allows runtime filters to cut +down on the amount of data read, has been extended to support **inner hash +joins**, dramatically improving performance when one relation is relatively +small or filtered by a highly selective predicate. More details can be found in +the [Dynamic Filter Pushdown for Hash Joins](#dynamic-filter-pushdown-for-hash-joins) section below. +The dynamic filters in the TopK operator have also been improved in DataFusion +50.0.0, further increasing the effectiveness and efficiency of the optimization. +More details can be found in this +[ticket](https://github.com/apache/datafusion/pull/16433). + +**Nested Loop Join Optimization** + +The nested loop join operator has been rewritten to reduce execution time and memory +usage by adopting a finer-grained approach. Specifically, we now limit the +intermediate data size to around a single `RecordBatch` for better memory +efficiency, and we have eliminated redundant conversions from the old +implementation to further improve execution speed. +When evaluating this new approach in a microbenchmark, we measured up to a 5x +improvement in execution time and a 99% reduction in memory usage. More details and +results can be found in this +[ticket](https://github.com/apache/datafusion/pull/16996). + +**Parquet Metadata Caching** + +DataFusion now automatically caches the metadata of Parquet files (statistics, +page indexes, etc.), to avoid unnecessary disk/network round-trips. This is +especially useful when querying the same table multiple times over relatively +slow networks, allowing us to achieve an order of magnitude faster execution +time when running many small reads over large files. More information can be +found in the [Parquet Metadata Cache](#parquet-metadata-cache) section. + +## Community Growth 📈 + +Between `49.0.0` and `50.0.0`, we continue to see our community grow: + +1. Qi Zhu ([zhuqi-lucas](https://github.com/zhuqi-lucas)) and Yoav Cohen + ([yoavcloud](https://github.com/yoavcloud)) became committers. See the + [mailing list] for more details. +2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs + from 79 different committers, created over 235 issues, and closed 197 of them + 🚀. All changes are listed in the detailed [changelogs]. +3. DataFusion published several blogs, including *[Using External Indexes, Metadata Stores, Catalogs and + Caches to Accelerate Queries on Apache Parquet]*, *[Dynamic Filters: + Passing Information Between Operators During Execution for 25x Faster + Queries]*, and *[Implementing User Defined Types and Custom Metadata + in DataFusion]*. + + + + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?dev@datafusion.apache.org +[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet]: https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/ +[Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/ +[Implementing User Defined Types and Custom Metadata in DataFusion]: https://datafusion.apache.org/blog/2025/09/21/custom-types-using-metadata/ + +## New Features ✨ + +### Improved Spilling Sorts for Larger-than-Memory Datasets + +DataFusion has long been able to sort datasets that do not fit entirely in memory, +but still struggled with particularly large inputs or highly memory-constrained +setups. Larger-than-memory sorts in DataFusion 50.0.0 have been improved with the recent introduction +of multi-level merge sorts (more details in the respective +[ticket](https://github.com/apache/datafusion/pull/15700)). It is now +possible to execute almost any sorting query that would have previously triggered *out-of-memory* +errors, by relying on disk spilling. Thanks to [Raz Luvaton], [Yongting You], and +[ding-young] for delivering this feature. + +[Raz Luvaton]: https://github.com/rluvaton +[Yongting You]: https://github.com/2010YOUY01 +[ding-young]: https://github.com/ding-young + +### Dynamic Filter Pushdown for Hash Joins + +The [dynamic filter pushdown +optimization](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) +has been extended to inner hash joins, dramatically reducing the amount of +scanned data in some workloads—a technique sometimes referred to as +[*Sideways Information Passing*](https://www.cs.cmu.edu/~15721-f24/papers/Sideways_Information_Passing.pdf). + +These filters are automatically applied to inner hash joins, while future work +will introduce them to other join types. + +For example, given a query that looks for a specific customer and +their orders, DataFusion can now filter the `orders` relation based on the +`c_custkey` of the target customer, reducing the amount of data +read from disk by orders of magnitude. + +```sql +-- retrieve the orders of the customer with c_phone = '25-989-741-2988' +SELECT * +FROM customer +JOIN orders ON c_custkey = o_custkey +WHERE c_phone = '25-989-741-2988'; +``` + +The following shows an execution plan in DataFusion 50.0.0 with this optimization: + +```sql +HashJoinExec + DataSourceExec: <-- read customer + predicate=c_phone@4 = 25-989-741-2988 + metrics=[output_rows=1, ...] + DataSourceExec: <-- read orders + -- dynamic filter is added here, filtering directly at scan time + predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1 <= 1 ] + -- the number of output rows is kept to a minimum + metrics=[output_rows=11, ...] +``` + +Because there is a single customer in this query, +almost all rows from `orders` are filtered out by the join. +In previous versions of DataFusion, the entire `orders` relation would be +scanned to join with the target customer, but now the dynamic filter pushdown can +filter it right at the source, minimizing the amount of data decoded. + +More information can be found in the respective +[ticket](https://github.com/apache/datafusion/pull/16445) and the next step will be to +[extend the dynamic filters to other types of joins](https://github.com/apache/datafusion/issues/16973), such as `LEFT` and +`RIGHT` outer joins. Thanks to [Adrian Garcia Badaracco], [Qi Zhu], [xudong963], [Daniël Heres], and [Lía Adriana] +for delivering this feature. + +[Adrian Garcia Badaracco]: https://github.com/adriangb +[Qi Zhu]: https://github.com/zhuqi-lucas +[xudong963]: https://github.com/xudong963 +[Daniël Heres]: https://github.com/Dandandan +[Lía Adriana]: https://github.com/LiaCastaneda + +### Parquet Metadata Cache + +The metadata of Parquet files (statistics, page indexes, etc.) is now +automatically cached when using the built-in [ListingTable], which reduces disk/network round-trips and repeated decoding +of the same information. With a simple microbenchmark that executes point reads +(e.g., `SELECT v FROM t WHERE k = x`) over large files, we measured a 12x +improvement in execution time (more details can be found in the respective +[ticket](https://github.com/apache/datafusion/pull/16971)). This optimization +is production ready and enabled by default (more details in the +[Epic](https://github.com/apache/datafusion/issues/17000)). +Thanks to [Nuno Faria], [Jonathan Chen], [Shehab Amin], [Oleks V], [Tim Saucer], and [Blake Orth] for delivering this feature. + +[ListingTable]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html +[Nuno Faria]: https://github.com/nuno-faria +[Jonathan Chen]: https://github.com/jonathanc-n +[Shehab Amin]: https://github.com/shehabgamin +[Oleks V]: https://github.com/comphead +[Tim Saucer]: https://github.com/timsaucer +[Blake Orth]: https://github.com/BlakeOrth + +Here is an example of the metadata cache in action: + +```sql +-- disabling the metadata cache +> SET datafusion.runtime.metadata_cache_limit = '0M'; + +-- simple query (t.parquet: 100M rows, 3 cols) +> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1; +DataSourceExec: ... metrics=[..., metadata_load_time=229.196422ms, ...] +Elapsed 0.246 seconds. + +-- enabling the metadata cache +> SET datafusion.runtime.metadata_cache_limit = '50M'; + +> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1; +DataSourceExec: ... metrics=[..., metadata_load_time=228.612µs, ...] +Elapsed 0.003 seconds. -- 82x improvement in this specific query +``` + + +The cache can be configured with the following runtime parameter: + +```sql +datafusion.runtime.metadata_cache_limit +``` + +The default [`FileMetadataCache`](https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html) uses a +least-recently-used eviction algorithm and up to 50MB of memory. +If the underlying file changes, the cache is automatically invalidated. +Setting the limit to 0 will disable any metadata caching. As with most APIs in +DataFusion, users can provide their own behavior using a custom +[`FileMetadataCache`](https://docs.rs/datafusion/50.0.0/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html) +implementation when setting up the [`RuntimeEnv`](https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnv.html). + + +For users with custom [`TableProvider`](https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html): + +* If the custom provider uses the +[`ParquetFormat`](https://docs.rs/datafusion/latest/datafusion/datasource/file_format/parquet/struct.ParquetFormat.html), caching will work +without any changes. + +* Otherwise the +[`CachedParquetFileReaderFactory`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.CachedParquetFileReaderFactory.html) +can be provided when creating a +[`ParquetSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html). + +Users can inspect the cache contents through the +[`FileMetadataCache::list_entries`](https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html#tymethod.list_entries) +method, or with the +[`metadata_cache()`](https://datafusion.apache.org/user-guide/cli/functions.html#metadata-cache) +function in `datafusion-cli`: + + +```sql +> SELECT * FROM metadata_cache(); ++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+ +| path | file_modified | file_size_bytes | e_tag | version | metadata_size_bytes | hits | extra | ++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+ +| .../t.parquet | 2025-09-21T17:40:13.650 | 420827020 | 0-63f5331fb4458-19154f8c | NULL | 44480534 | 27 | page_index=true | ++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+ +1 row(s) fetched. +Elapsed 0.003 seconds. +``` + + +### `QUALIFY` Clause + +DataFusion now supports the `QUALIFY` SQL clause +([#16933](https://github.com/apache/datafusion/pull/16933)), which simplifies +filtering window function output (similar to how `HAVING` filters +aggregation output). + +For example, filtering the output of the `rank()` function previously +required a query like this: + +```sql +SELECT a, b, c +FROM ( + SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk + FROM t +) +WHERE rk = 1 +``` + +The same query can now be written like this: +```sql +SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk +FROM t +QUALIFY rk = 1 +``` + +Although it is not part of the SQL standard (yet), it has been gaining +adoption in several SQL analytical systems such as DuckDB, Snowflake, and +BigQuery. Thanks to [Huaijin] and [Jonah Gao] for delivering this feature. + +[Huaijin]: https://github.com/haohuaijin +[Jonah Gao]: https://github.com/jonahgao + +### `FILTER` Support for Window Functions + +Continuing the theme, the `FILTER` clause has been extended to support +[aggregate window functions](https://github.com/apache/datafusion/pull/17378). +It allows these functions to apply to specific rows without having to +rely on `CASE` expressions, similar to what was already possible with regular +aggregate functions. + +For example, we can gather multiple distinct sets of values matching different +criteria with a single pass over the input: + +```sql +SELECT + ARRAY_AGG(c2) FILTER (WHERE c2 >= 2) OVER (...) -- e.g. [2, 3, 4] + ARRAY_AGG(CASE WHEN c2 >= 2 THEN c2 END) OVER (...) -- e.g. [NULL, NULL, 2, 3, 4] +... +FROM table +``` + +Thanks to [Geoffrey Claude] and [Jeffrey Vo] for delivering this feature. + + +[Geoffrey Claude]: https://github.com/geoffreyclaude +[Jeffrey Vo]: https://github.com/Jefffrey + +### `ConfigOptions` Now Available to Functions + +DataFusion 50.0.0 now passes session configuration parameters to User-Defined +Functions (UDFs) via +[ScalarFunctionArgs](https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarFunctionArgs.html) +([#16970](https://github.com/apache/datafusion/pull/16970)). This allows +behavior that varies based on runtime state; for example, time UDFs can use the +session-specified time zone instead of just UTC. + +Thanks to [Bruce Ritchie], [Piotr Findeisen], [Oleks V], and [Andrew Lamb] for delivering this feature. + +[Bruce Ritchie]: https://github.com/Omega359 +[Piotr Findeisen]: https://github.com/findepi +[Oleks V]: https://github.com/comphead +[Andrew Lamb]: https://github.com/alamb + +### Additional Apache Spark Compatible Functions + +Finally, due to Apache Spark's impact on analytical processing, many DataFusion +users desire Spark compatibility in their workloads, so DataFusion provides a +set of Spark-compatible functions in the [datafusion-spark](https://crates.io/crates/datafusion-spark) crate. +You can read more about this project in the [announcement] and [epic]. +DataFusion 50.0.0 adds several new such functions: + +- [`array`](https://github.com/apache/datafusion/pull/16936) +- [`bit_get/bit_count`](https://github.com/apache/datafusion/pull/16942) +- [`bitmap_count`](https://github.com/apache/datafusion/pull/17179) +- [`crc32/sha1`](https://github.com/apache/datafusion/pull/17032) +- [`date_add/date_sub`](https://github.com/apache/datafusion/pull/17024) +- [`if`](https://github.com/apache/datafusion/pull/16946) +- [`last_day`](https://github.com/apache/datafusion/pull/16828) +- [`like/ilike`](https://github.com/apache/datafusion/pull/16962) +- [`luhn_check`](https://github.com/apache/datafusion/pull/16848) +- [`mod/pmod`](https://github.com/apache/datafusion/pull/16829) +- [`next_day`](https://github.com/apache/datafusion/pull/16780) +- [`parse_url`](https://github.com/apache/datafusion/pull/16937) +- [`rint`](https://github.com/apache/datafusion/pull/16924) +- [`width_bucket`](https://github.com/apache/datafusion/pull/17331) + +Thanks to [David López], [Chen Chongchen], [Alan Tang], [Peter Nguyen], and [Evgenii Glotov] for delivering these functions. We are looking for additional help +reviewing and implementing more functions; please reach out on the [epic] if you are interested. + + +[David López]: https://github.com/davidlghellin +[Chen Chongchen]: https://github.com/chenkovsky +[Alan Tang]: https://github.com/Standing-Man +[Peter Nguyen]: https://github.com/petern48 +[Evgenii Glotov]: https://github.com/SparkApplicationMaster +[announcement]: https://datafusion.apache.org/blog/2025/07/16/datafusion-48.0.0/#new-datafusion-spark-crate +[epic]: https://github.com/apache/datafusion/issues/15914 + + +## Known Issues / Patchset + +As DataFusion continues to mature, we regularly release patch versions to fix issues +in major releases. Since the release of `50.0.0`, we have identified a few +issues, and expect to release `50.1.0` to address them. You can track progress +in this [ticket](https://github.com/apache/datafusion/issues/17594). + + +## Upgrade Guide and Changelog + +Upgrading to 50.0.0 should be straightforward for most users. Please review the +[Upgrade Guide](https://datafusion.apache.org/library-user-guide/upgrading.html) +for details on breaking changes and code snippets to help with the transition. +Recently, some users have reported success automatically upgrading DataFusion by +pairing AI tools with the upgrade guide. For a comprehensive list of all +changes, please refer to the [changelog]. + +## About DataFusion + +[Apache DataFusion] is an extensible query engine, written in [Rust], that uses +[Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast, data-centric systems such as databases, dataframe libraries, +and machine learning and streaming applications. While [DataFusion’s primary +design goal] is to accelerate the creation of other data-centric systems, it +provides a reasonable experience directly out of the box as a [dataframe +library], [Python library], and [command-line SQL tool]. + +[apache datafusion]: https://datafusion.apache.org/ +[rust]: https://www.rust-lang.org/ +[apache arrow]: https://arrow.apache.org +[DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals +[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html +[python library]: https://datafusion.apache.org/python/ +[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/ + +DataFusion's core thesis is that, as a community, together we can build much +more advanced technology than any of us as individuals or companies could build +alone. Without DataFusion, highly performant vectorized query engines would +remain the domain of a few large companies and world-class research +institutions. With DataFusion, we can all build on top of a shared foundation +and focus on what makes our projects unique. + +## How to Get Involved + +DataFusion is not a project built or driven by a single person, company, or +foundation. Rather, our community of users and contributors works together to +build a shared technology that none of us could have built alone. + +If you are interested in joining us, we would love to have you. You can try out +DataFusion on some of your own data and projects and let us know how it goes, +contribute suggestions, documentation, bug reports, or a PR with documentation, +tests, or code. A list of open issues suitable for beginners is [here], and you +can find out how to reach us on the [communication doc]. + +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 +[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html diff --git a/content/images/datafusion-50.0.0/performance_over_time_clickbench.png b/content/images/datafusion-50.0.0/performance_over_time_clickbench.png new file mode 100644 index 00000000..e6a8f14c Binary files /dev/null and b/content/images/datafusion-50.0.0/performance_over_time_clickbench.png differ