Skip to content
Merged
Changes from 3 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
516b520
Blog: Add blog post about DataFusion 50.0.0 release
nuno-faria Sep 23, 2025
bb7393c
Update content/blog/2025-09-24-datafusion-50.0.0.md
nuno-faria Sep 23, 2025
fc2eb2a
Add ref to future work of dynamic filter pushdown
nuno-faria Sep 23, 2025
0ce5b67
Update content/blog/2025-09-24-datafusion-50.0.0.md
nuno-faria Sep 24, 2025
5a2e15b
Add clarification about dynamic filters
nuno-faria Sep 24, 2025
ff3bed0
Adjust date
alamb Sep 25, 2025
0c96542
Add new committers and additional blog
alamb Sep 25, 2025
abf5ce4
Move dynamic predicate content into section
alamb Sep 25, 2025
39633d0
Improve spilling sorts section
alamb Sep 25, 2025
06bda1d
Update filter pushdown section
alamb Sep 25, 2025
c8013e2
Edit parquet metadata cache section
alamb Sep 25, 2025
6dcb94f
Merge pull request #1 from alamb/alamb/df50_suggestions
nuno-faria Sep 25, 2025
08fb67d
Update performance numbers
alamb Sep 25, 2025
a0f8cc3
Update content/blog/2025-09-29-datafusion-50.0.0.md
nuno-faria Sep 25, 2025
fe95e61
Update content/blog/2025-09-29-datafusion-50.0.0.md
nuno-faria Sep 25, 2025
e53bb50
Update content/blog/2025-09-29-datafusion-50.0.0.md
nuno-faria Sep 25, 2025
312e260
Apply suggestions, Minor fixes
nuno-faria Sep 25, 2025
4ce41a6
wordsmith and add some more links to spark functions
alamb Sep 26, 2025
a0495e2
Copyediting -- thanks to chatGPT
alamb Sep 26, 2025
ff9569e
more tweaks
alamb Sep 26, 2025
3714364
Apply suggestions
nuno-faria Sep 26, 2025
7f8369e
Add 'Known Issues' section
nuno-faria Sep 26, 2025
9cb7f8c
Clarify cache improvements
nuno-faria Sep 26, 2025
aa4b697
reword known issues section
alamb Sep 29, 2025
c38da61
Tighten up intro and figure caption
alamb Sep 29, 2025
9665a38
Add thanks for contributors
alamb Sep 29, 2025
39fd971
Add thanks for contributors for metadata cache
alamb Sep 29, 2025
9f624bd
Thanks for filter, qualify, and configs
alamb Sep 29, 2025
bc658a0
more thanks
alamb Sep 29, 2025
62fed22
fixups
alamb Sep 29, 2025
fe5b498
final touchups
alamb Sep 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
389 changes: 389 additions & 0 deletions content/blog/2025-09-24-datafusion-50.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,389 @@
---
layout: post
title: Apache DataFusion 50.0.0 Released
date: 2025-09-24
author: pmc
categories: [release]
---

<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

[TOC]

<!-- see https://github.com/apache/datafusion/issues/16347 for details -->

## Introduction

We are proud to announce the release of [DataFusion 50.0.0]. This blog post
highlights some of the major improvements since the release of [DataFusion
49.0.0]. The complete list of changes is available in the [changelog].

[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0
[DataFusion 49.0.0]: https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/
[changelog]: https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md


## Performance Improvements 🚀

> **📝TODO** *Update chart*

DataFusion continues to focus on enhancing performance, as shown in the
ClickBench and other results.

<img src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png"
width="100%" class="img-responsive" alt="ClickBench performance results over
time for DataFusion" />

**Figure 1**: ClickBench performance improvements over time Average and median
normalized query execution times for ClickBench queries for each git revision.
Query times are normalized using the ClickBench definition. Data and definitions
on the [DataFusion Benchmarking
Page](https://alamb.github.io/datafusion-benchmarking/).

Here are some noteworthy optimizations added since DataFusion 49:

**Dynamic Filter Pushdown Improvements**

The dynamic filter pushdown optimization, which allows runtime filters to cut
down on the amount of data read, has been extended to support **inner hash
joins**. This optimization dramatically improves the performance of inner joins
when one of the relations is relatively small or filtered by a highly selective
selection. Consider the following example:

```sql
-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
SELECT *
FROM customer
JOIN orders on c_custkey = o_custkey
WHERE c_phone = '25-989-741-2988';
```

While previously the entire `orders` relation would be scanned to join with the
target customer, now the dynamic filter pushdown can filter it right at the
source, keeping the data loaded at a minimum. The result is an order of
magnitude faster execution time. This
[article](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) goes
into more detail about the dynamic filter pushdown optimization in DataFusion.

The dynamic filter pushdown optimization in the TopK operator has also been
improved in DataFusion 50.0.0, ensuring that the filters used are as selective
as possible. You can read more about it in this
[ticket](https://github.com/apache/datafusion/pull/16433).

The next step will be to [extend the dynamic filters to other types of
joins](https://github.com/apache/datafusion/issues/16973), such as left and
right ones.

**Nested Loop Optimization**

The nested loop join has been rewritten to reduce execution time and memory
usage by adopting a finer-grained approach. Specifically, we now limit the
intermediate data size to around a single `RecordBatch` for better memory
efficiency, and we have eliminated redundant conversions from the old
implementation to further improve execution speed.

When evaluating this new approach in a microbenchmark, we have measured up to 5x
improvements in execution time and 99% less memory usage. More details and
results can be found in this
[ticket](https://github.com/apache/datafusion/pull/16996).

**Parquet Metadata Caching**

The metadata of Parquet files, such as min/max statistics and page indexes, is
now cached to avoid unnecessary disk/network round-trips. This is especially
useful with multiple small reads over relatively large files, allowing us to
achieve an order of magnitude faster execution time. More information can be
found in the [Parquet Metadata Cache](#parquet-metadata-cache) section.

## Community Growth 📈

In the last month and a half, between `49.0.0` and `50.0.0`, we have seen our
community grow:

1. New PMC members and committers: **📝TODO** joined the PMC. **📝TODO** joined
as committers. See the [mailing list] for more details.
2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs
from 79 different committers, created over 235 issues, and closed 197 of them
🚀. All changes are listed in the detailed [changelogs].
3. DataFusion published *[Using External Indexes, Metadata Stores, Catalogs and
Caches to Accelerate Queries on Apache Parquet]* and *[Dynamic Filters:
Passing Information Between Operators During Execution for 25x Faster
Queries]*, which detail several substantial performance optimizations.

<!--
# Unique committers
$ git shortlog -sn 49.0.0..50.0.0 . | wc -l
79
# commits
$ git log --pretty=oneline 49.0.0..50.0.0 . | wc -l
318

https://crates.io/crates/datafusion/49.0.0
DataFusion 49 released July 25, 2025

https://crates.io/crates/datafusion/50.0.0
DataFusion 50 released September 16, 2025

Issues created in this time: 117 open, 118 closed = 235 total
https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-07-25..2025-09-16

Issues closed: 197
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-07-25..2025-09-16

PRs merged in this time 371
https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-07-25..2025-09-16
-->


[core DataFusion repo]: https://github.com/apache/arrow-datafusion
[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
[mailing list]: https://lists.apache.org/[email protected]
[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet]: https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
[Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/

## New Features ✨

### Spilling Sorts

Larger-than-memory sorts in DataFusion 50.0.0 are now mostly solved, with the
recent introduction of multi-level merge sorts (more details in the respective
[ticket](https://github.com/apache/datafusion/pull/15700)). This makes it
possible to execute more queries which would otherwise trigger *out-of-memory*
errors, by relying on disk spilling.

### Dynamic Filter Pushdown For Hash Joins

The [dynamic filter pushdown
optimization](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/)
has been extended to inner hash joins, dramatically reducing the amount of
scanned data in some workloads. More information can be found in the respective
[ticket](https://github.com/apache/datafusion/pull/16445). This technique is
also sometimes referred to as [*Sideways information
passing*](https://www.cs.cmu.edu/~15721-f24/papers/Sideways_Information_Passing.pdf).

These filters are automatically applied on inner hash joins, while future work
will aim to introduce them to other types. They can be toggled with the
following setting:

```sql
datafusion.optimizer.enable_dynamic_filter_pushdown
```

The following example shows how execution plans look in DataFusion 50.0.0 with
this optimization:

```sql
EXPLAIN ANALYZE
SELECT *
FROM customer
JOIN orders on c_custkey = o_custkey
WHERE c_phone = '25-989-741-2988';

-- plan excerpt
HashJoinExec
DataSourceExec:
predicate=c_phone@4 = 25-989-741-2988
metrics=[output_rows=1, ...]
DataSourceExec:
-- dynamic filter is added here, filtering directly at scan time
predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1 <= 1 ]
-- the number of output rows is kept to a minimum
metrics=[output_rows=11, ...]
```

### Parquet Metadata Cache

The metadata of Parquet files (statistics, page indexes, ...) is now
automatically cached to reduce disk/network round-trips and repeated decodings
of the same information. With a simple microbenchmark that executes point reads
(e.g., `SELECT v FROM t WHERE k = x`) over large files, we measured a 12x
improvement in execution time (more details can be found in the respective
[ticket](https://github.com/apache/datafusion/pull/16971/)). Further work was
made to make this optimization production-ready, such as making the cache limit
configurable. More details can be found in this
[Epic](https://github.com/apache/datafusion/issues/17000).

The cache can be configured with the following runtime parameter:

```sql
datafusion.runtime.metadata_cache_limit
```

By default, it uses up to 50MB of memory. Setting the limit to 0 will disable
any metadata caching. The default `FileMetadataCache` implementation uses a
*Least-recently-used* eviction algorithm. If necessary, we can provide a custom
[`FileMetadataCache`](https://docs.rs/datafusion/50.0.0/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html)
implementation when setting up the `RuntimeEnv`.

If the underlying file changes, the cache is automatically invalidated.

Here is the metadata caching in action:

```sql
-- disabling the metadata cache
> SET datafusion.runtime.metadata_cache_limit = '0M';

-- simple query (t.parquet: 100M rows, 3 cols)
> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
DataSourceExec: ... metrics=[..., metadata_load_time=229.196422ms, ...]
Elapsed 0.246 seconds.

-- enabling the metadata cache
> SET datafusion.runtime.metadata_cache_limit = '50M';

> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
DataSourceExec: ... metrics=[..., metadata_load_time=228.612µs, ...]
Elapsed 0.003 seconds. -- 82x improvement in this specific query
```

We can also inspect the cache contents through the
`FileMetadataCache::list_entries` method. In `datafusion-cli`, we can also use
the
[`metadata_cache()`](https://datafusion.apache.org/user-guide/cli/functions.html#metadata-cache)
function:

```sql
> SELECT * FROM metadata_cache();
+---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
| path | file_modified | file_size_bytes | e_tag | version | metadata_size_bytes | hits | extra |
+---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
| .../t.parquet | 2025-09-21T17:40:13.650 | 420827020 | 0-63f5331fb4458-19154f8c | NULL | 44480534 | 27 | page_index=true |
+---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
1 row(s) fetched.
Elapsed 0.003 seconds.
```

### `QUALIFY` Clause

The `QUALIFY` clause is now available in DataFusion
([#16933](https://github.com/apache/datafusion/pull/16933)). It allows window
function columns to be filtered without requiring a subquery (similarly to what
`HAVING` does for aggregations).

For example, this query:
```sql
SELECT a, b, c
FROM (
SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
FROM t
)
WHERE rk = 1
```

can now be written like this:
```sql
SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
FROM t
QUALIFY rk = 1
```

Although it is not a part of the SQL standard (yet), it has been gaining
adoption in several SQL analytical systems, such as DuckDB, Snowflake, and
BigQuery.

### `FILTER` Support for Window Functions

Keeping with the theme, the `FILTER` clause has been extended to support
[aggregate window functions](https://github.com/apache/datafusion/pull/17378).
This allows these functions to be applied to specific rows without having to
rely on `CASE` expressions, similar to what was already possible with regular
aggregate functions.

> **📝TODO** *Add a practical example?*

### Behavior of User-Defined Functions

DataFusion 50.0.0 now allows User-Defined Functions (UDFs) to access the global
configuration parameters
([#16970](https://github.com/apache/datafusion/pull/16970)), allowing their
behavior to better suit users' workloads. As an example, time UDFs can now use
custom time zones instead of being limited to UTC.

### Added Several Spark functions

Finally, due to Apache Spark's impact on analytical processing, many DataFusion
users seek to use its functions in their workloads. Therefore, the new release
of DataFusion has added many such functions, namely:

- [`array`](https://github.com/apache/datafusion/pull/16936)
- [`bit_get/bit_count`](https://github.com/apache/datafusion/pull/16942)
- [`bitmap_count`](https://github.com/apache/datafusion/pull/17179)
- [`crc32/sha1`](https://github.com/apache/datafusion/pull/17032)
- [`date_add/date_sub`](https://github.com/apache/datafusion/pull/17024)
- [`if`](https://github.com/apache/datafusion/pull/16946)
- [`last_day `](https://github.com/apache/datafusion/pull/16828)
- [`like/ilike`](https://github.com/apache/datafusion/pull/16962)
- [`luhn_check`](https://github.com/apache/datafusion/pull/16848)
- [`mod/pmod`](https://github.com/apache/datafusion/pull/16829)
- [`next_day`](https://github.com/apache/datafusion/pull/16780)
- [`parse_url`](https://github.com/apache/datafusion/pull/16937)
- [`rint`](https://github.com/apache/datafusion/pull/16924)
- [`width_bucket`](https://github.com/apache/datafusion/pull/17331)


## Upgrade Guide and Changelog

Upgrading to 50.0.0 should be straightforward for most users. Please review the
[Upgrade Guide](https://datafusion.apache.org/library-user-guide/upgrading.html)
for details on breaking changes and code snippets to help with the transition.
Recently, some users have reported success automatically upgrading DataFusion by
pairing AI tools with the upgrade guide. For a comprehensive list of all
changes, please refer to the [changelog].

## About DataFusion

[Apache DataFusion] is an extensible query engine, written in [Rust], that uses
[Apache Arrow] as its in-memory format. DataFusion is used by developers to
create new, fast, data-centric systems such as databases, dataframe libraries,
and machine learning and streaming applications. While [DataFusion’s primary
design goal] is to accelerate the creation of other data-centric systems, it
provides a reasonable experience directly out of the box as a [dataframe
library], [python library], and [command-line SQL tool].

[apache datafusion]: https://datafusion.apache.org/
[rust]: https://www.rust-lang.org/
[apache arrow]: https://arrow.apache.org
[DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals
[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html
[python library]: https://datafusion.apache.org/python/
[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/

DataFusion's core thesis is that, as a community, together we can build much
more advanced technology than any of us as individuals or companies could do
alone. Without DataFusion, highly performant vectorized query engines would
remain the domain of a few large companies and world-class research
institutions. With DataFusion, we can all build on top of a shared foundation
and focus on what makes our projects unique.


## How to Get Involved

DataFusion is not a project built or driven by a single person, company, or
foundation. Rather, our community of users and contributors works together to
build a shared technology that none of us could have built alone.

If you are interested in joining us, we would love to have you. You can try out
DataFusion on some of your own data and projects and let us know how it goes,
contribute suggestions, documentation, bug reports, or a PR with documentation,
tests, or code. A list of open issues suitable for beginners is [here], and you
can find out how to reach us on the [communication doc].

[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html