apache · alamb · Sep 29, 2025 · Sep 23, 2025 · Sep 23, 2025 · Sep 23, 2025
diff --git a/content/blog/2025-09-24-datafusion-50.0.0.md b/content/blog/2025-09-24-datafusion-50.0.0.md
@@ -0,0 +1,389 @@
+---
+layout: post
+title: Apache DataFusion 50.0.0 Released
+date: 2025-09-24
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+<!-- see https://github.com/apache/datafusion/issues/16347 for details -->
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 50.0.0]. This blog post
+highlights some of the major improvements since the release of [DataFusion
+49.0.0]. The complete list of changes is available in the [changelog].
+
+[DataFusion 50.0.0]: https://crates.io/crates/datafusion/50.0.0
+[DataFusion 49.0.0]: https://datafusion.apache.org/blog/2025/07/28/datafusion-49.0.0/
+[changelog]: https://github.com/apache/datafusion/blob/branch-50/dev/changelog/50.0.0.md
+
+
+## Performance Improvements 🚀
+
+> **📝TODO** *Update chart*
+
+DataFusion continues to focus on enhancing performance, as shown in the
+ClickBench and other results.
+
+<img src="/blog/images/datafusion-49.0.0/performance_over_time_clickbench.png"
+  width="100%" class="img-responsive" alt="ClickBench performance results over
+  time for DataFusion" />
+
+**Figure 1**: ClickBench performance improvements over time Average and median
+normalized query execution times for ClickBench queries for each git revision.
+Query times are normalized using the ClickBench definition. Data and definitions
+on the [DataFusion Benchmarking
+Page](https://alamb.github.io/datafusion-benchmarking/).
+
+Here are some noteworthy optimizations added since DataFusion 49:
+
+**Dynamic Filter Pushdown Improvements**
+
+The dynamic filter pushdown optimization, which allows runtime filters to cut
+down on the amount of data read, has been extended to support **inner hash
+joins**. This optimization dramatically improves the performance of inner joins
+when one of the relations is relatively small or filtered by a highly selective
+selection. Consider the following example:
+
+```sql
+-- retrieve the orders of the customer with c_phone = '25-989-741-2988'
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+```
+
+While previously the entire `orders` relation would be scanned to join with the
+target customer, now the dynamic filter pushdown can filter it right at the
+source, keeping the data loaded at a minimum. The result is an order of
+magnitude faster execution time. This
+[article](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) goes
+into more detail about the dynamic filter pushdown optimization in DataFusion.
+
+The dynamic filter pushdown optimization in the TopK operator has also been
+improved in DataFusion 50.0.0, ensuring that the filters used are as selective
+as possible. You can read more about it in this
+[ticket](https://github.com/apache/datafusion/pull/16433).
+
+The next step will be to [extend the dynamic filters to other types of
+joins](https://github.com/apache/datafusion/issues/16973), such as left and
+right ones.
+
+**Nested Loop Optimization**
+
+The nested loop join has been rewritten to reduce execution time and memory
+usage by adopting a finer-grained approach. Specifically, we now limit the 
+intermediate data size to around a single `RecordBatch` for better memory
+efficiency, and we have eliminated redundant conversions from the old 
+implementation to further improve execution speed.
+
+When evaluating this new approach in a microbenchmark, we have measured up to 5x
+improvements in execution time and 99% less memory usage. More details and
+results can be found in this
+[ticket](https://github.com/apache/datafusion/pull/16996).
+
+**Parquet Metadata Caching**
+
+The metadata of Parquet files, such as min/max statistics and page indexes, is
+now cached to avoid unnecessary disk/network round-trips. This is especially
+useful with multiple small reads over relatively large files, allowing us to
+achieve an order of magnitude faster execution time. More information can be
+found in the [Parquet Metadata Cache](#parquet-metadata-cache) section.
+
+## Community Growth  📈
+
+In the last month and a half, between `49.0.0` and `50.0.0`, we have seen our
+community grow:
+
+1. New PMC members and committers: **📝TODO** joined the PMC. **📝TODO** joined
+   as committers. See the [mailing list] for more details.
+2. In the [core DataFusion repo] alone, we reviewed and accepted 318 PRs
+   from 79 different committers, created over 235 issues, and closed 197 of them
+   🚀. All changes are listed in the detailed [changelogs].
+3. DataFusion published *[Using External Indexes, Metadata Stores, Catalogs and
+   Caches to Accelerate Queries on Apache Parquet]* and *[Dynamic Filters:
+   Passing Information Between Operators During Execution for 25x Faster
+   Queries]*, which detail several substantial performance optimizations.
+
+<!--
+# Unique committers
+$ git shortlog -sn 49.0.0..50.0.0  . | wc -l
+    79
+# commits
+$ git log --pretty=oneline 49.0.0..50.0.0  . | wc -l
+    318
+
+https://crates.io/crates/datafusion/49.0.0
+DataFusion 49 released July 25, 2025
+
+https://crates.io/crates/datafusion/50.0.0
+DataFusion 50 released September 16, 2025
+
+Issues created in this time: 117 open, 118 closed = 235 total
+https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2025-07-25..2025-09-16
+
+Issues closed: 197
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2025-07-25..2025-09-16
+
+PRs merged in this time 371
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2025-07-25..2025-09-16
+-->
+
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
+[mailing list]: https://lists.apache.org/[email protected]
+[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet]: https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
+[Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries]: https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/
+
+## New Features ✨
+
+### Spilling Sorts
+
+Larger-than-memory sorts in DataFusion 50.0.0 are now mostly solved, with the
+recent introduction of multi-level merge sorts (more details in the respective
+[ticket](https://github.com/apache/datafusion/pull/15700)). This makes it
+possible to execute more queries which would otherwise trigger *out-of-memory*
+errors, by relying on disk spilling.
+
+### Dynamic Filter Pushdown For Hash Joins
+
+The [dynamic filter pushdown
+optimization](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/)
+has been extended to inner hash joins, dramatically reducing the amount of
+scanned data in some workloads. More information can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16445). This technique is
+also sometimes referred to as [*Sideways information
+passing*](https://www.cs.cmu.edu/~15721-f24/papers/Sideways_Information_Passing.pdf).
+
+These filters are automatically applied on inner hash joins, while future work
+will aim to introduce them to other types. They can be toggled with the
+following setting:
+
+```sql
+datafusion.optimizer.enable_dynamic_filter_pushdown
+```
+
+The following example shows how execution plans look in DataFusion 50.0.0 with
+this optimization:
+
+```sql
+EXPLAIN ANALYZE
+SELECT *
+FROM customer
+JOIN orders on c_custkey = o_custkey
+WHERE c_phone = '25-989-741-2988';
+
+-- plan excerpt
+HashJoinExec
+    DataSourceExec:
+      predicate=c_phone@4 = 25-989-741-2988
+      metrics=[output_rows=1, ...]
+    DataSourceExec:
+      -- dynamic filter is added here, filtering directly at scan time
+      predicate=DynamicFilterPhysicalExpr [ o_custkey@1 >= 1 AND o_custkey@1 <= 1 ]
+      -- the number of output rows is kept to a minimum
+      metrics=[output_rows=11, ...]
+```
+
+### Parquet Metadata Cache
+
+The metadata of Parquet files (statistics, page indexes, ...) is now
+automatically cached to reduce disk/network round-trips and repeated decodings
+of the same information. With a simple microbenchmark that executes point reads
+(e.g., `SELECT v FROM t WHERE k = x`) over large files, we measured a 12x
+improvement in execution time (more details can be found in the respective
+[ticket](https://github.com/apache/datafusion/pull/16971/)). Further work was
+made to make this optimization production-ready, such as making the cache limit
+configurable. More details can be found in this
+[Epic](https://github.com/apache/datafusion/issues/17000).
+
+The cache can be configured with the following runtime parameter:
+
+```sql
+datafusion.runtime.metadata_cache_limit
+```
+
+By default, it uses up to 50MB of memory. Setting the limit to 0 will disable
+any metadata caching. The default `FileMetadataCache` implementation uses a
+*Least-recently-used* eviction algorithm. If necessary, we can provide a custom
+[`FileMetadataCache`](https://docs.rs/datafusion/50.0.0/datafusion/execution/cache/cache_manager/trait.FileMetadataCache.html)
+implementation when setting up the `RuntimeEnv`.
+
+If the underlying file changes, the cache is automatically invalidated.
+
+Here is the metadata caching in action:
+
+```sql
+-- disabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '0M';
+
+-- simple query (t.parquet: 100M rows, 3 cols)
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=229.196422ms, ...]
+Elapsed 0.246 seconds.
+
+-- enabling the metadata cache
+> SET datafusion.runtime.metadata_cache_limit = '50M';
+
+> EXPLAIN ANALYZE SELECT * FROM 't.parquet' LIMIT 1;
+DataSourceExec: ... metrics=[..., metadata_load_time=228.612µs, ...]
+Elapsed 0.003 seconds. -- 82x improvement in this specific query
+```
+
+We can also inspect the cache contents through the
+`FileMetadataCache::list_entries` method. In `datafusion-cli`, we can also use
+the
+[`metadata_cache()`](https://datafusion.apache.org/user-guide/cli/functions.html#metadata-cache)
+function:
+
+```sql
+> SELECT * FROM metadata_cache();
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| path          | file_modified           | file_size_bytes | e_tag                    | version | metadata_size_bytes | hits | extra           |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+| .../t.parquet | 2025-09-21T17:40:13.650 | 420827020       | 0-63f5331fb4458-19154f8c | NULL    | 44480534            | 27   | page_index=true |
++---------------+-------------------------+-----------------+--------------------------+---------+---------------------+------+-----------------+
+1 row(s) fetched.
+Elapsed 0.003 seconds.
+```
+
+### `QUALIFY` Clause
+
+The `QUALIFY` clause is now available in DataFusion
+([#16933](https://github.com/apache/datafusion/pull/16933)). It allows window
+function columns to be filtered without requiring a subquery (similarly to what
+`HAVING` does for aggregations).
+
+For example, this query:
+```sql
+SELECT a, b, c
+FROM (
+   SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+   FROM t
+)
+WHERE rk = 1
+```
+
+can now be written like this:
+```sql
+SELECT a, b, c, rank() OVER(PARTITION BY a ORDER BY b) as rk
+FROM t
+QUALIFY rk = 1
+```
+
+Although it is not a part of the SQL standard (yet), it has been gaining
+adoption in several SQL analytical systems, such as DuckDB, Snowflake, and
+BigQuery.
+
+### `FILTER` Support for Window Functions
+
+Keeping with the theme, the `FILTER` clause has been extended to support
+[aggregate window functions](https://github.com/apache/datafusion/pull/17378).
+This allows these functions to be applied to specific rows without having to
+rely on `CASE` expressions, similar to what was already possible with regular
+aggregate functions.
+
+> **📝TODO** *Add a practical example?*
+
+### Behavior of User-Defined Functions
+
+DataFusion 50.0.0 now allows User-Defined Functions (UDFs) to access the global
+configuration parameters
+([#16970](https://github.com/apache/datafusion/pull/16970)), allowing their
+behavior to better suit users' workloads. As an example, time UDFs can now use
+custom time zones instead of being limited to UTC.
+
+### Added Several Spark functions
+
+Finally, due to Apache Spark's impact on analytical processing, many DataFusion
+users seek to use its functions in their workloads. Therefore, the new release
+of DataFusion has added many such functions, namely:
+
+- [`array`](https://github.com/apache/datafusion/pull/16936)
+- [`bit_get/bit_count`](https://github.com/apache/datafusion/pull/16942)
+- [`bitmap_count`](https://github.com/apache/datafusion/pull/17179)
+- [`crc32/sha1`](https://github.com/apache/datafusion/pull/17032)
+- [`date_add/date_sub`](https://github.com/apache/datafusion/pull/17024)
+- [`if`](https://github.com/apache/datafusion/pull/16946)
+- [`last_day `](https://github.com/apache/datafusion/pull/16828)
+- [`like/ilike`](https://github.com/apache/datafusion/pull/16962)
+- [`luhn_check`](https://github.com/apache/datafusion/pull/16848)
+- [`mod/pmod`](https://github.com/apache/datafusion/pull/16829)
+- [`next_day`](https://github.com/apache/datafusion/pull/16780)
+- [`parse_url`](https://github.com/apache/datafusion/pull/16937)
+- [`rint`](https://github.com/apache/datafusion/pull/16924)
+- [`width_bucket`](https://github.com/apache/datafusion/pull/17331)
+
+
+## Upgrade Guide and Changelog
+
+Upgrading to 50.0.0 should be straightforward for most users. Please review the
+[Upgrade Guide](https://datafusion.apache.org/library-user-guide/upgrading.html)
+for details on breaking changes and code snippets to help with the transition.
+Recently, some users have reported success automatically upgrading DataFusion by
+pairing AI tools with the upgrade guide. For a comprehensive list of all
+changes, please refer to the [changelog].
+
+## About DataFusion
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that uses
+[Apache Arrow] as its in-memory format. DataFusion is used by developers to
+create new, fast, data-centric systems such as databases, dataframe libraries,
+and machine learning and streaming applications. While [DataFusion’s primary
+design goal] is to accelerate the creation of other data-centric systems, it
+provides a reasonable experience directly out of the box as a [dataframe
+library], [python library], and [command-line SQL tool].
+
+[apache datafusion]: https://datafusion.apache.org/
+[rust]: https://www.rust-lang.org/
+[apache arrow]: https://arrow.apache.org
+[DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html
+[python library]: https://datafusion.apache.org/python/
+[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+
+DataFusion's core thesis is that, as a community, together we can build much
+more advanced technology than any of us as individuals or companies could do
+alone. Without DataFusion, highly performant vectorized query engines would
+remain the domain of a few large companies and world-class research
+institutions. With DataFusion, we can all build on top of a shared foundation
+and focus on what makes our projects unique.
+
+
+## How to Get Involved
+
+DataFusion is not a project built or driven by a single person, company, or
+foundation. Rather, our community of users and contributors works together to
+build a shared technology that none of us could have built alone.
+
+If you are interested in joining us, we would love to have you. You can try out
+DataFusion on some of your own data and projects and let us know how it goes,
+contribute suggestions, documentation, bug reports, or a PR with documentation,
+tests, or code. A list of open issues suitable for beginners is [here], and you
+can find out how to reach us on the [communication doc].
+
+[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
+[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html