diff --git a/content/blog/2025-07-28-datafusion-49.0.0.md b/content/blog/2025-07-28-datafusion-49.0.0.md new file mode 100644 index 00000000..576a2f76 --- /dev/null +++ b/content/blog/2025-07-28-datafusion-49.0.0.md @@ -0,0 +1,424 @@ +--- +layout: post +title: Apache DataFusion 49.0.0 Released +date: 2025-07-28 +author: pmc +categories: [release] +--- + + + + + +## Introduction + +We are proud to announce the release of [DataFusion 49.0.0]. This blog post highlights some of +the major improvements since the release of [DataFusion 48.0.0]. The complete list of changes is available in the [changelog]. + +[DataFusion 49.0.0]: https://crates.io/crates/datafusion/49.0.0 +[DataFusion 48.0.0]: https://datafusion.apache.org/blog/2025/07/18/datafusion-48.0.0/ +[changelog]: https://github.com/apache/datafusion/blob/branch-49/dev/changelog/49.0.0.md + + +## Performance Improvements 🚀 + +DataFusion continues to focus on enhancing performance, as shown in the ClickBench and other results. + +ClickBench performance results over time for DataFusion + +**Figure 1**: ClickBench performance improvements over time +Average and median normalized query execution times for ClickBench queries for each git revision. +Query times are normalized using the ClickBench definition. Data and definitions on the +[DataFusion Benchmarking Page](https://alamb.github.io/datafusion-benchmarking/). + + + +Here are some noteworthy optimizations added since DataFusion 48: + +**Equivalence system upgrade:** The lower levels of the equivalence system, which is used to implement the + optimizations described in [Using Ordering for Better Plans], were rewritten, leading to + much faster planning times, especially for queries with a [large number of columns](https://github.com/apache/datafusion/pull/16217#pullrequestreview-2891941229). This change also prepares + the way for more sophisticated sort-based optimizations in the future. (PR [#16217](https://github.com/apache/datafusion/pull/16217) by [ozankabak](https://github.com/ozankabak)). + +[Using Ordering for Better Plans]: https://datafusion.apache.org/blog/2025/03/11/ordering-analysis + +**Dynamic Filters and TopK pushdown** + +DataFusion now supports dynamic filters, which are improved during query execution, +and physical filter pushdown. Together, these features improve the performance of +queries that use `LIMIT` and `ORDER BY` clauses, such as the following: + +```sql +SELECT * +FROM data +ORDER BY timestamp DESC +LIMIT 10 +``` + +While the query above is simple, without dynamic filtering or knowing that the data +is already sorted by `timestamp`, a query engine must decode *all* of the data to +find the top 10 values. With the dynamic filters system, DataFusion applies an +increasingly selective filter during query execution. It checks the **current** +top 10 values of the `timestamp` column **before** opening files or reading +Parquet Row Groups and Data Pages, which can skip older data very quickly. + +Dynamic predicates are a common feature of advanced engines such as [Dynamic +Filters in Starburst] and [Top-K Aggregation Optimization at Snowflake]. The +technique drastically improves query performance (we've seen over a 1.5x +improvement for some TPC-H-style queries), especially in combination with late +materialization and columnar file formats such as Parquet. We [plan to write a +blog post] explaining the details of this optimization in the future, and we expect to +use the same mechanism to implement additional optimizations such as [Sideways +Information Passing for joins] (Issue +[#15037](https://github.com/apache/datafusion/issues/15037) PR +[#15770](https://github.com/apache/datafusion/pull/15770) by +[adriangb](https://github.com/adriangb)). + + +[Dynamic Filters in Starburst]: https://docs.starburst.io/latest/admin/dynamic-filtering.html +[Top-K Aggregation Optimization at Snowflake]: https://www.snowflake.com/en/engineering-blog/optimizing-top-k-aggregation-snowflake/ +[plan to write a blog post]: https://github.com/apache/datafusion/issues/15513 +[Sideways Information Passing for joins]: https://github.com/apache/datafusion/issues/7955 + + + +## Community Growth 📈 + +The last few months, between `46.0.0` and `49.0.0`, have seen our community grow: + +1. New PMC members and committers: [berkay], [xudong963] and [timsaucer] joined the PMC. + [blaginin], [milenkovicm], [adriangb] and [kosiew] joined as committers. See the [mailing list] for more details. +2. In the [core DataFusion repo] alone, we reviewed and accepted over 850 PRs from 172 different + committers, created over 669 issues, and closed 379 of them 🚀. All changes are listed in the detailed + [changelogs]. +3. DataFusion published a number of blog posts, including [User defined Window Functions], [Optimizing SQL (and DataFrames) + in DataFusion part 1], [part 2], [Using Rust async for Query Execution and Cancelling Long-Running Queries], and + [Embedding User-Defined Indexes in Apache Parquet Files]. + + + + + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?dev@datafusion.apache.org +[berkay]: https://github.com/berkaysynnada +[xudong963]: https://github.com/xudong963 +[timsaucer]: https://github.com/timsaucer +[blaginin]: https://github.com/blaginin +[milenkovicm]: https://github.com/milenkovicm +[adriangb]: https://github.com/adriangb +[kosiew]: https://github.com/kosiew +[User defined Window Functions]: https://datafusion.apache.org/blog/2025/04/19/user-defined-window-functions +[Optimizing SQL (and DataFrames) in DataFusion part 1]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one +[part 2]: https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two +[Using Rust async for Query Execution and Cancelling Long-Running Queries]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[Embedding User-Defined Indexes in Apache Parquet Files]: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ + + +## New Features ✨ + +### Async User-Defined Functions + +It is now possible to write `async` User-Defined Functions +(UDFs) in DataFusion that perform asynchronous +operations, such as network requests or database queries, without blocking the +execution of the query. This enables new use cases, such as +integrating with large language models (LLMs) or other external services, and we can't +wait to see what the community builds with it. + +See the [documentation] for more details and the [async UDF example] for +working code. + +[documentation]: https://datafusion.apache.org/library-user-guide/functions/adding-udfs.html +[async UDF example]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/async_udf.rs + +You could, for example, implement a function `ask_llm` that asks a large language model +(LLM) service a question based on the content of two columns. + +```sql +SELECT * +FROM animal a +WHERE ask_llm(a.name, 'Is this animal furry?')") +``` + +The implementation of an async UDF is almost identical to a normal +UDF, except that it must implement the `AsyncScalarUDFImpl` trait in addition to `ScalarUDFImpl` and +provide an `async` implementation via `invoke_async_with_args`: + +```rust +#[derive(Debug)] +struct AskLLM { + signature: Signature, +} + +#[async_trait] +impl AsyncScalarUDFImpl for AskLLM { + /// The `invoke_async_with_args` method is similar to `invoke_with_args`, + /// but it returns a `Future` that resolves to the result. + /// + /// Since this signature is `async`, it can do any `async` operations, such + /// as network requests. + async fn invoke_async_with_args( + &self, + args: ScalarFunctionArgs, + options: &ConfigOptions, + ) -> Result { + // Converts the arguments to arrays for simplicity. + let args = ColumnarValue::values_to_arrays(&args.args)?; + let [column_of_interest, question] = take_function_args(self.name(), args)?; + let client = Client::new(); + + // Make a network request to a hypothetical LLM service + let res = client + .post(URI) + .headers(get_llm_headers(options)) + .json(&req) + .send() + .await? + .json::() + .await?; + + let results = extract_results_from_llm_response(&res); + + Ok(Arc::new(results)) + } +} +``` + +(Issue [#6518](https://github.com/apache/datafusion/issues/6518), +[PR #14837](https://github.com/apache/datafusion/pull/14837) from +[goldmedal](https://github.com/goldmedal) 🏆) + + +### Better Cancellation for Certain Long-Running Queries + +In rare cases, it was previously not possible to cancel long-running queries, +leading to unresponsiveness. Other projects would likely have fixed this issue +by treating the symptom, but [pepijnve] and the DataFusion community worked together to +treat the root cause. The general solution required a deep understanding of the +DataFusion execution engine, Rust `Streams`, and the tokio cooperative +scheduling model. The [resulting PR] is a model of careful +community engineering and a great example of using Rust's `async` ecosystem +to implement complex functionality. It even resulted in a [contribution upstream to tokio] +(since accepted). See the [blog post] for more details. + +[resulting PR]: https://github.com/apache/datafusion/pull/16398 +[blog post]: https://datafusion.apache.org/blog/2025/06/30/cancellation +[contribution upstream to tokio]: https://github.com/tokio-rs/tokio/pull/7405 +[pepijnve]: https://github.com/pepijnve + +### Metadata for User Defined Types such as `Variant` and `Geometry` + +User-defined types have been [a long-requested feature], and this release provides +the low-level APIs to support them efficiently. + +1. Metadata handling in PRs [#15646](https://github.com/apache/datafusion/pull/15646) and [#16170](https://github.com/apache/datafusion/pull/16170) from [timsaucer] +2. Pushdown of filters and expressions (see "Dynamic Filters and TopK pushdown" section above) + +[a long-requested feature]: https://github.com/apache/datafusion/issues/12644 +[timsaucer]: https://github.com/timsaucer + +We still have some work to do to fully support user-defined types, specifically +in documentation and testing, and we would +love your help in this area. If you are interested in contributing, +please see [issue #12644](https://github.com/apache/datafusion/issues/12644). + +### Parquet Modular Encryption + +DataFusion now supports reading and writing encrypted [Apache Parquet] files with [modular +encryption]. This allows users to encrypt specific columns in a Parquet file +using different keys, while still being able to read data without needing to +decrypt the entire file. + +[Apache Parquet]: https://parquet.apache.org/ +[modular encryption]: https://parquet.apache.org/docs/file-format/data-pages/encryption/ + +Here is an example of how to configure DataFusion to read an encrypted Parquet +table with two columns, `double_field` and `float_field`, using modular +encryption: + +```sql +CREATE EXTERNAL TABLE encrypted_parquet_table +( +double_field double, +float_field float +) +STORED AS PARQUET LOCATION 'pq/' OPTIONS ( + -- encryption + 'format.crypto.file_encryption.encrypt_footer' 'true', + 'format.crypto.file_encryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_encryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_encryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" + -- decryption + 'format.crypto.file_decryption.footer_key_as_hex' '30313233343536373839303132333435', -- b"0123456789012345" + 'format.crypto.file_decryption.column_key_as_hex::double_field' '31323334353637383930313233343530', -- b"1234567890123450" + 'format.crypto.file_decryption.column_key_as_hex::float_field' '31323334353637383930313233343531', -- b"1234567890123451" +); +``` + +([Issue #15216](https://github.com/apache/datafusion/issues/15216), +[PR #16351](https://github.com/apache/datafusion/pull/16351) +from [corwinjoy](https://github.com/corwinjoy) and [adamreeve](https://github.com/adamreeve)) + + +### Support for `WITHIN GROUP` for Ordered-Set Aggregate Functions + +DataFusion now supports the `WITHIN GROUP` clause for [ordered-set aggregate +functions] such as `approx_percentile_cont`, `percentile_cont`, and +`percentile_disc`, which allows users to specify the precise order. + +For example, the following query computes the 50th percentile for the `temperature` column +in the `city_data` table, ordered by `date`: + +```sql +SELECT + percentile_disc(0.5) WITHIN GROUP (ORDER BY date) AS median_temperature +FROM city_data; +``` + +[ordered-set aggregate functions]: https://www.postgresql.org/docs/9.4/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE + +(Issue [#11732](https://github.com/apache/datafusion/issues/11732), +PR [#13511](https://github.com/apache/datafusion/pull/13511), +by [Garamda](https://github.com/Garamda)) + +### Compressed Spill Files + +DataFusion now supports compressing the files written to disk when spilling +larger-than-memory datasets while sorting and grouping. Using compression +can significantly reduce the +size of the intermediate files and improve performance when reading them back into memory. + +(Issue [#16130](https://github.com/apache/datafusion/issues/16130), +PR [#16268](https://github.com/apache/datafusion/pull/16268) +by [ding-young](https://github.com/ding-young)) + +### Support for `REGEX_INSTR` function + +DataFusion now supports the [`REGEXP_INSTR` function], which returns the position of a +regular expression match within a string. + +For example, to find the position of the first match of the regular expression +`C(.)(..)` in the string `ABCDEF`, you can use: + +```sql +> SELECT regexp_instr('ABCDEF', 'C(.)(..)'); ++---------------------------------------------------------------+ +| regexp_instr(Utf8("ABCDEF"),Utf8("C(.)(..)")) | ++---------------------------------------------------------------+ +| 3 | ++---------------------------------------------------------------+ +``` + +[`REGEXP_INSTR` function]: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#regexp-instr +([Issue #13009](https://github.com/apache/datafusion/issues/13009), +[PR #15928](https://github.com/apache/datafusion/pull/15928) +by [nirnayroy](https://github.com/nirnayroy)) + +## Upgrade Guide and Changelog + +Upgrading to 49.0.0 should be straightforward for most users. Please review the +[Upgrade Guide](https://datafusion.apache.org/library-user-guide/upgrading.html) +for details on breaking changes and code snippets to help with the transition. +Recently, some users have reported success automatically upgrading DataFusion by +pairing AI tools with the upgrade guide. For a comprehensive list of all changes, +please refer to the [changelog]. + +## About DataFusion + +[Apache DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast, data-centric systems such as databases, dataframe libraries, +and machine learning and streaming applications. While [DataFusion’s primary design +goal] is to accelerate the creation of other data-centric systems, it provides a +reasonable experience directly out of the box as a [dataframe library], +[python library], and [command-line SQL tool]. + +[apache datafusion]: https://datafusion.apache.org/ +[rust]: https://www.rust-lang.org/ +[apache arrow]: https://arrow.apache.org +[DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals +[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html +[python library]: https://datafusion.apache.org/python/ +[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/ + +DataFusion's core thesis is that as a community, together we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation and focus on +what makes our projects unique. + + +## How to Get Involved + +DataFusion is not a project built or driven by a single person, company, or +foundation. Rather, our community of users and contributors works together to +build a shared technology that none of us could have built alone. + +If you are interested in joining us, we would love to have you. You can try out +DataFusion on some of your own data and projects and let us know how it goes, +contribute suggestions, documentation, bug reports, or a PR with documentation, +tests, or code. A list of open issues suitable for beginners is [here], and you +can find out how to reach us on the [communication doc]. + +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 +[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html \ No newline at end of file diff --git a/content/images/datafusion-49.0.0/performance_over_time_clickbench.png b/content/images/datafusion-49.0.0/performance_over_time_clickbench.png new file mode 100644 index 00000000..adb9003b Binary files /dev/null and b/content/images/datafusion-49.0.0/performance_over_time_clickbench.png differ diff --git a/content/images/datafusion-49.0.0/performance_over_time_planning.png b/content/images/datafusion-49.0.0/performance_over_time_planning.png new file mode 100644 index 00000000..50cda905 Binary files /dev/null and b/content/images/datafusion-49.0.0/performance_over_time_planning.png differ