Support Schema Field Metadata in User-Defined Aggregate Functions (UDAFs) #17085

kosiew · 2025-08-08T10:33:02Z

Which issue does this PR close?

Closes AccumulatorArgs.schema is empty when passing in scalar input #16997

Rationale for this change

Previously, aggregate UDFs could not easily access field-level metadata (e.g., Arrow extension types) when invoked with literal arguments only, because the AccumulatorArgs.schema was always derived from the physical schema — which is empty for literal-only inputs.
This change ensures that in such cases, a schema is synthesized from the literal expressions, preserving metadata and enabling richer accumulator behavior. It also clarifies API documentation for AccumulatorArgs and AggregateUDFImpl.

What changes are included in this PR?

Added args_schema() helper to AggregateFunctionExpr to return either the physical input schema or a synthesized schema from literals when the physical schema is empty.
Updated create_accumulator, create_sliding_accumulator, groups_accumulator_supported, and create_groups_accumulator to use the new schema logic via make_acc_args().
Enhanced AccumulatorArgs and AggregateUDFImpl documentation to explain how to access input field metadata and when synthesized schemas are used.
Introduced SchemaBasedAggregateUdf test in user_defined_aggregates.rs to validate metadata handling in literal-only aggregates.
Added unit test in aggregate.rs to verify correct schema behavior for both literal-only and physical-schema-present cases.

Are these changes tested?

Yes.

New integration test test_schema_based_aggregate_udf_metadata ensures that metadata from literals is accessible in the accumulator.
New unit test in aggregate.rs validates that args_schema() returns an owned schema for literal-only inputs and a borrowed schema for non-empty physical schemas.

Are there any user-facing changes?

Yes:

Aggregate UDF implementations can now reliably access input field metadata in AccumulatorArgs.schema for literal-only inputs.
No breaking API changes; only additional guarantees and improved documentation.

…ent handling

- Improve documentation for AccumulatorArgs.schema and exprs: - Add example showing how to retrieve field metadata and return field. - Explain synthesized schema behavior for literal-only inputs. - Clarify precedence when inputs are mixed (physical schema metadata wins; synthesized metadata used only when physical schema is empty). - Update AggregateFunctionExpr::args_schema docs: - Explain field order guarantees, synthesized schema usage, and that std::borrow::Cow is used to avoid allocations when possible. - Add a TODO to factor AccumulatorArgs construction into a private helper. Documentation-only changes; no behavioral changes.

… in AggregateFunctionExpr

…behavior

…egate.rs

…gs documentation

… Hash traits

- Updated the `DummyUdf::new` function to accept a `Signature` parameter for initialization. - Modified test cases to reflect the new initialization method for `DummyUdf`. - Improved the `args_schema` method to better manage when to borrow the existing schema or create a new one, ensuring correct correspondence between expressions and schema fields. - Added comments and examples for improved clarity on the schema handling behavior.

- Introduced a `build_acc_args` method to encapsulate the logic for building `AccumulatorArgs` and executing a closure with them. - Updated the `create_accumulator`, `create_sliding_accumulator`, `groups_accumulator_supported`, and `create_groups_accumulator` methods to utilize the new method for better readability and maintainability.

…ests

Jefffrey

I must admit I am quite confused at the fix proposed by this PR; it seems there are cases where for literal only inputs to aggregates, schema is empty and this PR fixes it by having a fallback behaviour in the AggregateFunctionExpr to "synthesize" the schema from the input arguments?

If my understanding is correct, my questions/thoughts are:

Is this a band-aid fix? Is there a root cause we should be looking for instead?
There's a heavy emphasis on the word "synthesize" throughout this PR but I don't know what it means to "synthesize" a schema from literal expressions 🤔

datafusion/physical-expr/src/aggregate.rs

Jefffrey · 2025-09-30T08:14:23Z

datafusion/expr/src/udaf.rs

+    /// Example: retrieving metadata and return field for input `i`:
+    /// ```ignore
+    /// let metadata = acc_args.schema.field(i).metadata();
+    /// let field = acc_args.exprs[i].return_field(&acc_args.schema)?;
+    /// ```


I'm having some trouble understanding this example; I can understand the part for getting the metadata of a field given the context of the PR, but why do we also include an example for getting the return field?

The snippet is meant to illustrate the sentence immediately above it: you pair acc_args.exprs with acc_args.schema to recover the full FieldRef for argument i.

Pulling the metadata out of schema.field(i) is one common use case, and the follow-up line shows how you would then obtain the complete FieldRef (name, type, metadata) via:

exprs[i].return_field(&acc_args.schema)

...using the same pairing.

I'll tweak the wording.

The snippet is meant to illustrate the sentence immediately above it: you pair acc_args.exprs with acc_args.schema to recover the full FieldRef for argument i.

This may be a silly question, but what's the difference between acc_args.exprs[i].return_field(&acc_args.schema) and acc_args.schema.field(i)?

Not at all 😄

acc_args.schema.field(i) — returns the raw Arrow Field from the (physical) input schema at position i (name, type, nullability, metadata exactly as in that schema).

acc_args.exprs[i].return_field(&acc_args.schema)? — asks the expression for the effective FieldRef for argument i given the full schema. It incorporates expression semantics (casts, literals, computed types, extension metadata, nullability changes, etc.) and returns an owned/Arc FieldRef (and can fail), not just a borrowed &Field.

datafusion/physical-expr/src/aggregate.rs

kosiew · 2025-10-01T14:06:56Z

Is this a band-aid fix? Is there a root cause we should be looking for instead?
There's a heavy emphasis on the word "synthesize" throughout this PR but I don't know what it means to "synthesize" a schema from literal expressions 🤔

AggregateExprBuilder already captures a FieldRef for every argument (including literals) by calling each physical expression’s return_field during construction, so we retain the full Arrow metadata for those inputs in input_fields.

The new args_schema helper detects when the physical input schema is empty—something that legitimately happens when an aggregate is invoked with literals only because the child plan has no columns—and in that case reconstitutes a Schema from the stored input_fields so the accumulator can still see that metadata.

We then hand that schema to every AccumulatorArgs we build, so UDAFs observe the same field information whether their inputs were columns or literals. In other words, “synthesize” means “wrap the already-computed argument fields in a temporary Schema when the physical schema is empty”; there isn’t another layer hiding the real root cause.

…ateUDFImpl

Jefffrey · 2025-10-08T04:26:17Z

Not at all 😄

* `acc_args.schema.field(i)` — returns the raw Arrow `Field` from the (physical) input schema at position `i` (name, type, nullability, metadata exactly as in that schema).

* `acc_args.exprs[i].return_field(&acc_args.schema)?` — asks the expression for the effective `FieldRef` for argument `i` given the full schema. It incorporates expression semantics (casts, literals, computed types, extension metadata, nullability changes, etc.) and returns an owned/Arc `FieldRef` (and can fail), not just a borrowed `&Field`.

Thank you for the explanations, I'm still trying to wrap my head around all the parts involved here 😅

So I think my main confusion lies around the difference between the physical input schema, and the effective FieldRef argument; is there a reason we provide the ability to access both? This fix only synthesizes a schema if the physical schema is missing as you mentioned, but would it be incorrect behaviour to instead always synthesize the schema from the physical schema (whether present or not)?

If we look at scalar & window functions, I don't see them having equivalent logic around providing direct access to the physical schema, instead they provide methods to directly access Fields:

I'm trying to understand why AccumulatorArgs seems to be the odd one out here; I'm sure there's some historical reason but the limited existing documentation on AccumulatorArgs makes it hard for me to reason that this fix is the correct approach 🤔

kosiew · 2025-10-10T09:47:47Z

@Jefffrey

Why `AccumulatorArgs` Differs from Other Function Args

Your review raises an excellent question about API consistency. The answer lies in when and how these functions resolve their input fields.

Scalar & Window Functions: Pre-computed Fields

Both scalar and window functions receive pre-computed FieldRefs at creation time:

// From create_udwf_window_expr (windows/mod.rs:178-180)
let input_fields: Vec<_> = args
    .iter()
    .map(|arg| arg.return_field(input_schema)) // Computed once at planning
    .collect::<Result<_>>()?;

These are then passed to ExpressionArgs, PartitionEvaluatorArgs, or ScalarFunctionArgs.
The schema itself is never exposed because the fields have already been computed.

Aggregate Functions: Runtime Schema Access

Aggregates, by contrast, receive the physical schema and compute fields on demand:

// From AggregateFunctionExpr::create_accumulator (aggregate.rs:429-432)
let schema = self.args_schema();
let acc_args = self.make_acc_args(schema.as_ref()); // Schema passed here
self.fun.accumulator(acc_args)

Why Can't Aggregates Use Pre-computed Fields?

The expressions need schema access at runtime. Consider:

SUM(a + b)

Expression and Physical Expression Resolution

The expression a + b references two columns from the physical schema, but produces one logical argument for the accumulator.

The PhysicalExpr for a + b needs to:

Resolve columns a and b from the physical schema
Evaluate the addition
Pass the result to the accumulator

If we only passed pre-computed fields (like scalar/window functions do), the PhysicalExpr couldn’t resolve its column references — it needs the full Schema.

What the Patch Actually Changes

The patch doesn't change runtime behavior — it:

Clarifies documentation: Makes explicit that schema is the physical input schema
Adds convenience methods: input_field() and input_fields() wrap the common pattern of calling expr.return_field(&schema)
Aligns ergonomics: Matches the convenience of ScalarFunctionArgs.arg_fields and PartitionEvaluatorArgs.input_fields(), without losing the schema access that aggregate expressions require

Comparison Table

Aspect	Scalar / Window	Accumulator (Before)	Accumulator (After)
Fields	Pre-computed at planning	Computed on-demand	Computed on-demand
Schema	Hidden (not needed)	Exposed (confusing)	Exposed (documented)
Field access	Direct: `args.arg_fields[i]`	Manual: `exprs[i].return_field(schema)` ?	Helper: `input_field(i)` ?
Why different?	Simple 1:1 args	Expressions need schema for column resolution	Same (clarified)

This PR clarifies intent and align ergonomics where possible, while preserving the functionality that makes aggregates work correctly.

…d input fields

Jefffrey · 2025-10-10T12:14:59Z

Sorry this doesn't clear it up for me; the example of SUM(a + b) needing the physical schema because it must resolve both a and b doesn't make sense to me given scalar and window functions can similarly accept multiple columns like that (e.g. SIN(a + b)). I'll see if I can dig deeper into the related code to try bring my own understanding into this 🤔

Expand documentation to explain the relationship between schemas, unevaluated argument expressions, and their differences from scalar and window function argument handling. Address the specific case of SUM(a + b) vs SIN(a + b) for better understanding.

kosiew · 2025-10-17T03:52:29Z

Closing this.
#18100 has a better approach.

…ields (#18100) ## Which issue does this PR close?  - Closes #16997 - Part of #11725 - Supersedes #17085 ## Rationale for this change  When reviewing #17085 I was very confused by the fix suggested, and tried to understand why `AccumulatorArgs` didn't have easy access to `Field`s of its input expressions, as compared to scalar/window functions which do. Introducing this new field should make it easier for users to grab datatype, metadata, nullability of their input expressions for aggregate functions. ## What changes are included in this PR?  Add a slice of `FieldRef` to `AccumulatorArgs` so users don't need to compute the input expression fields themselves via using schema. This addresses #16997 as it was confusing to have only the schema available as there are valid (?) cases where the schema is empty (such as literal only input). This fix differs from #17085 in that it doesn't special case for when there is literal only input; it leaves the physical `schema` provided to `AccumulatorArgs` untouched but provides a more ergonomic (and less confusing) API for users to retrieve `Field`s of their input arguments. - I'm still not sure if the schema being empty for literal only inputs is correct or not, so this might be considered a side step. If we could remove `schema` entirely from `AccumulatorArgs` maybe we wouldn't need to worry about this, but see my comment for why that wasn't done in this PR ## Are these changes tested?  Existing unit tests. ## Are there any user-facing changes?  Yes, new field to `AccumulatorArgs` which is publicly exposed (with all it's fields).

…ields (apache#18100) ## Which issue does this PR close?  - Closes apache#16997 - Part of apache#11725 - Supersedes apache#17085 ## Rationale for this change  When reviewing apache#17085 I was very confused by the fix suggested, and tried to understand why `AccumulatorArgs` didn't have easy access to `Field`s of its input expressions, as compared to scalar/window functions which do. Introducing this new field should make it easier for users to grab datatype, metadata, nullability of their input expressions for aggregate functions. ## What changes are included in this PR?  Add a slice of `FieldRef` to `AccumulatorArgs` so users don't need to compute the input expression fields themselves via using schema. This addresses apache#16997 as it was confusing to have only the schema available as there are valid (?) cases where the schema is empty (such as literal only input). This fix differs from apache#17085 in that it doesn't special case for when there is literal only input; it leaves the physical `schema` provided to `AccumulatorArgs` untouched but provides a more ergonomic (and less confusing) API for users to retrieve `Field`s of their input arguments. - I'm still not sure if the schema being empty for literal only inputs is correct or not, so this might be considered a side step. If we could remove `schema` entirely from `AccumulatorArgs` maybe we wouldn't need to worry about this, but see my comment for why that wasn't done in this PR ## Are these changes tested?  Existing unit tests. ## Are there any user-facing changes?  Yes, new field to `AccumulatorArgs` which is publicly exposed (with all it's fields).

Merge branch 'main' into udaf-schema-16997

1d31dfe

github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate labels Aug 8, 2025

kosiew added 7 commits August 12, 2025 12:39

Merge branch 'main' into udaf-schema-16997

66ee0c5

feat: Implement SchemaBasedAggregateUdf and enhance accumulator argum…

15e89aa

…ent handling

refactor: Rename test function for schema-based aggregate UDF metadata

fea8c1c

refactor: Extract AccumulatorArgs construction into a separate method…

c8678e5

… in AggregateFunctionExpr

test: Add unit tests for AggregateUDF implementation and args_schema …

25df8c6

…behavior

refactor: Consolidate use statements for improved readability in aggr…

2991acc

…egate.rs

kosiew force-pushed the udaf-schema-16997 branch from fb6c9a8 to 2991acc Compare August 12, 2025 07:50

kosiew marked this pull request as ready for review August 12, 2025 07:54

kosiew added 14 commits August 12, 2025 16:17

refactor(tests): Simplify argument passing in AggregateExprBuilder tests

664367b

docs: Mark code examples as ignored in AggregateUDF and AccumulatorAr…

20d7a93

…gs documentation

Merge branch 'main' into udaf-schema-16997

99824bd

Enhance DummyUdf struct by deriving PartialEq, Eq, and Hash traits

8692423

Enhance SchemaBasedAggregateUdf struct by deriving PartialEq, Eq, and…

f2a2d51

… Hash traits

Merge branch 'main' into udaf-schema-16997

f4484be

Merge branch 'main' into udaf-schema-16997

35014e0

refactor(tests): reorganize and enhance DummyUdf implementation and t…

03579a1

…ests

Merge branch 'main' into udaf-schema-16997

7175121

Revert to last good point

4dd8bb2

Merge branch 'main' into udaf-schema-16997

fee3fe9

Merge branch 'main' into udaf-schema-16997

b5a931b

Jefffrey reviewed Sep 30, 2025

View reviewed changes

docs(udaf): improve documentation for AccumulatorArgs usage in Aggreg…

f6bb7ec

…ateUDFImpl

kosiew added 2 commits October 1, 2025 22:12

refactor(tests): move tests to bottom

f5a36ee

Merge branch 'main' into udaf-schema-16997

07f6fab

kosiew force-pushed the udaf-schema-16997 branch from 1583191 to 07f6fab Compare October 1, 2025 14:14

docs(udaf, accumulator): enhance documentation for AccumulatorArgs an…

34b2d83

…d input fields

Jefffrey mentioned this pull request Oct 16, 2025

Introduce expr_fields to AccumulatorArgs to hold input argument fields #18100

Merged

kosiew closed this Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Schema Field Metadata in User-Defined Aggregate Functions (UDAFs) #17085

Support Schema Field Metadata in User-Defined Aggregate Functions (UDAFs) #17085

Uh oh!

kosiew commented Aug 8, 2025 •

edited

Loading

Uh oh!

Jefffrey left a comment

Uh oh!

Uh oh!

Jefffrey Sep 30, 2025

Uh oh!

kosiew Oct 1, 2025

Uh oh!

Jefffrey Oct 1, 2025

Uh oh!

kosiew Oct 1, 2025

Uh oh!

Uh oh!

kosiew commented Oct 1, 2025

Uh oh!

Jefffrey commented Oct 8, 2025

Uh oh!

kosiew commented Oct 10, 2025 •

edited

Loading

Uh oh!

Jefffrey commented Oct 10, 2025

Uh oh!

kosiew commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support Schema Field Metadata in User-Defined Aggregate Functions (UDAFs) #17085

Support Schema Field Metadata in User-Defined Aggregate Functions (UDAFs) #17085

Uh oh!

Conversation

kosiew commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jefffrey Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kosiew commented Oct 1, 2025

Uh oh!

Jefffrey commented Oct 8, 2025

Uh oh!

kosiew commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why AccumulatorArgs Differs from Other Function Args

Scalar & Window Functions: Pre-computed Fields

Aggregate Functions: Runtime Schema Access

Why Can't Aggregates Use Pre-computed Fields?

Expression and Physical Expression Resolution

What the Patch Actually Changes

Comparison Table

Uh oh!

Jefffrey commented Oct 10, 2025

Uh oh!

kosiew commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew commented Aug 8, 2025 •

edited

Loading

kosiew commented Oct 10, 2025 •

edited

Loading

Why `AccumulatorArgs` Differs from Other Function Args