Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Aug 8, 2025

Which issue does this PR close?

Rationale for this change

Previously, aggregate UDFs could not easily access field-level metadata (e.g., Arrow extension types) when invoked with literal arguments only, because the AccumulatorArgs.schema was always derived from the physical schema — which is empty for literal-only inputs.
This change ensures that in such cases, a schema is synthesized from the literal expressions, preserving metadata and enabling richer accumulator behavior. It also clarifies API documentation for AccumulatorArgs and AggregateUDFImpl.

What changes are included in this PR?

  • Added args_schema() helper to AggregateFunctionExpr to return either the physical input schema or a synthesized schema from literals when the physical schema is empty.
  • Updated create_accumulator, create_sliding_accumulator, groups_accumulator_supported, and create_groups_accumulator to use the new schema logic via make_acc_args().
  • Enhanced AccumulatorArgs and AggregateUDFImpl documentation to explain how to access input field metadata and when synthesized schemas are used.
  • Introduced SchemaBasedAggregateUdf test in user_defined_aggregates.rs to validate metadata handling in literal-only aggregates.
  • Added unit test in aggregate.rs to verify correct schema behavior for both literal-only and physical-schema-present cases.

Are these changes tested?

Yes.

  • New integration test test_schema_based_aggregate_udf_metadata ensures that metadata from literals is accessible in the accumulator.
  • New unit test in aggregate.rs validates that args_schema() returns an owned schema for literal-only inputs and a borrowed schema for non-empty physical schemas.

Are there any user-facing changes?

Yes:

  • Aggregate UDF implementations can now reliably access input field metadata in AccumulatorArgs.schema for literal-only inputs.
  • No breaking API changes; only additional guarantees and improved documentation.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate labels Aug 8, 2025
kosiew added 7 commits August 12, 2025 12:39
- Improve documentation for AccumulatorArgs.schema and exprs:
  - Add example showing how to retrieve field metadata and return field.
  - Explain synthesized schema behavior for literal-only inputs.
  - Clarify precedence when inputs are mixed (physical schema metadata wins; synthesized metadata used only when physical schema is empty).
- Update AggregateFunctionExpr::args_schema docs:
  - Explain field order guarantees, synthesized schema usage, and that std::borrow::Cow is used to avoid allocations when possible.
- Add a TODO to factor AccumulatorArgs construction into a private helper.

Documentation-only changes; no behavioral changes.
@kosiew kosiew force-pushed the udaf-schema-16997 branch from fb6c9a8 to 2991acc Compare August 12, 2025 07:50
@kosiew kosiew marked this pull request as ready for review August 12, 2025 07:54
kosiew added 14 commits August 12, 2025 16:17
- Updated the `DummyUdf::new` function to accept a `Signature` parameter for initialization.
- Modified test cases to reflect the new initialization method for `DummyUdf`.
- Improved the `args_schema` method to better manage when to borrow the existing schema or create a new one, ensuring correct correspondence between expressions and schema fields.
- Added comments and examples for improved clarity on the schema handling behavior.
- Introduced a `build_acc_args` method to encapsulate the logic for building `AccumulatorArgs` and executing a closure with them.
- Updated the `create_accumulator`, `create_sliding_accumulator`, `groups_accumulator_supported`, and `create_groups_accumulator` methods to utilize the new method for better readability and maintainability.
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit I am quite confused at the fix proposed by this PR; it seems there are cases where for literal only inputs to aggregates, schema is empty and this PR fixes it by having a fallback behaviour in the AggregateFunctionExpr to "synthesize" the schema from the input arguments?

If my understanding is correct, my questions/thoughts are:

  • Is this a band-aid fix? Is there a root cause we should be looking for instead?
  • There's a heavy emphasis on the word "synthesize" throughout this PR but I don't know what it means to "synthesize" a schema from literal expressions 🤔

Comment on lines 768 to 772
/// Example: retrieving metadata and return field for input `i`:
/// ```ignore
/// let metadata = acc_args.schema.field(i).metadata();
/// let field = acc_args.exprs[i].return_field(&acc_args.schema)?;
/// ```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having some trouble understanding this example; I can understand the part for getting the metadata of a field given the context of the PR, but why do we also include an example for getting the return field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snippet is meant to illustrate the sentence immediately above it: you pair acc_args.exprs with acc_args.schema to recover the full FieldRef for argument i.

Pulling the metadata out of schema.field(i) is one common use case, and the follow-up line shows how you would then obtain the complete FieldRef (name, type, metadata) via:

exprs[i].return_field(&acc_args.schema)

...using the same pairing.

I'll tweak the wording.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snippet is meant to illustrate the sentence immediately above it: you pair acc_args.exprs with acc_args.schema to recover the full FieldRef for argument i.

This may be a silly question, but what's the difference between acc_args.exprs[i].return_field(&acc_args.schema) and acc_args.schema.field(i)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all 😄

  • acc_args.schema.field(i) — returns the raw Arrow Field from the (physical) input schema at position i (name, type, nullability, metadata exactly as in that schema).

  • acc_args.exprs[i].return_field(&acc_args.schema)? — asks the expression for the effective FieldRef for argument i given the full schema. It incorporates expression semantics (casts, literals, computed types, extension metadata, nullability changes, etc.) and returns an owned/Arc FieldRef (and can fail), not just a borrowed &Field.

@kosiew
Copy link
Contributor Author

kosiew commented Oct 1, 2025

Is this a band-aid fix? Is there a root cause we should be looking for instead?
There's a heavy emphasis on the word "synthesize" throughout this PR but I don't know what it means to "synthesize" a schema from literal expressions 🤔

AggregateExprBuilder already captures a FieldRef for every argument (including literals) by calling each physical expression’s return_field during construction, so we retain the full Arrow metadata for those inputs in input_fields.

The new args_schema helper detects when the physical input schema is empty—something that legitimately happens when an aggregate is invoked with literals only because the child plan has no columns—and in that case reconstitutes a Schema from the stored input_fields so the accumulator can still see that metadata.

We then hand that schema to every AccumulatorArgs we build, so UDAFs observe the same field information whether their inputs were columns or literals. In other words, “synthesize” means “wrap the already-computed argument fields in a temporary Schema when the physical schema is empty”; there isn’t another layer hiding the real root cause.

@kosiew kosiew force-pushed the udaf-schema-16997 branch from 1583191 to 07f6fab Compare October 1, 2025 14:14
@Jefffrey
Copy link
Contributor

Jefffrey commented Oct 8, 2025

Not at all 😄

* `acc_args.schema.field(i)` — returns the raw Arrow `Field` from the (physical) input schema at position `i` (name, type, nullability, metadata exactly as in that schema).

* `acc_args.exprs[i].return_field(&acc_args.schema)?` — asks the expression for the effective `FieldRef` for argument `i` given the full schema. It incorporates expression semantics (casts, literals, computed types, extension metadata, nullability changes, etc.) and returns an owned/Arc `FieldRef` (and can fail), not just a borrowed `&Field`.

Thank you for the explanations, I'm still trying to wrap my head around all the parts involved here 😅

So I think my main confusion lies around the difference between the physical input schema, and the effective FieldRef argument; is there a reason we provide the ability to access both? This fix only synthesizes a schema if the physical schema is missing as you mentioned, but would it be incorrect behaviour to instead always synthesize the schema from the physical schema (whether present or not)?

If we look at scalar & window functions, I don't see them having equivalent logic around providing direct access to the physical schema, instead they provide methods to directly access Fields:

I'm trying to understand why AccumulatorArgs seems to be the odd one out here; I'm sure there's some historical reason but the limited existing documentation on AccumulatorArgs makes it hard for me to reason that this fix is the correct approach 🤔

@kosiew
Copy link
Contributor Author

kosiew commented Oct 10, 2025

@Jefffrey

Why AccumulatorArgs Differs from Other Function Args

Your review raises an excellent question about API consistency. The answer lies in when and how these functions resolve their input fields.


Scalar & Window Functions: Pre-computed Fields

Both scalar and window functions receive pre-computed FieldRefs at creation time:

// From create_udwf_window_expr (windows/mod.rs:178-180)
let input_fields: Vec<_> = args
    .iter()
    .map(|arg| arg.return_field(input_schema)) // Computed once at planning
    .collect::<Result<_>>()?;

These are then passed to ExpressionArgs, PartitionEvaluatorArgs, or ScalarFunctionArgs.
The schema itself is never exposed because the fields have already been computed.

Aggregate Functions: Runtime Schema Access

Aggregates, by contrast, receive the physical schema and compute fields on demand:

// From AggregateFunctionExpr::create_accumulator (aggregate.rs:429-432)
let schema = self.args_schema();
let acc_args = self.make_acc_args(schema.as_ref()); // Schema passed here
self.fun.accumulator(acc_args)

Why Can't Aggregates Use Pre-computed Fields?

The expressions need schema access at runtime. Consider:

SUM(a + b)

Expression and Physical Expression Resolution

The expression a + b references two columns from the physical schema, but produces one logical argument for the accumulator.

The PhysicalExpr for a + b needs to:

  1. Resolve columns a and b from the physical schema
  2. Evaluate the addition
  3. Pass the result to the accumulator

If we only passed pre-computed fields (like scalar/window functions do), the PhysicalExpr couldn’t resolve its column references — it needs the full Schema.


What the Patch Actually Changes

The patch doesn't change runtime behavior — it:

  1. Clarifies documentation: Makes explicit that schema is the physical input schema
  2. Adds convenience methods: input_field() and input_fields() wrap the common pattern of calling expr.return_field(&schema)
  3. Aligns ergonomics: Matches the convenience of ScalarFunctionArgs.arg_fields and PartitionEvaluatorArgs.input_fields(), without losing the schema access that aggregate expressions require

Comparison Table

Aspect Scalar / Window Accumulator (Before) Accumulator (After)
Fields Pre-computed at planning Computed on-demand Computed on-demand
Schema Hidden (not needed) Exposed (confusing) Exposed (documented)
Field access Direct: args.arg_fields[i] Manual: exprs[i].return_field(schema) ? Helper: input_field(i) ?
Why different? Simple 1:1 args Expressions need schema for column resolution Same (clarified)

This PR clarifies intent and align ergonomics where possible, while preserving the functionality that makes aggregates work correctly.

@Jefffrey
Copy link
Contributor

Sorry this doesn't clear it up for me; the example of SUM(a + b) needing the physical schema because it must resolve both a and b doesn't make sense to me given scalar and window functions can similarly accept multiple columns like that (e.g. SIN(a + b)). I'll see if I can dig deeper into the related code to try bring my own understanding into this 🤔

Expand documentation to explain the relationship between schemas,
unevaluated argument expressions, and their differences from
scalar and window function argument handling. Address the
specific case of SUM(a + b) vs SIN(a + b) for better
understanding.
@kosiew
Copy link
Contributor Author

kosiew commented Oct 17, 2025

Closing this.
#18100 has a better approach.

@kosiew kosiew closed this Oct 17, 2025
github-merge-queue bot pushed a commit that referenced this pull request Oct 30, 2025
…ields (#18100)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #16997
- Part of #11725
- Supersedes #17085

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

When reviewing #17085 I was very confused by the fix suggested, and
tried to understand why `AccumulatorArgs` didn't have easy access to
`Field`s of its input expressions, as compared to scalar/window
functions which do. Introducing this new field should make it easier for
users to grab datatype, metadata, nullability of their input expressions
for aggregate functions.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Add a slice of `FieldRef` to `AccumulatorArgs` so users don't need to
compute the input expression fields themselves via using schema. This
addresses #16997 as it was confusing to have only the schema available
as there are valid (?) cases where the schema is empty (such as literal
only input).

This fix differs from #17085 in that it doesn't special case for when
there is literal only input; it leaves the physical `schema` provided to
`AccumulatorArgs` untouched but provides a more ergonomic (and less
confusing) API for users to retrieve `Field`s of their input arguments.

- I'm still not sure if the schema being empty for literal only inputs
is correct or not, so this might be considered a side step. If we could
remove `schema` entirely from `AccumulatorArgs` maybe we wouldn't need
to worry about this, but see my comment for why that wasn't done in this
PR

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Existing unit tests.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

Yes, new field to `AccumulatorArgs` which is publicly exposed (with all
it's fields).

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
tobixdev pushed a commit to tobixdev/datafusion that referenced this pull request Nov 2, 2025
…ields (apache#18100)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#16997
- Part of apache#11725
- Supersedes apache#17085

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

When reviewing apache#17085 I was very confused by the fix suggested, and
tried to understand why `AccumulatorArgs` didn't have easy access to
`Field`s of its input expressions, as compared to scalar/window
functions which do. Introducing this new field should make it easier for
users to grab datatype, metadata, nullability of their input expressions
for aggregate functions.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Add a slice of `FieldRef` to `AccumulatorArgs` so users don't need to
compute the input expression fields themselves via using schema. This
addresses apache#16997 as it was confusing to have only the schema available
as there are valid (?) cases where the schema is empty (such as literal
only input).

This fix differs from apache#17085 in that it doesn't special case for when
there is literal only input; it leaves the physical `schema` provided to
`AccumulatorArgs` untouched but provides a more ergonomic (and less
confusing) API for users to retrieve `Field`s of their input arguments.

- I'm still not sure if the schema being empty for literal only inputs
is correct or not, so this might be considered a side step. If we could
remove `schema` entirely from `AccumulatorArgs` maybe we wouldn't need
to worry about this, but see my comment for why that wasn't done in this
PR

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Existing unit tests.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

Yes, new field to `AccumulatorArgs` which is publicly exposed (with all
it's fields).

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
codetyri0n pushed a commit to codetyri0n/datafusion that referenced this pull request Nov 11, 2025
…ields (apache#18100)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#16997
- Part of apache#11725
- Supersedes apache#17085

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

When reviewing apache#17085 I was very confused by the fix suggested, and
tried to understand why `AccumulatorArgs` didn't have easy access to
`Field`s of its input expressions, as compared to scalar/window
functions which do. Introducing this new field should make it easier for
users to grab datatype, metadata, nullability of their input expressions
for aggregate functions.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Add a slice of `FieldRef` to `AccumulatorArgs` so users don't need to
compute the input expression fields themselves via using schema. This
addresses apache#16997 as it was confusing to have only the schema available
as there are valid (?) cases where the schema is empty (such as literal
only input).

This fix differs from apache#17085 in that it doesn't special case for when
there is literal only input; it leaves the physical `schema` provided to
`AccumulatorArgs` untouched but provides a more ergonomic (and less
confusing) API for users to retrieve `Field`s of their input arguments.

- I'm still not sure if the schema being empty for literal only inputs
is correct or not, so this might be considered a side step. If we could
remove `schema` entirely from `AccumulatorArgs` maybe we wouldn't need
to worry about this, but see my comment for why that wasn't done in this
PR

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Existing unit tests.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

Yes, new field to `AccumulatorArgs` which is publicly exposed (with all
it's fields).

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AccumulatorArgs.schema is empty when passing in scalar input

2 participants