Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Nov 14, 2025

Summary

This PR enhances the physical-expr projection handling with several improvements needed for better projection management in datasources.

Changes

  1. Add trait implementations:

    • Added PartialEq and Eq for ProjectionExpr
    • Added PartialEq and Eq for ProjectionExprs
  2. Add project_batch() method:

    • Efficiently projects RecordBatch with pre-computed schema
    • Handles empty projections correctly
    • Reduces schema projection overhead for repeated calls
  3. Fix update_expr() bug:

    • Bug: Previously returned None for literal expressions (no column references)
    • Fix: Now returns Some(expr) for both Unchanged and RewrittenValid states
    • Impact: Critical for queries like SELECT 1 FROM table where no file columns are needed
  4. Change from_indices() signature:

    • Changed from &SchemaRef to &Schema for consistency
  5. Add comprehensive tests:

    • test_merge_empty_projection_with_literal() - Reproduces roundtrip issue
    • test_update_expr_with_literal() - Tests literal handling
    • test_update_expr_with_complex_literal_expr() - Tests mixed expressions

Part of

This PR is part of #18627 - a larger effort to refactor projection handling in DataFusion.

Testing

All tests pass:

  • ✅ New projection tests
  • ✅ Existing physical-expr test suite
  • ✅ Doc tests

AI use

I asked Claude to extract this change from #18627

@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Nov 14, 2025
@adriangb adriangb changed the title Enhance physical-expr projection handling Misc improvements to ProjectionExprs Nov 14, 2025
@adriangb adriangb force-pushed the projection-source-physexpr branch from 6100daf to 3f0dd4c Compare November 15, 2025 00:46
This PR adds trait implementations, a project_batch() method, and fixes
a bug in update_expr() for literal expressions. Also adds comprehensive tests.

Part of apache#18627
///
/// This function accepts a pre-computed output schema instead of calling [`ProjectionExprs::project_schema`]
/// so that repeated calls do not have schema projection overhead.
pub fn project_batch(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better API than this would be make_projector() -> Projector where Projector holds a reference the output schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the method be used in #18627?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I'm just thinking it would be less error prone to package it up in a struct. I'll push the change here then we can rebase #18627 to use the better version once this is merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that users don't have to track output_schema and pass it in, they can just keep track of a Projector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented my idea in f6afd71 and was able to use it to simplify ProjectionExec a bit so it's already used 😄

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Nov 17, 2025
@adriangb adriangb removed the request for review from Jefffrey November 18, 2025 09:26
@adriangb adriangb added this pull request to the merge queue Nov 18, 2025
Merged via the queue into apache:main with commit b3ff6d8 Nov 18, 2025
32 checks passed
@adriangb adriangb deleted the projection-source-physexpr branch November 18, 2025 13:03
logan-keede pushed a commit to logan-keede/datafusion that referenced this pull request Nov 23, 2025
## Summary

This PR enhances the physical-expr projection handling with several
improvements needed for better projection management in datasources.

## Changes

1. **Add trait implementations**:
   - Added `PartialEq` and `Eq` for `ProjectionExpr`
   - Added `PartialEq` and `Eq` for `ProjectionExprs`

2. **Add `project_batch()` method**:
   - Efficiently projects `RecordBatch` with pre-computed schema
   - Handles empty projections correctly
   - Reduces schema projection overhead for repeated calls

3. **Fix `update_expr()` bug**:
- **Bug**: Previously returned `None` for literal expressions (no column
references)
- **Fix**: Now returns `Some(expr)` for both `Unchanged` and
`RewrittenValid` states
- **Impact**: Critical for queries like `SELECT 1 FROM table` where no
file columns are needed

4. **Change `from_indices()` signature**:
   - Changed from `&SchemaRef` to `&Schema` for consistency

5. **Add comprehensive tests**:
- `test_merge_empty_projection_with_literal()` - Reproduces roundtrip
issue
   - `test_update_expr_with_literal()` - Tests literal handling
- `test_update_expr_with_complex_literal_expr()` - Tests mixed
expressions

## Part of

This PR is part of apache#18627 - a larger effort to refactor projection
handling in DataFusion.

## Testing

All tests pass:
- ✅ New projection tests
- ✅ Existing physical-expr test suite
- ✅ Doc tests

## AI use

I asked Claude to extract this change from apache#18627

---------

Co-authored-by: Jeffrey Vo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants