Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Nov 11, 2025

This moves ownership of projections from FileScanConfig into FileSource.
Notably we do not do anything special with this in Parquet just yet: I leave it for a followup to actually use the projection expressions instead of column indices to e.g. generate the Parquet ProjectionMask directly from expressions (in particular to select leaves instead of roots for struct and variant access).

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate substrait Changes to the substrait crate proto Related to proto crate datasource Changes to the datasource crate labels Nov 11, 2025
Comment on lines +70 to +73
let scan_config =
FileScanConfigBuilder::new(ObjectStoreUrl::local_filesystem(), source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree it is really nice to not have this strange circular dependency between the source and the config

@adriangb
Copy link
Contributor Author

adriangb commented Nov 11, 2025

This is the tracer bullet I've been using to guide this change:

use arrow::util::pretty::pretty_format_batches;
use arrow_schema::DataType;
use datafusion::{
    common::Result,
    prelude::{ParquetReadOptions, SessionContext},
};

#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();
    let df = ctx
        .read_parquet(
            "data/",
            ParquetReadOptions::default()
                .table_partition_cols(vec![("b".to_string(), DataType::UInt32)]),
        )
        .await?;
    let res = df
        .select_exprs(&["b", "a", "a = b", "a * 2"])?
        .collect()
        .await?;

    println!("Result:\n{}", pretty_format_batches(&res)?);

    Ok(())
}

I plan to do a review and cleanup, handle the protobuf stuff (I know that needs to be changed) and then fix all the rest of the tests. It looks like we have 36 failing tests, not too bad, and a lot of them look related.

@adriangb
Copy link
Contributor Author

Down to 4 failing tests! All proto related.

Comment on lines 122 to 123
// No projection - use the full file schema
Arc::clone(self.table_schema.file_schema())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: do we need to dynamically generate a "full" projection that includes table partition columns? Or should that be the default that each FileSource initializes itself with (then we don't even need to check here!)?

@adriangb adriangb marked this pull request as ready for review November 12, 2025 23:27
@adriangb
Copy link
Contributor Author

adriangb commented Nov 12, 2025

I'm marking this as ready for review. I expect more changes / cleanup is needed, I already have some ideas, but I'd like some initial feedback. cc @AdamGS @XiangpengHao @waynexia since you've all expressed interest.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 13, 2025
adriangb added a commit to pydantic/datafusion that referenced this pull request Nov 14, 2025
This PR adds trait implementations, a project_batch() method, and fixes
a bug in update_expr() for literal expressions. Also adds comprehensive tests.

Part of apache#18627
adriangb added a commit to pydantic/datafusion that referenced this pull request Nov 14, 2025
This commit consolidates the separate ArrowFileSource and ArrowStreamFileSource
implementations into a unified ArrowSource with an ArrowFormat enum.

Key changes:
- Removed ArrowFileSource and ArrowStreamFileSource structs
- Added ArrowFormat enum (File, Stream) to distinguish between formats
- Created unified ArrowSource struct that uses ArrowFormat to dispatch
- Kept separate ArrowFileOpener and ArrowStreamFileOpener implementations
- Consolidated all FileSource trait implementations in ArrowSource
- Format-specific behavior in repartitioned() method (Stream returns None)

This consolidation reduces code duplication while maintaining clear separation
of concerns between the file and stream format handling.

Part of apache#18627

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
adriangb added a commit to pydantic/datafusion that referenced this pull request Nov 14, 2025
This commit moves statistics handling from individual FileSource implementations
into FileScanConfig, simplifying the FileSource trait.

Changes:
- Remove statistics() and with_statistics() methods from FileSource trait
- Remove with_projection() method from FileSource trait (statistics PR only)
- Add statistics field to FileScanConfig struct
- Add statistics() method to FileScanConfig to retrieve statistics
- Update FileScanConfigBuilder to properly handle statistics
- Remove projected_statistics field from all FileSource implementations:
  - ParquetSource
  - CsvSource
  - JsonSource
  - AvroSource
  - ArrowFileSource and ArrowStreamFileSource
  - MockSource (test utility)
- Update test utilities and assertions to use config.statistics() instead of file_source.statistics()
- Update proto serialization to use config.statistics()

Part of apache#18627

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
adriangb added a commit to pydantic/datafusion that referenced this pull request Nov 15, 2025
This PR adds trait implementations, a project_batch() method, and fixes
a bug in update_expr() for literal expressions. Also adds comprehensive tests.

Part of apache#18627
adriangb added a commit to pydantic/datafusion that referenced this pull request Nov 15, 2025
This commit consolidates the separate ArrowFileSource and ArrowStreamFileSource
implementations into a unified ArrowSource with an ArrowFormat enum.

Key changes:
- Removed ArrowFileSource and ArrowStreamFileSource structs
- Added ArrowFormat enum (File, Stream) to distinguish between formats
- Created unified ArrowSource struct that uses ArrowFormat to dispatch
- Kept separate ArrowFileOpener and ArrowStreamFileOpener implementations
- Consolidated all FileSource trait implementations in ArrowSource
- Format-specific behavior in repartitioned() method (Stream returns None)

This consolidation reduces code duplication while maintaining clear separation
of concerns between the file and stream format handling.

Part of apache#18627

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
adriangb added a commit to pydantic/datafusion that referenced this pull request Nov 15, 2025
This commit moves statistics handling from individual FileSource implementations
into FileScanConfig, simplifying the FileSource trait.

Changes:
- Remove statistics() and with_statistics() methods from FileSource trait
- Remove with_projection() method from FileSource trait (statistics PR only)
- Add statistics field to FileScanConfig struct
- Add statistics() method to FileScanConfig to retrieve statistics
- Update FileScanConfigBuilder to properly handle statistics
- Remove projected_statistics field from all FileSource implementations:
  - ParquetSource
  - CsvSource
  - JsonSource
  - AvroSource
  - ArrowFileSource and ArrowStreamFileSource
  - MockSource (test utility)
- Update test utilities and assertions to use config.statistics() instead of file_source.statistics()
- Update proto serialization to use config.statistics()

Part of apache#18627

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
adriangb added a commit to pydantic/datafusion that referenced this pull request Nov 15, 2025
This PR adds trait implementations, a project_batch() method, and fixes
a bug in update_expr() for literal expressions. Also adds comprehensive tests.

Part of apache#18627
adriangb added a commit to pydantic/datafusion that referenced this pull request Nov 15, 2025
This commit consolidates the separate ArrowFileSource and ArrowStreamFileSource
implementations into a unified ArrowSource with an ArrowFormat enum.

Key changes:
- Removed ArrowFileSource and ArrowStreamFileSource structs
- Added ArrowFormat enum (File, Stream) to distinguish between formats
- Created unified ArrowSource struct that uses ArrowFormat to dispatch
- Kept separate ArrowFileOpener and ArrowStreamFileOpener implementations
- Consolidated all FileSource trait implementations in ArrowSource
- Format-specific behavior in repartitioned() method (Stream returns None)

This consolidation reduces code duplication while maintaining clear separation
of concerns between the file and stream format handling.

Part of apache#18627

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
adriangb added a commit to pydantic/datafusion that referenced this pull request Nov 15, 2025
This commit moves statistics handling from individual FileSource implementations
into FileScanConfig, simplifying the FileSource trait.

Changes:
- Remove statistics() and with_statistics() methods from FileSource trait
- Remove with_projection() method from FileSource trait (statistics PR only)
- Add statistics field to FileScanConfig struct
- Add statistics() method to FileScanConfig to retrieve statistics
- Update FileScanConfigBuilder to properly handle statistics
- Remove projected_statistics field from all FileSource implementations:
  - ParquetSource
  - CsvSource
  - JsonSource
  - AvroSource
  - ArrowFileSource and ArrowStreamFileSource
  - MockSource (test utility)
- Update test utilities and assertions to use config.statistics() instead of file_source.statistics()
- Update proto serialization to use config.statistics()

Part of apache#18627

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@github-actions github-actions bot added the physical-expr Changes to the physical-expr crates label Nov 19, 2025
github-merge-queue bot pushed a commit that referenced this pull request Nov 20, 2025
## Summary

This PR moves statistics handling from individual `FileSource`
implementations into `FileScanConfig`, simplifying the `FileSource`
trait interface.
The `FileSource`s were all acting as a container for the statistics but
never actually using them.
Since `FileScanConfig` deals with file-level things (which the
statistics are) it is better equipped to deal with it.

### Changes

- **FileSource trait simplification**: Removed `statistics()`,
`with_statistics()`, and `with_projection()` methods
- **FileScanConfig enhancement**: Added `statistics` field and
`statistics()` method
- **FileSource implementations updated**: Removed `projected_statistics`
field from all implementations:
  - ParquetSource
  - CsvSource  
  - JsonSource
  - AvroSource
  - ArrowFileSource and ArrowStreamFileSource
  - MockSource (test utility)
- **Test utilities**: Updated assertions to use `config.statistics()`
instead of `file_source.statistics()`
- **Proto serialization**: Updated to use `config.statistics()`

### Benefits

1. **Simpler trait interface**: `FileSource` implementations no longer
need to manage statistics
2. **Centralized statistics**: All statistics are now managed
consistently in `FileScanConfig`
3. **Cleaner API**: Statistics lifecycle is clearer and less error-prone
4. **Reduced code duplication**: Removes ~140 lines of boilerplate
across implementations

### Related

This is part of the projection refactoring work in #18627. This PR
extracts just the statistics-related changes to make review easier. The
full projection refactoring will come in subsequent PRs.

## Test plan

- [x] All modified file source implementations compile
- [x] Test utilities updated and compile 
- [x] CI tests pass (will verify after PR creation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Martin Grigorov <[email protected]>
@github-actions github-actions bot added the catalog Related to the catalog crate label Nov 21, 2025
assert_snapshot!(
plan_string,
@"AggregateExec: mode=Partial, gby=[id@0 as id, 1 + id@0 as expr], aggr=[COUNT(c)]"
@"AggregateExec: mode=Partial, gby=[id@0 as id, 1 + id@0 as expr], aggr=[COUNT(c)], ordering_mode=Sorted"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure where this is coming from... we need to double check it's correct / an improvement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is created with ORDER BY id so I think this plan is correct:
https://github.com/apache/datafusion/blob/3c21b546a9acf9922229220d3ceca91a945cbf46/datafusion/core/tests/physical_optimizer/partition_statistics.rs#L89-L88

(I don't really know why it started appearing either)

Comment on lines +219 to +224
// Preserve projection from the original file source
if let Some(projection) = conf.file_source.projection() {
if let Some(new_source) = source.try_pushdown_projection(projection)? {
source = new_source;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to double check this bit of code. Do we also need to preserve the filters? What is the original source / why are we recreating it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @corasaurus-hex was recently working in this area, perhaps they can help review this change

Comment on lines 54 to 61
/// `FileSource` for Arrow IPC file format. Supports range-based parallel reading.
#[derive(Clone)]
pub(crate) struct ArrowFileSource {
table_schema: TableSchema,
metrics: ExecutionPlanMetricsSet,
projected_statistics: Option<Statistics>,
schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was that instead of having a dedicated FileSource for each one of Arrow File / Arrow Stream we can have a unified FileSource and two different openers. It's less code and complexity.

Comment on lines 152 to 158
async fn create_physical_plan(
&self,
_state: &dyn Session,
conf: FileScanConfig,
) -> Result<Arc<dyn ExecutionPlan>> {
let file_schema = Arc::clone(conf.file_schema());
let config = FileScanConfigBuilder::from(conf)
.with_source(Arc::new(AvroSource::new(file_schema)))
.build();
Ok(DataSourceExec::from_data_source(config))
Ok(DataSourceExec::from_data_source(conf))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite perplexed about the purpose of this method. Should it always just return Ok(DataSourceExec::from_data_source(conf))? We already ahve a FileScanConfig...

Comment on lines 69 to 72
/// Initialize new instance with projection information
fn with_projection(&self, config: &FileScanConfig) -> Arc<dyn FileSource>;
/// Initialize new instance with projected statistics
fn with_statistics(&self, statistics: Statistics) -> Arc<dyn FileSource>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_proejection becomes try_pushdown_projection and statistics handling gets moved into FileScanConfig

metrics: &ExecutionPlanMetricsSet,
) -> Result<Self> {
let projected_schema = config.projected_schema();
let pc_projector = PartitionColumnProjector::new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We completely nuke partition value handling here in favor of having the projection handle it.

///
/// # Errors
/// Returns an error if projection pushdown fails or if schema operations fail.
pub fn build(self) -> Result<FileScanConfig> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting this to return Result to properly handle when a FileSource cannot accept any projections. Previously this would have been a silent but that I think was just never hit because all of our FileSources accepted projections

}

fn with_projection(&self, _config: &FileScanConfig) -> Arc<dyn FileSource> {
Arc::new(Self { ..self.clone() })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me if this contains bugs or not. It seems like the projection was silently dropped on the floor...

logan-keede pushed a commit to logan-keede/datafusion that referenced this pull request Nov 23, 2025
## Summary

This PR consolidates the separate `ArrowFileSource` and
`ArrowStreamFileSource` implementations into a unified `ArrowSource`
with an `ArrowFormat` enum.

This is part of the larger projection refactoring effort tracked in
apache#18627.

## Key Changes

- **Removed separate structs**: Eliminated duplicate `ArrowFileSource`
and `ArrowStreamFileSource` implementations
- **Added `ArrowFormat` enum**: Simple enum with `File` and `Stream`
variants to distinguish between Arrow IPC formats
- **Unified `ArrowSource` struct**: Single struct that uses
`ArrowFormat` to dispatch to appropriate opener
- **Kept separate openers**: `ArrowFileOpener` and
`ArrowStreamFileOpener` remain distinct as their implementations differ
significantly
- **Format-specific behavior**: `repartitioned()` method returns `None`
for Stream format (doesn't support parallel reading) and delegates to
default logic for File format

## Benefits

- **Reduced code duplication**: ~144 net lines removed
- **Clearer architecture**: Single source of truth for Arrow file
handling
- **Maintained separation**: Format-specific logic remains in separate
openers
- **No behavior changes**: All existing tests pass without modification

## Testing

- All existing tests pass
- No changes to test files needed
- Both file and stream formats work correctly

## Related Work

This PR is independent and can be merged before or after:
- PR 1: Move Statistics Handling (if created)
- PR 3: Enhance Physical-Expr Projection Handling (if created)

Part of apache#18627

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <[email protected]>
logan-keede pushed a commit to logan-keede/datafusion that referenced this pull request Nov 23, 2025
## Summary

This PR enhances the physical-expr projection handling with several
improvements needed for better projection management in datasources.

## Changes

1. **Add trait implementations**:
   - Added `PartialEq` and `Eq` for `ProjectionExpr`
   - Added `PartialEq` and `Eq` for `ProjectionExprs`

2. **Add `project_batch()` method**:
   - Efficiently projects `RecordBatch` with pre-computed schema
   - Handles empty projections correctly
   - Reduces schema projection overhead for repeated calls

3. **Fix `update_expr()` bug**:
- **Bug**: Previously returned `None` for literal expressions (no column
references)
- **Fix**: Now returns `Some(expr)` for both `Unchanged` and
`RewrittenValid` states
- **Impact**: Critical for queries like `SELECT 1 FROM table` where no
file columns are needed

4. **Change `from_indices()` signature**:
   - Changed from `&SchemaRef` to `&Schema` for consistency

5. **Add comprehensive tests**:
- `test_merge_empty_projection_with_literal()` - Reproduces roundtrip
issue
   - `test_update_expr_with_literal()` - Tests literal handling
- `test_update_expr_with_complex_literal_expr()` - Tests mixed
expressions

## Part of

This PR is part of apache#18627 - a larger effort to refactor projection
handling in DataFusion.

## Testing

All tests pass:
- ✅ New projection tests
- ✅ Existing physical-expr test suite
- ✅ Doc tests

## AI use

I asked Claude to extract this change from apache#18627

---------

Co-authored-by: Jeffrey Vo <[email protected]>
logan-keede pushed a commit to logan-keede/datafusion that referenced this pull request Nov 23, 2025
## Summary

This PR moves statistics handling from individual `FileSource`
implementations into `FileScanConfig`, simplifying the `FileSource`
trait interface.
The `FileSource`s were all acting as a container for the statistics but
never actually using them.
Since `FileScanConfig` deals with file-level things (which the
statistics are) it is better equipped to deal with it.

### Changes

- **FileSource trait simplification**: Removed `statistics()`,
`with_statistics()`, and `with_projection()` methods
- **FileScanConfig enhancement**: Added `statistics` field and
`statistics()` method
- **FileSource implementations updated**: Removed `projected_statistics`
field from all implementations:
  - ParquetSource
  - CsvSource  
  - JsonSource
  - AvroSource
  - ArrowFileSource and ArrowStreamFileSource
  - MockSource (test utility)
- **Test utilities**: Updated assertions to use `config.statistics()`
instead of `file_source.statistics()`
- **Proto serialization**: Updated to use `config.statistics()`

### Benefits

1. **Simpler trait interface**: `FileSource` implementations no longer
need to manage statistics
2. **Centralized statistics**: All statistics are now managed
consistently in `FileScanConfig`
3. **Cleaner API**: Statistics lifecycle is clearer and less error-prone
4. **Reduced code duplication**: Removes ~140 lines of boilerplate
across implementations

### Related

This is part of the projection refactoring work in apache#18627. This PR
extracts just the statistics-related changes to make review easier. The
full projection refactoring will come in subsequent PRs.

## Test plan

- [x] All modified file source implementations compile
- [x] Test utilities updated and compile 
- [x] CI tests pass (will verify after PR creation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Martin Grigorov <[email protected]>
@AdamGS
Copy link
Contributor

AdamGS commented Nov 25, 2025

Looks really awesome! I hope to read through it this week/early next week

@alamb alamb self-assigned this Nov 25, 2025
@alamb alamb added the api change Changes the API exposed to users of the crate label Nov 25, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb -- I went through this PR carefully and I think it could be merged as is. It is large enough that I am not sure I would be able to catch all potential issues, but I did read all the changes and they made sense.

I marked this PR as an API change

I think the code looks really nice to me, though I left some small suggestions

Overall, the biggest thing I think it needs is 🥁 more documentation! Specifically:

  1. A note in upgrading.md to help people upgrade -- specifically that with_projection_indices now returns a Results and any help on how people will have to adjust their custom file sources
  2. Documentation on FileSource::try_pushdown_projection with expectations (e.g. that the file source needs to handle it internally)
  3. Maybe try to clarify / document more where the partition column handling is happening now

I think it would also be very nice to wait for @AdamGS 's comments too

Comment on lines +70 to +73
let scan_config =
FileScanConfigBuilder::new(ObjectStoreUrl::local_filesystem(), source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree it is really nice to not have this strange circular dependency between the source and the config

let opener = config.create_file_opener(object_store, &scan_config, 0);
let opener =
scan_config
.file_source()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is still kind of weird -- look like we might be able to replace it entirely with

fn open(
&self,
partition: usize,
context: Arc<TaskContext>,
) -> Result<SendableRecordBatchStream> {
let object_store = context.runtime_env().object_store(&self.object_store_url)?;
let batch_size = self
.batch_size
.unwrap_or_else(|| context.session_config().batch_size());
let source = self.file_source.with_batch_size(batch_size);
let opener = source.create_file_opener(object_store, self, partition)?;
let stream = FileStream::new(self, partition, opener, source.metrics())?;
Ok(Box::pin(cooperative(stream)))
}

But then we would have to change the example to setup the runtime env

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll leave it untouched for now.

assert_snapshot!(
plan_string,
@"AggregateExec: mode=Partial, gby=[id@0 as id, 1 + id@0 as expr], aggr=[COUNT(c)]"
@"AggregateExec: mode=Partial, gby=[id@0 as id, 1 + id@0 as expr], aggr=[COUNT(c)], ordering_mode=Sorted"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table is created with ORDER BY id so I think this plan is correct:
https://github.com/apache/datafusion/blob/3c21b546a9acf9922229220d3ceca91a945cbf46/datafusion/core/tests/physical_optimizer/partition_statistics.rs#L89-L88

(I don't really know why it started appearing either)

ProjectionExec: expr=[b@1 as new_b, c@2 + e@4 as binary, b@1 as newest_b]
DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b, c, d, e], file_type=csv, has_header=false
"
@"DataSourceExec: file_groups={1 group: [[x]]}, projection=[b@1 as new_b, c@2 + e@4 as binary, b@1 as newest_b], file_type=csv, has_header=false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is pretty neat to see the expressions pushed down as projections here

Comment on lines +219 to +224
// Preserve projection from the original file source
if let Some(projection) = conf.file_source.projection() {
if let Some(new_source) = source.try_pushdown_projection(projection)? {
source = new_source;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @corasaurus-hex was recently working in this area, perhaps they can help review this change

/// dictionaries. Indeed, the partition columns are constant, so the dictionaries that represent them
/// have all their keys equal to 0. This enables us to re-use the same "all-zero" buffer across batches,
/// which makes the space consumption of the partition columns O(batch_size) instead of O(record_count).
pub struct PartitionColumnProjector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good thing to document in the upgrade guide -- what to do in this case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code also seems to have a bunch of optimizations -- should we move them to the schema adapter code perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mostly handled the optimizations globally already: #16789


[dependencies]
arrow = { workspace = true }
arrow-schema = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have arrow as a dependency, the arrow-schema dependency seems unecessary

_state: &dyn Session,
conf: FileScanConfig,
) -> Result<Arc<dyn ExecutionPlan>> {
let table_schema = TableSchema::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you had some open questions about this -- why was it re-creating an opener?

But given all the tests pass I am going to assume it was some left over unecessary code

.as_any()
.downcast_ref::<ParquetSource>()
.cloned()
.expect("should be a parquet source");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this would be better if it were an internal error rather than panic

The idea of using the existing source rather than creating a new one makes a lot of sense though

/// 4. Assign final indices: file columns → [0..n), partition columns → [n..)
/// 5. Transform expressions once to remap all column references
///
/// This replaces the previous three-pass approach:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the comment here about old vs new approach is helpful / adds value

Could you also update this comment to clarify:

  1. What is contained in the file_schema
  2. What assumptions are made

Specifically, I am confused as this code seems to assumes any columns referenced in projection that are greater than file_schema are partition columns. But what about columns that are in the table schema but not in the file schema (aka "schema evolution " columns that will be filled in as nulls)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added docstrings and updated the parameter name to be logical_file_schema.

physical_plan
01)ProjectionExec: expr=[NULL as log(NULL,aggregate_simple.c2)]
02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/aggregate_simple.csv]]}, file_type=csv, has_header=true
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/aggregate_simple.csv]]}, projection=[NULL as log(NULL,aggregate_simple.c2)], file_type=csv, has_header=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and there you can see the projection expressions moved into DataSourceExec per the plan 🚀

@github-actions github-actions bot added documentation Improvements or additions to documentation physical-plan Changes to the physical-plan crate labels Nov 26, 2025
@adriangb
Copy link
Contributor Author

@alamb thank you for the review! I think I've handled the most important bits of feedback. I'll leave this open for another day or so for review and if there's not big ticket items I'll go for a merge.

@adriangb
Copy link
Contributor Author

I'm going to merge this now so we can continue with the next steps. If any reviewers have feedback feel free to drop it here and we can address in a followup PR.

@adriangb adriangb added this pull request to the merge queue Nov 27, 2025
Merged via the queue into apache:main with commit 9f725d9 Nov 27, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants