Skip to content

Conversation

@corasaurus-hex
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Currently Datafusion can only read Arrow files if the're in the File format, not the Stream format. I work with a bunch of Stream format files and wanted native support.

What changes are included in this PR?

To accomplish the above, this PR splits the Arrow datasource into two separate implementations (ArrowStream* and ArrowFile*) with a facade on top to differentiate between the formats at query planning time.

Are these changes tested?

Yes, there are end-to-end sqllogictests along with tests for the changes within datasource-arrow.

Are there any user-facing changes?

Technically yes, in that we support a new format now. I'm not sure which documentation would need to be updated?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Nov 3, 2025
// correct offset which is a lot of duplicate I/O. We're opting to avoid
// that entirely by only acting on a single partition and reading sequentially.
Ok(None)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is perhaps the weightiest decision in this PR. if we want to repartition a file in the ipc stream format then we need to read from the beginning of the file for each partition, or figure out another way to create the ad-hoc equivalent of the ipc file format footer so we can minimize duplicate reads (likely by reading the entire file all the way through once and then caching the result in memory for the execution plan to use for each partition)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that while this problem is worth solving, doing so is tangent to this change.
I'd like to see this solved, but I see no reason why we couldn't solve this in a follow-on.

Probably worth documenting the practical consequences of leaving it in this state though -- correct me if I'm wrong here, but I think this means that we end up hydrating the entire file into memory for certain operations, right? That's probably not a good long-term state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine this would mean I need to read the entire file into memory and keep it there? In my previous message I meant we would need to read all the record batch and dictionary locations and keep them in memory in much the same way that the arrow file format footer does. So it would mean a single pass through to record all of that and then multiple threads can seek to different parts of the file and process it.

That's my understanding of the effect of this, that it means we can't parallelize queries against this file format.

If you believe that the resulting behavior would be pathological to the extreme then we should absolutely document that. Thoughts on how we can reliably test that it is? Or who might be aware of the implications of this? And where to document it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think partitioning is doable, but it's better be done afterwards if anyone has a real use case.

In order to do repartition, this function has to scan once, record the dictionary and batch positions, then split the work evenly to parallel partitioned workers -- this task's can be done at around full disk bandwidth speed (5GB/Sec on recent MacBooks)
Regarding decoding the batches from Arrow IPC Stream file to in-memory arrow RecordBatches, if dictionary encoding and some heavy weigh compression like zstd is applied, the bandwidth can be way lower (several hundred MB/S)
So it's still worth a whole scan up front to make the whole processing faster with partitioning, though I don't known if it's a common requirement to query large IPC Stream file.

Copy link

@jdcasale jdcasale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is basically right. Couple of nits, one question.

"Unexpected end of byte stream for Arrow IPC file".to_string(),
))?;
)
.into());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

@corasaurus-hex corasaurus-hex Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Err(...)? is redundant, you really only need either a bare Err(...)? or a return Err(...), but a bare Err(...)? looks funny to me and we still need to convert the ArrowError into a DatafusionError (which ? does for us automatically) and so we end up with return Err(...).into()

Copy link
Contributor Author

@corasaurus-hex corasaurus-hex Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err, return Err(err.into()) in this case

// correct offset which is a lot of duplicate I/O. We're opting to avoid
// that entirely by only acting on a single partition and reading sequentially.
Ok(None)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that while this problem is worth solving, doing so is tangent to this change.
I'd like to see this solved, but I see no reason why we couldn't solve this in a follow-on.

Probably worth documenting the practical consequences of leaving it in this state though -- correct me if I'm wrong here, but I think this means that we end up hydrating the entire file into memory for certain operations, right? That's probably not a good long-term state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for registering files in the Arrow IPC stream format as tables using register_arrow or similar

5 participants