-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Feat: Add fetch to CoalescePartitionsExec #14499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5032,18 +5032,17 @@ logical_plan | |
| 03)----Aggregate: groupBy=[[aggregate_test_100.c3]], aggr=[[min(aggregate_test_100.c1)]] | ||
| 04)------TableScan: aggregate_test_100 projection=[c1, c3] | ||
| physical_plan | ||
| 01)GlobalLimitExec: skip=0, fetch=5 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that SKIP is missing.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like the SKIP setting is not supported for all with_fetch operators, for example: SortPreservingMergeExec also support fetch but not support setting skip, it's default to 0.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We default to use skip=0 for the SortPreservingMergeExec: 1 => match self.fetch {
Some(fetch) => {
let stream = self.input.execute(0, context)?;
debug!("Done getting stream for SortPreservingMergeExec::execute with 1 input with {fetch}");
Ok(Box::pin(LimitStream::new(
stream,
0,
Some(fetch),
BaselineMetrics::new(&self.metrics, partition),
)))
}impl LimitStream {
pub fn new(
input: SendableRecordBatchStream,
skip: usize,
fetch: Option<usize>,
baseline_metrics: BaselineMetrics,
) -> Self {
let schema = input.schema();
Self {
skip,
fetch: fetch.unwrap_or(usize::MAX),
input: Some(input),
schema,
baseline_metrics,
}
}
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, as @zhuqi-lucas mentioned, only Limit operators support skip, and limit_pushdown is adding a Limit operator if skip exists, so this is only affecting plans without skip. Here's the query result with skip: |
||
| 02)--CoalescePartitionsExec | ||
| 03)----AggregateExec: mode=FinalPartitioned, gby=[c3@0 as c3, min(aggregate_test_100.c1)@1 as min(aggregate_test_100.c1)], aggr=[], lim=[5] | ||
| 04)------CoalesceBatchesExec: target_batch_size=8192 | ||
| 05)--------RepartitionExec: partitioning=Hash([c3@0, min(aggregate_test_100.c1)@1], 4), input_partitions=4 | ||
| 06)----------AggregateExec: mode=Partial, gby=[c3@0 as c3, min(aggregate_test_100.c1)@1 as min(aggregate_test_100.c1)], aggr=[], lim=[5] | ||
| 07)------------AggregateExec: mode=FinalPartitioned, gby=[c3@0 as c3], aggr=[min(aggregate_test_100.c1)] | ||
| 08)--------------CoalesceBatchesExec: target_batch_size=8192 | ||
| 09)----------------RepartitionExec: partitioning=Hash([c3@0], 4), input_partitions=4 | ||
| 10)------------------AggregateExec: mode=Partial, gby=[c3@1 as c3], aggr=[min(aggregate_test_100.c1)] | ||
| 11)--------------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 | ||
| 12)----------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c3], has_header=true | ||
| 01)CoalescePartitionsExec: fetch=5 | ||
| 02)--AggregateExec: mode=FinalPartitioned, gby=[c3@0 as c3, min(aggregate_test_100.c1)@1 as min(aggregate_test_100.c1)], aggr=[], lim=[5] | ||
| 03)----CoalesceBatchesExec: target_batch_size=8192 | ||
| 04)------RepartitionExec: partitioning=Hash([c3@0, min(aggregate_test_100.c1)@1], 4), input_partitions=4 | ||
| 05)--------AggregateExec: mode=Partial, gby=[c3@0 as c3, min(aggregate_test_100.c1)@1 as min(aggregate_test_100.c1)], aggr=[], lim=[5] | ||
| 06)----------AggregateExec: mode=FinalPartitioned, gby=[c3@0 as c3], aggr=[min(aggregate_test_100.c1)] | ||
| 07)------------CoalesceBatchesExec: target_batch_size=8192 | ||
| 08)--------------RepartitionExec: partitioning=Hash([c3@0], 4), input_partitions=4 | ||
| 09)----------------AggregateExec: mode=Partial, gby=[c3@1 as c3], aggr=[min(aggregate_test_100.c1)] | ||
| 10)------------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1 | ||
| 11)--------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c3], has_header=true | ||
|
|
||
|
|
||
| # | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
May be we can move the following to above, only call limit_reached when