-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: Incorrect memory accounting in array_agg function
#16519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -341,12 +341,20 @@ impl Accumulator for ArrayAggAccumulator { | |||||||||||||||||
| Some(values) => { | ||||||||||||||||||
| // Make sure we don't insert empty lists | ||||||||||||||||||
| if !values.is_empty() { | ||||||||||||||||||
| self.values.push(values); | ||||||||||||||||||
| // The ArrayRef might be holding a reference to its original input buffer, so | ||||||||||||||||||
| // storing it here directly copied/compacted avoids over accounting memory | ||||||||||||||||||
| // not used here. | ||||||||||||||||||
| self.values | ||||||||||||||||||
| .push(make_array(copy_array_data(&values.to_data()))); | ||||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
| None => { | ||||||||||||||||||
| for arr in list_arr.iter().flatten() { | ||||||||||||||||||
| self.values.push(arr); | ||||||||||||||||||
| // The ArrayRef might be holding a reference to its original input buffer, so | ||||||||||||||||||
| // storing it here directly copied/compacted avoids over accounting memory | ||||||||||||||||||
| // not used here. | ||||||||||||||||||
| self.values | ||||||||||||||||||
| .push(make_array(copy_array_data(&arr.to_data()))); | ||||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
@@ -728,7 +736,7 @@ impl Accumulator for OrderSensitiveArrayAggAccumulator { | |||||||||||||||||
| mod tests { | ||||||||||||||||||
| use super::*; | ||||||||||||||||||
| use arrow::array::{ListBuilder, StringBuilder}; | ||||||||||||||||||
| use arrow::datatypes::{FieldRef, Schema}; | ||||||||||||||||||
| use arrow::datatypes::{FieldRef, Schema, UInt64Type}; | ||||||||||||||||||
| use datafusion_common::cast::as_generic_string_array; | ||||||||||||||||||
| use datafusion_common::internal_err; | ||||||||||||||||||
| use datafusion_physical_expr::expressions::Column; | ||||||||||||||||||
|
|
@@ -994,6 +1002,34 @@ mod tests { | |||||||||||||||||
| Ok(()) | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| #[test] | ||||||||||||||||||
| fn does_not_over_account_memory_for_merge() -> Result<()> { | ||||||||||||||||||
| let (mut acc1, mut acc2) = ArrayAggAccumulatorBuilder::string().build_two()?; | ||||||||||||||||||
|
|
||||||||||||||||||
| let a1 = ListArray::from_iter_primitive::<UInt64Type, _, _>(vec![ | ||||||||||||||||||
| Some(vec![Some(0), Some(1), Some(2)]), | ||||||||||||||||||
| Some(vec![Some(3)]), | ||||||||||||||||||
| None, | ||||||||||||||||||
| Some(vec![Some(4)]), | ||||||||||||||||||
| ]); | ||||||||||||||||||
| let a2 = ListArray::from_iter_primitive::<UInt64Type, _, _>(vec![ | ||||||||||||||||||
| Some(vec![Some(0), Some(1), Some(2)]), | ||||||||||||||||||
| Some(vec![Some(3)]), | ||||||||||||||||||
| None, | ||||||||||||||||||
| Some(vec![Some(4)]), | ||||||||||||||||||
| ]); | ||||||||||||||||||
|
|
||||||||||||||||||
| acc1.merge_batch(&[Arc::new(a1.slice(0, 1))])?; | ||||||||||||||||||
| acc2.merge_batch(&[Arc::new(a2.slice(0, 1))])?; | ||||||||||||||||||
|
|
||||||||||||||||||
| acc1 = merge(acc1, acc2)?; | ||||||||||||||||||
|
Comment on lines
+1022
to
+1025
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Suggested change
With this, you would notice that the test result is the same regardless of the changes in this PR
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||||||||||||||||||
|
|
||||||||||||||||||
| // without compaction, the size is 16812. | ||||||||||||||||||
| assert_eq!(acc1.size(), 556); | ||||||||||||||||||
|
|
||||||||||||||||||
| Ok(()) | ||||||||||||||||||
| } | ||||||||||||||||||
|
|
||||||||||||||||||
| #[test] | ||||||||||||||||||
| fn does_not_over_account_memory() -> Result<()> { | ||||||||||||||||||
| let (mut acc1, mut acc2) = ArrayAggAccumulatorBuilder::string().build_two()?; | ||||||||||||||||||
|
|
||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I'm not sure if this will solve the issue. Keep in mind that the
merge_batchmethod argument receives the states of other accumulators, which already hold "compacted" data, so I'd expect this compaction here to be unnecessary.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16519 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should add a special case to copy_array_data to avoid copying the data when it already is only a single row / has no offset 🤔
Right now it seems to copy the data unconditionally which is a non trivial overhead on each row 🤔
datafusion/datafusion/common/src/scalar/mod.rs
Lines 3564 to 3567 in a87d6f2
Perhaps we can do that as a follow of PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something along those lines ? (potentially handling all the primitive types and not just int64)
The problem with this is that we will probably have to do it for each of the internal arrow array types (primitives + strings + ??).
I tried to find a way to access the size of the topmost parent array regardless of the array type but couldn't find one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to clarify, I did try this approach and the benchmarks shows on-par performance but I'm wondering if maintenability wise it will be alright given the multitude of arrow types we would need to support (I only did it for
int64arraysince this is what the benchmarks are using):