feat: Implement bloom_filter_agg #987

mbutrovich · 2024-09-30T19:39:26Z

Which issue does this PR close?

Closes #846.

Rationale for this change

What changes are included in this PR?

Native implementation (bloom_filter_agg.rs) that uses DataFusion's Accumulator trait. We do not have a GroupsAccumulator implementation and leave it as a possible future optimization.
Serde logic (planner.rs, QueryPlanSerde.scala)
Serialization and merging logic for underlying data structures (spark_bloom_filter.rs, spark_bit_array.rs)

How are these changes tested?

New test in CometExecSuite
Spark tests in CI exercise this aggregation
Scala benchmark to compare against Spark code path
Native benchmark for partial and final aggregation modes
Native tests for new bit array logic spark_bit_array.rs

…w. Added spark_bit_array_tests.

… `cargo clippy`.

alamb · 2024-10-01T19:14:07Z

native/core/src/execution/datafusion/expressions/bloom_filter_agg.rs

+    fn state(&mut self) -> Result<Vec<ScalarValue>> {
+        // TODO(Matt): There might be a more efficient way to do this by transmuting since calling
+        // state() on an Accumulator is considered destructive.
+        let state_sv = ScalarValue::Binary(Some(self.state_as_bytes()));


One way to avoid the copy, which may be too ugly , would be to store bloom filter data as an Option<>

So instead of

pub struct SparkBloomFilter { bits: SparkBitArray, num_hash_functions: u32, }

Something like

pub struct SparkBloomFilter { bits: Option<SparkBitArray> num_hash_functions: u32, }

And then you could basically use Option::take to take the value and leave a None in its place

let Some(bits) = self.bits.take() else { return Err(invalid state) }; // do whatever you want now you have the owned `bits`

codecov-commenter · 2024-10-02T01:50:40Z

Codecov Report

Attention: Patch coverage is 76.47059% with 4 lines in your changes missing coverage. Please review.

Project coverage is 34.41%. Comparing base (c3023c5) to head (bf22902).
Report is 19 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	76.47%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #987      +/-   ##
============================================
+ Coverage     34.03%   34.41%   +0.38%     
- Complexity      875      889      +14     
============================================
  Files           112      112              
  Lines         43289    43428     +139     
  Branches       9572     9627      +55     
============================================
+ Hits          14734    14947     +213     
+ Misses        25521    25437      -84     
- Partials       3034     3044      +10

Flag	Coverage Δ
	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mbutrovich · 2024-10-02T14:19:45Z

Results from the benchmark I just added:

BloomFilterAggregate Exec (cardinality 100):       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark (BloomFilterAgg)                         117            136          18         89.4          11.2       1.0X
SQL Parquet - Comet (Scan) (BloomFilterAgg)                  117            134          18         89.4          11.2       1.0X
SQL Parquet - Comet (Scan, Exec) (BloomFilterAgg)             71             78           9        148.3           6.7       1.7X

BloomFilterAggregate Exec (cardinality 1024):      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark (BloomFilterAgg)                         111            128          11         94.7          10.6       1.0X
SQL Parquet - Comet (Scan) (BloomFilterAgg)                  110            135          15         95.7          10.4       1.0X
SQL Parquet - Comet (Scan, Exec) (BloomFilterAgg)             69             78          12        152.9           6.5       1.6X

BloomFilterAggregate Exec (cardinality 1048576):   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark (BloomFilterAgg)                         165            183          14         63.6          15.7       1.0X
SQL Parquet - Comet (Scan) (BloomFilterAgg)                  169            184          11         62.0          16.1       1.0X
SQL Parquet - Comet (Scan, Exec) (BloomFilterAgg)            117            126           9         89.2          11.2       1.4X

alamb · 2024-10-02T16:51:52Z

Results from the benchmark I just added:

Looks like a 40% improvement.

kazuyukitanimura

Still looking

native/core/src/execution/datafusion/util/spark_bloom_filter.rs

native/core/src/execution/datafusion/util/spark_bit_array.rs

native/core/src/execution/datafusion/expressions/bloom_filter_agg.rs

mbutrovich · 2024-10-03T20:35:54Z

Just putting notes for the test failure. It's failing one Spark test in InjectRuntimeFilterSuite. The test is "Merge runtime bloom filters". It's failing in a check in CometArrayImporter when it's bringing Arrow data from Native back over to JVM.

The plan is a bit of a monster, but I'll provide it below:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [c1#45920, b1#45919], [c2#45926, b2#45925], Inner
   :- Sort [c1#45920 ASC NULLS FIRST, b1#45919 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(c1#45920, b1#45919, 5), ENSURE_REQUIREMENTS, [plan_id=739]
   :     +- Filter (((isnotnull(c1#45920) AND isnotnull(b1#45919)) AND might_contain(Subquery subquery#45946, [id=#610].bloomFilter, xxhash64(c1#45920, 42))) AND might_contain(Subquery subquery#45949, [id=#678].bloomFilter, xxhash64(b1#45919, 42)))
   :        :  :- Subquery subquery#45946, [id=#610]
   :        :  :  +- AdaptiveSparkPlan isFinalPlan=false
   :        :  :     +- CometProject [mergedValue#45952], [named_struct(bloomFilter, bloomFilter#45945, bloomFilter, bloomFilter#45948) AS mergedValue#45952]
   :        :  :        +- !CometHashAggregate [buf#45954, buf#45955], Final, [bloom_filter_agg(xxhash64(c2#45926, 42), 1000000, 8388608, 0, 0), bloom_filter_agg(xxhash64(b2#45925, 42), 1000000, 8388608, 0, 0)]
   :        :  :           +- CometExchange SinglePartition, ENSURE_REQUIREMENTS, CometNativeShuffle, [plan_id=605]
   :        :  :              +- !CometHashAggregate [c2#45926, b2#45925], Partial, [partial_bloom_filter_agg(xxhash64(c2#45926, 42), 1000000, 8388608, 0, 0), partial_bloom_filter_agg(xxhash64(b2#45925, 42), 1000000, 8388608, 0, 0)]
   :        :  :                 +- CometProject [c2#45926, b2#45925], [c2#45926, b2#45925]
   :        :  :                    +- CometFilter [a2#45924, b2#45925, c2#45926], (((isnotnull(a2#45924) AND (a2#45924 = 62)) AND isnotnull(c2#45926)) AND isnotnull(b2#45925))
   :        :  :                       +- CometScan parquet spark_catalog.default.bf2[a2#45924,b2#45925,c2#45926] Batched: true, DataFilters: [isnotnull(a2#45924), (a2#45924 = 62), isnotnull(c2#45926), isnotnull(b2#45925)], Format: CometParquet, Location: ... PartitionFilters: [], PushedFilters: [IsNotNull(a2), EqualTo(a2,62), IsNotNull(c2), IsNotNull(b2)], ReadSchema: struct<a2:int,b2:int,c2:int>
   :        :  +- Subquery subquery#45949, [id=#678]
   :        :     +- AdaptiveSparkPlan isFinalPlan=false
   :        :        +- CometProject [mergedValue#45952], [named_struct(bloomFilter, bloomFilter#45945, bloomFilter, bloomFilter#45948) AS mergedValue#45952]
   :        :           +- !CometHashAggregate [buf#45954, buf#45955], Final, [bloom_filter_agg(xxhash64(c2#45926, 42), 1000000, 8388608, 0, 0), bloom_filter_agg(xxhash64(b2#45925, 42), 1000000, 8388608, 0, 0)]
   :        :              +- CometExchange SinglePartition, ENSURE_REQUIREMENTS, CometNativeShuffle, [plan_id=673]
   :        :                 +- !CometHashAggregate [c2#45926, b2#45925], Partial, [partial_bloom_filter_agg(xxhash64(c2#45926, 42), 1000000, 8388608, 0, 0), partial_bloom_filter_agg(xxhash64(b2#45925, 42), 1000000, 8388608, 0, 0)]
   :        :                    +- CometProject [c2#45926, b2#45925], [c2#45926, b2#45925]
   :        :                       +- CometFilter [a2#45924, b2#45925, c2#45926], (((isnotnull(a2#45924) AND (a2#45924 = 62)) AND isnotnull(c2#45926)) AND isnotnull(b2#45925))
   :        :                          +- CometScan parquet spark_catalog.default.bf2[a2#45924,b2#45925,c2#45926] Batched: true, DataFilters: [isnotnull(a2#45924), (a2#45924 = 62), isnotnull(c2#45926), isnotnull(b2#45925)], Format: CometParquet, Location: ... PartitionFilters: [], PushedFilters: [IsNotNull(a2), EqualTo(a2,62), IsNotNull(c2), IsNotNull(b2)], ReadSchema: struct<a2:int,b2:int,c2:int>
   :        +- CometScan parquet spark_catalog.default.bf1[a1#45918,b1#45919,c1#45920,d1#45921,e1#45922,f1#45923] Batched: true, DataFilters: [isnotnull(c1#45920), isnotnull(b1#45919)], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/Users/matt/git/spark/spark-warehouse/org.apache.spark.sql.Inject..., PartitionFilters: [], PushedFilters: [IsNotNull(c1), IsNotNull(b1)], ReadSchema: struct<a1:int,b1:int,c1:int,d1:int,e1:int,f1:int>
   +- CometSort [a2#45924, b2#45925, c2#45926, d2#45927, e2#45928, f2#45929], [c2#45926 ASC NULLS FIRST, b2#45925 ASC NULLS FIRST]
      +- CometExchange hashpartitioning(c2#45926, b2#45925, 5), ENSURE_REQUIREMENTS, CometNativeShuffle, [plan_id=742]
         +- CometFilter [a2#45924, b2#45925, c2#45926, d2#45927, e2#45928, f2#45929], (((isnotnull(a2#45924) AND (a2#45924 = 62)) AND isnotnull(c2#45926)) AND isnotnull(b2#45925))
            +- CometScan parquet spark_catalog.default.bf2[a2#45924,b2#45925,c2#45926,d2#45927,e2#45928,f2#45929] Batched: true, DataFilters: [isnotnull(a2#45924), (a2#45924 = 62), isnotnull(c2#45926), isnotnull(b2#45925)], Format: CometParquet, Location: ... PartitionFilters: [], PushedFilters: [IsNotNull(a2), EqualTo(a2,62), IsNotNull(c2), IsNotNull(b2)], ReadSchema: struct<a2:int,b2:int,c2:int,d2:int,e2:int,f2:int>

This is what it looks like on the main branch:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [c1#45920, b1#45919], [c2#45926, b2#45925], Inner
   :- Sort [c1#45920 ASC NULLS FIRST, b1#45919 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(c1#45920, b1#45919, 5), ENSURE_REQUIREMENTS, [plan_id=729]
   :     +- Filter (((isnotnull(c1#45920) AND isnotnull(b1#45919)) AND might_contain(Subquery subquery#45946, [id=#605].bloomFilter, xxhash64(c1#45920, 42))) AND might_contain(Subquery subquery#45949, [id=#668].bloomFilter, xxhash64(b1#45919, 42)))
   :        :  :- Subquery subquery#45946, [id=#605]
   :        :  :  +- AdaptiveSparkPlan isFinalPlan=false
   :        :  :     +- Project [named_struct(bloomFilter, bloomFilter#45945, bloomFilter, bloomFilter#45948) AS mergedValue#45952]
   :        :  :        +- ObjectHashAggregate(keys=[], functions=[bloom_filter_agg(xxhash64(c2#45926, 42), 1000000, 8388608, 0, 0), bloom_filter_agg(xxhash64(b2#45925, 42), 1000000, 8388608, 0, 0)])
   :        :  :           +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=601]
   :        :  :              +- ObjectHashAggregate(keys=[], functions=[partial_bloom_filter_agg(xxhash64(c2#45926, 42), 1000000, 8388608, 0, 0), partial_bloom_filter_agg(xxhash64(b2#45925, 42), 1000000, 8388608, 0, 0)])
   :        :  :                 +- CometProject [c2#45926, b2#45925], [c2#45926, b2#45925]
   :        :  :                    +- CometFilter [a2#45924, b2#45925, c2#45926], (((isnotnull(a2#45924) AND (a2#45924 = 62)) AND isnotnull(c2#45926)) AND isnotnull(b2#45925))
   :        :  :                       +- CometScan parquet spark_catalog.default.bf2[a2#45924,b2#45925,c2#45926] Batched: true, DataFilters: [isnotnull(a2#45924), (a2#45924 = 62), isnotnull(c2#45926), isnotnull(b2#45925)], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/Users/matt/git/spark/spark-warehouse/org.apache.spark.sql.Inject..., PartitionFilters: [], PushedFilters: [IsNotNull(a2), EqualTo(a2,62), IsNotNull(c2), IsNotNull(b2)], ReadSchema: struct<a2:int,b2:int,c2:int>
   :        :  +- Subquery subquery#45949, [id=#668]
   :        :     +- AdaptiveSparkPlan isFinalPlan=false
   :        :        +- Project [named_struct(bloomFilter, bloomFilter#45945, bloomFilter, bloomFilter#45948) AS mergedValue#45952]
   :        :           +- ObjectHashAggregate(keys=[], functions=[bloom_filter_agg(xxhash64(c2#45926, 42), 1000000, 8388608, 0, 0), bloom_filter_agg(xxhash64(b2#45925, 42), 1000000, 8388608, 0, 0)])
   :        :              +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=664]
   :        :                 +- ObjectHashAggregate(keys=[], functions=[partial_bloom_filter_agg(xxhash64(c2#45926, 42), 1000000, 8388608, 0, 0), partial_bloom_filter_agg(xxhash64(b2#45925, 42), 1000000, 8388608, 0, 0)])
   :        :                    +- CometProject [c2#45926, b2#45925], [c2#45926, b2#45925]
   :        :                       +- CometFilter [a2#45924, b2#45925, c2#45926], (((isnotnull(a2#45924) AND (a2#45924 = 62)) AND isnotnull(c2#45926)) AND isnotnull(b2#45925))
   :        :                          +- CometScan parquet spark_catalog.default.bf2[a2#45924,b2#45925,c2#45926] Batched: true, DataFilters: [isnotnull(a2#45924), (a2#45924 = 62), isnotnull(c2#45926), isnotnull(b2#45925)], Format: CometParquet, Location: ... PartitionFilters: [], PushedFilters: [IsNotNull(a2), EqualTo(a2,62), IsNotNull(c2), IsNotNull(b2)], ReadSchema: struct<a2:int,b2:int,c2:int>
   :        +- CometScan parquet spark_catalog.default.bf1[a1#45918,b1#45919,c1#45920,d1#45921,e1#45922,f1#45923] Batched: true, DataFilters: [isnotnull(c1#45920), isnotnull(b1#45919)], Format: CometParquet, Location: ... PartitionFilters: [], PushedFilters: [IsNotNull(c1), IsNotNull(b1)], ReadSchema: struct<a1:int,b1:int,c1:int,d1:int,e1:int,f1:int>
   +- CometSort [a2#45924, b2#45925, c2#45926, d2#45927, e2#45928, f2#45929], [c2#45926 ASC NULLS FIRST, b2#45925 ASC NULLS FIRST]
      +- CometExchange hashpartitioning(c2#45926, b2#45925, 5), ENSURE_REQUIREMENTS, CometNativeShuffle, [plan_id=732]
         +- CometFilter [a2#45924, b2#45925, c2#45926, d2#45927, e2#45928, f2#45929], (((isnotnull(a2#45924) AND (a2#45924 = 62)) AND isnotnull(c2#45926)) AND isnotnull(b2#45925))
            +- CometScan parquet spark_catalog.default.bf2[a2#45924,b2#45925,c2#45926,d2#45927,e2#45928,f2#45929] Batched: true, DataFilters: [isnotnull(a2#45924), (a2#45924 = 62), isnotnull(c2#45926), isnotnull(b2#45925)], Format: CometParquet, Location: ... PartitionFilters: [], PushedFilters: [IsNotNull(a2), EqualTo(a2,62), IsNotNull(c2), IsNotNull(b2)], ReadSchema: struct<a2:int,b2:int,c2:int,d2:int,e2:int,f2:int>

kazuyukitanimura

I feel we can merge this as soon as we pass all the tests. We can work on optimizations separately if necessary.

mbutrovich · 2024-10-04T13:46:18Z

Debugger output from the failing state in CometArrayImporter.

this = {CometArrayImporter@18451} 
snapshot = {ArrowArray$Snapshot@18448} 
 length = 1
 null_count = 0
 offset = 0
 n_buffers = 1
 n_children = 2
 buffers = 105553139081104
 children = 105553139081088
 dictionary = 0
 release = 6002611972
 private_data = 105553130135808
children = {long[2]@18449} [105553174820048, 105553174820208]
 0 = 105553174820048
 1 = 105553174820208
childVectors = {ArrayList@18450}  size = 1
 0 = {VarBinaryVector@22881} "[]"
vector = {StructVector@18452} "[]"
 reader = {NullableStructReaderImpl@22883} 
 writer = {NullableStructWriter@22884} "org.apache.comet.shaded.arrow.vector.complex.impl.NullableStructWriter@8a493b[index = 0]"
 validityBuffer = {ArrowBuf@22885} "ArrowBuf[1], address:0, capacity:0"
 validityAllocationSizeInBytes = 497
 NonNullableStructVector.reader = {SingleStructReaderImpl@22886} 
 field = {Field@22887} "Struct not null"
 valueCount = 0
 ephPair = null
 vectors = {PromotableMultiMapWithOrdinal@22888} 
 allowConflictPolicyChanges = true
 conflictPolicy = {AbstractStructVector$ConflictPolicy@22889} "CONFLICT_REPLACE"
 name = null
 allocator = {RootAllocator@18453} "Allocator(ROOT) 0/2176/4196904/9223372036854775807 (res/actual/peak/limit)\n"
 callBack = null
children.length = 2

The entire subquery runs in native code now, so my guess is that the output from that projection, which looks like it should be a struct with two binary values in it, is wrong. I'm not sure if it's a bug in the projection, or something further downstream.

viirya · 2024-10-09T03:43:44Z

I do not have time to look at this error yet. I may take a look after the conference.

mbutrovich · 2024-10-09T21:52:37Z

Can't say I see a huge different in TPC-H or TPC-DS locally, but the plans I looked at were typically building filters over very small relations.

kazuyukitanimura · 2024-10-10T22:02:36Z

@mbutrovich Is it possible to trace back where children and childVectors are populated?

viirya · 2024-10-12T08:20:32Z

The Spark SQL test failure can be fixed by #1016.

viirya · 2024-10-13T07:14:52Z

I merged the fix. You can rebase and re-trigger CI now.

mbutrovich · 2024-10-13T15:44:43Z

Merged in updated main, thanks for the fix!

native/core/src/execution/datafusion/expressions/bloom_filter_agg.rs

spark/src/test/scala/org/apache/spark/sql/benchmark/CometExecBenchmark.scala

viirya · 2024-10-16T21:22:42Z

native/core/src/execution/datafusion/expressions/bloom_filter_agg.rs

+        (0..arr.len()).try_for_each(|index| {
+            let v = ScalarValue::try_from_array(arr, index)?;
+
+            if let ScalarValue::Int64(Some(value)) = v {


It only supports Int64? Spark BloomFilterAggregate supports Byte, Short, Int, Long and String. If Comet BloomFilterAggregate only support Int64 for now. We need to fallback to Spark for other cases in QueryPlanSerde.

I think I was going off of their docs which say it only supports Long.

In their implementation, however, if looks like they can cast the fixed width types directly to Long
https://github.com/apache/spark/blob/b078c0d6e2adf7eb0ee7d4742a6c52864440226e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L238

and for strings their bloom filter implementation has a putBinary method that we don't currently support. The casts should be easy. I'll look at what putBinary on our bloom filter implementation will take.

Ah I see what happened. 3.4 only supports Long, which was the Spark source I was working off of. 3.5 added support for other types.

I modified it to only generate a native BloomFilterAgg if the child has LongType. I'll open an issue to support more types in the future.

viirya · 2024-10-16T21:27:20Z

native/core/src/execution/datafusion/util/spark_bloom_filter.rs

+        // Does it make sense to do a std::mem::take of filter_state here? Unclear to me if a deep
+        // copy of filter_state as a Vec<u64> to a Vec<u8> is happening here.


You mean if std::mem::take also does copy?

filter_state isn't needed after this function call, so ideally I'd be able to move its contents out instead of the byte slice, but because the underlying type of filter_state is u64 vector and I'm stuffing it into a u8 vector I think the alignment requirements won't be the same and I won't get the clean buffer transfer that I want.

u64 vector alignment should be acceptable for u8 vector alignment. At least locally I saw their alignments are same.

andygrove · 2024-10-18T15:23:43Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

+        if (childExpr.isDefined &&
+          child.dataType
+            .isInstanceOf[LongType] && // Spark 3.4 only supports Long, 3.5+ adds more types.
+          numItemsExpr.isDefined &&
+          numBitsExpr.isDefined &&
+          dataType.isDefined) {


one minor nit (and no need to address for this PR, because this is a general issue that we already have) is that we are testing for a multiple preconditions here, and we fall back if any of them are false (which is correct) but we do not let the user know the specific reason for the fallback, which makes debugging more difficult.

I would eventually like to explore refactoring how we approach this and see if we can add some utilities to make it easier to report fallback reasons, but this is much lower priority than the performance & stability work for now.

andygrove · 2024-10-18T20:34:47Z

I tested with TPC-H q5 and see that we are now running the bloom filter agg natively

andygrove

Thanks @mbutrovich

## Which issue does this PR close?  Closes #. ## Rationale for this change  ## What changes are included in this PR?  ``` cb3e977 perf: Add experimental feature to replace SortMergeJoin with ShuffledHashJoin (apache#1007) 3df9d5c fix: Make comet-git-info.properties optional (apache#1027) 4033687 chore: Reserve memory for native shuffle writer per partition (apache#1022) bd541d6 (public/main) remove hard-coded version number from Dockerfile (apache#1025) e3ac6cf feat: Implement bloom_filter_agg (apache#987) 8d097d5 (origin/main) chore: Revert "chore: Reserve memory for native shuffle writer per partition (apache#988)" (apache#1020) 591f45a chore: Bump arrow-rs to 53.1.0 and datafusion (apache#1001) e146cfa chore: Reserve memory for native shuffle writer per partition (apache#988) abd9f85 fix: Fallback to Spark if named_struct contains duplicate field names (apache#1016) 22613e9 remove legacy comet-spark-shell (apache#1013) d40c802 clarify that Maven central only has jars for Linux (apache#1009) 837c256 docs: Various documentation improvements (apache#1005) 0667c60 chore: Make parquet reader options Comet options instead of Hadoop options (apache#968) 0028f1e fix: Fallback to Spark if scan has meta columns (apache#997) b131cc3 feat: Support `GetArrayStructFields` expression (apache#993) 3413397 docs: Update tuning guide (apache#995) afd28b9 Quality of life fixes for easier hacking (apache#982) 18150fb chore: Don't transform the HashAggregate to CometHashAggregate if Comet shuffle is disabled (apache#991) a1599e2 chore: Update for 0.3.0 release, prepare for 0.4.0 development (apache#970) ``` ## How are these changes tested?

mbutrovich and others added 8 commits September 25, 2024 15:36

Add test that invokes bloom_filter_agg.

e662814

QueryPlanSerde support for BloomFilterAgg.

20f6e67

Add bloom_filter_agg based on sample UDAF. planner instantiates it no…

1ec31a2

…w. Added spark_bit_array_tests.

Partial work on Accumulator. Need to finish merge_batch and state.

3965dc4

BloomFilterAgg state, merge_state, and evaluate. Need more tests.

62e656c

Matches Spark behavior. Need to clean up the code quite a bit, and do…

33ef47d

… `cargo clippy`.

Merge branch 'apache:main' into bloom_field_agg

2040c76

Remove old comment.

599a8f9

mbutrovich marked this pull request as draft September 30, 2024 19:39

mbutrovich added 3 commits September 30, 2024 16:02

Clippy. Increase bloom filter size back to Spark's default.

a2a8cf3

API cleanup.

22aedd9

API cleanup.

bf22902

alamb reviewed Oct 1, 2024

View reviewed changes

mbutrovich mentioned this pull request Oct 1, 2024

chore: Don't transform the HashAggregate to CometHashAggregate if Comet shuffle is disabled #991

Merged

mbutrovich and others added 2 commits October 2, 2024 08:16

Merge branch 'apache:main' into bloom_field_agg

4b7000c

Add BloomFilterAgg benchmark to CometExecBenchmark

88adc75

mbutrovich added 5 commits October 2, 2024 10:31

Docs.

a21e0e3

API cleanup, fix merge_bits to update cardinality.

5c5d0f9

Refactor merge_bits to update bit_count with the bit merging.

cd107e3

Remove benchmark results file.

4f06098

Docs.

79f6468

mbutrovich marked this pull request as ready for review October 2, 2024 15:49

kazuyukitanimura reviewed Oct 2, 2024

View reviewed changes

mbutrovich and others added 4 commits October 2, 2024 17:21

Add native side benchmarks.

57fe742

Adjust benchmark parameters to match Spark defaults.

ec64e4c

Address review feedback.

7a81f35

Merge branch 'apache:main' into bloom_field_agg

013513e

kazuyukitanimura reviewed Oct 3, 2024

View reviewed changes

Add assertion to merge_batch.

3347923

Merge branch 'apache:main' into bloom_field_agg

5c82f24

Merge branch 'apache:main' into bloom_field_agg

c39ff1d

kazuyukitanimura approved these changes Oct 14, 2024

View reviewed changes

viirya reviewed Oct 16, 2024

View reviewed changes

native/core/src/execution/datafusion/expressions/bloom_filter_agg.rs Show resolved Hide resolved

viirya reviewed Oct 16, 2024

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/benchmark/CometExecBenchmark.scala Outdated Show resolved Hide resolved

viirya reviewed Oct 16, 2024

View reviewed changes

mbutrovich added 2 commits October 17, 2024 12:48

Address some review feedback.

1ed99e3

Only generate native BloomFilterAgg if child has LongType.

d41a9d2

mbutrovich mentioned this pull request Oct 18, 2024

Add more types to BloomFilterAgg #1023

Closed

Add TODO with GitHub issue link.

6d13890

mbutrovich requested a review from viirya October 18, 2024 15:09

andygrove reviewed Oct 18, 2024

View reviewed changes

andygrove approved these changes Oct 18, 2024

View reviewed changes

andygrove merged commit e3ac6cf into apache:main Oct 18, 2024

mbutrovich deleted the bloom_field_agg branch October 23, 2024 13:36

mbutrovich mentioned this pull request Dec 7, 2024

fix: Enable scenarios accidentally commented out in CometExecBenchmark #1151

Merged

		// Does it make sense to do a std::mem::take of filter_state here? Unclear to me if a deep
		// copy of filter_state as a Vec<u64> to a Vec<u8> is happening here.

feat: Implement bloom_filter_agg #987

feat: Implement bloom_filter_agg #987

Uh oh!

Conversation

mbutrovich commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mbutrovich commented Oct 2, 2024

Uh oh!

alamb commented Oct 2, 2024

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented Oct 3, 2024

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Oct 9, 2024

Uh oh!

mbutrovich commented Oct 9, 2024

Uh oh!

kazuyukitanimura commented Oct 10, 2024

Uh oh!

viirya commented Oct 12, 2024

Uh oh!

viirya commented Oct 13, 2024

Uh oh!

mbutrovich commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Oct 18, 2024

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

mbutrovich commented Sep 30, 2024 •

edited

Loading

codecov-commenter commented Oct 2, 2024 •

edited

Loading

mbutrovich commented Oct 4, 2024 •

edited

Loading

mbutrovich commented Oct 13, 2024 •

edited

Loading

mbutrovich Oct 17, 2024 •

edited

Loading