-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add bloom_filter_agg Spark aggregate function #4028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for meta-velox canceled.
|
| 100, [](vector_size_t row) { return row % 9; })})}; | ||
|
|
||
| auto expected = {makeRowVector({makeFlatVector<StringView>( | ||
| 1, [](vector_size_t row) { return "\u0004"; })})}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding:
- The expected result is at least 36 bytes, not just "\u0004".
- You can use
return StringView(pointer_to_data, data_size);instead ofreturn "\u0004";to avoid the risk of truncation.Enhance BloomFilter to serialize and memory track #3861 (comment) - Please reconfirm the correctness of the expected result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I try to fix as you say, but failed by #4028 (comment).
Can you give me more suggestions?
e8aae21 to
d1a0001
Compare
This comment was marked as outdated.
This comment was marked as outdated.
714ded6 to
82a36cf
Compare
This comment was marked as outdated.
This comment was marked as outdated.
82a36cf to
c844aa9
Compare
c844aa9 to
bc9d750
Compare
|
Spark fuzzer test will raise the exception, can you help to fix this? @duanmeng |
|
Can you help review this one? @mbasmanova Thanks! |
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4633 is fixed. Would you re-enable fuzzer test?
Curious how is this function used in Spark to reduce the amount of shuffle data. Is there something I can read about this?
| set_property(TARGET velox_functions_spark PROPERTY JOB_POOL_COMPILE | ||
| high_memory_pool) | ||
|
|
||
| if(${VELOX_ENABLE_AGGREGATES}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change? Let's remove. Feel free to open a separate PR with this change along if necessary.
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Some comments.
| bloomFilter_.insert(folly::hasher<int64_t>()(value)); | ||
| } | ||
|
|
||
| BloomFilter<StlAllocator<uint64_t>> bloomFilter_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: struct members do not have trailing underscore
| explicit BloomFilterAccumulator(HashStringAllocator* allocator) | ||
| : bloomFilter_{StlAllocator<uint64_t>(allocator)} {} | ||
|
|
||
| int32_t serializedSize() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this method const
| return bloomFilter_.serializedSize(); | ||
| } | ||
|
|
||
| void serialize(char* output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this method const
| const std::vector<VectorPtr>& args, | ||
| bool /*mayPushdown*/) override { | ||
| decodeArguments(rows, args); | ||
| VELOX_USER_CHECK(!decodedRaw_.mayHaveNulls()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check? Would you add a test case where some of the input data is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark bloomfilter aggregate test only tests the empty input. https://github.com/apache/spark/blob/branch-3.3/sql/core/src/test/scala/org/apache/spark/sql/BloomFilterAggregateQuerySuite.scala#L196
I add the test:
test("Test that bloom_filter_agg errors null") {
spark.sql(
"""
|SELECT bloom_filter_agg(null)"""
.stripMargin)
}
It will throw exception:
[DATATYPE_MISMATCH.BLOOM_FILTER_WRONG_TYPE] Cannot resolve "bloom_filter_agg(NULL, 1000000, 8388608)" due to data type mismatch: Input to function `bloom_filter_agg` should have been "BINARY" followed by value with "BIGINT", but it's ["VOID", "BIGINT", "BIGINT"].; line 2 pos 7;
In spark, its first argument is xxhash(table_col), so it won't be null.
Velox BloomFilter accepts uint64_t while xxhash() returns int64_t, so we need to use folly to hash twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In spark, its first argument is xxhash(table_col), so it won't be null.
I think it is totally possible that table_col is null for some or all rows.
SELECT bloom_filter_agg(null)
Try changing this to something like this:
SELECT bloom_filter_agg(cast(null as varbinary))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change the test to
test("Test that bloom_filter_agg errors null") {
spark.sql(
"""
|SELECT bloom_filter_agg(cast(null as binary))"""
.stripMargin)
}
Different exception:
[DATATYPE_MISMATCH.BLOOM_FILTER_WRONG_TYPE] Cannot resolve "bloom_filter_agg(CAST(NULL AS BINARY), 1000000, 8388608)" due to data type mismatch: Input to function `bloom_filter_agg` should have been "BINARY" followed by value with "BIGINT", but it's ["BINARY", "BIGINT", "BIGINT"].; line 2 pos 7;
'Aggregate [unresolvedalias(bloom_filter_agg(cast(null as binary), 1000000, 8388608, 0, 0), None)]
+- OneRowRelation
This is spark internal function, it is revoked by the planner, for the case table_col is null, it will be
bloom_filter_agg(xxhash64(null)), and xxhash64(null) is 42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is spark internal function, it is revoked by the planner, for the case table_col is null, it will be
bloom_filter_agg(xxhash64(null)), and xxhash64(null) is 42
Interesting. So the input to bloom_filter_agg is not a value, but a hash of the value. Let's clarify this in the documentation. What's the type if input? Is it VARBINARY or BIGINT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val rowCount = filterCreationSidePlan.stats.rowCount
val bloomFilterAgg = new BloomFilterAggregate(new XxHash64(Seq(filterCreationSideExp)), rowCount.get.longValue)
It is BIGINT
| accumulator->serialize(buffer.data()); | ||
| serialized = StringView::makeInline(buffer); | ||
| } else { | ||
| Buffer* buffer = flatResult->getBufferWithSpace(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more efficient to compute total bytes needed for the whole result and call getBufferWithSpace once.
| } | ||
|
|
||
| private: | ||
| const int64_t DEFAULT_ESPECTED_NUM_ITEMS = 1000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: kDefaultExpe...
Use 1'000'000 for readability
| private: | ||
| const int64_t DEFAULT_ESPECTED_NUM_ITEMS = 1000000; | ||
| const int64_t MAX_NUM_ITEMS = 4000000; | ||
| // Spark MAX_NUM_BITS is 67108864, but velox has memory limit sizeClassSizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the documentation match?
@mbasmanova Spark Runtime Filters, apache/spark#35789, https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit#heading=h.4v65wq7vzy4q |
|
@mbasmanova In short, a big table A join with small table B with filter, it will generate the bloomfilter for table B after filter, then broadcast this bloomfilter to A side, and use |
17d662f to
36f399d
Compare
|
Spark aggregate fuzzer tests passed. |
|
I receive a core dump, but I don't think it is caused by my PR |
Looks like this is tracked in #4652 |
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Some follow up comments.
| return; | ||
| } | ||
| rows.applyToSelected([&](vector_size_t row) { | ||
| accumulator->init(capacity_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can be done once before the loop
| } | ||
|
|
||
| auto size = accumulator->serializedSize(); | ||
| StringView serialized; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is not used. Let's remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in following flatResult->setNoCopy(i, serialized);
| for (vector_size_t i = 0; i < numGroups; ++i) { | ||
| auto group = groups[i]; | ||
| auto accumulator = value<BloomFilterAccumulator>(group); | ||
| if (UNLIKELY(!accumulator->bloomFilter.isSet())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason this cannot be part of accumulator->serializedSize()? It could return zero in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accumulator->serializedSize() never return 0, it is
uint32_t serializedSize() const {
return 1 /* version */
+ 4 /* number of bits */
+ bits_.size() * 8;
}
| capacity_ = numBits_ / 16; | ||
| } | ||
|
|
||
| int32_t getPreAllocatedBufferSize(char** groups, int32_t numGroups) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps, getPreAllocatedBufferSize -> getTotalSize
| for (vector_size_t i = 0; i < numGroups; ++i) { | ||
| auto group = groups[i]; | ||
| auto accumulator = value<BloomFilterAccumulator>(group); | ||
| if (UNLIKELY(!accumulator->bloomFilter.isSet())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we add a method for this for consistency?
accumulator->initialized()
| } | ||
|
|
||
| static void | ||
| setConstantArgument(const char* name, int64_t& val, int64_t newVal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not abbreviate: currentValue and newValue
| } | ||
|
|
||
| static void | ||
| setConstantArgument(const char* name, int64_t& val, int64_t newVal) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is small and used only only. Consider folding this logic into the caller for readability.
| // Reusable instance of DecodedVector for decoding input vectors. | ||
| DecodedVector decodedRaw_; | ||
| DecodedVector decodedIntermediate_; | ||
| int64_t originalEstimatedNumItems_ = kMissingArgument; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both originalEstimatedNumItems_ and estimatedNumItems_ member variables? Looks like just one would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because constant originalEstimatedNumItems_ is the value in Vector, and it will compare with max value to get estimatedNumItems_ which is used in function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that estimatedNumItems_ can be lower than originalEstimatedNumItems_ if input value is too large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
| auto vectors = {makeRowVector({makeAllNullFlatVector<int64_t>(2)})}; | ||
| auto expectedFake = {makeRowVector( | ||
| {makeNullableFlatVector<StringView>({std::nullopt}, VARBINARY())})}; | ||
| EXPECT_THROW( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use VELOX_ASSERT_THROW
|
|
||
| .. spark:function:: bloom_filter_agg(x, estimatedNumItems, numBits) -> varbinary | ||
| Insert ``x`` into BloomFilter, and returns the serialized BloomFilter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps,
Creates bloom filter from values of 'x' and returns it serialized into VARBINARY.
``estimatedNumItems`` provides an estimate of the number of unique values of ``x``. Value is capped at 716,800.
``numBits`` specifies max capacity of the bloom filter, which allows to trade accuracy for memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the number of unique values to the number of values, this is spark intend meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh I'm not sure I understand why would we want to specify the estimate of the total number of input values. Would you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new BloomFilterImpl(optimalNumOfHashFunctions(expectedNumItems, numBits), numBits);
/**
* Computes the optimal k (number of hashes per item inserted in Bloom filter), given the
* expected insertions and total number of bits in the Bloom filter.
*
* See http://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svg for the formula.
*
* @param n expected insertions (must be positive)
* @param m total number of bits in Bloom filter (must be positive)
*/
private static int optimalNumOfHashFunctions(long n, long m) {
// (m / n) * log(2), but avoid truncation due to division!
return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}
The spark BloomFilter implementation use this value to compute number of hash functions, it has a optimal value according to the theory. Velox implementation does not use variable number of hash function, it uses constant 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Are you saying Velox's implementation doesn't use estimatedNumItems arguments? If so, should we remove it? Otherwise, should we document that this argument is not used and remove logic for capping its value and initializing estimatedNumItems_ member variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the final solution, I may implement native Spark BloomFilter in Velox, it will have better performance, we can switch to it then.
And if we not specify numBits, we will use estimatedNumItems to estimate numBits, BloomFilter implemenation doesn't use estimatedNumItems argument, but used in bloom_filter_agg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let's clarify all this in the documentation. It is not obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to also document the difference between spark, or just clarify the usage in Velox?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's document the difference with Spark as well. By default, the assumption is that Velox functions match semantics of the original engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, updated
|
|
||
| .. spark:function:: bloom_filter_agg(x, estimatedNumItems, numBits) -> varbinary | ||
| Creates bloom filter from values of hashed value 'x' and returns it serialized into VARBINARY. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from values of hashed value 'x'
It is a bit cryptic. Perhaps, rename x to hash and say something like
.. spark:function:: bloom_filter_agg(hash, estimatedNumItems, numBits) -> varbinary
Creates bloom filter from input hashes and returns it serialized into VARBINARY. The caller is expected to apply xxhash64 function to input data before calling bloom_filter_agg.
For example,
bloom_filter_agg(xxhash64(x), 1000000, 1024)
| .. spark:function:: bloom_filter_agg(x, estimatedNumItems, numBits) -> varbinary | ||
| Creates bloom filter from values of hashed value 'x' and returns it serialized into VARBINARY. | ||
| ``estimatedNumItems`` and ``numBits`` decides the number of hash functions and bloom filter capacity in Spark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos:
In Spark implementation, ``estimatedNumItems`` and ``numBits`` are used to decide the number of hash functions and bloom filter capacity.
In Velox implementation, ``estimatedNumItems`` is not used.
| Current bloom filter implementation is different with Spark, if specified ``numBits``, ``estimatedNumItems`` | ||
| will not be used. | ||
|
|
||
| ``x`` should be xxhash64(``y``). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed.
| will not be used. | ||
|
|
||
| ``x`` should be xxhash64(``y``). | ||
| ``estimatedNumItems`` provides an estimate of the number of values of ``y``, which takes no effect here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove since it is not used.
| ``x`` should be xxhash64(``y``). | ||
| ``estimatedNumItems`` provides an estimate of the number of values of ``y``, which takes no effect here. | ||
| ``numBits`` specifies max capacity of the bloom filter, which allows to trade accuracy for memory. | ||
| Value of numBits in Spark is capped at 67,108,864, actually is capped at 716,800 in case of class memory limit . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos:
In Spark, the value of``numBits`` is automatically capped at 67,108,864.
In Velxo, the value of``numBits`` is automatically capped at 716,800.
| Returns the bitwise XOR of all non-null input values, or null if none. | ||
|
|
||
| .. spark:function:: bloom_filter_agg(x, estimatedNumItems, numBits) -> varbinary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also mention that x / hash cannot be null.
| bool /*mayPushdown*/) override { | ||
| decodeArguments(rows, args); | ||
| VELOX_USER_CHECK( | ||
| !decodedRaw_.mayHaveNulls(), "First argument value should not be null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First argument value
Users may not understand what this refers to. How about,
First argument of bloom_filter_agg cannot be null
However, !decodedRaw_.mayHaveNulls() is a very strong check. It may return false even if there are no nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about It may return false even if there are no nulls.
In my mind, I understand it is supposed to be return false if there are no nulls, return false or true when there are nulls or not, so it is determined there is no nulls.
Other codes obey this rule.
https://github.com/facebookincubator/velox/blob/main/velox/connectors/hive/HiveConnector.cpp#L633
Scalar function document:
https://facebookincubator.github.io/velox/develop/scalar-functions.html
bool mayHaveNulls() : constant time check on the underlying vector nullity. When it returns false, there are definitely no nulls, a true does not guarantee null existence.
| auto size = accumulator->serializedSize(); | ||
| StringView serialized; | ||
| if (StringView::isInline(size)) { | ||
| std::string buffer(size, '\0'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check this comment?
| } else { | ||
| char* ptr = buffer->asMutable<char>() + buffer->size(); | ||
| accumulator->serialize(ptr); | ||
| buffer->setSize(buffer->size() + size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
| private: | ||
| const int64_t kDefaultExpectedNumItems = 1'000'000; | ||
| const int64_t kMaxNumItems = 4'000'000; | ||
| // Spark kMaxNumBits is 67108864, but velox has memory limit sizeClassSizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
53c6763 to
f9041c8
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Thank you for iterating on this PR. There are a lot of tricky details to get right. Some follow-up comments.
| Creates bloom filter from input hashes and returns it serialized into VARBINARY. | ||
| The caller is expected to apply xxhash64 function to input data before calling bloom_filter_agg. | ||
| For example, | ||
| bloom_filter_agg(xxhash64(x), 100, 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you generate the docs locally and verify they get formatted nicely. It seems to me that we need some new lines or something around the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the document, I don't know how to confirm the document before.
Now I know how to convert it to html and use Vscode to preview, I will check the html format. Thanks for your kindly review.
| ``hash`` cannot be null. | ||
| ``numBits`` specifies max capacity of the bloom filter, which allows to trade accuracy for memory. | ||
| In Spark, the value of``numBits`` is automatically capped at 67,108,864. | ||
| In Velxo, the value of``numBits`` is automatically capped at 716,800. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Velxo -> Velox
is automatically capped at 716,800.
Let's update PR description to explain where this limitation comes from. CC: @xiaoxmeng
| In Spark, the value of``numBits`` is automatically capped at 67,108,864. | ||
| In Velxo, the value of``numBits`` is automatically capped at 716,800. | ||
|
|
||
| ``x``, ``estimatedNumItems`` and ``numBits`` must be ``BIGINT``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x -> hash
|
|
||
| .. spark:function:: bloom_filter_agg(hash, estimatedNumItems) -> varbinary | ||
| As ``bloom_filter_agg``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description needs to be revised.
A version of ``bloom_filter_agg`` that uses ``numBits`` computed as ``estimatedNumItems`` * 8.
``estimatedNumItems`` provides an estimate of the number of values of <TBD: fill in; y seems wrong>.
Value of ``estimatedNumItems`` is capped at 4,000,000.
Does 4M cap come from Spark? If so, let's clarify
Value of ``estimatedNumItems`` is capped at 4,000,000 like to match Spark's implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Spark implementation https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala#L58
It matches Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like defaults are configurable in Spark. Should these be configurable in Velox as well?
val RUNTIME_BLOOM_FILTER_EXPECTED_NUM_ITEMS =
buildConf("spark.sql.optimizer.runtime.bloomFilter.expectedNumItems")
.doc("The default number of expected items for the runtime bloomfilter")
.version("3.3.0")
.longConf
.createWithDefault(1000000L)
val RUNTIME_BLOOM_FILTER_MAX_NUM_ITEMS =
buildConf("spark.sql.optimizer.runtime.bloomFilter.maxNumItems")
.doc("The max allowed number of expected items for the runtime bloom filter")
.version("3.3.0")
.longConf
.createWithDefault(4000000L)
val RUNTIME_BLOOM_FILTER_NUM_BITS =
buildConf("spark.sql.optimizer.runtime.bloomFilter.numBits")
.doc("The default number of bits to use for the runtime bloom filter")
.version("3.3.0")
.longConf
.createWithDefault(8388608L)
val RUNTIME_BLOOM_FILTER_MAX_NUM_BITS =
buildConf("spark.sql.optimizer.runtime.bloomFilter.maxNumBits")
.doc("The max number of bits to use for the runtime bloom filter")
.version("3.3.0")
.longConf
.createWithDefault(67108864L)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we cannot get configuration when we implement aggregate functions, because we cannot get queryContext config here.
If we need this feature, we should reserve config in GroupingSet when we initialize it in HashAggregation.cpp.
Now just Spiller::Config exists in GroupingSet.
And we need to add a new argument config or context to functions such as addRawInput, it will change all the aggregate function input arguments.
Or we can reserve the config in Aggregate, but I don't suggest this way, it change the code less but Aggregate is created from FunctionRegistry factory, we cannot use constructor to create it with config.
If you think it is needed, I can help to implement it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out. Let's create a GitHub issue to explain this use case and discuss how best to implement it. For this PR, let's just mention in the documentation that Spark allows for changing the defaults, but Velox does not.
|
|
||
| .. spark:function:: bloom_filter_agg(hash) -> varbinary | ||
| As ``bloom_filter_agg``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
A version of ``bloom_filter_agg`` that uses 8,000,000 as ``numBits``.
Would you confirm that this matches Spark's implementation?
| flatResult->resize(numGroups); | ||
|
|
||
| int32_t totalSize = getTotalSize(groups, numGroups); | ||
| Buffer* buffer = flatResult->getBufferWithSpace(totalSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but is quite a bit of code and easy to get wrong. For example, by forgetting to call buffer->setSize or forgetting to add buffer_.size() when initializing bufferPtr.
Consider introducing new method:
char* rawBuffer = flatResult->getRawStringBufferWithSpace(totalSize);
This method would return the pointer to the first "writable" byte and update the size of the 'buffer' to include totalSize.
| if (args.size() > 1) { | ||
| DecodedVector decodedEstimatedNumItems(*args[1], rows); | ||
| setConstantArgument( | ||
| "originalEstimatedNumItems", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, check this comment.
| VELOX_CHECK_EQ(args.size(), 3); | ||
| DecodedVector decodedNumBits(*args[2], rows); | ||
| setConstantArgument( | ||
| "originalNumBits", originalNumBits_, decodedNumBits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
| for (vector_size_t i = 0; i < numGroups; ++i) { | ||
| auto group = groups[i]; | ||
| auto accumulator = value<BloomFilterAccumulator>(group); | ||
| if (UNLIKELY(!accumulator->initialized())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this happen if we run masked aggregation and all rows for a given groups are masked out? Would you add a test case to verify this code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain a bit more about BloomFilterAggAggregateTest.emptyInput.
Current test can cover this path, And gluten unit test can run into this path too.
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from BloomFilterAggAggregateTest
[ RUN ] BloomFilterAggAggregateTest.basic
not init
[ OK ] BloomFilterAggAggregateTest.basic (40 ms)
[ RUN ] BloomFilterAggAggregateTest.bloomFilterAggArgument
not init
not init
[ OK ] BloomFilterAggAggregateTest.bloomFilterAggArgument (160 ms)
[ RUN ] BloomFilterAggAggregateTest.emptyInput
not init
not init
not init
not init
not init
not init
not init
not init
not init
not init
[ OK ] BloomFilterAggAggregateTest.emptyInput (30 ms)
| static void setConstantArgument( | ||
| const char* name, | ||
| int64_t& currentValue, | ||
| const DecodedVector& vec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not abbreviate: vec -> vector or decoded
efc1cd1 to
f69a977
Compare
|
This failure happens again, can you help check it? @mbasmanova |
|
Do you have further comments? @mbasmanova |
|
@jinchengchenghh The CI is red. Would you rebase the PR and make sure CI is green? |
38e947d to
2e642ce
Compare
|
The CI passed @mbasmanova |
|
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
velox/vector/FlatVector.h
Outdated
|
|
||
| char* getRawStringBufferWithSpace(vector_size_t /* unused */) { | ||
| return nullptr; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linter pointed out that this semi-colon is not needed; let's remove
I'm also seeing that this method is lacking documentation and tests. Would you submit a separate PR to introduce this method, document it clearly and add a test?
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
11cfb31 to
36b31e5
Compare
|
Fixed the linter warning, can it be imported? @mbasmanova |
|
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@mbasmanova merged this pull request in 86137eb. |
|
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |



This function is used in Spark Runtime Filters: apache/spark#35789
https://docs.google.com/document/d/16IEuyLeQlubQkH8YuVuXWKo2-grVIoDJqQpHZrE7q04/edit#heading=h.4v65wq7vzy4q
BloomFilter implementation in Velox is different from Spark, hence, serialized BloomFilter is different.
Velox has memory limit for contiguous memory buffer, hence BloomFilter capacity is less than in Spark when numBits is large. See #4713 (comment)
Spark allows for changing the defaults while Velox does not.
See also #3342
Fixes #3694