Skip to content

Conversation

@jbrooks-stripe
Copy link
Collaborator

@jbrooks-stripe jbrooks-stripe commented Jul 2, 2024

Summary

Adds a BOUNDED_UNIQUE_COUNT aggregation. This will allow exact unique/distinct counts, but will cap at a given value to keep memory usage constant.

Why / Goal

We have use cases where we'd prefer an exact solution instead of the approx equivalents, but want to have protections in place so that memory doesn't become an issue.

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested

Checklist

  • Documentation update

Reviewers

@jbrooks-stripe jbrooks-stripe force-pushed the jbrooks-bounded-unique-count branch 6 times, most recently from 685e0b9 to 3c368e8 Compare July 2, 2024 21:46
@jbrooks-stripe jbrooks-stripe marked this pull request as ready for review July 3, 2024 19:31
Copy link
Collaborator

@pengyu-hou pengyu-hou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrooks-stripe the change looks good. Could you please also update the operation in the groupby.py at https://github.com/airbnb/chronon/blob/main/api/py/ai/chronon/group_by.py#L56

Could you follow the same style with the HISTOGRAM and HISTOGRAM_K? Thanks!!

@pengyu-hou pengyu-hou force-pushed the jbrooks-bounded-unique-count branch from 3c160e0 to d5746da Compare July 4, 2025 00:37
override def irType: DataType = ListType(StringType)

override def merge(ir1: util.Set[String], ir2: util.Set[String]): util.Set[String] = {
ir2.asScala.foreach(v =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ir2.asScala.foreach(v =>
ir2.iterator().asScala.foreach(v =>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise it will create intermediate collections

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@nikhil-zlai nikhil-zlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for putting this together. have some minor perf related comments.

override def irType: DataType = ListType(StringType)

override def merge(ir1: util.Set[String], ir2: util.Set[String]): util.Set[String] = {
ir2.asScala.foreach(v =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise it will create intermediate collections

}

private def md5Hex(bytes: Array[Byte]): String =
MessageDigest.getInstance("MD5").digest(bytes).map("%02x".format(_)).mkString
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets say i want to unique count a bunch of user / merchant ids (long values) - won't this be less efficient than simply keeping the set of longs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made the code change to keep the numeric type as is


override def update(ir: util.Set[String], input: T): util.Set[String] = {
if (ir.size() >= k) {
return ir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory optimization: we can use a sentinel set when k is reached.

Suggested change
return ir
if(ir == Constants.SentinelSet || ir.size() >= k) return Constants.SentinelSet

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I don't think we have sentinel set yet in OSS branch.

ir1
}

override def finalize(ir: util.Set[String]): Long = ir.size()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
override def finalize(ir: util.Set[String]): Long = ir.size()
override def finalize(ir: util.Set[String]): Long = if(ir == Constants.SentinelSet) k else ir.size()

Comment on lines +691 to +696
if (ir == BoundedUniqueCount.SentinelSet) {
val list = new util.ArrayList[Any]()
list.add(BoundedUniqueCount.SentinelMarker)
list
} else {
new util.ArrayList[Any](ir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't both of these same - can we just do

Suggested change
if (ir == BoundedUniqueCount.SentinelSet) {
val list = new util.ArrayList[Any]()
list.add(BoundedUniqueCount.SentinelMarker)
list
} else {
new util.ArrayList[Any](ir)
new util.ArrayList[Any](ir)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nikhil-zlai , this is used to differentiate the empty Sentinel Set and the actual empty set.

Copy link
Collaborator

@nikhil-zlai nikhil-zlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! thanks pengyu!

@pengyu-hou pengyu-hou merged commit 5da9c32 into main Jul 11, 2025
9 checks passed
@pengyu-hou pengyu-hou deleted the jbrooks-bounded-unique-count branch July 11, 2025 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants