Add bounded unique count aggregation #781

jbrooks-stripe · 2024-07-02T21:17:24Z

Summary

Adds a BOUNDED_UNIQUE_COUNT aggregation. This will allow exact unique/distinct counts, but will cap at a given value to keep memory usage constant.

Why / Goal

We have use cases where we'd prefer an exact solution instead of the approx equivalents, but want to have protections in place so that memory doesn't become an issue.

Test Plan

Added Unit Tests
Covered by existing CI
Integration tested

Checklist

Documentation update

Reviewers

pengyu-hou

@jbrooks-stripe the change looks good. Could you please also update the operation in the groupby.py at https://github.com/airbnb/chronon/blob/main/api/py/ai/chronon/group_by.py#L56

Could you follow the same style with the HISTOGRAM and HISTOGRAM_K? Thanks!!

nikhil-zlai · 2025-07-09T21:23:55Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

+  override def irType: DataType = ListType(StringType)
+
+  override def merge(ir1: util.Set[String], ir2: util.Set[String]): util.Set[String] = {
+    ir2.asScala.foreach(v =>


Suggested change

ir2.asScala.foreach(v =>

ir2.iterator().asScala.foreach(v =>

otherwise it will create intermediate collections

Good call out!

nikhil-zlai

thanks for putting this together. have some minor perf related comments.

nikhil-zlai · 2025-07-09T21:24:10Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

+  override def irType: DataType = ListType(StringType)
+
+  override def merge(ir1: util.Set[String], ir2: util.Set[String]): util.Set[String] = {
+    ir2.asScala.foreach(v =>


otherwise it will create intermediate collections

nikhil-zlai · 2025-07-09T21:27:45Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

+  }
+
+  private def md5Hex(bytes: Array[Byte]): String =
+    MessageDigest.getInstance("MD5").digest(bytes).map("%02x".format(_)).mkString


lets say i want to unique count a bunch of user / merchant ids (long values) - won't this be less efficient than simply keeping the set of longs?

made the code change to keep the numeric type as is

nikhil-zlai · 2025-07-09T21:30:19Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

+
+  override def update(ir: util.Set[String], input: T): util.Set[String] = {
+    if (ir.size() >= k) {
+      return ir


memory optimization: we can use a sentinel set when k is reached.

Suggested change

return ir

if(ir == Constants.SentinelSet || ir.size() >= k) return Constants.SentinelSet

Hm.. I don't think we have sentinel set yet in OSS branch.

nikhil-zlai · 2025-07-09T21:30:52Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

+    ir1
+  }
+
+  override def finalize(ir: util.Set[String]): Long = ir.size()


Suggested change

override def finalize(ir: util.Set[String]): Long = ir.size()

override def finalize(ir: util.Set[String]): Long = if(ir == Constants.SentinelSet) k else ir.size()

nikhil-zlai · 2025-07-11T21:27:21Z

aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala

+    if (ir == BoundedUniqueCount.SentinelSet) {
+      val list = new util.ArrayList[Any]()
+      list.add(BoundedUniqueCount.SentinelMarker)
+      list
+    } else {
+      new util.ArrayList[Any](ir)


aren't both of these same - can we just do

Suggested change

if (ir == BoundedUniqueCount.SentinelSet) {

val list = new util.ArrayList[Any]()

list.add(BoundedUniqueCount.SentinelMarker)

list

} else {

new util.ArrayList[Any](ir)

new util.ArrayList[Any](ir)

Hi @nikhil-zlai , this is used to differentiate the empty Sentinel Set and the actual empty set.

nikhil-zlai

This looks great! thanks pengyu!

jbrooks-stripe force-pushed the jbrooks-bounded-unique-count branch 6 times, most recently from 685e0b9 to 3c368e8 Compare July 2, 2024 21:46

jbrooks-stripe marked this pull request as ready for review July 3, 2024 19:31

piyushn-stripe approved these changes Jul 5, 2024

View reviewed changes

pengyu-hou reviewed Aug 23, 2024

View reviewed changes

varant-zlai mentioned this pull request Jan 29, 2025

Add bounded unique count aggregation #781 zipline-ai/chronon#299

Closed

4 tasks

hzding621 approved these changes Jul 3, 2025

View reviewed changes

jbrooks-stripe and others added 2 commits July 3, 2025 15:36

Add bounded unique count aggregation

dac9f91

use hash value instead

d5746da

pengyu-hou force-pushed the jbrooks-bounded-unique-count branch from 3c160e0 to d5746da Compare July 4, 2025 00:37

pengyu-hou added 2 commits July 3, 2025 17:58

scala fmt

03a3a3b

fix ir type

fd27c7a

nikhil-zlai reviewed Jul 9, 2025

View reviewed changes

pengyu-hou added 4 commits July 9, 2025 15:14

use iterator

0ca15cd

scala fmt

5f1ff09

Optimize BoundedUniqueCount for numeric types with sentinel set pattern

996d050

Merge branch 'main' into jbrooks-bounded-unique-count

941a060

nikhil-zlai reviewed Jul 11, 2025

View reviewed changes

nikhil-zlai approved these changes Jul 11, 2025

View reviewed changes

pengyu-hou merged commit 5da9c32 into main Jul 11, 2025
9 checks passed

pengyu-hou deleted the jbrooks-bounded-unique-count branch July 11, 2025 22:55

	return ir
	if(ir == Constants.SentinelSet \|\| ir.size() >= k) return Constants.SentinelSet

	override def finalize(ir: util.Set[String]): Long = ir.size()
	override def finalize(ir: util.Set[String]): Long = if(ir == Constants.SentinelSet) k else ir.size()

Add bounded unique count aggregation #781

Add bounded unique count aggregation #781

Uh oh!

Conversation

jbrooks-stripe commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why / Goal

Test Plan

Checklist

Reviewers

Uh oh!

pengyu-hou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikhil-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikhil-zlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jbrooks-stripe commented Jul 2, 2024 •

edited

Loading