feat: add a backlog-queue to the snuba consumers by kylemumma · Pull Request #7856 · getsentry/snuba

kylemumma · 2026-04-01T23:14:02Z

https://www.notion.so/sentry/project-snuba-backlog-queue-BLQ-3148b10e4b5d809ba443ed6b79606bc6

this adds a BLQRouter step to snuba consumers, if a message is fresh it gets passed along as usual, if a message is stale it is routed onto the backlog queue. We are using the DLQ as the backlog queue. It is guarded behind a feature flag that is off by default.

todo: the following are todos I will ensure get done as follow ups before enabling

add alerting and observability so we know when things are written to the dlq topic
ensure we have a consumer set up to consume from the dlq topic

kylemumma · 2026-04-02T17:05:38Z

.vscode/launch.json

debug target for the consumer in vscode, runs maturin to build rust as pre-step

kylemumma · 2026-04-02T17:06:14Z

rust_snuba/src/strategies/blq_router.rs

this my new arroyo strategy for the blq, see factory_v2 for usage

kylemumma · 2026-04-02T17:07:25Z

rust_snuba/src/consumer.rs

here is where i steal the info about the dlq and use it for the blq. if there is a dlq it will be used for the blq.

kylemumma · 2026-04-02T17:08:02Z

rust_snuba/src/factory_v2.rs

here I wire up our existing consumer pipeline to have the BLQ step at the beginning

kylemumma · 2026-04-02T17:08:16Z

sentry-options/schemas/snuba/schema.json

and here is my feature flag, thanks @kenzoengineer

rust_snuba/src/strategies/blq_router.rs

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: ConcurrencyConfig runtime dropped before strategy uses it
- Moved BLQ concurrency to a long-lived blq_concurrency field on ConsumerStrategyFactoryV2 and used it when constructing Produce so the runtime outlives the strategy chain.
✅ Fixed: Duplicated BLQ router setup across two methods
- Extracted the repeated BLQ feature-flag/producer/router block into a shared wrap_with_blq_router helper used by both pipeline constructors.

Or push these changes by commenting:

@cursor push 58d1d747f9

Preview (58d1d747f9)

diff --git a/rust_snuba/benches/processors.rs b/rust_snuba/benches/processors.rs
--- a/rust_snuba/benches/processors.rs
+++ b/rust_snuba/benches/processors.rs
@@ -103,6 +103,7 @@
         join_timeout_ms: None,
         health_check: "arroyo".to_string(),
         use_row_binary: false,
+        blq_concurrency: ConcurrencyConfig::with_runtime(concurrency, RUNTIME.handle().to_owned()),
         blq_producer_config: None,
         blq_topic: None,
     };

diff --git a/rust_snuba/src/consumer.rs b/rust_snuba/src/consumer.rs
--- a/rust_snuba/src/consumer.rs
+++ b/rust_snuba/src/consumer.rs
@@ -285,6 +285,7 @@
         join_timeout_ms,
         health_check: health_check.to_string(),
         use_row_binary,
+        blq_concurrency: ConcurrencyConfig::new(10),
         blq_producer_config: dlq_producer_config.clone(),
         blq_topic: dlq_topic,
     };

diff --git a/rust_snuba/src/factory_v2.rs b/rust_snuba/src/factory_v2.rs
--- a/rust_snuba/src/factory_v2.rs
+++ b/rust_snuba/src/factory_v2.rs
@@ -69,6 +69,7 @@
     pub join_timeout_ms: Option<u64>,
     pub health_check: String,
     pub use_row_binary: bool,
+    pub blq_concurrency: ConcurrencyConfig,
     pub blq_producer_config: Option<KafkaConfig>,
     pub blq_topic: Option<Topic>,
 }
@@ -267,43 +268,7 @@
             Some(Duration::from_millis(self.join_timeout_ms.unwrap_or(0))),
         );
 
-        let blq_enabled_flag = options("snuba")
-            .ok()
-            .and_then(|o| o.get("consumer.blq_enabled").ok())
-            .and_then(|v| v.as_bool())
-            .unwrap_or(false);
-        let next_step: Box<dyn ProcessingStrategy<KafkaPayload>> =
-            if let (true, Some(blq_producer_config), Some(blq_topic)) =
-                (blq_enabled_flag, &self.blq_producer_config, self.blq_topic)
-            {
-                let stale_threshold = TimeDelta::minutes(30);
-                let static_friction = TimeDelta::minutes(2);
-                tracing::info!(
-                "Routing all messages older than {:?} to the topic {:?} with static_friction {:?}",
-                stale_threshold,
-                self.blq_topic,
-                static_friction
-            );
-                let concurrency = ConcurrencyConfig::new(10);
-                let blq_producer = Produce::new(
-                    CommitOffsets::new(Duration::from_millis(250)),
-                    KafkaProducer::new(blq_producer_config.clone()),
-                    &concurrency,
-                    TopicOrPartition::Topic(blq_topic),
-                );
-                Box::new(
-                    BLQRouter::new(
-                        next_step,
-                        blq_producer,
-                        stale_threshold,
-                        Some(static_friction),
-                    )
-                    .expect("invalid BLQRouter config"),
-                )
-            } else {
-                tracing::info!("Not using a backlog-queue",);
-                Box::new(next_step)
-            };
+        let next_step = self.wrap_with_blq_router(Box::new(next_step));
 
         if let Some(path) = &self.health_check_file {
             {
@@ -324,6 +289,49 @@
 }
 
 impl ConsumerStrategyFactoryV2 {
+    fn wrap_with_blq_router(
+        &self,
+        next_step: Box<dyn ProcessingStrategy<KafkaPayload>>,
+    ) -> Box<dyn ProcessingStrategy<KafkaPayload>> {
+        let blq_enabled_flag = options("snuba")
+            .ok()
+            .and_then(|o| o.get("consumer.blq_enabled").ok())
+            .and_then(|v| v.as_bool())
+            .unwrap_or(false);
+
+        if let (true, Some(blq_producer_config), Some(blq_topic)) =
+            (blq_enabled_flag, &self.blq_producer_config, self.blq_topic)
+        {
+            let stale_threshold = TimeDelta::minutes(30);
+            let static_friction = TimeDelta::minutes(2);
+            tracing::info!(
+                "Routing all messages older than {:?} to the topic {:?} with static_friction {:?}",
+                stale_threshold,
+                self.blq_topic,
+                static_friction
+            );
+
+            let blq_producer = Produce::new(
+                CommitOffsets::new(Duration::from_millis(250)),
+                KafkaProducer::new(blq_producer_config.clone()),
+                &self.blq_concurrency,
+                TopicOrPartition::Topic(blq_topic),
+            );
+            Box::new(
+                BLQRouter::new(
+                    next_step,
+                    blq_producer,
+                    stale_threshold,
+                    Some(static_friction),
+                )
+                .expect("invalid BLQRouter config"),
+            )
+        } else {
+            tracing::info!("Not using a backlog-queue",);
+            next_step
+        }
+    }
+
     fn create_row_binary_pipeline<
         T: clickhouse::Row + serde::Serialize + Clone + Send + Sync + 'static,
     >(
@@ -420,43 +428,7 @@
             Some(Duration::from_millis(self.join_timeout_ms.unwrap_or(0))),
         );
 
-        let blq_enabled_flag = options("snuba")
-            .ok()
-            .and_then(|o| o.get("consumer.blq_enabled").ok())
-            .and_then(|v| v.as_bool())
-            .unwrap_or(false);
-        let next_step: Box<dyn ProcessingStrategy<KafkaPayload>> =
-            if let (true, Some(blq_producer_config), Some(blq_topic)) =
-                (blq_enabled_flag, &self.blq_producer_config, self.blq_topic)
-            {
-                let stale_threshold = TimeDelta::minutes(30);
-                let static_friction = TimeDelta::minutes(2);
-                tracing::info!(
-                "Routing all messages older than {:?} to the topic {:?} with static_friction {:?}",
-                stale_threshold,
-                self.blq_topic,
-                static_friction
-            );
-                let concurrency = ConcurrencyConfig::new(10);
-                let blq_producer = Produce::new(
-                    CommitOffsets::new(Duration::from_millis(250)),
-                    KafkaProducer::new(blq_producer_config.clone()),
-                    &concurrency,
-                    TopicOrPartition::Topic(blq_topic),
-                );
-                Box::new(
-                    BLQRouter::new(
-                        next_step,
-                        blq_producer,
-                        stale_threshold,
-                        Some(static_friction),
-                    )
-                    .expect("invalid BLQRouter config"),
-                )
-            } else {
-                tracing::info!("Not using a backlog-queue",);
-                Box::new(next_step)
-            };
+        let next_step = self.wrap_with_blq_router(Box::new(next_step));
 
         if let Some(path) = &self.health_check_file {
             if self.health_check == "snuba" {

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

rust_snuba/src/factory_v2.rs

rust_snuba/src/strategies/blq_router.rs

kenzoengineer

i'd like to see tests that validate behaviour when the option value is changed (using the thread local override guard)
make sure that the codepath always includes init_with_schemas first

rust_snuba/src/strategies/blq_router.rs

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-04-03T19:09:00Z

rust_snuba/src/strategies/blq_router.rs

+
+                // i know i shouldnt be blocking in submit but there was no better way to do it
+                // the pipeline cant make progress until this completes anyways so it should be fine
+                let flush_results = self.producer.join(Some(Duration::from_secs(5))).unwrap();


Hardcoded 5s join timeout with unwrap risks crash loops

Medium Severity

When transitioning from RoutingStale to Flushing, self.producer.join(Some(Duration::from_secs(5))) is called with .unwrap(). If the Kafka broker is slow or unreachable, the 5-second timeout may expire, causing join() to return a StrategyError which .unwrap() converts into a panic with a generic error message. Under sustained broker latency this creates a tight crash loop: stale messages arrive → transition → timeout → panic → restart → repeat.

kylemumma marked this pull request as ready for review April 2, 2026 17:04

kylemumma requested a review from a team as a code owner April 2, 2026 17:04

kylemumma commented Apr 2, 2026

View reviewed changes

.vscode/launch.json

Copy link
Copy Markdown

Member Author

kylemumma Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug target for the consumer in vscode, runs maturin to build rust as pre-step

kylemumma commented Apr 2, 2026

View reviewed changes

rust_snuba/src/strategies/blq_router.rs

Copy link
Copy Markdown

Member Author

kylemumma Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this my new arroyo strategy for the blq, see factory_v2 for usage

kylemumma commented Apr 2, 2026

View reviewed changes

rust_snuba/src/factory_v2.rs

Copy link
Copy Markdown

Member Author

kylemumma Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I wire up our existing consumer pipeline to have the BLQ step at the beginning

kylemumma commented Apr 2, 2026

View reviewed changes

sentry-options/schemas/snuba/schema.json

Copy link
Copy Markdown

Member Author

kylemumma Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here is my feature flag, thanks @kenzoengineer

sentry bot reviewed Apr 2, 2026

View reviewed changes

rust_snuba/src/strategies/blq_router.rs Show resolved Hide resolved

rust_snuba/src/strategies/blq_router.rs Outdated Show resolved Hide resolved

cursor bot reviewed Apr 2, 2026

View reviewed changes

rust_snuba/src/factory_v2.rs Outdated Show resolved Hide resolved

rust_snuba/src/factory_v2.rs Show resolved Hide resolved

sentry bot reviewed Apr 2, 2026

View reviewed changes

rust_snuba/src/strategies/blq_router.rs Outdated Show resolved Hide resolved

cursor bot reviewed Apr 2, 2026

View reviewed changes

rust_snuba/src/strategies/blq_router.rs Outdated Show resolved Hide resolved

kylemumma marked this pull request as draft April 2, 2026 17:59

kenzoengineer reviewed Apr 2, 2026

View reviewed changes

kylemumma marked this pull request as ready for review April 2, 2026 21:58

sentry bot reviewed Apr 2, 2026

View reviewed changes

rust_snuba/src/strategies/blq_router.rs Outdated Show resolved Hide resolved

Claude Code added 11 commits April 2, 2026 15:03

initial BLQRouter implemented

6cdd27c

BLQRouter working???

a89f8c0

blqrouter pre-test

ddc53fe

blq has tests passing!

f006a0a

should be good to go

0954b18

lint

3c95e26

add to binary pipeline

4167274

module docs

acf7467

tests and comments

b40e567

now everything gets committed properly

7392289

merge cofnlict

62e9ff7

kylemumma force-pushed the krm/blq-router branch from 829be7f to 62e9ff7 Compare April 2, 2026 22:13

cursor bot reviewed Apr 2, 2026

View reviewed changes

rust_snuba/src/strategies/blq_router.rs Outdated Show resolved Hide resolved

Claude Code added 2 commits April 2, 2026 15:28

linter

b2cb2bb

bug fix

973f690

kylemumma requested a review from kenzoengineer April 2, 2026 22:31

sentry bot reviewed Apr 2, 2026

View reviewed changes

rust_snuba/src/strategies/blq_router.rs Show resolved Hide resolved

fix test

65e8b51

cursor bot reviewed Apr 3, 2026

View reviewed changes

Uh oh!

Conversation

kylemumma commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylemumma Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kylemumma Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kylemumma Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kylemumma Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kylemumma Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kenzoengineer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 3, 2026

Choose a reason for hiding this comment

Hardcoded 5s join timeout with unwrap risks crash loops

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kylemumma commented Apr 1, 2026 •

edited

Loading

cursor bot left a comment •

edited

Loading