[tailsamplingprocessor] Parallel decision ticker #1

Logiraptor · 2025-10-08T20:14:46Z

This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker. It also fixes the open issue in open-telemetry#41656 by moving all state management to a single goroutine (per shard).

In practice at Grafana Labs, we've seen that the single threaded nature of the decision ticker is the primary bottleneck when running the tail sampling processor at high scales. For example, we have one large cluster performing around 12M policy evaluations per second while maintaining p99 tick latency of around 500ms. In order to reach this scale, we need to run ~200 replicas just to get enough parallel threads of execution. By adding support for parallel decision ticks, we hope to run fewer, larger replicas which should reduce the fragmentation caused by trace id load balancing and better balance spiky load across replicas.

Another benefit here is to reduce contention on the idToTrace map and generally simplify reasoning about concurrency by giving each trace a clear "owner" goroutine. Previously we'd try to do some work inside the calling goroutine of ConsumeTraces. Instead, new work is passed via channels to its dedicated shard. By default we run one shard per available core. Shared state is kept to the minimum required to implement TSP semantics:

The trace limiter is shared, which allows the overall memory consumption to be reliably capped even in case of uneven shard utilization.
Both decision caches are shared, which again allows overall memory consumption to be capped even with uneven shard utilization.
Policy evaluators are shared, since they may have internal state which is meant to be global to the processor.

In practice, shard utilization should be very even, but it's still nice to have hard guarantees on the total traces in memory, and this allows us to avoid changing any tests in this PR, which greatly increases my confidence in the code despite the large change. Ideally the change to parallel execution is invisible to end users (aside from the much higher throughput).

Sharding makes use of the maphash package with a unique Seed, which should provide even shard distribution even if traceIDs have previously been sampled downstream. This is better than e.g. a simple modulo on traceID, since traceIDs are not necessarily uniformly distribute after head sampling.

…b.uid (open-telemetry#42641)  #### Description This PR adds support to expose `k8s.cronjob.uid` as resource metadata when a `Job` is owned by a `CronJob`.  #### Link to tracking issue Fixes open-telemetry#42557  #### Testing Local tested with `telemetrygen` and is working as expected. ``` [pod/k8sevents-receiver-opentelemetry-collector-6fd9966559-brlb6/opentelemetry-collector] {"level":"debug","ts":"2025-09-11T16:29:11.588Z","caller":"[email protected]/processor.go:159","msg":"getting the pod","resource":{"service.instance.id":"9631e38b-aec3-439f-8178-d96fc8368e1e","service.name":"otelcontribcol","service.version":"0.135.0-dev"},"otelcol.component.id":"k8sattributes","otelcol.component.kind":"processor","otelcol.pipeline.id":"traces","otelcol.signal":"traces","pod":{"Name":"otel-log-cronjob-29293469-lw97x","Address":"10.244.0.70","PodUID":"7960681c-5a24-4287-8bea-e2cf506500ee","Attributes":{"k8s.cronjob.name":"otel-log-cronjob","k8s.cronjob.uid":"082b1c42-e393-46bc-9d51-b20a3700d1ab","k8s.job.name":"otel-log-cronjob-29293469","k8s.job.uid":"fbd853b8-7f63-44d8-ace1-8b48c89e3041"},"StartTime":"2025-09-11T16:29:00Z","Ignore":false,"Namespace":"default","NodeName":"","DeploymentUID":"","StatefulSetUID":"","DaemonSetUID":"","JobUID":"fbd853b8-7f63-44d8-ace1-8b48c89e3041","HostNetwork":false,"Containers":{"ByID":null,"ByName":null},"DeletedAt":"0001-01-01T00:00:00Z"}} [pod/k8sevents-receiver-opentelemetry-collector-6fd9966559-brlb6/opentelemetry-collector] {"level":"info","ts":"2025-09-11T16:29:11.588Z","msg":"Traces","resource":{"service.instance.id":"9631e38b-aec3-439f-8178-d96fc8368e1e","service.name":"otelcontribcol","service.version":"0.135.0-dev"},"otelcol.component.id":"debug","otelcol.component.kind":"exporter","otelcol.signal":"traces","resource spans":1,"spans":2} [pod/k8sevents-receiver-opentelemetry-collector-6fd9966559-brlb6/opentelemetry-collector] {"level":"info","ts":"2025-09-11T16:29:11.588Z","msg":"ResourceSpans #0\nResource SchemaURL: https://opentelemetry.io/schemas/1.4.0\nResource attributes:\n -> k8s.container.name: Str(telemetrygen)\n -> service.name: Str(telemetrygen)\n -> k8s.pod.ip: Str(10.244.0.70)\n -> k8s.cronjob.name: Str(otel-log-cronjob)\n -> k8s.cronjob.uid: Str(082b1c42-e393-46bc-9d51-b20a3700d1ab)\n -> k8s.job.uid: Str(fbd853b8-7f63-44d8-ace1-8b48c89e3041)\n -> k8s.job.name: Str(otel-log-cronjob-29293469)\nScopeSpans #0\nScopeSpans SchemaURL: \nInstrumentationScope telemetrygen \nSpan #0\n Trace ID : 3c7381c14a37814676b00a7d961cb219\n Parent ID : 4f8780d5148a9c1c\n ID : 17e9da9533dc93ca\n Name : okey-dokey-0\n Kind : Server\n Start time : 2025-09-11 16:29:09.583785469 +0000 UTC\n End time : 2025-09-11 16:29:09.583908469 +0000 UTC\n Status code : Unset\n Status message : \nAttributes:\n -> net.peer.ip: Str(1.2.3.4)\n -> peer.service: Str(telemetrygen-client)\nSpan #1\n Trace ID : 3c7381c14a37814676b00a7d961cb219\n Parent ID : \n ID : 4f8780d5148a9c1c\n Name : lets-go\n Kind : Client\n Start time : 2025-09-11 16:29:09.583785469 +0000 UTC\n End time : 2025-09-11 16:29:09.583908469 +0000 UTC\n Status code : Unset\n Status message : \nAttributes:\n -> net.peer.ip: Str(1.2.3.4)\n -> peer.service: Str(telemetrygen-server)\n","resource":{"service.instance.id":"9631e38b-aec3-439f-8178-d96fc8368e1e","service.name":"otelcontribcol","service.version":"0.135.0-dev"},"otelcol.component.id":"debug","otelcol.component.kind":"exporter","otelcol.signal":"traces"} ``` Added also the tests to guarantee the proper functionality. --------- Signed-off-by: Paulo Dias <[email protected]>

This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker. It also fixes the open issue in open-telemetry#41656 by moving all state management to a single goroutine (per shard). In practice at Grafana Labs, we've seen that the single threaded nature of the decision ticker is the primary bottleneck when running the tail sampling processor at high scales. For example, we have one large cluster performing around 12M policy evaluations per second while maintaining p99 tick latency of around 500ms. In order to reach this scale, we need to run ~200 replicas just to get enough parallel threads of execution. By adding support for parallel decision ticks, we hope to run fewer, larger replicas which should reduce the fragmentation caused by trace id load balancing. Another benefit here is to reduce contention on the idToTrace map and generally simplify reasoning about concurrency by giving each trace a clear "owner" goroutine. Previously we'd try to do some work inside the calling goroutine of ConsumeTraces. Instead, new work is passed via channels to its dedicated shard. By default we run one shard per available core. Shared state is kept to the minimum required to implement TSP semantics: 1. The trace limiter is shared, which allows the overall memory consumption to be reliably capped even in case of uneven shard utilization. 2. Both decision caches are shared, which again allows overall memory consumption to be capped even with uneven shard utilization. In practice, shard utilization should be very even, but it's still nice to have hard guarantees on the total traces in memory, and this allows us to avoid changing any tests in this PR, which greatly increases my confidence in the code despite the large change. Ideally the change to parallel execution is invisible to end users (aside from the much higher throughput). Sharding makes use of the maphash package with a unique `Seed`, which should provide even shard distribution even if traceIDs have previously been sampled downstream. This is better than e.g. a simple modulo on traceID, since traceIDs are not necessarily uniformly distribute after head sampling.

github-actions · 2025-10-29T05:16:52Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2025-11-12T05:16:58Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@Logiraptor

#### Description This PR changes the tailsampling processor to do all work (append traces to a batch, decide on which traces from a batch are sampled, forward to the next processor) in a single goroutine. At Grafana Labs we have found that there is significant contention between the various mutexes such that little performance is gained by allowing traces to be updated while a policy evaluation is occurring. In addition, the locking adds a lot of complexity which makes future improvements more challenging, and introduces subtle concurrency issues such as seen in the tracking issue linked below. We (mostly @Logiraptor) also experimented with using multiple workers, each with their own goroutine to process traces but we found little improvement as yet again various contention issues showed up. See Logiraptor#1 (or the first commit in this PR) for more details. If additional performance is necessary for slow decision tick latencies I would suggest adding parallelism just to the call to `makeDecision`, which would allow additional CPUs to be used without adding contention on `idToTrace` or other locations.  #### Link to tracking issue Fixes #41656 Working towards #43876  #### Testing Tests were updated to make sure the tick behavior remains correct and consistent. In addition, the syncBatcher needed to be updated to properly implement `Stop`. --------- Co-authored-by: Patrick Oyarzun <[email protected]>

Logiraptor force-pushed the parallel-tsp branch 6 times, most recently from d442d12 to 19887c6 Compare October 13, 2025 14:16

Logiraptor mentioned this pull request Oct 13, 2025

[tailsamplingprocessor] [chore] Remove test coupling on component internal fields open-telemetry/opentelemetry-collector-contrib#43201

Merged

Logiraptor force-pushed the tsp-synctest branch from d0fc836 to 7e9122d Compare October 14, 2025 19:05

Logiraptor force-pushed the parallel-tsp branch from 8c93275 to 2c402c1 Compare October 14, 2025 19:27

csmarchbanks mentioned this pull request Oct 21, 2025

[processor/tailsampling] Simplify tsp locking open-telemetry/opentelemetry-collector-contrib#43671

Merged

github-actions bot added the Stale label Oct 29, 2025

github-actions bot closed this Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tailsamplingprocessor] Parallel decision ticker #1

[tailsamplingprocessor] Parallel decision ticker #1

Uh oh!

Logiraptor commented Oct 8, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[tailsamplingprocessor] Parallel decision ticker #1

[tailsamplingprocessor] Parallel decision ticker #1

Uh oh!

Conversation

Logiraptor commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Logiraptor commented Oct 8, 2025 •

edited

Loading