Skip to content

Conversation

@Logiraptor
Copy link
Owner

@Logiraptor Logiraptor commented Oct 8, 2025

This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker. It also fixes the open issue in open-telemetry#41656 by moving all state management to a single goroutine (per shard).

In practice at Grafana Labs, we've seen that the single threaded nature of the decision ticker is the primary bottleneck when running the tail sampling processor at high scales. For example, we have one large cluster performing around 12M policy evaluations per second while maintaining p99 tick latency of around 500ms. In order to reach this scale, we need to run ~200 replicas just to get enough parallel threads of execution. By adding support for parallel decision ticks, we hope to run fewer, larger replicas which should reduce the fragmentation caused by trace id load balancing and better balance spiky load across replicas.

Screenshot 2025-10-08 at 4 21 11 PM

Another benefit here is to reduce contention on the idToTrace map and generally simplify reasoning about concurrency by giving each trace a clear "owner" goroutine. Previously we'd try to do some work inside the calling goroutine of ConsumeTraces. Instead, new work is passed via channels to its dedicated shard. By default we run one shard per available core. Shared state is kept to the minimum required to implement TSP semantics:

  1. The trace limiter is shared, which allows the overall memory consumption to be reliably capped even in case of uneven shard utilization.
  2. Both decision caches are shared, which again allows overall memory consumption to be capped even with uneven shard utilization.
  3. Policy evaluators are shared, since they may have internal state which is meant to be global to the processor.

In practice, shard utilization should be very even, but it's still nice to have hard guarantees on the total traces in memory, and this allows us to avoid changing any tests in this PR, which greatly increases my confidence in the code despite the large change. Ideally the change to parallel execution is invisible to end users (aside from the much higher throughput).

Sharding makes use of the maphash package with a unique Seed, which should provide even shard distribution even if traceIDs have previously been sampled downstream. This is better than e.g. a simple modulo on traceID, since traceIDs are not necessarily uniformly distribute after head sampling.

@Logiraptor Logiraptor force-pushed the parallel-tsp branch 6 times, most recently from d442d12 to 19887c6 Compare October 13, 2025 14:16
Logiraptor pushed a commit that referenced this pull request Oct 13, 2025
…b.uid (open-telemetry#42641)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description

This PR adds support to expose `k8s.cronjob.uid` as resource metadata
when a `Job` is owned by a `CronJob`.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue

Fixes open-telemetry#42557

<!--Describe what testing was performed and which tests were added.-->
#### Testing

Local tested with `telemetrygen` and is working as expected.

```
[pod/k8sevents-receiver-opentelemetry-collector-6fd9966559-brlb6/opentelemetry-collector] {"level":"debug","ts":"2025-09-11T16:29:11.588Z","caller":"[email protected]/processor.go:159","msg":"getting the pod","resource":{"service.instance.id":"9631e38b-aec3-439f-8178-d96fc8368e1e","service.name":"otelcontribcol","service.version":"0.135.0-dev"},"otelcol.component.id":"k8sattributes","otelcol.component.kind":"processor","otelcol.pipeline.id":"traces","otelcol.signal":"traces","pod":{"Name":"otel-log-cronjob-29293469-lw97x","Address":"10.244.0.70","PodUID":"7960681c-5a24-4287-8bea-e2cf506500ee","Attributes":{"k8s.cronjob.name":"otel-log-cronjob","k8s.cronjob.uid":"082b1c42-e393-46bc-9d51-b20a3700d1ab","k8s.job.name":"otel-log-cronjob-29293469","k8s.job.uid":"fbd853b8-7f63-44d8-ace1-8b48c89e3041"},"StartTime":"2025-09-11T16:29:00Z","Ignore":false,"Namespace":"default","NodeName":"","DeploymentUID":"","StatefulSetUID":"","DaemonSetUID":"","JobUID":"fbd853b8-7f63-44d8-ace1-8b48c89e3041","HostNetwork":false,"Containers":{"ByID":null,"ByName":null},"DeletedAt":"0001-01-01T00:00:00Z"}}
[pod/k8sevents-receiver-opentelemetry-collector-6fd9966559-brlb6/opentelemetry-collector] {"level":"info","ts":"2025-09-11T16:29:11.588Z","msg":"Traces","resource":{"service.instance.id":"9631e38b-aec3-439f-8178-d96fc8368e1e","service.name":"otelcontribcol","service.version":"0.135.0-dev"},"otelcol.component.id":"debug","otelcol.component.kind":"exporter","otelcol.signal":"traces","resource spans":1,"spans":2}
[pod/k8sevents-receiver-opentelemetry-collector-6fd9966559-brlb6/opentelemetry-collector] {"level":"info","ts":"2025-09-11T16:29:11.588Z","msg":"ResourceSpans #0\nResource SchemaURL: https://opentelemetry.io/schemas/1.4.0\nResource attributes:\n     -> k8s.container.name: Str(telemetrygen)\n     -> service.name: Str(telemetrygen)\n     -> k8s.pod.ip: Str(10.244.0.70)\n     -> k8s.cronjob.name: Str(otel-log-cronjob)\n     -> k8s.cronjob.uid: Str(082b1c42-e393-46bc-9d51-b20a3700d1ab)\n     -> k8s.job.uid: Str(fbd853b8-7f63-44d8-ace1-8b48c89e3041)\n     -> k8s.job.name: Str(otel-log-cronjob-29293469)\nScopeSpans #0\nScopeSpans SchemaURL: \nInstrumentationScope telemetrygen \nSpan #0\n    Trace ID       : 3c7381c14a37814676b00a7d961cb219\n    Parent ID      : 4f8780d5148a9c1c\n    ID             : 17e9da9533dc93ca\n    Name           : okey-dokey-0\n    Kind           : Server\n    Start time     : 2025-09-11 16:29:09.583785469 +0000 UTC\n    End time       : 2025-09-11 16:29:09.583908469 +0000 UTC\n    Status code    : Unset\n    Status message : \nAttributes:\n     -> net.peer.ip: Str(1.2.3.4)\n     -> peer.service: Str(telemetrygen-client)\nSpan #1\n    Trace ID       : 3c7381c14a37814676b00a7d961cb219\n    Parent ID      : \n    ID             : 4f8780d5148a9c1c\n    Name           : lets-go\n    Kind           : Client\n    Start time     : 2025-09-11 16:29:09.583785469 +0000 UTC\n    End time       : 2025-09-11 16:29:09.583908469 +0000 UTC\n    Status code    : Unset\n    Status message : \nAttributes:\n     -> net.peer.ip: Str(1.2.3.4)\n     -> peer.service: Str(telemetrygen-server)\n","resource":{"service.instance.id":"9631e38b-aec3-439f-8178-d96fc8368e1e","service.name":"otelcontribcol","service.version":"0.135.0-dev"},"otelcol.component.id":"debug","otelcol.component.kind":"exporter","otelcol.signal":"traces"}
```

Added also the tests to guarantee the proper functionality.

---------

Signed-off-by: Paulo Dias <[email protected]>
This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker.
It also fixes the open issue in open-telemetry#41656 by moving all state management to
a single goroutine (per shard).

In practice at Grafana Labs, we've seen that the single threaded nature
of the decision ticker is the primary bottleneck when running the tail
sampling processor at high scales. For example, we have one large
cluster performing around 12M policy evaluations per second while
maintaining p99 tick latency of around 500ms. In order to reach this
scale, we need to run ~200 replicas just to get enough parallel threads
of execution. By adding support for parallel decision ticks, we hope to
run fewer, larger replicas which should reduce the fragmentation
caused by trace id load balancing.

Another benefit here is to reduce contention on the idToTrace map and generally
simplify reasoning about concurrency by giving each trace a clear
"owner" goroutine. Previously we'd try to do some work inside the calling
goroutine of ConsumeTraces. Instead, new work is passed via channels to
its dedicated shard. By default we run one shard per available core.
Shared state is kept to the minimum required to implement TSP semantics:

1. The trace limiter is shared, which allows the overall memory
   consumption to be reliably capped even in case of uneven shard
   utilization.
2. Both decision caches are shared, which again allows overall memory
   consumption to be capped even with uneven shard utilization.

In practice, shard utilization should be very even, but it's still nice
to have hard guarantees on the total traces in memory, and this allows
us to avoid changing any tests in this PR, which greatly increases my
confidence in the code despite the large change. Ideally the change to
parallel execution is invisible to end users (aside from the much higher throughput).

Sharding makes use of the maphash package with a unique `Seed`, which
should provide even shard distribution even if traceIDs have previously
been sampled downstream. This is better than e.g. a simple modulo on
traceID, since traceIDs are not necessarily uniformly distribute after
head sampling.
@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 29, 2025
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 12, 2025
atoulme pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Nov 19, 2025
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
This PR changes the tailsampling processor to do all work (append traces
to a batch, decide on which traces from a batch are sampled, forward to
the next processor) in a single goroutine. At Grafana Labs we have found
that there is significant contention between the various mutexes such
that little performance is gained by allowing traces to be updated while
a policy evaluation is occurring. In addition, the locking adds a lot of
complexity which makes future improvements more challenging, and
introduces subtle concurrency issues such as seen in the tracking issue
linked below.

We (mostly @Logiraptor) also experimented with using multiple workers,
each with their own goroutine to process traces but we found little
improvement as yet again various contention issues showed up. See
Logiraptor#1 (or
the first commit in this PR) for more details.

If additional performance is necessary for slow decision tick latencies
I would suggest adding parallelism just to the call to `makeDecision`,
which would allow additional CPUs to be used without adding contention
on `idToTrace` or other locations.

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Fixes #41656 
Working towards
#43876

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Tests were updated to make sure the tick behavior remains correct and
consistent. In addition, the syncBatcher needed to be updated to
properly implement `Stop`.

---------

Co-authored-by: Patrick Oyarzun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants