forked from open-telemetry/opentelemetry-collector-contrib
-
Notifications
You must be signed in to change notification settings - Fork 0
[tailsamplingprocessor] Parallel decision ticker #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
d442d12 to
19887c6
Compare
Logiraptor
pushed a commit
that referenced
this pull request
Oct 13, 2025
…b.uid (open-telemetry#42641) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR adds support to expose `k8s.cronjob.uid` as resource metadata when a `Job` is owned by a `CronJob`. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes open-telemetry#42557 <!--Describe what testing was performed and which tests were added.--> #### Testing Local tested with `telemetrygen` and is working as expected. ``` [pod/k8sevents-receiver-opentelemetry-collector-6fd9966559-brlb6/opentelemetry-collector] {"level":"debug","ts":"2025-09-11T16:29:11.588Z","caller":"[email protected]/processor.go:159","msg":"getting the pod","resource":{"service.instance.id":"9631e38b-aec3-439f-8178-d96fc8368e1e","service.name":"otelcontribcol","service.version":"0.135.0-dev"},"otelcol.component.id":"k8sattributes","otelcol.component.kind":"processor","otelcol.pipeline.id":"traces","otelcol.signal":"traces","pod":{"Name":"otel-log-cronjob-29293469-lw97x","Address":"10.244.0.70","PodUID":"7960681c-5a24-4287-8bea-e2cf506500ee","Attributes":{"k8s.cronjob.name":"otel-log-cronjob","k8s.cronjob.uid":"082b1c42-e393-46bc-9d51-b20a3700d1ab","k8s.job.name":"otel-log-cronjob-29293469","k8s.job.uid":"fbd853b8-7f63-44d8-ace1-8b48c89e3041"},"StartTime":"2025-09-11T16:29:00Z","Ignore":false,"Namespace":"default","NodeName":"","DeploymentUID":"","StatefulSetUID":"","DaemonSetUID":"","JobUID":"fbd853b8-7f63-44d8-ace1-8b48c89e3041","HostNetwork":false,"Containers":{"ByID":null,"ByName":null},"DeletedAt":"0001-01-01T00:00:00Z"}} [pod/k8sevents-receiver-opentelemetry-collector-6fd9966559-brlb6/opentelemetry-collector] {"level":"info","ts":"2025-09-11T16:29:11.588Z","msg":"Traces","resource":{"service.instance.id":"9631e38b-aec3-439f-8178-d96fc8368e1e","service.name":"otelcontribcol","service.version":"0.135.0-dev"},"otelcol.component.id":"debug","otelcol.component.kind":"exporter","otelcol.signal":"traces","resource spans":1,"spans":2} [pod/k8sevents-receiver-opentelemetry-collector-6fd9966559-brlb6/opentelemetry-collector] {"level":"info","ts":"2025-09-11T16:29:11.588Z","msg":"ResourceSpans #0\nResource SchemaURL: https://opentelemetry.io/schemas/1.4.0\nResource attributes:\n -> k8s.container.name: Str(telemetrygen)\n -> service.name: Str(telemetrygen)\n -> k8s.pod.ip: Str(10.244.0.70)\n -> k8s.cronjob.name: Str(otel-log-cronjob)\n -> k8s.cronjob.uid: Str(082b1c42-e393-46bc-9d51-b20a3700d1ab)\n -> k8s.job.uid: Str(fbd853b8-7f63-44d8-ace1-8b48c89e3041)\n -> k8s.job.name: Str(otel-log-cronjob-29293469)\nScopeSpans #0\nScopeSpans SchemaURL: \nInstrumentationScope telemetrygen \nSpan #0\n Trace ID : 3c7381c14a37814676b00a7d961cb219\n Parent ID : 4f8780d5148a9c1c\n ID : 17e9da9533dc93ca\n Name : okey-dokey-0\n Kind : Server\n Start time : 2025-09-11 16:29:09.583785469 +0000 UTC\n End time : 2025-09-11 16:29:09.583908469 +0000 UTC\n Status code : Unset\n Status message : \nAttributes:\n -> net.peer.ip: Str(1.2.3.4)\n -> peer.service: Str(telemetrygen-client)\nSpan #1\n Trace ID : 3c7381c14a37814676b00a7d961cb219\n Parent ID : \n ID : 4f8780d5148a9c1c\n Name : lets-go\n Kind : Client\n Start time : 2025-09-11 16:29:09.583785469 +0000 UTC\n End time : 2025-09-11 16:29:09.583908469 +0000 UTC\n Status code : Unset\n Status message : \nAttributes:\n -> net.peer.ip: Str(1.2.3.4)\n -> peer.service: Str(telemetrygen-server)\n","resource":{"service.instance.id":"9631e38b-aec3-439f-8178-d96fc8368e1e","service.name":"otelcontribcol","service.version":"0.135.0-dev"},"otelcol.component.id":"debug","otelcol.component.kind":"exporter","otelcol.signal":"traces"} ``` Added also the tests to guarantee the proper functionality. --------- Signed-off-by: Paulo Dias <[email protected]>
d0fc836 to
7e9122d
Compare
This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker. It also fixes the open issue in open-telemetry#41656 by moving all state management to a single goroutine (per shard). In practice at Grafana Labs, we've seen that the single threaded nature of the decision ticker is the primary bottleneck when running the tail sampling processor at high scales. For example, we have one large cluster performing around 12M policy evaluations per second while maintaining p99 tick latency of around 500ms. In order to reach this scale, we need to run ~200 replicas just to get enough parallel threads of execution. By adding support for parallel decision ticks, we hope to run fewer, larger replicas which should reduce the fragmentation caused by trace id load balancing. Another benefit here is to reduce contention on the idToTrace map and generally simplify reasoning about concurrency by giving each trace a clear "owner" goroutine. Previously we'd try to do some work inside the calling goroutine of ConsumeTraces. Instead, new work is passed via channels to its dedicated shard. By default we run one shard per available core. Shared state is kept to the minimum required to implement TSP semantics: 1. The trace limiter is shared, which allows the overall memory consumption to be reliably capped even in case of uneven shard utilization. 2. Both decision caches are shared, which again allows overall memory consumption to be capped even with uneven shard utilization. In practice, shard utilization should be very even, but it's still nice to have hard guarantees on the total traces in memory, and this allows us to avoid changing any tests in this PR, which greatly increases my confidence in the code despite the large change. Ideally the change to parallel execution is invisible to end users (aside from the much higher throughput). Sharding makes use of the maphash package with a unique `Seed`, which should provide even shard distribution even if traceIDs have previously been sampled downstream. This is better than e.g. a simple modulo on traceID, since traceIDs are not necessarily uniformly distribute after head sampling.
8c93275 to
2c402c1
Compare
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
atoulme
pushed a commit
to open-telemetry/opentelemetry-collector-contrib
that referenced
this pull request
Nov 19, 2025
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR changes the tailsampling processor to do all work (append traces to a batch, decide on which traces from a batch are sampled, forward to the next processor) in a single goroutine. At Grafana Labs we have found that there is significant contention between the various mutexes such that little performance is gained by allowing traces to be updated while a policy evaluation is occurring. In addition, the locking adds a lot of complexity which makes future improvements more challenging, and introduces subtle concurrency issues such as seen in the tracking issue linked below. We (mostly @Logiraptor) also experimented with using multiple workers, each with their own goroutine to process traces but we found little improvement as yet again various contention issues showed up. See Logiraptor#1 (or the first commit in this PR) for more details. If additional performance is necessary for slow decision tick latencies I would suggest adding parallelism just to the call to `makeDecision`, which would allow additional CPUs to be used without adding contention on `idToTrace` or other locations. <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes #41656 Working towards #43876 <!--Describe what testing was performed and which tests were added.--> #### Testing Tests were updated to make sure the tick behavior remains correct and consistent. In addition, the syncBatcher needed to be updated to properly implement `Stop`. --------- Co-authored-by: Patrick Oyarzun <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker. It also fixes the open issue in open-telemetry#41656 by moving all state management to a single goroutine (per shard).
In practice at Grafana Labs, we've seen that the single threaded nature of the decision ticker is the primary bottleneck when running the tail sampling processor at high scales. For example, we have one large cluster performing around 12M policy evaluations per second while maintaining p99 tick latency of around 500ms. In order to reach this scale, we need to run ~200 replicas just to get enough parallel threads of execution. By adding support for parallel decision ticks, we hope to run fewer, larger replicas which should reduce the fragmentation caused by trace id load balancing and better balance spiky load across replicas.
Another benefit here is to reduce contention on the idToTrace map and generally simplify reasoning about concurrency by giving each trace a clear "owner" goroutine. Previously we'd try to do some work inside the calling goroutine of ConsumeTraces. Instead, new work is passed via channels to its dedicated shard. By default we run one shard per available core. Shared state is kept to the minimum required to implement TSP semantics:
In practice, shard utilization should be very even, but it's still nice to have hard guarantees on the total traces in memory, and this allows us to avoid changing any tests in this PR, which greatly increases my confidence in the code despite the large change. Ideally the change to parallel execution is invisible to end users (aside from the much higher throughput).
Sharding makes use of the maphash package with a unique
Seed, which should provide even shard distribution even if traceIDs have previously been sampled downstream. This is better than e.g. a simple modulo on traceID, since traceIDs are not necessarily uniformly distribute after head sampling.