Skip to content

Conversation

@Logiraptor
Copy link
Contributor

@Logiraptor Logiraptor commented Oct 7, 2025

While working on a fix for #41656, I've noticed many tests are strongly coupled to the tsp internals. This PR refactors all such coupling to exist in a small number of test helpers, instead of being duplicated across all tests. Those helpers are:

  • withPolicies - allows directly setting the policy evaluators which is useful for fine grained decision logic tests
  • withTickerLatency - allows speeding up the decision ticker which is useful for speeding up tests
  • withTestController - makes the decision tick blocking until triggered. This allows deterministic tests of otherwise non-deterministic behavior.

A few tests were mostly rewritten to make them test externally observable behavior. For example:

  • TestLateArrivingSpanUsesDecisionCache - used to inspect idToTrace state to ensure traces are cleared from memory. Now the tsp is destroyed and recreated with the same cache to ensure the cache is the only state available.
  • TestSequentialTraceMapSize et al - these used to inspect idToTrace state to ensure trace data is properly tracked. Now they trigger sampling decisions and assert that the output is complete.
  • TestSetSamplingPolicy et al - these used to inspect policies state to ensure policies are reloaded. Now they observe sampled traces only.

Instead of invoking the TSP tick loop directly

Instead of directly calling the tsp tick loop, I've added code to wrap the tick loop and make it blocking until called. This still involves some coupling, but since it's encapsulated in one place, it will be easier to change when refactoring the tsp in #41656. I took the opportunity to consolidate the sync batcher into this controller as well. In all cases with a manual tick loop, we also want a synchronous batcher for the same reason. Again this makes it simpler to update later without modifying every test at once.

Instead of accessing the idToTrace map

In all cases where we were directly inspecting the idToTrace map, we could instead just let the tsp sample those traces and inspect the output.

Instead of setting tickerFrequency directly

I've moved these tests to use withTickerFrequency, which is still a coupling, but since it's delegated to the option func, there's a central place to change how it works.

other

I included a final commit: 7bfab91. This just removes an unused field.

I've decided to remove two tests:

  • TestConcurrentTraceMapSize - this one writes traces concurrently then asserts on the size of the idToTrace map. We have overlapping tests which write concurrently then assert complete sampled output, which IMO is a better test.
  • TestDecisionPolicyMetrics - this one calls the internal makeDecision method and then asserts on the returned metrics, which are only used in debug logging. This feels like over testing to me and makes the tests overly coupled.

@Logiraptor Logiraptor requested a review from a team as a code owner October 7, 2025 18:51
@Logiraptor Logiraptor requested a review from axw October 7, 2025 18:51
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Oct 7, 2025
@github-actions github-actions bot requested a review from portertech October 7, 2025 18:51
@axw
Copy link
Contributor

axw commented Oct 8, 2025

@Logiraptor I like the sound of this, but it needs to wait until the minimum Go version is bumped to 1.25. That will happen when Go 1.26.0 comes out in February 2026.

@Logiraptor
Copy link
Contributor Author

@axw Yeah, I just realized that 😓 I will mark this as a draft while I figure out if there's an easy way around it

@Logiraptor Logiraptor marked this pull request as draft October 8, 2025 01:40
@Logiraptor Logiraptor marked this pull request as ready for review October 8, 2025 17:26
@Logiraptor
Copy link
Contributor Author

@axw Figured out an alternative approach. It's not quite as magical, but it accomplishes the goal of reducing how much needs to change when the tsp internals are refactored.

@Logiraptor Logiraptor changed the title [tailsamplingprocessor] [chore] Use synctest instead of tsp internals [tailsamplingprocessor] [chore] Remove test coupling on component internal fields Oct 8, 2025
Logiraptor added a commit to Logiraptor/opentelemetry-collector-contrib that referenced this pull request Oct 8, 2025
This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker.
It also fixes the open issue in open-telemetry#41656 by moving all state management to
a single goroutine (per shard).

In practice at Grafana Labs, we've seen that the single threaded nature
of the decision ticker is the primary bottleneck when running the tail
sampling processor at high scales. For example, we have one large
cluster performing around 12M policy evaluations per second while
maintaining p99 tick latency of around 500ms. In order to reach this
scale, we need to run ~200 replicas just to get enough parallel threads
of execution. By adding support for parallel decision ticks, we hope to
run fewer, larger replicas which should reduce the fragmentation
caused by trace id load balancing.

Another benefit here is to reduce contention on the idToTrace map and generally
simplify reasoning about concurrency by giving each trace a clear
"owner" goroutine. Previously we'd try to do some work inside the calling
goroutine of ConsumeTraces. Instead, new work is passed via channels to
its dedicated shard. By default we run one shard per available core.
Shared state is kept to the minimum required to implement TSP semantics:

1. The trace limiter is shared, which allows the overall memory
   consumption to be reliably capped even in case of uneven shard
   utilization.
2. Both decision caches are shared, which again allows overall memory
   consumption to be capped even with uneven shard utilization.

In practice, shard utilization should be very even, but it's still nice
to have hard guarantees on the total traces in memory, and this allows
us to avoid changing any tests in this PR, which greatly increases my
confidence in the code despite the large change. Ideally the change to
parallel execution is invisible to end users (aside from the much higher throughput).

Sharding makes use of the maphash package with a unique `Seed`, which
should provide even shard distribution even if traceIDs have previously
been sampled downstream. This is better than e.g. a simple modulo on
traceID, since traceIDs are not necessarily uniformly distribute after
head sampling.
Logiraptor added a commit to Logiraptor/opentelemetry-collector-contrib that referenced this pull request Oct 9, 2025
This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker.
It also fixes the open issue in open-telemetry#41656 by moving all state management to
a single goroutine (per shard).

In practice at Grafana Labs, we've seen that the single threaded nature
of the decision ticker is the primary bottleneck when running the tail
sampling processor at high scales. For example, we have one large
cluster performing around 12M policy evaluations per second while
maintaining p99 tick latency of around 500ms. In order to reach this
scale, we need to run ~200 replicas just to get enough parallel threads
of execution. By adding support for parallel decision ticks, we hope to
run fewer, larger replicas which should reduce the fragmentation
caused by trace id load balancing.

Another benefit here is to reduce contention on the idToTrace map and generally
simplify reasoning about concurrency by giving each trace a clear
"owner" goroutine. Previously we'd try to do some work inside the calling
goroutine of ConsumeTraces. Instead, new work is passed via channels to
its dedicated shard. By default we run one shard per available core.
Shared state is kept to the minimum required to implement TSP semantics:

1. The trace limiter is shared, which allows the overall memory
   consumption to be reliably capped even in case of uneven shard
   utilization.
2. Both decision caches are shared, which again allows overall memory
   consumption to be capped even with uneven shard utilization.

In practice, shard utilization should be very even, but it's still nice
to have hard guarantees on the total traces in memory, and this allows
us to avoid changing any tests in this PR, which greatly increases my
confidence in the code despite the large change. Ideally the change to
parallel execution is invisible to end users (aside from the much higher throughput).

Sharding makes use of the maphash package with a unique `Seed`, which
should provide even shard distribution even if traceIDs have previously
been sampled downstream. This is better than e.g. a simple modulo on
traceID, since traceIDs are not necessarily uniformly distribute after
head sampling.
Copy link
Contributor

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Leaving a comment as I'm also interested in improving tail sampling processor.

PR lgtm at a high level. Alternatively, have you considered keeping policyTicker as TTicker, and creating a testTicker implementing TTicker which allows it to tick as fast as possible, instead of 1ms?

@Logiraptor
Copy link
Contributor Author

@carsonip I hadn't tried that, no. FWIW, I have another PR in draft which will remove the ticker abstraction in order to consolidate state manipulation into a single select statement: Logiraptor#1 (staged in my fork while I test, but will open the real PR here). In that PR we no longer need a 1ms ticker and can directly control the tick via a side channel.

More context on that PR in case you're interested:
In that PR, I'm hoping to simplify how state is managed in general, so we can mostly avoid locks / channels / sync.Map. In doing so, it also becomes easier to run parallel workers in the TSP, so we can finally overcome the single threaded bottleneck. In testing, I'm finding that most of the policy ticker latency comes from synchronization overhead caused by having so much shared, concurrency-aware state. This also means running those parallel workers doesn't gain as much throughput as expected. So I have a few side quests like this one and #43510 to reduce contention on shared state.

Copy link
Contributor

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks! Tests are much cleaner without coupling to implementation

Logiraptor added a commit to Logiraptor/opentelemetry-collector-contrib that referenced this pull request Oct 14, 2025
This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker.
It also fixes the open issue in open-telemetry#41656 by moving all state management to
a single goroutine (per shard).

In practice at Grafana Labs, we've seen that the single threaded nature
of the decision ticker is the primary bottleneck when running the tail
sampling processor at high scales. For example, we have one large
cluster performing around 12M policy evaluations per second while
maintaining p99 tick latency of around 500ms. In order to reach this
scale, we need to run ~200 replicas just to get enough parallel threads
of execution. By adding support for parallel decision ticks, we hope to
run fewer, larger replicas which should reduce the fragmentation
caused by trace id load balancing.

Another benefit here is to reduce contention on the idToTrace map and generally
simplify reasoning about concurrency by giving each trace a clear
"owner" goroutine. Previously we'd try to do some work inside the calling
goroutine of ConsumeTraces. Instead, new work is passed via channels to
its dedicated shard. By default we run one shard per available core.
Shared state is kept to the minimum required to implement TSP semantics:

1. The trace limiter is shared, which allows the overall memory
   consumption to be reliably capped even in case of uneven shard
   utilization.
2. Both decision caches are shared, which again allows overall memory
   consumption to be capped even with uneven shard utilization.

In practice, shard utilization should be very even, but it's still nice
to have hard guarantees on the total traces in memory, and this allows
us to avoid changing any tests in this PR, which greatly increases my
confidence in the code despite the large change. Ideally the change to
parallel execution is invisible to end users (aside from the much higher throughput).

Sharding makes use of the maphash package with a unique `Seed`, which
should provide even shard distribution even if traceIDs have previously
been sampled downstream. This is better than e.g. a simple modulo on
traceID, since traceIDs are not necessarily uniformly distribute after
head sampling.
set: set,
telemetry: telemetry,
nextConsumer: nextConsumer,
maxNumTraces: cfg.NumTraces,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this really a noop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the processor refers to cfg.NumTraces everywhere else

@axw axw added the ready to merge Code review completed; ready to merge by maintainers label Oct 14, 2025
@songy23 songy23 merged commit 740ed94 into open-telemetry:main Oct 15, 2025
202 of 203 checks passed
@github-actions github-actions bot added this to the next release milestone Oct 15, 2025
ChrsMark pushed a commit to ChrsMark/opentelemetry-collector-contrib that referenced this pull request Oct 20, 2025
…ernal fields (open-telemetry#43201)

While working on a fix for open-telemetry#41656, I've noticed many tests are strongly
coupled to the tsp internals. This PR refactors all such coupling to
exist in a small number of test helpers, instead of being duplicated
across all tests. Those helpers are:

* `withPolicies` - allows directly setting the policy evaluators which
is useful for fine grained decision logic tests
* `withTickerLatency` - allows speeding up the decision ticker which is
useful for speeding up tests
* `withTestController` - makes the decision tick blocking until
triggered. This allows deterministic tests of otherwise
non-deterministic behavior.

A few tests were mostly rewritten to make them test externally
observable behavior. For example:

* TestLateArrivingSpanUsesDecisionCache - used to inspect idToTrace
state to ensure traces are cleared from memory. Now the tsp is destroyed
and recreated with the same cache to ensure the cache is the only state
available.
* TestSequentialTraceMapSize et al - these used to inspect idToTrace
state to ensure trace data is properly tracked. Now they trigger
sampling decisions and assert that the output is complete.
* TestSetSamplingPolicy et al - these used to inspect policies state to
ensure policies are reloaded. Now they observe sampled traces only.

### Instead of invoking the TSP tick loop directly

Instead of directly calling the tsp tick loop, I've added code to wrap
the tick loop and make it blocking until called. This still involves
some coupling, but since it's encapsulated in one place, it will be
easier to change when refactoring the tsp in open-telemetry#41656. I took the
opportunity to consolidate the sync batcher into this controller as
well. In all cases with a manual tick loop, we also want a synchronous
batcher for the same reason. Again this makes it simpler to update later
without modifying every test at once.

### Instead of accessing the idToTrace map

In all cases where we were directly inspecting the idToTrace map, we
could instead just let the tsp sample those traces and inspect the
output.

### Instead of setting tickerFrequency directly

I've moved these tests to use `withTickerFrequency`, which is still a
coupling, but since it's delegated to the option func, there's a central
place to change how it works.

### other

I included a final commit: 7bfab91.
This just removes an unused field.

I've decided to remove two tests:

* TestConcurrentTraceMapSize - this one writes traces concurrently then
asserts on the size of the idToTrace map. We have overlapping tests
which write concurrently then assert complete sampled output, which IMO
is a better test.
* TestDecisionPolicyMetrics - this one calls the internal `makeDecision`
method and then asserts on the returned metrics, which are only used in
debug logging. This feels like over testing to me and makes the tests
overly coupled.
csmarchbanks pushed a commit to csmarchbanks/opentelemetry-collector-contrib that referenced this pull request Oct 21, 2025
This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker.
It also fixes the open issue in open-telemetry#41656 by moving all state management to
a single goroutine (per shard).

In practice at Grafana Labs, we've seen that the single threaded nature
of the decision ticker is the primary bottleneck when running the tail
sampling processor at high scales. For example, we have one large
cluster performing around 12M policy evaluations per second while
maintaining p99 tick latency of around 500ms. In order to reach this
scale, we need to run ~200 replicas just to get enough parallel threads
of execution. By adding support for parallel decision ticks, we hope to
run fewer, larger replicas which should reduce the fragmentation
caused by trace id load balancing.

Another benefit here is to reduce contention on the idToTrace map and generally
simplify reasoning about concurrency by giving each trace a clear
"owner" goroutine. Previously we'd try to do some work inside the calling
goroutine of ConsumeTraces. Instead, new work is passed via channels to
its dedicated shard. By default we run one shard per available core.
Shared state is kept to the minimum required to implement TSP semantics:

1. The trace limiter is shared, which allows the overall memory
   consumption to be reliably capped even in case of uneven shard
   utilization.
2. Both decision caches are shared, which again allows overall memory
   consumption to be capped even with uneven shard utilization.

In practice, shard utilization should be very even, but it's still nice
to have hard guarantees on the total traces in memory, and this allows
us to avoid changing any tests in this PR, which greatly increases my
confidence in the code despite the large change. Ideally the change to
parallel execution is invisible to end users (aside from the much higher throughput).

Sharding makes use of the maphash package with a unique `Seed`, which
should provide even shard distribution even if traceIDs have previously
been sampled downstream. This is better than e.g. a simple modulo on
traceID, since traceIDs are not necessarily uniformly distribute after
head sampling.
csmarchbanks pushed a commit to csmarchbanks/opentelemetry-collector-contrib that referenced this pull request Oct 21, 2025
This PR builds on open-telemetry#43201 (pending) to parallelize the decision ticker.
It also fixes the open issue in open-telemetry#41656 by moving all state management to
a single goroutine (per shard).

In practice at Grafana Labs, we've seen that the single threaded nature
of the decision ticker is the primary bottleneck when running the tail
sampling processor at high scales. For example, we have one large
cluster performing around 12M policy evaluations per second while
maintaining p99 tick latency of around 500ms. In order to reach this
scale, we need to run ~200 replicas just to get enough parallel threads
of execution. By adding support for parallel decision ticks, we hope to
run fewer, larger replicas which should reduce the fragmentation
caused by trace id load balancing.

Another benefit here is to reduce contention on the idToTrace map and generally
simplify reasoning about concurrency by giving each trace a clear
"owner" goroutine. Previously we'd try to do some work inside the calling
goroutine of ConsumeTraces. Instead, new work is passed via channels to
its dedicated shard. By default we run one shard per available core.
Shared state is kept to the minimum required to implement TSP semantics:

1. The trace limiter is shared, which allows the overall memory
   consumption to be reliably capped even in case of uneven shard
   utilization.
2. Both decision caches are shared, which again allows overall memory
   consumption to be capped even with uneven shard utilization.

In practice, shard utilization should be very even, but it's still nice
to have hard guarantees on the total traces in memory, and this allows
us to avoid changing any tests in this PR, which greatly increases my
confidence in the code despite the large change. Ideally the change to
parallel execution is invisible to end users (aside from the much higher throughput).

Sharding makes use of the maphash package with a unique `Seed`, which
should provide even shard distribution even if traceIDs have previously
been sampled downstream. This is better than e.g. a simple modulo on
traceID, since traceIDs are not necessarily uniformly distribute after
head sampling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

processor/tailsampling Tail sampling processor ready to merge Code review completed; ready to merge by maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants