sdk/trace/x: Add experimental ProbabilitySampler#8123
sdk/trace/x: Add experimental ProbabilitySampler#8123yuanyuanzhao3 wants to merge 9 commits intoopen-telemetry:mainfrom
ProbabilitySampler#8123Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8123 +/- ##
======================================
Coverage 82.3% 82.4%
======================================
Files 310 312 +2
Lines 24260 24375 +115
======================================
+ Hits 19982 20093 +111
- Misses 3901 3904 +3
- Partials 377 378 +1
🚀 New features to boost your workflow:
|
dashpole
left a comment
There was a problem hiding this comment.
I'm a bit confused by this still. I see we already have a TraceIDRatioBased sampler in the SDK. I also see that the sampler is marked deprecated in the spec: https://opentelemetry.io/docs/specs/otel/trace/sdk/#traceidratiobased, so i'm not sure why we would be adding new features to it. The changes here seem to align with the ProbabilitySampler specification, not the TraceIDRatioBased sampler specification, so i'm not sure why this is named TraceIDRatioBased, and links to that specification.
@jmacd maybe you can clarify? Is the intent that SDKs introduce a ProbabilitySampler artifact along-side the TraceIDRatioBased sampler? Or is the intent that we only keep TraceIDRatioBased, but update it to follow the ProbabilitySampler spec (is that breaking)?
Not @jmacd, but here's my understanding: This is intended to be a new API-level compatible replacement for the original TraceIDRatioBased sampler. The difference lies in the underlying sampling algorithm and the information it passes through The new spec: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md is what this PR is about and it is a kind of probability sampling. I do not know why the place https://opentelemetry.io/docs/specs/otel/trace/sdk/#traceidratiobased is marking TraceIdRatioBased as deprecated. |
| hbits = 4 | ||
| ) | ||
| if fraction > probabilityOneThreshold { | ||
| return sdktrace.AlwaysSample() |
There was a problem hiding this comment.
This doesn't propagate th properly, right? It also doesn't return the correct description.
Same comment for NeverSample below.
There was a problem hiding this comment.
I compared with https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/208b7c0565fe51033fa8d8b3d96a3c3dcba79a3f/pkg/sampling/probability.go#L33, which is an ancestor of this. It returns an error in these cases.
I realize this may have happened because in an earlier implementation of this, I had assumed that AlwaysSample() would return a sampler that sets ot=th:0. @yuanyuanzhao3 sadly note that https://opentelemetry.io/docs/specs/otel/trace/sdk/#alwayson does not dictate to set th:0, and I'm not sure myself whether this counts as oversight or just the path of least resistance. It suggests we should have ProbabilitySampler with fraction=1 fix this, later we can discuss modifying the OTel SDK (which is I think what we want).
There was a problem hiding this comment.
I think it is probably a bit more nuanced for AlwaysSample to export th:0. Without risking too philosophical, it depends on what we think th:0 conveys. That is, besides that it allows every trace to pass, does it also convey the kind of sampling algorithm used? Would there be another sampling algorithm that is also deemed as "OTel-native", is not based on threshold but can be configured to pass all traces. If such cases exist, how many tracestate encodings should AlwaysSample output?
So, in short, let's defer the AlwaysSample change.
I have updated the implementation to use ProbabilitySampler{0}.
The sdk/trace package (and experimental features in sdk/trace/x) needs to follow the specification in https://opentelemetry.io/docs/specs/otel/trace/sdk. The only place that tracestate-probability-sampling.md is referenced from in the SDK spec is from https://opentelemetry.io/docs/specs/otel/trace/sdk/#probabilitysampler. So if you are implementing https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md, I assume it should follow the ProbabilitySampler specification in the SDK. If my understanding is correct, please make sure it follows the ProbabilitySampler specification. Other than naming this ProbabilitySampler, i don't see the compatibility warning described here: https://opentelemetry.io/docs/specs/otel/trace/sdk/#compatibility-warnings-for-probabilitysampler |
|
I can probably explain some confusion -- there was a late-entering change in the sampling specification as it moved from OTEP 235 to the SDK specification, where we decided on the name @yuanyuanzhao3 see open-telemetry/opentelemetry-specification#4627 for details. Sorry for the confusion! Thank you @dashpole for your guidance. |
Add a threshold-based TraceIDRatioBased sampler that conforms to the OpenTelemetry specification. It uses the least significant 56 bits of the trace ID for deterministic sampling decisions and propagates the sampling threshold via the `th` sub-key in the W3C `ot` tracestate vendor key. When an explicit `rv` (randomness value) is present in the tracestate, it is used instead of the trace ID.
Align naming with the trace/x API and update changelog and tests. Made-with: Cursor
6c0d82c to
64eb6d3
Compare
ProbabilitySampler
MrAlias
left a comment
There was a problem hiding this comment.
Thanks for pushing this forward. I reviewed it against the current SDK sampling spec, especially the newer ProbabilitySampler behavior. There are still a few issues to address.
| var newOtts string | ||
| // Only insert/update th when randomness is available (either from | ||
| // explicit rv value or trace ID with the random flag). Otherwise, | ||
| // erase any existing th to signal the span is not guaranteed to be | ||
| // statistically representative. | ||
| // See https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/#general-requirements | ||
| if hasRandomness || psc.TraceFlags().IsRandom() { | ||
| newOtts = InsertOrUpdateTraceStateThKeyValue(existingOtts, ps.thkv) | ||
| } else { | ||
| newOtts = eraseTraceStateThKeyValue(existingOtts) | ||
| } |
There was a problem hiding this comment.
This looks stricter than the ProbabilitySampler spec. For sampled spans, th should be propagated based on the threshold decision itself, and samplers should presume TraceID randomness unless explicit rv says otherwise.
With the current condition, sampled root spans never emit ot=th:..., and sampled child spans will also lose th after a Trace Context v1 hop when the Random flag is not present. That breaks downstream consistent-probability behavior.
The current SDK spec for ProbabilitySampler says sampled spans should include th:T, and the sampling requirements say samplers should presume TraceID randomness unless rv is present:
- https://opentelemetry.io/docs/specs/otel/trace/sdk/#probabilitysampler
- https://opentelemetry.io/docs/specs/otel/trace/sdk/#sampling-requirements
I think this should insert/update th whenever the span is sampled, rather than gating on the parent Random flag.
| var newOtts string | |
| // Only insert/update th when randomness is available (either from | |
| // explicit rv value or trace ID with the random flag). Otherwise, | |
| // erase any existing th to signal the span is not guaranteed to be | |
| // statistically representative. | |
| // See https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/#general-requirements | |
| if hasRandomness || psc.TraceFlags().IsRandom() { | |
| newOtts = InsertOrUpdateTraceStateThKeyValue(existingOtts, ps.thkv) | |
| } else { | |
| newOtts = eraseTraceStateThKeyValue(existingOtts) | |
| } | |
| newOtts := InsertOrUpdateTraceStateThKeyValue(existingOtts, ps.thkv) |
There was a problem hiding this comment.
You are right. We specifically discussed this in the sampling SIG meeting on March 26.
The issue with non-random trace ID is that they skew span metrics extrapolation. The spanmetricsconnector treats spans with th value and without th differently. The former is used for extrapolation as is marked with an attribute sampling_method=extrapolated. The latter is counted as one (representing just itself) and the metric is marked with an attribute sampling_method=counted.
Should we clarify this in the spec? @jmacd FYI
There was a problem hiding this comment.
The issue with ignoring the absence of randomness indicator is that, it skews the statistics and there is no way for the downstream/customers to know about it.
There was a problem hiding this comment.
I'm cool with whatever way we want to go here. Just what is implemented here needs to be defined by the specification.
There was a problem hiding this comment.
Thanks for pointing this out @MrAlias. I believe @yuanyuanzhao3 has this the way it is described, just not in the SDK specification. The property we are after is included in https://opentelemetry.io/docs/specs/otel/trace/tracestate-probability-sampling/#parent-based-threshold. In the meeting you remember, it was @PeterF778 who led the group to this requirement, and he confirmed it again.
To resolve this, @yuanyuanzhao3 will you (a) file an issue at https://github.com/open-telemetry/opentelemetry-specification/issues describing how this stipulation about erasing threshold when it is inconsistent was omitted from the specification language, (b) suggest a change in tracing/sdk.md we can make to the experimental specification?
There was a problem hiding this comment.
Created this issue: open-telemetry/opentelemetry-specification#5035
Please feel free to assign it to me.
There was a problem hiding this comment.
I must retract what I said above. See open-telemetry/opentelemetry-specification#5035 (comment)
There was a problem hiding this comment.
We discussed this in today's sampling SIG meeting. The resolution is to keep the spec as it is and conform to the spec. @MrAlias thanks for keeping us adhering to the spec.
There was a problem hiding this comment.
The spec says:
the SDK SHOULD issue a warning statement in its log with a compatibility warning
This is bound to be very chatty as the absence of randomness flag would not be a rare instance once it occurs. Opinions?
…izes a key with `th` partially in the suffix.
|
FYI, the |
always put `th` value in `tracestate` even though randomness indication is missing. This conforms to the presumption of traceID randomness which supports extrapolating span metrics even though only the current service uses W3C tracecontext level 2. Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds an experimental threshold-based probabilistic sampler (ProbabilitySampler) under go.opentelemetry.io/otel/sdk/trace/x, along with tracestate helpers and tests, aiming to align with the OpenTelemetry spec’s deterministic sampling algorithm and ot tracestate propagation.
Changes:
- Introduce
sdk/trace/xexperimental package withProbabilitySamplerimplementation. - Add
ottracestate helpers forth(threshold) andrv(randomness) handling plus unit tests. - Document the experimental feature and record it in the changelog.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/trace/x/sampler.go | Implements ProbabilitySampler and tracestate mutation during sampling. |
| sdk/trace/x/sampler_test.go | Adds behavioral tests for probability thresholds, inclusivity, and tracestate updates. |
| sdk/trace/x/tracestate.go | Adds helpers for parsing rv and inserting/updating/erasing th within ot tracestate value. |
| sdk/trace/x/tracestate_test.go | Tests rv parsing and th insert/erase behavior. |
| sdk/trace/x/README.md | Documents the experimental package and ProbabilitySampler usage. |
| CHANGELOG.md | Adds an entry for the new experimental sampler. |
| var randomness uint64 | ||
| var hasRandomness bool | ||
| if existingOtts != "" { | ||
| randomness, hasRandomness = tracestateRandomness(existingOtts) | ||
| } | ||
|
|
||
| if !hasRandomness { | ||
| randomness = binary.BigEndian.Uint64(p.TraceID[8:16]) & randomnessMask | ||
| } |
| if len(otts) < start+14 || (len(otts) > start+14 && otts[start+14] != ';') { | ||
| otel.Handle(fmt.Errorf("could not parse tracestate randomness: %s", otts)) | ||
| return 0, false | ||
| } | ||
|
|
||
| rv, err := strconv.ParseUint(otts[start:start+14], 16, 56) | ||
| if err != nil { | ||
| otel.Handle(fmt.Errorf("could not parse tracestate randomness: %s", otts)) | ||
| return 0, false |
| for range numTraces { | ||
| traceID := trace.TraceID{} | ||
| _, _ = rand.Read(traceID[:]) | ||
| params := sdktrace.SamplingParameters{ |
| func (ps *probabilitySampler) ShouldSample(p sdktrace.SamplingParameters) sdktrace.SamplingResult { | ||
| psc := trace.SpanContextFromContext(p.ParentContext) | ||
| state := psc.TraceState() | ||
|
|
||
| existingOtts := state.Get("ot") | ||
|
|
||
| var randomness uint64 | ||
| var hasRandomness bool | ||
| if existingOtts != "" { | ||
| randomness, hasRandomness = tracestateRandomness(existingOtts) | ||
| } | ||
|
|
||
| if !hasRandomness { | ||
| randomness = binary.BigEndian.Uint64(p.TraceID[8:16]) & randomnessMask | ||
| } | ||
|
|
||
| if ps.threshold > randomness { | ||
| return sdktrace.SamplingResult{ | ||
| Decision: sdktrace.Drop, | ||
| Tracestate: state, | ||
| } | ||
| } | ||
|
|
||
| newOtts := InsertOrUpdateTraceStateThKeyValue(existingOtts, ps.thkv) | ||
|
|
||
| if newOtts == "" { | ||
| state = state.Delete("ot") | ||
| return sdktrace.SamplingResult{Decision: sdktrace.RecordAndSample, Tracestate: state} | ||
| } | ||
| combined, err := state.Insert("ot", newOtts) |
Description
Adds an experimental
ProbabilitySampleringo.opentelemetry.io/otel/sdk/trace/xthat conforms to the OpenTelemetry specification's threshold-based sampling algorithm.Features
thhandling: Encodes and propagates the sampling threshold in the W3Cottracestate vendor key for consistent downstream samplingTraceFlags.IsRandom()andWithRandom()(trace: add Random Trace ID Flag #8012) for proper indication ofthvalue for extrapolated metrics supportFiles
sdk/trace/x/sampler.go— Core sampler implementation (ProbabilitySampler)sdk/trace/x/sampler_test.go— Tests for sampler behaviorsdk/trace/x/tracestate.go— Tracestateth/rvkey helperssdk/trace/x/tracestate_test.go— Tests for tracestate helperssdk/trace/x/README.md— Documentation for the experimental featureRelated
Co-Author
Joshua MacDonald jmacd@users.noreply.github.com