Conversation
… to determine otlp mode
…nfig registry to reflect this
…ut ddspan -> otlpspan conversion
… span_to_otlp_test.go
… service name differs from global service name
…me (#4576) ### What does this PR do? Moves the `strconv.ParseInt` call for `_dd.p.dm` (decision maker) out of the v1 payload flush path and into the write path. A `dm uint32` field is added to the `trace` struct. It is kept in sync with `propagatingTags[keyDecisionMaker]` at all mutation sites (`setPropagatingTagLocked`, `unsetPropagatingTagLocked`, `replacePropagatingTags`). A shared `unsetPropagatingTagLocked` helper is introduced so the delete-and-clear logic is not duplicated between `unsetPropagatingTag` and `setSamplingPriorityLockedWithForce`. `payloadV1.push` now reads the pre-computed value via `trace.decisionMaker()` instead of fetching the string from the propagating tags map and parsing it on every flush. ### Motivation `payloadV1.push` is called once per finished trace chunk. On every call it was acquiring a read lock, doing a map lookup, and then running `strconv.ParseInt` — all to read a value that was set at most once or twice during the lifetime of the trace. Moving the parse to write-time eliminates the per-flush map lookup and parse, replacing them with a single scalar read behind an `RLock`. ### Reviewer's Checklist - [x] Changed code has unit tests for its functionality at or near 100% coverage. - [x] New code is free of linting errors. You can check this by running `make lint` locally. - [x] New code doesn't break existing tests. You can check this by running `make test` locally. - [x] Add an appropriate team label so this PR gets put in the right place for the release notes. - [x] All generated files are up to date. You can check this by running `make generate` locally. Co-authored-by: dario.castane <dario.castane@datadoghq.com>
…ED` once, use a global `bool` (#4548) Stop checking environment variable `DD_TRACE_128_BIT_TRACEID_GENERATION_ENABLED` on span start. Checks it once, use a global `bool`. It'd be better to load it on `newConfig` and make the value available, but this one happens in `newSpanContext`, which modification will cause a lot of changes in multiple files, including tests. Complete #1905 work using the current benchmarking platform. - [x] Changed code has unit tests for its functionality at or near 100% coverage. - [x] There is a benchmark for any new code, or changes to existing code. - [x] New code is free of linting errors. You can check this by running `make lint` locally. - [x] New code doesn't break existing tests. You can check this by running `make test` locally. - [x] Add an appropriate team label so this PR gets put in the right place for the release notes. Unsure? Have a question? Request a review! Co-authored-by: kakkoyun <kakkoyun@users.noreply.github.com> Co-authored-by: kemal.akkoyun <kemal.akkoyun@datadoghq.com>
Signed-off-by: Moe Zein <moe.zein@datadoghq.com>
Co-authored-by: Benjamin De Bernardi <debernardi.benjamin@gmail.com>
Co-authored-by: Dario Castañé <d@rio.hn>
… ParentId from proto when 0
|
|
||
| // +checklocksignore — Post-finish: reads finished span fields during payload encoding. | ||
| func convertSpan(s *Span, defaultServiceName string) *otlptrace.Span { | ||
| if p, ok := s.context.SamplingPriority(); ok && p < ext.PriorityAutoKeep { |
There was a problem hiding this comment.
I don't think there's ever a case where we get ok = false (right?), but what is the intended behavior if we get ok=false? Should the span be kept by default (as it is currently)?
There was a problem hiding this comment.
In practice, no we never expect to get ok = false. But theoretically if we were to, this would mean the span has no sampling (or trace-level) information on it, and we would drop the span. Given that we would consider this span would corrupt, dropping it -- as we do here -- makes sense.
Also: This particular implementation is dropping spans that are not marked keep, as is the otel spec for client side sampling.
…he same TCP connection instead of creating a new one for every request. Also, add tests.
| needsFlush := w.buffSize > payloadSizeLimit | ||
| w.mu.Unlock() | ||
| if needsFlush { | ||
| w.flush() | ||
| } |
There was a problem hiding this comment.
What do you think is the possibility of w getting changed between the unlock and the flush() call? I wonder if it's worth holding the lock until after the flush is done, and having a separate flushWithLock() function? What do you think?
There was a problem hiding this comment.
Hmm..
needsFlush captures a snapshot of w in a moment that can serve as a trigger. flush() acquires the lock itself, so it is safe. If, say some other goroutine concurrently accessed flush while we were on line 78-79, we would then:
- enter flush with a lock
- see
len(w.spans) == 0, and return immediately
So there wouldn't be any duplicate sending or race condition here.
| // resolveTraceTransport returns the trace URL and headers for the transport | ||
| // based on whether OTLP export mode is active. | ||
| // resolveTraceTransport returns the trace URL and headers for the Datadog | ||
| // agent transport. In OTLP export mode the ddTransport is not used for trace |
| } | ||
|
|
||
| // send posts a protobuf-encoded payload to the configured OTLP endpoint. | ||
| func (t *otlpTransport) send(data []byte) error { |
There was a problem hiding this comment.
This method is super close to io.Writer I wonder if it would make sense to match them 🤔 I'm thinking out loud here.
There was a problem hiding this comment.
I believe this is what the msgpack implementation is based on. But, in the otlp implementation the "send" logic is quite different - we are storing everything in memory up until the point that we are ready to flush, and only at that point do we encode the data. The encoding happens just once per payload - at flush time - not iteratively.
| readySpans := w.reset() | ||
| w.mu.Unlock() | ||
|
|
||
| w.climit <- struct{}{} |
There was a problem hiding this comment.
We can just use https://pkg.go.dev/golang.org/x/sync/errgroup instead of implementing this on our own.
There was a problem hiding this comment.
How do you envision this?
otlpTraceWriter.flush was modeled after agentTraceWriter.flush; changing one but not the other would break consistency, and I'd rather not modify agentTraceWriter in this PR.
Also, as I undestand it, errgroup would clean up the concurrency-limiting boilerplate, but it wants to cancel on the first error, whereas we just log and continue.
What does this PR do?
Implements the OTLP trace export pipeline (See: RFC) for dd-trace-go: when
OTEL_TRACES_EXPORTER=otlpis set, the tracer converts Datadog spans into OTLP protobuf format and sends them directly to an OpenTelemetry Collector instead of the Datadog agent.This adds three new components:
The PR also renames the existing transport interface and config.transport field to ddTransport to make it clear that it handles Datadog-specific traffic (msgpack traces, stats), and ensures ddTransport always points at the Datadog agent even when OTLP mode is active.
Motivation
We are adding native OTLP export support to dd-trace-go so users can send traces directly to any OTLP-compatible collector. A prior PR added the configuration layer that resolves
OTEL_TRACES_EXPORTER,OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, andOTEL_EXPORTER_OTLP_TRACES_HEADERS. This PR builds the runtime pipeline on top of that configuration.IMPORTANT NOTE!
The existing
agentTraceWriterandddTransportwas too tightly coupled to Datadog conventions — msgpack encoding, agent sampling rate feedback, stats collection — to extend it for the otlp path, so a separate writer and transport were introduced instead. TheotlpTraceWriterowns its ownotlpTransportdirectly (rather than reaching throughconfig.ddTransport), keeping OTLP concerns isolated from the Datadog agent path. A follow-up PR may removeddTransportfrom the config struct entirely, instead passing the transport directly to each writer and consumer at construction time. This would make all writer implementations structurally consistent — each owning its transport as a field — and eliminate the implicit coupling through config.An additional PR to disable stats collection in otlp mode is in progress here.
Reviewer's Checklist
make lintlocally.make testlocally.make generatelocally.make fix-moduleslocally.Unsure? Have a question? Request a review!