Skip to content

feat(tracer): Implement OTLP Traces Export#4600

Open
mtoffl01 wants to merge 63 commits intomainfrom
mtoff/otlp-export-traces
Open

feat(tracer): Implement OTLP Traces Export#4600
mtoffl01 wants to merge 63 commits intomainfrom
mtoff/otlp-export-traces

Conversation

@mtoffl01
Copy link
Copy Markdown
Contributor

@mtoffl01 mtoffl01 commented Mar 25, 2026

What does this PR do?

Implements the OTLP trace export pipeline (See: RFC) for dd-trace-go: when OTEL_TRACES_EXPORTER=otlp is set, the tracer converts Datadog spans into OTLP protobuf format and sends them directly to an OpenTelemetry Collector instead of the Datadog agent.

This adds three new components:

  1. span_to_otlp.go — converts internal Datadog spans to OTLP TracesData protobuf, including resource attributes, span kind/status, events, links, 128-bit trace IDs, and sampling filtering (only sampled spans are exported).
  2. otlp_writer.go — a traceWriter implementation that buffers converted OTLP spans, marshals them to protobuf, and flushes with retry logic.
  3. otlp_transport.go — a lightweight HTTP transport for sending protobuf payloads to an OTLP endpoint, separate from the existing Datadog-protocol transport.

The PR also renames the existing transport interface and config.transport field to ddTransport to make it clear that it handles Datadog-specific traffic (msgpack traces, stats), and ensures ddTransport always points at the Datadog agent even when OTLP mode is active.

Motivation

We are adding native OTLP export support to dd-trace-go so users can send traces directly to any OTLP-compatible collector. A prior PR added the configuration layer that resolves OTEL_TRACES_EXPORTER, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, and OTEL_EXPORTER_OTLP_TRACES_HEADERS. This PR builds the runtime pipeline on top of that configuration.

IMPORTANT NOTE!

The existing agentTraceWriter and ddTransport was too tightly coupled to Datadog conventions — msgpack encoding, agent sampling rate feedback, stats collection — to extend it for the otlp path, so a separate writer and transport were introduced instead. The otlpTraceWriter owns its own otlpTransport directly (rather than reaching through config.ddTransport), keeping OTLP concerns isolated from the Datadog agent path. A follow-up PR may remove ddTransport from the config struct entirely, instead passing the transport directly to each writer and consumer at construction time. This would make all writer implementations structurally consistent — each owning its transport as a field — and eliminate the implicit coupling through config.

An additional PR to disable stats collection in otlp mode is in progress here.

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • New code is free of linting errors. You can check this by running make lint locally.
  • New code doesn't break existing tests. You can check this by running make test locally.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • All generated files are up to date. You can check this by running make generate locally.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild. Make sure all nested modules are up to date by running make fix-modules locally.

Unsure? Have a question? Request a review!

mtoffl01 and others added 30 commits March 23, 2026 09:30
… service name differs from global service name
…me (#4576)

### What does this PR do?

Moves the `strconv.ParseInt` call for `_dd.p.dm` (decision maker) out of the v1 payload flush path and into the write path.

A `dm uint32` field is added to the `trace` struct. It is kept in sync with `propagatingTags[keyDecisionMaker]` at all mutation sites (`setPropagatingTagLocked`, `unsetPropagatingTagLocked`, `replacePropagatingTags`). A shared `unsetPropagatingTagLocked` helper is introduced so the delete-and-clear logic is not duplicated between `unsetPropagatingTag` and `setSamplingPriorityLockedWithForce`.

`payloadV1.push` now reads the pre-computed value via `trace.decisionMaker()` instead of fetching the string from the propagating tags map and parsing it on every flush.

### Motivation

`payloadV1.push` is called once per finished trace chunk. On every call it was acquiring a read lock, doing a map lookup, and then running `strconv.ParseInt` — all to read a value that was set at most once or twice during the lifetime of the trace. Moving the parse to write-time eliminates the per-flush map lookup and parse, replacing them with a single scalar read behind an `RLock`.

### Reviewer's Checklist

- [x] Changed code has unit tests for its functionality at or near 100% coverage.
- [x] New code is free of linting errors. You can check this by running `make lint` locally.
- [x] New code doesn't break existing tests. You can check this by running `make test` locally.
- [x] Add an appropriate team label so this PR gets put in the right place for the release notes.
- [x] All generated files are up to date. You can check this by running `make generate` locally.


Co-authored-by: dario.castane <dario.castane@datadoghq.com>
…ED` once, use a global `bool` (#4548)

Stop checking environment variable `DD_TRACE_128_BIT_TRACEID_GENERATION_ENABLED` on span start. Checks it once, use a global `bool`.

It'd be better to load it on `newConfig` and make the value available, but this one happens in `newSpanContext`, which modification will cause a lot of changes in multiple files, including tests.

Complete #1905 work using the current benchmarking platform.

- [x] Changed code has unit tests for its functionality at or near 100% coverage.
- [x] There is a benchmark for any new code, or changes to existing code.
- [x] New code is free of linting errors. You can check this by running `make lint` locally.
- [x] New code doesn't break existing tests. You can check this by running `make test` locally.
- [x] Add an appropriate team label so this PR gets put in the right place for the release notes.

Unsure? Have a question? Request a review!

Co-authored-by: kakkoyun <kakkoyun@users.noreply.github.com>
Co-authored-by: kemal.akkoyun <kemal.akkoyun@datadoghq.com>
Signed-off-by: Moe Zein <moe.zein@datadoghq.com>
Co-authored-by: Benjamin De Bernardi <debernardi.benjamin@gmail.com>
@darccio darccio requested review from darccio and removed request for darccio March 27, 2026 17:46
@mtoffl01 mtoffl01 requested a review from zacharycmontoya April 6, 2026 14:22

// +checklocksignore — Post-finish: reads finished span fields during payload encoding.
func convertSpan(s *Span, defaultServiceName string) *otlptrace.Span {
if p, ok := s.context.SamplingPriority(); ok && p < ext.PriorityAutoKeep {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's ever a case where we get ok = false (right?), but what is the intended behavior if we get ok=false? Should the span be kept by default (as it is currently)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, no we never expect to get ok = false. But theoretically if we were to, this would mean the span has no sampling (or trace-level) information on it, and we would drop the span. Given that we would consider this span would corrupt, dropping it -- as we do here -- makes sense.

Also: This particular implementation is dropping spans that are not marked keep, as is the otel spec for client side sampling.

@mtoffl01 mtoffl01 requested review from a team as code owners April 7, 2026 18:20
mtoffl01 and others added 2 commits April 7, 2026 14:21
…he same TCP connection instead of creating a new one for every request. Also, add tests.
@mtoffl01 mtoffl01 requested a review from hannahkm April 7, 2026 19:39
Copy link
Copy Markdown
Contributor

@hannahkm hannahkm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last question!

Comment on lines +76 to +80
needsFlush := w.buffSize > payloadSizeLimit
w.mu.Unlock()
if needsFlush {
w.flush()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think is the possibility of w getting changed between the unlock and the flush() call? I wonder if it's worth holding the lock until after the flush is done, and having a separate flushWithLock() function? What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm..
needsFlush captures a snapshot of w in a moment that can serve as a trigger. flush() acquires the lock itself, so it is safe. If, say some other goroutine concurrently accessed flush while we were on line 78-79, we would then:

  1. enter flush with a lock
  2. see len(w.spans) == 0, and return immediately

So there wouldn't be any duplicate sending or race condition here.

// resolveTraceTransport returns the trace URL and headers for the transport
// based on whether OTLP export mode is active.
// resolveTraceTransport returns the trace URL and headers for the Datadog
// agent transport. In OTLP export mode the ddTransport is not used for trace
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

// send posts a protobuf-encoded payload to the configured OTLP endpoint.
func (t *otlpTransport) send(data []byte) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is super close to io.Writer I wonder if it would make sense to match them 🤔 I'm thinking out loud here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is what the msgpack implementation is based on. But, in the otlp implementation the "send" logic is quite different - we are storing everything in memory up until the point that we are ready to flush, and only at that point do we encode the data. The encoding happens just once per payload - at flush time - not iteratively.

readySpans := w.reset()
w.mu.Unlock()

w.climit <- struct{}{}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just use https://pkg.go.dev/golang.org/x/sync/errgroup instead of implementing this on our own.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you envision this?

otlpTraceWriter.flush was modeled after agentTraceWriter.flush; changing one but not the other would break consistency, and I'd rather not modify agentTraceWriter in this PR.

Also, as I undestand it, errgroup would clean up the concurrency-limiting boilerplate, but it wants to cancel on the first error, whereas we just log and continue.

@mtoffl01 mtoffl01 requested review from hannahkm and kakkoyun April 13, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants