Introduce OpenTelemetry Tracing #6750

mcpherrinm · 2023-03-16T07:39:58Z

Add a new shared config stanza which all boulder components can use to configure their Open Telemetry tracing. This allows components to specify where their traces should be sent, what their sampling ratio should be, and whether or not they should respect their parent's sampling decisions (so that web front-ends can ignore sampling info coming from outside our infrastructure). It's likely we'll need to evolve this configuration over time, but this is a good starting point.

Add basic Open Telemetry setup to our existing cmd.StatsAndLogging helper, so that it gets initialized at the same time as our other observability helpers. This sets certain default fields on all traces/spans generated by the service. Currently these include the service name, the service version, and information about the telemetry SDK itself. In the future we'll likely augment this with information about the host and process.

Finally, add instrumentation for the HTTP servers and grpc clients/servers. This gives us a starting point of being able to monitor Boulder, but is fairly minimal as this PR is already somewhat unwieldy: It's really only enough to understand that everything is wired up properly in the configuration. In subsequent work we'll enhance those spans with more data, and add more spans for things not automatically traced here.

Fixes #6361

mcpherrinm · 2023-03-16T12:36:20Z

Here's some questions to think about in review:

What amount of dependency vetting should we do here? This is adding a fair number of new dependencies.

Is the configuration format reasonable? I've basically just picked a few settings that I want for the manual testing I've done so far, and some imagining of what we might want to start with in dev/stg.

There's not much in the way of testing here. I'm going to add an integration test in a follow-up, which is probably the best way to validate this works largely end-to-end, but is there anything we should be doing before that? Also, how should we track follow-up work? One big issue or a constellation of work items?

aarongable

This will need to be updated to resolve conflicts with the "remove beeline" PR.

Comments below are largely questions and minor nits. I have not done any dependency vetting yet.

Regarding your specific questions:

More vetting than I've done so far (none). Not sure how much we want to do. Probably won't know until I start.
I think the configuration format is reasonable. For beeline we specifically wanted a "mute" option to be able to turn it entirely off, but it looks like that happens here if both Endpoint and StdoutExporter are their zero values. I think we'll eventually want to have more nuanced sampling (like higher rates for 500s than 200s). That may be something we'll want to tune with configs, but I think it can wait for now.
I don't think there's particularly any other testing to do here. Maybe the one big thing would be a simple unittest for newOpenTelemetry that verifies that the ServiceName and ServiceVersion get set correctly, that the stdout looks reasonable, or whatever else can be easily verified.
I lean towards follow-up being a bunch of small tickets. It's always easier for one person to take three tickets than it is to split one ticket across two people working in parallel. Also closing tickets just feels good.

cmd/boulder-observer/main.go

cmd/akamai-purger/main.go

cmd/config.go

cmd/nonce-service/main.go

cmd/shell.go

mcpherrinm · 2023-03-24T17:30:37Z

I think we'll eventually want to have more nuanced sampling (like higher rates for 500s than 200s).

I think we want this, but we probably won't want Boulder to do it: The sampler here is head sampling: choosing whether or not to sample at the start of the request, and send traces for that request at all.

For sampling based on whether or not a request was an error, we want tail sampling: sampling after all the spans have been recieved by the Opentelemetry collector.

We may not want to do sampling in Boulder at all, and entirely rely on collector sidecars or a central receiver to do it.

aarongable

LGTM, aside from figuring out what to do about all of the defers not working the way we'd like. I'll get back to that discussion shortly.

cmd/shell.go

cmd/boulder-observer/main.go

observer/observer.go

mcpherrinm · 2023-04-07T06:41:17Z

What's left to do after this PR? I see several major tasks:

Add spans around outgoing requests: MariaDB, Redis, AWS, CT, DNS, HTTP, and ALPN challenges, HSM signing operations. For some of those we may want trace propagation too.
Ensure we appropriately catch and record errors in spans
Add the current audit log objects to spans
Add any other data to enrich spans
Add a file or stdout exporter to save traces without any additional software
Add unit and integration testing to validate traces have expected data using (5)
Do we want Jaeger Tracing (or something) in the standard development configuration for interactive use
Figure out if and what we want log lines to include some trace identifier for correlation of traces and logs
Configuration validation

aarongable

I think this looks good to me, just a few tiny comments at this point.

Of the remaining work items you mentioned, I think we should file bugs for: 1, 2 (can this be done automatically at gRPC repsonse time?), 3 (does OTel support 0-time spans?), and 5-and-6 together. The others should be discussed, and even if we do want to do them (which we probably do!) will be lower priority than those first few.

cmd/shell.go

observer/observer.go

mcpherrinm · 2023-04-15T01:37:57Z

Ooof, I merged main into this branch and then tried to reorder the merge with a rebase before the a committed suggestion above (since it relied on the new cmd.WaitForSignal), but that definitely didn't work out.

I force-pushed a rebased PR to address that. Sorry for the mess!

We deliberately drop the context to keep timeouts, but we should propogate the same span through still. Name the spans too.

This moves where the select {} run-forever loop is, and adds a signal handler.

Co-authored-by: Aaron Gable <[email protected]>

Even if it doesn't do anything now, it makes it more consistent

For now, don't include this. I intend to reintroduce it when setting up integration testing in a subsequent PR.

Co-authored-by: Aaron Gable <[email protected]>

Adds a context to the shutdown callback, which can be cancelled, for use with a timeout.

aarongable

Your code LGTM, with one comment that should probably be addressed as a follow-up / side ticket.
I've reviewed all of the non-go.opentelemetry.io transitive dependencies so far. There's some hairy parsing code in the grpc-ecosystem/grpc-gateway dep that I don't fully understand. There's also a bunch of syscalls in x/sys/windows/registry, but that's no surprise, we're not running that code, and it's basically stdlib, so I'm not worried about it.
I plan to read through the otel deps themselves today.

aarongable · 2023-04-18T23:44:53Z

cmd/shell.go

+		resource.Default(),
+		resource.NewWithAttributes(
+			semconv.SchemaURL,
+			semconv.ServiceNameKey.String(serviceName),


It's worth noting that, in our logging library, we extract the servicename directly as path.Base(os.Args[0]).

I don't love that, especially given how we've gone through a few different versions of symlinks vs subcommands for starting boulder components. But just noting that it wouldn't be out of place for this to simply use path.Base(os.Args[0]) instead of plumbing a hard-coded serviceName through from the callsite in main().

Alternatively, we should file a bug to plumb this same serviceName argument from StatsAndLogging into the logging library too.

We've even got a command() helper for that in this file, which I hadn't noticed before.

Let's use that, which guarantees we're consistent with other setup here.

I've also opened #6833 to consistently use that function everywhere instead of sprinkling path.Base(os.Args[0]) around the codebase.

We already have a `command()` function, use that.

This ensures any log messages coming from inside OpenTelemetry itself will go through our logging, instead of being printed to standard error.

mcpherrinm · 2023-04-19T17:29:50Z

I've realized another thing we're missing: We weren't setting the otel logger/error handler, which means they were just being printed to stderr by default. I've added a commit to pipe them through to a blog.Logger. It adds yet-another wrapper for a function from log.Logger. I could use and rename the promLogger type instead, as it's implementing a different method out of log.Logger.

aarongable

Ok, I've read through all of the rest of the new dependency files. They inspired a few additional comments, but nothing big. LGTM!

aarongable · 2023-04-19T21:38:30Z

vendor/go.opentelemetry.io/otel/sdk/trace/span.go

+// AddEvent adds an event with the provided name and options. If this span is
+// not being recorded than this method does nothing.
+func (s *recordingSpan) AddEvent(name string, o ...trace.EventOption) {


Excited to try adding Events to the current span each time we emit an Audit Log.

aarongable · 2023-04-19T21:41:14Z

vendor/go.opentelemetry.io/otel/semconv/v1.17.0/resource.go

+	return BrowserLanguageKey.String(val)
+}
+
+// A cloud environment (e.g. GCP, Azure, AWS)


Note for the future: We should use some of these standardized cloud provider keys when further instrumenting RVA spans.

aarongable · 2023-04-19T21:42:33Z

vendor/go.opentelemetry.io/otel/semconv/v1.17.0/resource.go

+	// Examples: 'staging', 'production'
+	DeploymentEnvironmentKey = attribute.Key("deployment.environment")


This one will be useful to set on all traces too.

aarongable · 2023-04-19T21:58:58Z

cmd/shell.go

+	otel.SetErrorHandler(otel.ErrorHandlerFunc(func(err error) { logger.Errf("OpenTelemetry error: %v", err) }))
+
+	r, err := resource.Merge(
+		resource.Default(),


It appears that Default() calls resource.Detect() with the defaultServiceNameDetector{}, fromEnv{}, and telemetrySDK{} detectors. This is a great start, but these don't set very many attributes. In the dependencies, I've called out some other values that I think we'll want to set in the future.

An easy one to add right now would be .WithHost(). Ones that may be valuable in the future seem like .WithOS() and .WithProcess().

I think we definitely want more attributes in our traces, but it's easy to add them later, as this PR is already pretty sprawling. And it'll be easier to see what else we want once we're running Boulder and getting traces out of it.

aarongable · 2023-04-19T22:03:59Z

vendor/go.opentelemetry.io/otel/semconv/v1.17.0/resource.go

+	return FaaSMaxMemoryKey.Int(val)
+}
+
+// A host is defined as a general computing instance.


It looks like .WithHost() will add the HostNameKey, but not most of these other host-based attributes. We should consider which (if any) of these would be useful to us.

vendor/go.opentelemetry.io/otel/metric/internal/global/meter.go

beautifulentropy

Looks great. One small consistency nit. Feel free to pass on it or not.

beautifulentropy · 2023-04-20T15:35:36Z

cmd/akamai-purger/main.go


-	scope, logger := cmd.StatsAndLogging(c.Syslog, apc.DebugAddr)
+	scope, logger, shutdown := cmd.StatsAndLogging(c.Syslog, c.OpenTelemetry, apc.DebugAddr)
+	defer shutdown(context.Background())


Consistency nit; regardless of whether this is deferred immediately we should name the otel shutdown var the same. I'm partial to your use of oTelShutdown elsewhere.

I had shutdown everywhere initially but changed it to oTelShutdown when there's some long-distance jumps between declaration and use (the two cases where we've got timeouts).

I agree being consistent is worth it, though, as it's clearer no matter what.

@aarongable

This moves command() from cmd/ into core.Command() and uses it from the log and main package, ensuring we have a single implementation of path.Base(os.Args[0]) instead of scattering that method around. I put it in core/util.go as similar enough functions like the BuildID and revision info already live there, though I am not entirely sure it's the right place. This came up in @aarongable's review of #6750, where he pointed out I was manually hardcoding commands instead of using command() or similar.

mcpherrinm mentioned this pull request Mar 16, 2023

Add Opentelemetry configuration to Boulder #6751

Closed

mcpherrinm marked this pull request as ready for review March 16, 2023 12:38

mcpherrinm requested a review from a team as a code owner March 16, 2023 12:38

mcpherrinm requested a review from pgporada March 16, 2023 12:38

aarongable reviewed Mar 24, 2023

View reviewed changes

aarongable reviewed Mar 28, 2023

View reviewed changes

cmd/shell.go Outdated Show resolved Hide resolved

cmd/boulder-observer/main.go Outdated Show resolved Hide resolved

observer/observer.go Outdated Show resolved Hide resolved

aarongable mentioned this pull request Apr 4, 2023

Fix long-lived process exit and cleanup #6794

Closed

aarongable reviewed Apr 7, 2023

View reviewed changes

cmd/shell.go Outdated Show resolved Hide resolved

cmd/shell.go Show resolved Hide resolved

observer/observer.go Outdated Show resolved Hide resolved

mcpherrinm and others added 18 commits April 18, 2023 18:18

Add OpenTelemetry configuration

984992a

Add grpc stream interceptors

74831cf

Add otelhttp tracing

851bc0d

go mod tidy; go mod vendor

4ca35af

Keep spans intact through WFE2

cff276b

We deliberately drop the context to keep timeouts, but we should propogate the same span through still. Name the spans too.

Tidy up config and edit the comments

7db2202

gofmt -s

e99ec81

Reword comment and tidy up code per feedback

4963971

rewording

bf405bc

code style feedback

5c59662

Make boulder-observer shutdown call observer.Stop

856d2e2

This moves where the select {} run-forever loop is, and adds a signal handler.

Reword

0c1cff9

Update cmd/shell.go

3e10a1e

Co-authored-by: Aaron Gable <[email protected]>

Update cmd/boulder-observer/main.go

9d3e751

Co-authored-by: Aaron Gable <[email protected]>

Update observer/observer.go

df595ca

Co-authored-by: Aaron Gable <[email protected]>

Return an error from Start

49beb88

Even if it doesn't do anything now, it makes it more consistent

Make boulder-observer signal catching more like the grpc services

8ca12ce

Remove stdouttrace

0f193b1

For now, don't include this. I intend to reintroduce it when setting up integration testing in a subsequent PR.

mcpherrinm and others added 5 commits April 18, 2023 18:19

Unvendor stdouttrace

2f966a4

Use cmd.WaitForSignal from #6808

4460eb8

Co-authored-by: Aaron Gable <[email protected]>

merge fixup

de99c4e

missing go-mod-tidy

3337ca5

Respect shutdown timeouts in wfe & ocsp

f83c29d

Adds a context to the shutdown callback, which can be cancelled, for use with a timeout.

mcpherrinm requested a review from aarongable April 18, 2023 23:09

aarongable reviewed Apr 19, 2023

View reviewed changes

mcpherrinm added 2 commits April 19, 2023 12:37

Don't hardcode service name for otel setup

a46226c

We already have a `command()` function, use that.

Set the Opentelemetry logger and error handler

4589821

This ensures any log messages coming from inside OpenTelemetry itself will go through our logging, instead of being printed to standard error.

mcpherrinm mentioned this pull request Apr 19, 2023

Use a common Command() function #6833

Merged

aarongable previously approved these changes Apr 19, 2023

View reviewed changes

aarongable requested a review from a team April 19, 2023 22:37

beautifulentropy previously approved these changes Apr 20, 2023

View reviewed changes

Consistently use oTelShutdown name

ca7cb18

mcpherrinm dismissed stale reviews from beautifulentropy and aarongable via ca7cb18 April 20, 2023 15:49

mcpherrinm mentioned this pull request Apr 20, 2023

Switch from Beeline to OTel #6361

Closed

beautifulentropy requested review from aarongable and beautifulentropy April 20, 2023 16:05

beautifulentropy approved these changes Apr 20, 2023

View reviewed changes

aarongable approved these changes Apr 20, 2023

View reviewed changes

aarongable merged commit 0060e69 into letsencrypt:main Apr 21, 2023

mcpherrinm deleted the mattm-introduce-otel branch April 21, 2023 19:10

		// Examples: 'staging', 'production'
		DeploymentEnvironmentKey = attribute.Key("deployment.environment")

Uh oh!

Introduce OpenTelemetry Tracing #6750

Introduce OpenTelemetry Tracing #6750

Uh oh!

Conversation

mcpherrinm commented Mar 16, 2023 • edited by aarongable Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcpherrinm commented Mar 16, 2023

Uh oh!

aarongable left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcpherrinm commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aarongable left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcpherrinm commented Apr 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aarongable left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcpherrinm commented Apr 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aarongable left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcpherrinm commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aarongable left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcpherrinm Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

beautifulentropy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mcpherrinm commented Mar 16, 2023 •

edited by aarongable

Loading

mcpherrinm commented Mar 24, 2023 •

edited

Loading

mcpherrinm commented Apr 7, 2023 •

edited

Loading

mcpherrinm commented Apr 15, 2023 •

edited

Loading

mcpherrinm commented Apr 19, 2023 •

edited

Loading

mcpherrinm Apr 20, 2023 •

edited

Loading