Skip to content

Conversation

@mcpherrinm
Copy link
Contributor

@mcpherrinm mcpherrinm commented Mar 16, 2023

Add a new shared config stanza which all boulder components can use to configure their Open Telemetry tracing. This allows components to specify where their traces should be sent, what their sampling ratio should be, and whether or not they should respect their parent's sampling decisions (so that web front-ends can ignore sampling info coming from outside our infrastructure). It's likely we'll need to evolve this configuration over time, but this is a good starting point.

Add basic Open Telemetry setup to our existing cmd.StatsAndLogging helper, so that it gets initialized at the same time as our other observability helpers. This sets certain default fields on all traces/spans generated by the service. Currently these include the service name, the service version, and information about the telemetry SDK itself. In the future we'll likely augment this with information about the host and process.

Finally, add instrumentation for the HTTP servers and grpc clients/servers. This gives us a starting point of being able to monitor Boulder, but is fairly minimal as this PR is already somewhat unwieldy: It's really only enough to understand that everything is wired up properly in the configuration. In subsequent work we'll enhance those spans with more data, and add more spans for things not automatically traced here.

Fixes #6361

@mcpherrinm
Copy link
Contributor Author

Here's some questions to think about in review:

What amount of dependency vetting should we do here? This is adding a fair number of new dependencies.

Is the configuration format reasonable? I've basically just picked a few settings that I want for the manual testing I've done so far, and some imagining of what we might want to start with in dev/stg.

There's not much in the way of testing here. I'm going to add an integration test in a follow-up, which is probably the best way to validate this works largely end-to-end, but is there anything we should be doing before that? Also, how should we track follow-up work? One big issue or a constellation of work items?

@mcpherrinm mcpherrinm marked this pull request as ready for review March 16, 2023 12:38
@mcpherrinm mcpherrinm requested a review from a team as a code owner March 16, 2023 12:38
@mcpherrinm mcpherrinm requested a review from pgporada March 16, 2023 12:38
Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be updated to resolve conflicts with the "remove beeline" PR.

Comments below are largely questions and minor nits. I have not done any dependency vetting yet.

Regarding your specific questions:

  • More vetting than I've done so far (none). Not sure how much we want to do. Probably won't know until I start.
  • I think the configuration format is reasonable. For beeline we specifically wanted a "mute" option to be able to turn it entirely off, but it looks like that happens here if both Endpoint and StdoutExporter are their zero values. I think we'll eventually want to have more nuanced sampling (like higher rates for 500s than 200s). That may be something we'll want to tune with configs, but I think it can wait for now.
  • I don't think there's particularly any other testing to do here. Maybe the one big thing would be a simple unittest for newOpenTelemetry that verifies that the ServiceName and ServiceVersion get set correctly, that the stdout looks reasonable, or whatever else can be easily verified.
  • I lean towards follow-up being a bunch of small tickets. It's always easier for one person to take three tickets than it is to split one ticket across two people working in parallel. Also closing tickets just feels good.

@mcpherrinm
Copy link
Contributor Author

mcpherrinm commented Mar 24, 2023

I think we'll eventually want to have more nuanced sampling (like higher rates for 500s than 200s).

I think we want this, but we probably won't want Boulder to do it: The sampler here is head sampling: choosing whether or not to sample at the start of the request, and send traces for that request at all.

For sampling based on whether or not a request was an error, we want tail sampling: sampling after all the spans have been recieved by the Opentelemetry collector.

We may not want to do sampling in Boulder at all, and entirely rely on collector sidecars or a central receiver to do it.

Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, aside from figuring out what to do about all of the defers not working the way we'd like. I'll get back to that discussion shortly.

@mcpherrinm
Copy link
Contributor Author

mcpherrinm commented Apr 7, 2023

What's left to do after this PR? I see several major tasks:

  1. Add spans around outgoing requests: MariaDB, Redis, AWS, CT, DNS, HTTP, and ALPN challenges, HSM signing operations. For some of those we may want trace propagation too.
  2. Ensure we appropriately catch and record errors in spans
  3. Add the current audit log objects to spans
  4. Add any other data to enrich spans
  5. Add a file or stdout exporter to save traces without any additional software
  6. Add unit and integration testing to validate traces have expected data using (5)
  7. Do we want Jaeger Tracing (or something) in the standard development configuration for interactive use
  8. Figure out if and what we want log lines to include some trace identifier for correlation of traces and logs
  9. Configuration validation

Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to me, just a few tiny comments at this point.

Of the remaining work items you mentioned, I think we should file bugs for: 1, 2 (can this be done automatically at gRPC repsonse time?), 3 (does OTel support 0-time spans?), and 5-and-6 together. The others should be discussed, and even if we do want to do them (which we probably do!) will be lower priority than those first few.

@mcpherrinm
Copy link
Contributor Author

mcpherrinm commented Apr 15, 2023

Ooof, I merged main into this branch and then tried to reorder the merge with a rebase before the a committed suggestion above (since it relied on the new cmd.WaitForSignal), but that definitely didn't work out.

I force-pushed a rebased PR to address that. Sorry for the mess!

mcpherrinm and others added 18 commits April 18, 2023 18:18
We deliberately drop the context to keep timeouts, but we should
propogate the same span through still.

Name the spans too.
This moves where the select {} run-forever loop is, and adds a signal
handler.
Co-authored-by: Aaron Gable <[email protected]>
Co-authored-by: Aaron Gable <[email protected]>
Even if it doesn't do anything now, it makes it more consistent
For now, don't include this.  I intend to reintroduce it when setting up integration testing in a subsequent PR.
mcpherrinm and others added 5 commits April 18, 2023 18:19
Adds a context to the shutdown callback, which can be cancelled,
for use with a timeout.
@mcpherrinm mcpherrinm requested a review from aarongable April 18, 2023 23:09
Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Your code LGTM, with one comment that should probably be addressed as a follow-up / side ticket.
  • I've reviewed all of the non-go.opentelemetry.io transitive dependencies so far. There's some hairy parsing code in the grpc-ecosystem/grpc-gateway dep that I don't fully understand. There's also a bunch of syscalls in x/sys/windows/registry, but that's no surprise, we're not running that code, and it's basically stdlib, so I'm not worried about it.
  • I plan to read through the otel deps themselves today.

cmd/shell.go Outdated
resource.Default(),
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(serviceName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that, in our logging library, we extract the servicename directly as path.Base(os.Args[0]).

I don't love that, especially given how we've gone through a few different versions of symlinks vs subcommands for starting boulder components. But just noting that it wouldn't be out of place for this to simply use path.Base(os.Args[0]) instead of plumbing a hard-coded serviceName through from the callsite in main().

Alternatively, we should file a bug to plumb this same serviceName argument from StatsAndLogging into the logging library too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've even got a command() helper for that in this file, which I hadn't noticed before.

Let's use that, which guarantees we're consistent with other setup here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also opened #6833 to consistently use that function everywhere instead of sprinkling path.Base(os.Args[0]) around the codebase.

We already have a `command()` function, use that.
This ensures any log messages coming from inside OpenTelemetry itself
will go through our logging, instead of being printed to standard error.
@mcpherrinm
Copy link
Contributor Author

mcpherrinm commented Apr 19, 2023

I've realized another thing we're missing: We weren't setting the otel logger/error handler, which means they were just being printed to stderr by default. I've added a commit to pipe them through to a blog.Logger. It adds yet-another wrapper for a function from log.Logger. I could use and rename the promLogger type instead, as it's implementing a different method out of log.Logger.

aarongable
aarongable previously approved these changes Apr 19, 2023
Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've read through all of the rest of the new dependency files. They inspired a few additional comments, but nothing big. LGTM!

Comment on lines +463 to +465
// AddEvent adds an event with the provided name and options. If this span is
// not being recorded than this method does nothing.
func (s *recordingSpan) AddEvent(name string, o ...trace.EventOption) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to try adding Events to the current span each time we emit an Audit Log.

return BrowserLanguageKey.String(val)
}

// A cloud environment (e.g. GCP, Azure, AWS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for the future: We should use some of these standardized cloud provider keys when further instrumenting RVA spans.

Comment on lines +612 to +613
// Examples: 'staging', 'production'
DeploymentEnvironmentKey = attribute.Key("deployment.environment")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one will be useful to set on all traces too.

otel.SetErrorHandler(otel.ErrorHandlerFunc(func(err error) { logger.Errf("OpenTelemetry error: %v", err) }))

r, err := resource.Merge(
resource.Default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that Default() calls resource.Detect() with the defaultServiceNameDetector{}, fromEnv{}, and telemetrySDK{} detectors. This is a great start, but these don't set very many attributes. In the dependencies, I've called out some other values that I think we'll want to set in the future.

An easy one to add right now would be .WithHost(). Ones that may be valuable in the future seem like .WithOS() and .WithProcess().

Copy link
Contributor Author

@mcpherrinm mcpherrinm Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we definitely want more attributes in our traces, but it's easy to add them later, as this PR is already pretty sprawling. And it'll be easier to see what else we want once we're running Boulder and getting traces out of it.

return FaaSMaxMemoryKey.Int(val)
}

// A host is defined as a general computing instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like .WithHost() will add the HostNameKey, but not most of these other host-based attributes. We should consider which (if any) of these would be useful to us.

@aarongable aarongable requested a review from a team April 19, 2023 22:37
Copy link
Member

@beautifulentropy beautifulentropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. One small consistency nit. Feel free to pass on it or not.


scope, logger := cmd.StatsAndLogging(c.Syslog, apc.DebugAddr)
scope, logger, shutdown := cmd.StatsAndLogging(c.Syslog, c.OpenTelemetry, apc.DebugAddr)
defer shutdown(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency nit; regardless of whether this is deferred immediately we should name the otel shutdown var the same. I'm partial to your use of oTelShutdown elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had shutdown everywhere initially but changed it to oTelShutdown when there's some long-distance jumps between declaration and use (the two cases where we've got timeouts).

I agree being consistent is worth it, though, as it's clearer no matter what.

@aarongable aarongable merged commit 0060e69 into letsencrypt:main Apr 21, 2023
@mcpherrinm mcpherrinm deleted the mattm-introduce-otel branch April 21, 2023 19:10
pgporada pushed a commit that referenced this pull request Apr 21, 2023
This moves command() from cmd/ into core.Command() and uses it from the
log and main package, ensuring we have a single implementation of
path.Base(os.Args[0]) instead of scattering that method around. I put it
in core/util.go as similar enough functions like the BuildID and
revision info already live there, though I am not entirely sure it's the
right place.

This came up in @aarongable's review of #6750, where he pointed out I
was manually hardcoding commands instead of using command() or similar.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch from Beeline to OTel

4 participants