-
-
Notifications
You must be signed in to change notification settings - Fork 631
Introduce OpenTelemetry Tracing #6750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce OpenTelemetry Tracing #6750
Conversation
|
Here's some questions to think about in review: What amount of dependency vetting should we do here? This is adding a fair number of new dependencies. Is the configuration format reasonable? I've basically just picked a few settings that I want for the manual testing I've done so far, and some imagining of what we might want to start with in dev/stg. There's not much in the way of testing here. I'm going to add an integration test in a follow-up, which is probably the best way to validate this works largely end-to-end, but is there anything we should be doing before that? Also, how should we track follow-up work? One big issue or a constellation of work items? |
aarongable
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be updated to resolve conflicts with the "remove beeline" PR.
Comments below are largely questions and minor nits. I have not done any dependency vetting yet.
Regarding your specific questions:
- More vetting than I've done so far (none). Not sure how much we want to do. Probably won't know until I start.
- I think the configuration format is reasonable. For beeline we specifically wanted a "mute" option to be able to turn it entirely off, but it looks like that happens here if both Endpoint and StdoutExporter are their zero values. I think we'll eventually want to have more nuanced sampling (like higher rates for 500s than 200s). That may be something we'll want to tune with configs, but I think it can wait for now.
- I don't think there's particularly any other testing to do here. Maybe the one big thing would be a simple unittest for
newOpenTelemetrythat verifies that the ServiceName and ServiceVersion get set correctly, that the stdout looks reasonable, or whatever else can be easily verified. - I lean towards follow-up being a bunch of small tickets. It's always easier for one person to take three tickets than it is to split one ticket across two people working in parallel. Also closing tickets just feels good.
I think we want this, but we probably won't want Boulder to do it: The sampler here is head sampling: choosing whether or not to sample at the start of the request, and send traces for that request at all. For sampling based on whether or not a request was an error, we want tail sampling: sampling after all the spans have been recieved by the Opentelemetry collector. We may not want to do sampling in Boulder at all, and entirely rely on collector sidecars or a central receiver to do it. |
aarongable
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, aside from figuring out what to do about all of the defers not working the way we'd like. I'll get back to that discussion shortly.
|
What's left to do after this PR? I see several major tasks:
|
aarongable
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good to me, just a few tiny comments at this point.
Of the remaining work items you mentioned, I think we should file bugs for: 1, 2 (can this be done automatically at gRPC repsonse time?), 3 (does OTel support 0-time spans?), and 5-and-6 together. The others should be discussed, and even if we do want to do them (which we probably do!) will be lower priority than those first few.
|
Ooof, I merged main into this branch and then tried to reorder the merge with a rebase before the a committed suggestion above (since it relied on the new cmd.WaitForSignal), but that definitely didn't work out. I force-pushed a rebased PR to address that. Sorry for the mess! |
We deliberately drop the context to keep timeouts, but we should propogate the same span through still. Name the spans too.
This moves where the select {} run-forever loop is, and adds a signal
handler.
Co-authored-by: Aaron Gable <[email protected]>
Co-authored-by: Aaron Gable <[email protected]>
Co-authored-by: Aaron Gable <[email protected]>
Even if it doesn't do anything now, it makes it more consistent
For now, don't include this. I intend to reintroduce it when setting up integration testing in a subsequent PR.
Co-authored-by: Aaron Gable <[email protected]>
Adds a context to the shutdown callback, which can be cancelled, for use with a timeout.
aarongable
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Your code LGTM, with one comment that should probably be addressed as a follow-up / side ticket.
- I've reviewed all of the non-go.opentelemetry.io transitive dependencies so far. There's some hairy parsing code in the grpc-ecosystem/grpc-gateway dep that I don't fully understand. There's also a bunch of syscalls in x/sys/windows/registry, but that's no surprise, we're not running that code, and it's basically stdlib, so I'm not worried about it.
- I plan to read through the otel deps themselves today.
cmd/shell.go
Outdated
| resource.Default(), | ||
| resource.NewWithAttributes( | ||
| semconv.SchemaURL, | ||
| semconv.ServiceNameKey.String(serviceName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that, in our logging library, we extract the servicename directly as path.Base(os.Args[0]).
I don't love that, especially given how we've gone through a few different versions of symlinks vs subcommands for starting boulder components. But just noting that it wouldn't be out of place for this to simply use path.Base(os.Args[0]) instead of plumbing a hard-coded serviceName through from the callsite in main().
Alternatively, we should file a bug to plumb this same serviceName argument from StatsAndLogging into the logging library too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've even got a command() helper for that in this file, which I hadn't noticed before.
Let's use that, which guarantees we're consistent with other setup here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also opened #6833 to consistently use that function everywhere instead of sprinkling path.Base(os.Args[0]) around the codebase.
We already have a `command()` function, use that.
This ensures any log messages coming from inside OpenTelemetry itself will go through our logging, instead of being printed to standard error.
|
I've realized another thing we're missing: We weren't setting the otel logger/error handler, which means they were just being printed to stderr by default. I've added a commit to pipe them through to a blog.Logger. It adds yet-another wrapper for a function from |
aarongable
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've read through all of the rest of the new dependency files. They inspired a few additional comments, but nothing big. LGTM!
| // AddEvent adds an event with the provided name and options. If this span is | ||
| // not being recorded than this method does nothing. | ||
| func (s *recordingSpan) AddEvent(name string, o ...trace.EventOption) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to try adding Events to the current span each time we emit an Audit Log.
| return BrowserLanguageKey.String(val) | ||
| } | ||
|
|
||
| // A cloud environment (e.g. GCP, Azure, AWS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for the future: We should use some of these standardized cloud provider keys when further instrumenting RVA spans.
| // Examples: 'staging', 'production' | ||
| DeploymentEnvironmentKey = attribute.Key("deployment.environment") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one will be useful to set on all traces too.
| otel.SetErrorHandler(otel.ErrorHandlerFunc(func(err error) { logger.Errf("OpenTelemetry error: %v", err) })) | ||
|
|
||
| r, err := resource.Merge( | ||
| resource.Default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that Default() calls resource.Detect() with the defaultServiceNameDetector{}, fromEnv{}, and telemetrySDK{} detectors. This is a great start, but these don't set very many attributes. In the dependencies, I've called out some other values that I think we'll want to set in the future.
An easy one to add right now would be .WithHost(). Ones that may be valuable in the future seem like .WithOS() and .WithProcess().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we definitely want more attributes in our traces, but it's easy to add them later, as this PR is already pretty sprawling. And it'll be easier to see what else we want once we're running Boulder and getting traces out of it.
| return FaaSMaxMemoryKey.Int(val) | ||
| } | ||
|
|
||
| // A host is defined as a general computing instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like .WithHost() will add the HostNameKey, but not most of these other host-based attributes. We should consider which (if any) of these would be useful to us.
beautifulentropy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. One small consistency nit. Feel free to pass on it or not.
cmd/akamai-purger/main.go
Outdated
|
|
||
| scope, logger := cmd.StatsAndLogging(c.Syslog, apc.DebugAddr) | ||
| scope, logger, shutdown := cmd.StatsAndLogging(c.Syslog, c.OpenTelemetry, apc.DebugAddr) | ||
| defer shutdown(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency nit; regardless of whether this is deferred immediately we should name the otel shutdown var the same. I'm partial to your use of oTelShutdown elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had shutdown everywhere initially but changed it to oTelShutdown when there's some long-distance jumps between declaration and use (the two cases where we've got timeouts).
I agree being consistent is worth it, though, as it's clearer no matter what.
ca7cb18
This moves command() from cmd/ into core.Command() and uses it from the log and main package, ensuring we have a single implementation of path.Base(os.Args[0]) instead of scattering that method around. I put it in core/util.go as similar enough functions like the BuildID and revision info already live there, though I am not entirely sure it's the right place. This came up in @aarongable's review of #6750, where he pointed out I was manually hardcoding commands instead of using command() or similar.
Add a new shared config stanza which all boulder components can use to configure their Open Telemetry tracing. This allows components to specify where their traces should be sent, what their sampling ratio should be, and whether or not they should respect their parent's sampling decisions (so that web front-ends can ignore sampling info coming from outside our infrastructure). It's likely we'll need to evolve this configuration over time, but this is a good starting point.
Add basic Open Telemetry setup to our existing cmd.StatsAndLogging helper, so that it gets initialized at the same time as our other observability helpers. This sets certain default fields on all traces/spans generated by the service. Currently these include the service name, the service version, and information about the telemetry SDK itself. In the future we'll likely augment this with information about the host and process.
Finally, add instrumentation for the HTTP servers and grpc clients/servers. This gives us a starting point of being able to monitor Boulder, but is fairly minimal as this PR is already somewhat unwieldy: It's really only enough to understand that everything is wired up properly in the configuration. In subsequent work we'll enhance those spans with more data, and add more spans for things not automatically traced here.
Fixes #6361