-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Hello,
I am trying to make aws-sdk SQS context propagation work between applications written in Java (scala actually), and nodejs / ruby.
In order for it to work, both the sender and receiver should agree on how to inject and extract the context onto the messages.
Unfortunately, Java and node \ ruby behave differently which creates broken traces for end-users who are sending messages across those systems.
NodeJS \ Ruby
These implementations use the "OpenTelemetry" approach. They use the propagators registered in the otel API (w3c \ b3 \ custom \ etc) to inject and extract context from the message attributes
Cons
There are quotas on SQS messages which the context propagation consumes: (https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html) which means:
- User is charged for this payload which includes the message attributes. It is not expected to be expensive as "Each 64 KB chunk of a payload is billed as 1 request", but still it's a few dozen bytes that are charged.
- If the message is just a bit below the 256K hard limit, then adding the context data can potentially cross this limit and reject the message.
- Message attributes are limited to 10 values total, including the user values, which means if he already used the available amount, then instrumentation has no more space to inject propagation headers as well.
Java
The jave implementation is using the X-Amzn-Trace-Id header, which does not consume quotas like the Nodejs \ Ruby implementation, but has the following cons:
Cons
- If X-Ray is enabled on a service, it might inject additional spans into the trace as the request is passing via the AWS x-ray enabled services. These spans are only exported to X-Ray. It means that otel users which are not using X-Ray, will have missing spans in the trace, and thus the goal to have async messaging visibility is lost.
- For some services, if X-Ray is disabled, the service will still create a new trace, flag it as a non-sampled trace and propagate this information downstream. If the application is configured with parent-based sampler, then x-ray effectivly turns off tracing for the application. See this issue in node for reference and more info.
- X-Ray propagator does not support baggage, which means baggage values are not propagated via AWS services even if user registers the
W3C Baggagepropagator.
Action Items
Considering the above Pros and Cons, I want to suggest adding a second propagation style into the aws-sdk instrumentations, which is compatible with nodejs / ruby and allow the user to bypass the x-ray propagator so the above compatibility issues will not affect his application.
I'll be very happy to get more insights and ideas on this issue. Do you think it makes sense adding this new propagation style? Maybe AWS has plans to solve the compatibility problems in the near future?