-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Unify zipkin v1 and v2 annotation/tag parsing logic #1002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ad533e1 to
d0c5225
Compare
Codecov Report
@@ Coverage Diff @@
## master #1002 +/- ##
==========================================
+ Coverage 86.46% 87.09% +0.62%
==========================================
Files 198 199 +1
Lines 14144 14151 +7
==========================================
+ Hits 12230 12325 +95
+ Misses 1463 1384 -79
+ Partials 451 442 -9
Continue to review full report at Codecov.
|
| return nil | ||
| } | ||
|
|
||
| const statusCodeUnknown = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this values is not going to be being used anymore.
It means that if we get an invalid value in error attribute that is out of the scope of canonicalCodesMap, we don't use that to identify an error in the span anymore. I'm not sure this is what we need. Why do you think we should change the behavior this way?
And since we have types conversion now, I would consider setting the UnknownError status for true boolean value.
IFAIK another case when this value was set in zipkinV2 translation is when opencensus.status_description attribute value is provided but error attribute is empty. After this change we will be setting SpanStatus = 0 instead of 2 for those spans. That is acceptable change I think.
@bogdandrutu @tigrannajaryan thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that if we get an invalid value in error attribute that is out of the scope of canonicalCodesMap, we don't use that to identify an error in the span anymore. I'm not sure this is what we need. Why do you think we should change the behavior this way?
I pushed a commit to add this back, its not super clean, I'm open to suggestions on how to improve it.
And since we have types conversion now, I would consider setting the UnknownError status for true boolean value
For the error attribute? Which piece of the mapper should I set that on? fromCensus/fromStatus/fromHTTP How should that interact with the presence of the other attributes? I've added some test cases, perhaps listing a few ones to add would help clarify for me
IFAIK another case when this value was set in zipkinV2 translation is when opencensus.status_description attribute value is provided but error attribute is empty. After this change we will be setting SpanStatus = 0 instead of 2 for those spans. That is acceptable change I think.
I actually undid this when I added support for unknown back. Let me know what you think the best path forward is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the error attribute? Which piece of the mapper should I set that on? fromCensus/fromStatus/fromHTTP How should that interact with the presence of the other attributes? I've added some test cases, perhaps listing a few ones to add would help clarify for me
Probably to add another one like fromErrorTag. We should set 2 (Unknown) error code if "error" tag has a value (like "true" or anything else that is not an actual error code), and others sources (fromCensus/fromStatus/fromHTTP) doesn't provide any details to infer an error code.
I actually undid this when I added support for unknown back. Let me know what you think the best path forward is
ZipkinV1 receiver was defaulting to 0 code if there is message but not error code identified. I think we should use that approach in both v1 and v2 receivers. Setting Unknown error code only based on a message doesn't sound right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rebased and added a commit for these changes
|
Please sign the CLA |
|
I signed it |
|
Please rebase and try to add tests :) |
Added some tests for the status mapper, there are still some open questions regarding intended behavior I think (see #1002 (comment) ) Let me know if there are other areas I should add tests for I just pushed another commit, I'll wait to rebase until after dmitryax gets a chance to look at it |
|
Actually I see dmitryax made some conflicting changes, merging for now |
|
@chris-smith-zocdoc could you check the failing test please? |
|
The failing test is also failing on the master branch right now, so I don't think its related to my change |
Filed a bug #1069 |
|
@chris-smith-zocdoc please rebase, tests should pass now. |
use same tag parsing logic for both zipkin v1 and v2
oc description without error should be OK add more test cases for type conversions on error tag
7670027 to
45c7df8
Compare
| s = m.fromHTTP | ||
| } | ||
|
|
||
| // If no codePtr was provided, fallback to the first source with a message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applying this default caused this test to fail
opentelemetry-collector/translator/trace/zipkin/zipkinv1_thrift_to_protospan_test.go
Lines 166 to 175 in 665d64e
| // only status.message tag | |
| { | |
| haveTags: []*zipkincore.BinaryAnnotation{{ | |
| Key: "status.message", | |
| Value: []byte("Forbidden"), | |
| AnnotationType: zipkincore.AnnotationType_STRING, | |
| }}, | |
| wantAttributes: nil, | |
| wantStatus: nil, | |
| }, |
attributes: {
"status.message": "Forbidden",
}
expects tracepb.Status to be nil
The jaeger behavior matches the previous zipkin behavior https://github.com/open-telemetry/opentelemetry-collector/blob/master/translator/trace/jaeger/jaegerthrift_to_protospan.go#L262-L265
Do you think I should update the tests or modify the behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change behavior . We should not set any error status based on "status.message" only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this only apply to that specific open census attribute? I ask because the existing code will return a tracepb.Status when either a code or message is set if s.codePtr != nil || s.message != "" . Because of the way it is currently implemented, this would only ever apply to fromHTTP though.
On master, the current behavior is
given
attributes: {
"http.status_message": "something"
}
returns
tracepb.Status{
Code: 0,
Message: "something",
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. we should not set error status code based only on an error message attribute such as http.status_message or status.message (hopefully I don't miss any other message attribute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-smith-zocdoc could you apply this approach ^ and make sure tests pass?
| s = m.fromHTTP | ||
| } | ||
|
|
||
| s.codePtr = m.fromErrorTag.codePtr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be under condition if (s.codePtr == nil) so we don't override code that was already set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it to make it more clear, but this would always be true currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this always be true? any of the assignments above can change it not null, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the first switch we know that
m.fromCensus.codePtr == nil
m.fromStatus.codePtr == nil
and line 62 gives
m.fromHTTP.codePtr == nil
So we've verified all three are nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I still don't get it. s can be set in any case above, so any of them can set s.codePtr to be something but not nil, right? If we don't have if (s.codePtr == nil) condition here, we override any s.codePtr with m.fromErrorTag.codePtr all the time
check for nil before overwriting
|
Did something change with the CLA process? Not sure why its complaining about that last commit |
| name: "error only: false", | ||
| expected: &tracepb.Status{Code: 2}, | ||
| attributes: map[string]string{ | ||
| "error": "false", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is still weird to me, I looked through our data and we do have spans with this being reported 🙄
How do you feel about treating a boolean attribute with a value false as OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the instrumentation on your end producing this kind of spans? It looks like OpenTracing convention, but should not be reported by zipkin instrumentation. If the instrumentation and spans are valid we probably should handle this edge case and make it OK code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this more, I don't think we need to make any changes.
Originally I was looking at our exported data in honeycomb where this was accidentally reported as a boolean instead of a string, causing some type conversion problems.
| }, | ||
| }, | ||
| }, | ||
| Description: &tracepb.TruncatableString{Value: currAnnotation.Value}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmitryax found another difference with the v1 parsing on Time Events, converted it to the v2 behavior
opentelemetry-collector/receiver/zipkinreceiver/trace_receiver.go
Lines 562 to 574 in 8aa2731
| func zipkinAnnotationToProtoAnnotation(zas zipkinmodel.Annotation) *tracepb.Span_TimeEvent { | |
| if zas == blankAnnotation { | |
| return nil | |
| } | |
| return &tracepb.Span_TimeEvent{ | |
| Time: internal.TimeToTimestamp(zas.Timestamp), | |
| Value: &tracepb.Span_TimeEvent_Annotation_{ | |
| Annotation: &tracepb.Span_TimeEvent_Annotation{ | |
| Description: &tracepb.TruncatableString{Value: zas.Value}, | |
| }, | |
| }, | |
| } | |
| } |
without this change, the exporter would emit time events without a name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it still work as expected for v1 spans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously it didn't work, it should now
The time event (annotation in zipkin terms) looks like this
annotations:
[ {
"timestamp": 1591814867,
"value": "my custom annotation",
"endpoint": {
"serviceName": "my-service"
}
} ]
Previously this was being converted into this proto structure
time_events:
[ {
time: 1591814867,
attributes: {
"my custom annotation": "my-service"
}
} ]
Now it is
time_events:
[ {
time: 1591814867,
description: "my custom annotation"
} ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. looks good
|
@chris-smith-zocdoc could you please increase the diff coverage to hit 95% requirement? |
# Conflicts: # receiver/zipkinreceiver/trace_receiver.go # receiver/zipkinreceiver/trace_receiver_test.go
33b9944 to
0ea765f
Compare
|
Unf it's caused by moving around parts that were not covered before. You can see what is not covered here: https://codecov.io/gh/open-telemetry/opentelemetry-collector/compare/1ad767e62f3dff6f62f32c7360b6fefe0fbf32ff...0ea765f5586230d49756fc380585341c6333d4a0/diff |
|
wooo its green! |
|
Thanks @chris-smith-zocdoc ! It looks good overall, but I'd like to give another thorough review tomorrow since it's a pretty risky change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-smith-zocdoc overall it looks good. I just found the last issue which I think we should solve in this PR. Please let me know what do you think.
| } | ||
| code, set := canonicalCodesMap[canonicalCodeStr] | ||
| if set { | ||
| return &code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we need to break down this function into two:
If this condition worked out and we can get a clear error status from "error" tag by doing the case * tracepb.AttributeValue_StringValue -> canonicalCodesMap[canonicalCodeStr] check, the result must be considered as higher priority. And we should not use HTTP status in that case. We also should drop the "error" tag, since it doesn't provide any additional value (return true in fromAttribute).
Otherwise, if there is something in the "error" that we cannot identify, we should set unknown status as lower priority, and use HTTP status instead. Probably we need to use another statusMapper like fromErrorTagUnknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made this change and was able to simplify this logic a bit. Please let me know what you think
|
|
||
| if s.codePtr == nil { | ||
| s.codePtr = m.fromErrorTag.codePtr | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the result of the proposed change we would have here:
if s.codePtr == nil {
switch {
case m.fromCensus.message != "":
s = m.fromCensus
case m.fromStatus.message != "":
s = m.fromStatus
case m.fromErrorTag.codePtr != nil:
s = m.fromErrorTag
case m.fromHTTP.message != "":
s = m.fromHTTP
default:
s = m.fromErrorTagUnknown
}
ea5ba4f to
d08da47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
) Fixes open-telemetry#975 Please look at the individual commits, most of this is just moving code around. Moves zipkin v2 trace conversion code into translator/trace/zipkin, previously it was in the receiver Use the same tag parsing logic for both zipkin v1 and v2
…/jaeger (open-telemetry#1002) * Bump google.golang.org/grpc in /exporters/trace/jaeger Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.30.0 to 1.31.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.30.0...v1.31.0) Signed-off-by: dependabot[bot] <[email protected]> * Auto-fix go.sum changes in dependent modules Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Tyler Yahn <[email protected]>
…y#1002) Bumps [boto3](https://github.com/boto/boto3) from 1.20.16 to 1.20.17. - [Release notes](https://github.com/boto/boto3/releases) - [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst) - [Commits](boto/boto3@1.20.16...1.20.17) --- updated-dependencies: - dependency-name: boto3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* demo: add missing namespace on grafana dashboard fixes: open-telemetry#1001 Signed-off-by: hsinhoyeh <[email protected]> * add rendered yamls Signed-off-by: hsinhoyeh <[email protected]> --------- Signed-off-by: hsinhoyeh <[email protected]>



Fixes #975
Please look at the individual commits, most of this is just moving code around.
Moves zipkin v2 trace conversion code into translator/trace/zipkin, previously it was in the receiver
Use the same tag parsing logic for both zipkin v1 and v2