-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[exporter/loadbalancingexporter] batch before exporting in load balancer #18273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm adding this to my review queue, but perhaps @pkositsyn would want to review this one as well? |
|
@jpkrohling I actually did like the same thing a couple of years ago in my company. Actually, I even replaced Split function with some custom logic. History aside, I'm ready to review this PR. Hope to get to it tomorrow |
pkositsyn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just a couple questions about excessive copying
| for i := 0; i < batches[batch].ResourceSpans().Len(); i++ { | ||
| batches[batch].ResourceSpans().At(i).CopyTo(endpointToTraceData[endpoint].ResourceSpans().AppendEmpty()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually here we could also benefit from moving, but previously checking that routingIDs has the length of 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we check routingIDs has length of 1? Shouldn't we just copy the trace to each backend it's supposed to be routed to, even if that's multiple backends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I mean that in case there is only 1 routing backend - why would we need to copy? In case of >= 2 each backend would share the same traces structure, so we should copy in this case for safety.
Sure, we can always copy, but there is room for optimization - this is my point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. I think we should check for only one endpoint outside of that for loop and routingIDs though. The point of the optimization is that there may be shared endpoints throughout different batches, and different calls to create routingIDs, and those should be sent with the same batched call. I made a function to see if there is just one endpoint there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call about this optimization, this could indeed save come RPC calls (not actually what I meant :)
But I have one concern - due to this code we could send one pdata.Traces containing spans with different TraceIDs. This could be a breaking change.
@jpkrohling Does this exporter guarantee, that in one sent pdata.Traces there is only one TraceID? Or there could be spans from different traces belonging to that backend?
So about my previous point. I meant the situation when there is for example only one trace, which is routed to multiple backends. This way it should be fully copied for each of these backends. And if we see, that there is only one routingID for batch, then I think we could safely move. Maybe we could create a test case for this to be clear what I try to say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the situation when there is for example only one trace, which is routed to multiple backends
Right now, one trace can be sent to only one backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, one trace can be sent to only one backend.
Is that so? Why is routingIDs a slice then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One pdata.Traces may contain multiple traces (represented as a ResourceSpan), whose routing keys (service names or trace IDs) are mapped to backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we get several routing ids for one batch, which already contains one trace. With custom routing key (for example by service of some span in the trace) we actually can route one trace to several backends
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning to my main point - I believe we can move traces only if len(routingIDs) is 1. For simplicity of this PR let's just copy everything as it was in the first place
|
And as a side note - this change could possibly lead to grpc max message size exceeded error (default 4mb limit) in some cases with really big traces. By grouping we make messages bigger in size, which only contibutes to error, although it's definitely worth it for performance. |
Foresight Summary
View More Details⭕ build-and-test-windows workflow has finished in 7 seconds (43 minutes 13 seconds less than
|
| Job | Failed Steps | Tests | |
|---|---|---|---|
| windows-unittest-matrix | - 🔗 | N/A | See Details |
| windows-unittest | - 🔗 | N/A | See Details |
✅ check-links workflow has finished in 42 seconds (1 minute 47 seconds less than main branch avg.) and finished at 2nd Mar, 2023.
| Job | Failed Steps | Tests | |
|---|---|---|---|
| changed files | - 🔗 | N/A | See Details |
| check-links | - 🔗 | N/A | See Details |
✅ telemetrygen workflow has finished in 1 minute 7 seconds (1 minute 58 seconds less than main branch avg.) and finished at 2nd Mar, 2023.
| Job | Failed Steps | Tests | |
|---|---|---|---|
| build-dev | - 🔗 | N/A | See Details |
| publish-stable | - 🔗 | N/A | See Details |
| publish-latest | - 🔗 | N/A | See Details |
✅ changelog workflow has finished in 2 minutes 23 seconds and finished at 2nd Mar, 2023.
| Job | Failed Steps | Tests | |
|---|---|---|---|
| changelog | - 🔗 | N/A | See Details |
✅ build-and-test workflow has finished in 43 minutes 34 seconds (27 minutes 26 seconds less than main branch avg.) and finished at 2nd Mar, 2023.
| Job | Failed Steps | Tests | |
|---|---|---|---|
| correctness-metrics | - 🔗 | ✅ 2 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.19, internal) | - 🔗 | ✅ 561 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.20, extension) | - 🔗 | ✅ 537 ❌ 0 ⏭ 0 🔗 | See Details |
| correctness-traces | - 🔗 | ✅ 17 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.19, extension) | - 🔗 | ✅ 537 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.20, internal) | - 🔗 | ✅ 561 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.19, processor) | - 🔗 | ✅ 1511 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.20, processor) | - 🔗 | ✅ 1511 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.19, receiver-0) | - 🔗 | ✅ 2577 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.20, receiver-0) | - 🔗 | ✅ 2577 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.19, exporter) | - 🔗 | ✅ 2457 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.19, receiver-1) | - 🔗 | ✅ 1932 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.19, other) | - 🔗 | ✅ 4772 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.20, exporter) | - 🔗 | ✅ 2457 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.20, other) | - 🔗 | ✅ 4772 ❌ 0 ⏭ 0 🔗 | See Details |
| unittest-matrix (1.20, receiver-1) | - 🔗 | ✅ 1932 ❌ 0 ⏭ 0 🔗 | See Details |
| integration-tests | - 🔗 | ✅ 55 ❌ 0 ⏭ 0 🔗 | See Details |
| setup-environment | - 🔗 | N/A | See Details |
| checks | - 🔗 | N/A | See Details |
| check-codeowners | - 🔗 | N/A | See Details |
| check-collector-module-version | - 🔗 | N/A | See Details |
| build-examples | - 🔗 | N/A | See Details |
| lint-matrix (receiver-0) | - 🔗 | N/A | See Details |
| lint-matrix (receiver-1) | - 🔗 | N/A | See Details |
| lint-matrix (processor) | - 🔗 | N/A | See Details |
| lint-matrix (exporter) | - 🔗 | N/A | See Details |
| lint-matrix (extension) | - 🔗 | N/A | See Details |
| lint-matrix (internal) | - 🔗 | N/A | See Details |
| lint-matrix (other) | - 🔗 | N/A | See Details |
| unittest (1.20) | - 🔗 | N/A | See Details |
| unittest (1.19) | - 🔗 | N/A | See Details |
| lint | - 🔗 | N/A | See Details |
| cross-compile (darwin, amd64) | - 🔗 | N/A | See Details |
| cross-compile (darwin, arm64) | - 🔗 | N/A | See Details |
| cross-compile (linux, 386) | - 🔗 | N/A | See Details |
| cross-compile (linux, amd64) | - 🔗 | N/A | See Details |
| cross-compile (linux, arm) | - 🔗 | N/A | See Details |
| cross-compile (linux, arm64) | - 🔗 | N/A | See Details |
| cross-compile (linux, ppc64le) | - 🔗 | N/A | See Details |
| cross-compile (windows, 386) | - 🔗 | N/A | See Details |
| cross-compile (windows, amd64) | - 🔗 | N/A | See Details |
| build-package (deb) | - 🔗 | N/A | See Details |
| build-package (rpm) | - 🔗 | N/A | See Details |
| windows-msi | - 🔗 | N/A | See Details |
| publish-check | - 🔗 | N/A | See Details |
| publish-stable | - 🔗 | N/A | See Details |
| publish-dev | - 🔗 | N/A | See Details |
✅ prometheus-compliance-tests workflow has finished in 11 minutes 17 seconds (⚠️ 2 minutes 20 seconds more than main branch avg.) and finished at 2nd Mar, 2023.
| Job | Failed Steps | Tests | |
|---|---|---|---|
| prometheus-compliance-tests | - 🔗 | ✅ 21 ❌ 0 ⏭ 0 🔗 | See Details |
✅ load-tests workflow has finished in 19 minutes 36 seconds (⚠️ 2 minutes 27 seconds more than main branch avg.) and finished at 2nd Mar, 2023.
| Job | Failed Steps | Tests | |
|---|---|---|---|
| loadtest (TestTraceAttributesProcessor) | - 🔗 | ✅ 3 ❌ 0 ⏭ 0 🔗 | See Details |
| loadtest (TestIdleMode) | - 🔗 | ✅ 1 ❌ 0 ⏭ 0 🔗 | See Details |
| loadtest (TestMetric10kDPS|TestMetricsFromFile) | - 🔗 | ✅ 6 ❌ 0 ⏭ 0 🔗 | See Details |
| loadtest (TestTraceNoBackend10kSPS|TestTrace1kSPSWithAttrs) | - 🔗 | ✅ 8 ❌ 0 ⏭ 0 🔗 | See Details |
| loadtest (TestMetricResourceProcessor|TestTrace10kSPS) | - 🔗 | ✅ 12 ❌ 0 ⏭ 0 🔗 | See Details |
| loadtest (TestTraceBallast1kSPSWithAttrs|TestTraceBallast1kSPSAddAttrs) | - 🔗 | ✅ 10 ❌ 0 ⏭ 0 🔗 | See Details |
| loadtest (TestBallastMemory|TestLog10kDPS) | - 🔗 | ✅ 18 ❌ 0 ⏭ 0 🔗 | See Details |
| setup-environment | - 🔗 | N/A | See Details |
✅ e2e-tests workflow has finished in 16 minutes 32 seconds and finished at 2nd Mar, 2023.
| Job | Failed Steps | Tests | |
|---|---|---|---|
| kubernetes-test (v1.26.0) | - 🔗 | N/A | See Details |
| kubernetes-test (v1.25.3) | - 🔗 | N/A | See Details |
| kubernetes-test (v1.24.7) | - 🔗 | N/A | See Details |
| kubernetes-test (v1.23.13) | - 🔗 | N/A | See Details |
*You can configure Foresight comments in your organization settings page.
|
It looks like it is failing the lint tests; I ran golangci-lint run --fix. Is there any other linting that is done for this project? I can't seem to pinpoint what issue it has. |
|
You can run |
|
I built the collector with this fix. I'm using it since last week, and it works fine. |
Does it prove correctness? |
Yes. |
It's happened to me. I adjusted the |
|
@pkositsyn, are you happy with this change, or do you have concerns about the correctness of the final result? |
|
I'm merging this, as I think it's worth having it there. The one thing I wish this had, which could be part of a follow-up PR, is the warning about gRPC messages exceeding the max and providing concrete instructions on how to avoid it (based on @janssenlima's comment perhaps). |
jpkrohling
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this PR, but there are a few things making me uncomfortable:
- nit: HasMultipleEndpoints could have been marked as private
- the problem with the grpc message size highlights a bigger design problem: we should rely on the common batching features of the collector instead. Perhaps we could use connectors here, and batch things on the second pipeline? @kovrus, what do you think?
- I'm still not 100% sure we have correctness, as @pkositsyn mentioned. Perhaps it's evidence that this could have more tests?
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
|
@jessberg have you moved on? Should someone else pick up this work? I've been running this for a few months now and would like the issue resolved. |
|
I believe this issue is referencing the same problem but I might be wrong: #13826 |
|
Hi, sorry, yes, I have moved on and no longer have the cycles for this. Sorry about that; hopefully this partial version will work for some people. |
|
Hi, do we will have this work continued? I experiencing the same problem with a load close to 300k rps using the load-balancer where each request has big traces. |
|
This (merged) PR appears to fix this issue. |
Description:
Some users (including myself) have had issues with the loadbalancingexporter failing on high loads. This is because the current code calls the underlying exporter for each span, in the case of traces, and each unit in the case of logs. This results in a lot of calls to the underlying exporter, when a lot of the calls could be consolidated since they are sending to the same backend. This PR batches the data by endpoint before calling any underlying exporter in order to reduce the overall overhead of each call.
Link to tracking Issue: 17173
Testing:
I changed a few of the tests so that the results are not many pdata.Traces or plogs.Logs going to the backend, but rather one large pdata.Traces or plogs.Logs.
Documentation:
No documentation changes.