Skip to content

Conversation

@jessberg
Copy link

@jessberg jessberg commented Feb 3, 2023

Description:
Some users (including myself) have had issues with the loadbalancingexporter failing on high loads. This is because the current code calls the underlying exporter for each span, in the case of traces, and each unit in the case of logs. This results in a lot of calls to the underlying exporter, when a lot of the calls could be consolidated since they are sending to the same backend. This PR batches the data by endpoint before calling any underlying exporter in order to reduce the overall overhead of each call.

Link to tracking Issue: 17173

Testing:
I changed a few of the tests so that the results are not many pdata.Traces or plogs.Logs going to the backend, but rather one large pdata.Traces or plogs.Logs.

Documentation:
No documentation changes.

@jessberg jessberg requested a review from a team February 3, 2023 19:36
@jessberg jessberg requested a review from jpkrohling as a code owner February 3, 2023 19:36
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 3, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@jpkrohling
Copy link
Member

I'm adding this to my review queue, but perhaps @pkositsyn would want to review this one as well?

@pkositsyn
Copy link
Contributor

@jpkrohling I actually did like the same thing a couple of years ago in my company. Actually, I even replaced Split function with some custom logic.

History aside, I'm ready to review this PR. Hope to get to it tomorrow

Copy link
Contributor

@pkositsyn pkositsyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a couple questions about excessive copying

Comment on lines 103 to 105
for i := 0; i < batches[batch].ResourceSpans().Len(); i++ {
batches[batch].ResourceSpans().At(i).CopyTo(endpointToTraceData[endpoint].ResourceSpans().AppendEmpty())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually here we could also benefit from moving, but previously checking that routingIDs has the length of 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we check routingIDs has length of 1? Shouldn't we just copy the trace to each backend it's supposed to be routed to, even if that's multiple backends?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I mean that in case there is only 1 routing backend - why would we need to copy? In case of >= 2 each backend would share the same traces structure, so we should copy in this case for safety.

Sure, we can always copy, but there is room for optimization - this is my point

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I think we should check for only one endpoint outside of that for loop and routingIDs though. The point of the optimization is that there may be shared endpoints throughout different batches, and different calls to create routingIDs, and those should be sent with the same batched call. I made a function to see if there is just one endpoint there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call about this optimization, this could indeed save come RPC calls (not actually what I meant :)
But I have one concern - due to this code we could send one pdata.Traces containing spans with different TraceIDs. This could be a breaking change.
@jpkrohling Does this exporter guarantee, that in one sent pdata.Traces there is only one TraceID? Or there could be spans from different traces belonging to that backend?

So about my previous point. I meant the situation when there is for example only one trace, which is routed to multiple backends. This way it should be fully copied for each of these backends. And if we see, that there is only one routingID for batch, then I think we could safely move. Maybe we could create a test case for this to be clear what I try to say?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the situation when there is for example only one trace, which is routed to multiple backends

Right now, one trace can be sent to only one backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, one trace can be sent to only one backend.
Is that so? Why is routingIDs a slice then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One pdata.Traces may contain multiple traces (represented as a ResourceSpan), whose routing keys (service names or trace IDs) are mapped to backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we get several routing ids for one batch, which already contains one trace. With custom routing key (for example by service of some span in the trace) we actually can route one trace to several backends

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning to my main point - I believe we can move traces only if len(routingIDs) is 1. For simplicity of this PR let's just copy everything as it was in the first place

@pkositsyn
Copy link
Contributor

And as a side note - this change could possibly lead to grpc max message size exceeded error (default 4mb limit) in some cases with really big traces. By grouping we make messages bigger in size, which only contibutes to error, although it's definitely worth it for performance.

@runforesight
Copy link

runforesight bot commented Feb 8, 2023

Foresight Summary

    
Major Impacts

build-and-test duration(43 minutes 34 seconds) has decreased 27 minutes 26 seconds compared to main branch avg(1 hour 11 minutes).
View More Details

⭕  build-and-test-windows workflow has finished in 7 seconds (43 minutes 13 seconds less than main branch avg.) and finished at 2nd Mar, 2023.


Job Failed Steps Tests
windows-unittest-matrix -     🔗  N/A See Details
windows-unittest -     🔗  N/A See Details

✅  check-links workflow has finished in 42 seconds (1 minute 47 seconds less than main branch avg.) and finished at 2nd Mar, 2023.


Job Failed Steps Tests
changed files -     🔗  N/A See Details
check-links -     🔗  N/A See Details

✅  telemetrygen workflow has finished in 1 minute 7 seconds (1 minute 58 seconds less than main branch avg.) and finished at 2nd Mar, 2023.


Job Failed Steps Tests
build-dev -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details
publish-latest -     🔗  N/A See Details

✅  changelog workflow has finished in 2 minutes 23 seconds and finished at 2nd Mar, 2023.


Job Failed Steps Tests
changelog -     🔗  N/A See Details

✅  build-and-test workflow has finished in 43 minutes 34 seconds (27 minutes 26 seconds less than main branch avg.) and finished at 2nd Mar, 2023.


Job Failed Steps Tests
correctness-metrics -     🔗  ✅ 2  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, internal) -     🔗  ✅ 561  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, extension) -     🔗  ✅ 537  ❌ 0  ⏭ 0    🔗 See Details
correctness-traces -     🔗  ✅ 17  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, extension) -     🔗  ✅ 537  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, internal) -     🔗  ✅ 561  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, processor) -     🔗  ✅ 1511  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, processor) -     🔗  ✅ 1511  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-0) -     🔗  ✅ 2577  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, receiver-0) -     🔗  ✅ 2577  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, exporter) -     🔗  ✅ 2457  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-1) -     🔗  ✅ 1932  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, other) -     🔗  ✅ 4772  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, exporter) -     🔗  ✅ 2457  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, other) -     🔗  ✅ 4772  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.20, receiver-1) -     🔗  ✅ 1932  ❌ 0  ⏭ 0    🔗 See Details
integration-tests -     🔗  ✅ 55  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details
checks -     🔗  N/A See Details
check-codeowners -     🔗  N/A See Details
check-collector-module-version -     🔗  N/A See Details
build-examples -     🔗  N/A See Details
lint-matrix (receiver-0) -     🔗  N/A See Details
lint-matrix (receiver-1) -     🔗  N/A See Details
lint-matrix (processor) -     🔗  N/A See Details
lint-matrix (exporter) -     🔗  N/A See Details
lint-matrix (extension) -     🔗  N/A See Details
lint-matrix (internal) -     🔗  N/A See Details
lint-matrix (other) -     🔗  N/A See Details
unittest (1.20) -     🔗  N/A See Details
unittest (1.19) -     🔗  N/A See Details
lint -     🔗  N/A See Details
cross-compile (darwin, amd64) -     🔗  N/A See Details
cross-compile (darwin, arm64) -     🔗  N/A See Details
cross-compile (linux, 386) -     🔗  N/A See Details
cross-compile (linux, amd64) -     🔗  N/A See Details
cross-compile (linux, arm) -     🔗  N/A See Details
cross-compile (linux, arm64) -     🔗  N/A See Details
cross-compile (linux, ppc64le) -     🔗  N/A See Details
cross-compile (windows, 386) -     🔗  N/A See Details
cross-compile (windows, amd64) -     🔗  N/A See Details
build-package (deb) -     🔗  N/A See Details
build-package (rpm) -     🔗  N/A See Details
windows-msi -     🔗  N/A See Details
publish-check -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details
publish-dev -     🔗  N/A See Details

✅  prometheus-compliance-tests workflow has finished in 11 minutes 17 seconds (⚠️ 2 minutes 20 seconds more than main branch avg.) and finished at 2nd Mar, 2023.


Job Failed Steps Tests
prometheus-compliance-tests -     🔗  ✅ 21  ❌ 0  ⏭ 0    🔗 See Details

✅  load-tests workflow has finished in 19 minutes 36 seconds (⚠️ 2 minutes 27 seconds more than main branch avg.) and finished at 2nd Mar, 2023.


Job Failed Steps Tests
loadtest (TestTraceAttributesProcessor) -     🔗  ✅ 3  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestIdleMode) -     🔗  ✅ 1  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetric10kDPS|TestMetricsFromFile) -     🔗  ✅ 6  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceNoBackend10kSPS|TestTrace1kSPSWithAttrs) -     🔗  ✅ 8  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetricResourceProcessor|TestTrace10kSPS) -     🔗  ✅ 12  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceBallast1kSPSWithAttrs|TestTraceBallast1kSPSAddAttrs) -     🔗  ✅ 10  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestBallastMemory|TestLog10kDPS) -     🔗  ✅ 18  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details

✅  e2e-tests workflow has finished in 16 minutes 32 seconds and finished at 2nd Mar, 2023.


Job Failed Steps Tests
kubernetes-test (v1.26.0) -     🔗  N/A See Details
kubernetes-test (v1.25.3) -     🔗  N/A See Details
kubernetes-test (v1.24.7) -     🔗  N/A See Details
kubernetes-test (v1.23.13) -     🔗  N/A See Details

🔎 See details on Foresight

*You can configure Foresight comments in your organization settings page.

@jessberg jessberg marked this pull request as draft February 10, 2023 15:51
@jessberg jessberg marked this pull request as ready for review February 13, 2023 16:47
@jessberg
Copy link
Author

It looks like it is failing the lint tests; I ran golangci-lint run --fix. Is there any other linting that is done for this project? I can't seem to pinpoint what issue it has.

@jpkrohling
Copy link
Member

You can run make lint on the component's root, which should behave the same way as the CI.

@janssenlima
Copy link

I built the collector with this fix. I'm using it since last week, and it works fine.

@pkositsyn
Copy link
Contributor

I built the collector with this fix. I'm using it since last week, and it works fine.

Does it prove correctness?

@janssenlima
Copy link

I built the collector with this fix. I'm using it since last week, and it works fine.

Does it prove correctness?

Yes.
I didn't find any problems.

@janssenlima
Copy link

And as a side note - this change could possibly lead to grpc max message size exceeded error (default 4mb limit) in some cases with really big traces. By grouping we make messages bigger in size, which only contibutes to error, although it's definitely worth it for performance.

It's happened to me. I adjusted the max_recv_msg_size_mib in the collector. Additionally, I've guided the use of gRPC compression in the microservice.

@jpkrohling
Copy link
Member

@pkositsyn, are you happy with this change, or do you have concerns about the correctness of the final result?

@jpkrohling
Copy link
Member

I'm merging this, as I think it's worth having it there. The one thing I wish this had, which could be part of a follow-up PR, is the warning about gRPC messages exceeding the max and providing concrete instructions on how to avoid it (based on @janssenlima's comment perhaps).

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR, but there are a few things making me uncomfortable:

  • nit: HasMultipleEndpoints could have been marked as private
  • the problem with the grpc message size highlights a bigger design problem: we should rely on the common batching features of the collector instead. Perhaps we could use connectors here, and batch things on the second pipeline? @kovrus, what do you think?
  • I'm still not 100% sure we have correctness, as @pkositsyn mentioned. Perhaps it's evidence that this could have more tests?

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label May 26, 2023
@github-actions
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jun 10, 2023
@forestsword
Copy link

@jessberg have you moved on? Should someone else pick up this work? I've been running this for a few months now and would like the issue resolved.

@forestsword
Copy link

I believe this issue is referencing the same problem but I might be wrong: #13826

@jessberg
Copy link
Author

Hi, sorry, yes, I have moved on and no longer have the cycles for this. Sorry about that; hopefully this partial version will work for some people.

@nicacioliveira
Copy link

nicacioliveira commented Oct 26, 2023

Hi, do we will have this work continued? I experiencing the same problem with a load close to 300k rps using the load-balancer where each request has big traces.

@rhnasc
Copy link

rhnasc commented Apr 11, 2024

This (merged) PR appears to fix this issue.
#30141

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants