[exporter/loadbalancingexporter] batch before exporting in load balancer #18273

jessberg · 2023-02-03T19:36:42Z

Description:
Some users (including myself) have had issues with the loadbalancingexporter failing on high loads. This is because the current code calls the underlying exporter for each span, in the case of traces, and each unit in the case of logs. This results in a lot of calls to the underlying exporter, when a lot of the calls could be consolidated since they are sending to the same backend. This PR batches the data by endpoint before calling any underlying exporter in order to reduce the overall overhead of each call.

Link to tracking Issue: 17173

Testing:
I changed a few of the tests so that the results are not many pdata.Traces or plogs.Logs going to the backend, but rather one large pdata.Traces or plogs.Logs.

Documentation:
No documentation changes.

linux-foundation-easycla · 2023-02-03T19:36:48Z

The committers listed above are authorized under a signed CLA.

✅ login: jessberg (480666d, 557b28d, 6d1a64b, b74a8e6, 03c3682, 7ed46ae)

jpkrohling · 2023-02-06T19:18:24Z

I'm adding this to my review queue, but perhaps @pkositsyn would want to review this one as well?

pkositsyn · 2023-02-06T21:50:06Z

@jpkrohling I actually did like the same thing a couple of years ago in my company. Actually, I even replaced Split function with some custom logic.

History aside, I'm ready to review this PR. Hope to get to it tomorrow

pkositsyn

Overall looks good, just a couple questions about excessive copying

exporter/loadbalancingexporter/log_exporter.go

pkositsyn · 2023-02-07T19:29:06Z

exporter/loadbalancingexporter/trace_exporter.go

+				for i := 0; i < batches[batch].ResourceSpans().Len(); i++ {
+					batches[batch].ResourceSpans().At(i).CopyTo(endpointToTraceData[endpoint].ResourceSpans().AppendEmpty())
+				}


Actually here we could also benefit from moving, but previously checking that routingIDs has the length of 1

Why would we check routingIDs has length of 1? Shouldn't we just copy the trace to each backend it's supposed to be routed to, even if that's multiple backends?

Well, I mean that in case there is only 1 routing backend - why would we need to copy? In case of >= 2 each backend would share the same traces structure, so we should copy in this case for safety.

Sure, we can always copy, but there is room for optimization - this is my point

I see what you mean. I think we should check for only one endpoint outside of that for loop and routingIDs though. The point of the optimization is that there may be shared endpoints throughout different batches, and different calls to create routingIDs, and those should be sent with the same batched call. I made a function to see if there is just one endpoint there.

Good call about this optimization, this could indeed save come RPC calls (not actually what I meant :)
But I have one concern - due to this code we could send one pdata.Traces containing spans with different TraceIDs. This could be a breaking change.
@jpkrohling Does this exporter guarantee, that in one sent pdata.Traces there is only one TraceID? Or there could be spans from different traces belonging to that backend?

So about my previous point. I meant the situation when there is for example only one trace, which is routed to multiple backends. This way it should be fully copied for each of these backends. And if we see, that there is only one routingID for batch, then I think we could safely move. Maybe we could create a test case for this to be clear what I try to say?

I meant the situation when there is for example only one trace, which is routed to multiple backends

Right now, one trace can be sent to only one backend.

Right now, one trace can be sent to only one backend.
Is that so? Why is routingIDs a slice then?

One pdata.Traces may contain multiple traces (represented as a ResourceSpan), whose routing keys (service names or trace IDs) are mapped to backends.

But we get several routing ids for one batch, which already contains one trace. With custom routing key (for example by service of some span in the trace) we actually can route one trace to several backends

Returning to my main point - I believe we can move traces only if len(routingIDs) is 1. For simplicity of this PR let's just copy everything as it was in the first place

pkositsyn · 2023-02-08T05:17:00Z

And as a side note - this change could possibly lead to grpc max message size exceeded error (default 4mb limit) in some cases with really big traces. By grouping we make messages bigger in size, which only contibutes to error, although it's definitely worth it for performance.

runforesight · 2023-02-08T19:49:10Z

Foresight Summary

Major Impacts

build-and-test duration(43 minutes 34 seconds) has decreased 27 minutes 26 seconds compared to main branch avg(1 hour 11 minutes).

View More Details

⭕ build-and-test-windows workflow has finished in 7 seconds (43 minutes 13 seconds less than `main` branch avg.) and finished at 2nd Mar, 2023.

Job	Failed Steps	Tests
windows-unittest-matrix	- 🔗	N/A	See Details
windows-unittest	- 🔗	N/A	See Details

✅ check-links workflow has finished in 42 seconds (1 minute 47 seconds less than `main` branch avg.) and finished at 2nd Mar, 2023.

Job	Failed Steps	Tests
changed files	- 🔗	N/A	See Details
check-links	- 🔗	N/A	See Details

✅ telemetrygen workflow has finished in 1 minute 7 seconds (1 minute 58 seconds less than `main` branch avg.) and finished at 2nd Mar, 2023.

Job	Failed Steps	Tests
build-dev	- 🔗	N/A	See Details
publish-stable	- 🔗	N/A	See Details
publish-latest	- 🔗	N/A	See Details

✅ changelog workflow has finished in 2 minutes 23 seconds and finished at 2nd Mar, 2023.

Job	Failed Steps	Tests
changelog	- 🔗	N/A	See Details

✅ build-and-test workflow has finished in 43 minutes 34 seconds (27 minutes 26 seconds less than `main` branch avg.) and finished at 2nd Mar, 2023.

Job	Failed Steps	Tests
correctness-metrics	- 🔗	✅ 2 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, internal)	- 🔗	✅ 561 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, extension)	- 🔗	✅ 537 ❌ 0 ⏭ 0 🔗	See Details
correctness-traces	- 🔗	✅ 17 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, extension)	- 🔗	✅ 537 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, internal)	- 🔗	✅ 561 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, processor)	- 🔗	✅ 1511 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, processor)	- 🔗	✅ 1511 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, receiver-0)	- 🔗	✅ 2577 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, receiver-0)	- 🔗	✅ 2577 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, exporter)	- 🔗	✅ 2457 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, receiver-1)	- 🔗	✅ 1932 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, other)	- 🔗	✅ 4772 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, exporter)	- 🔗	✅ 2457 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, other)	- 🔗	✅ 4772 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.20, receiver-1)	- 🔗	✅ 1932 ❌ 0 ⏭ 0 🔗	See Details
integration-tests	- 🔗	✅ 55 ❌ 0 ⏭ 0 🔗	See Details
setup-environment	- 🔗	N/A	See Details
checks	- 🔗	N/A	See Details
check-codeowners	- 🔗	N/A	See Details
check-collector-module-version	- 🔗	N/A	See Details
build-examples	- 🔗	N/A	See Details
lint-matrix (receiver-0)	- 🔗	N/A	See Details
lint-matrix (receiver-1)	- 🔗	N/A	See Details
lint-matrix (processor)	- 🔗	N/A	See Details
lint-matrix (exporter)	- 🔗	N/A	See Details
lint-matrix (extension)	- 🔗	N/A	See Details
lint-matrix (internal)	- 🔗	N/A	See Details
lint-matrix (other)	- 🔗	N/A	See Details
unittest (1.20)	- 🔗	N/A	See Details
unittest (1.19)	- 🔗	N/A	See Details
lint	- 🔗	N/A	See Details
cross-compile (darwin, amd64)	- 🔗	N/A	See Details
cross-compile (darwin, arm64)	- 🔗	N/A	See Details
cross-compile (linux, 386)	- 🔗	N/A	See Details
cross-compile (linux, amd64)	- 🔗	N/A	See Details
cross-compile (linux, arm)	- 🔗	N/A	See Details
cross-compile (linux, arm64)	- 🔗	N/A	See Details
cross-compile (linux, ppc64le)	- 🔗	N/A	See Details
cross-compile (windows, 386)	- 🔗	N/A	See Details
cross-compile (windows, amd64)	- 🔗	N/A	See Details
build-package (deb)	- 🔗	N/A	See Details
build-package (rpm)	- 🔗	N/A	See Details
windows-msi	- 🔗	N/A	See Details
publish-check	- 🔗	N/A	See Details
publish-stable	- 🔗	N/A	See Details
publish-dev	- 🔗	N/A	See Details

✅ prometheus-compliance-tests workflow has finished in 11 minutes 17 seconds (⚠️ 2 minutes 20 seconds more than `main` branch avg.) and finished at 2nd Mar, 2023.

Job	Failed Steps	Tests
prometheus-compliance-tests	- 🔗	✅ 21 ❌ 0 ⏭ 0 🔗	See Details

✅ load-tests workflow has finished in 19 minutes 36 seconds (⚠️ 2 minutes 27 seconds more than `main` branch avg.) and finished at 2nd Mar, 2023.

Job	Failed Steps	Tests
loadtest (TestTraceAttributesProcessor)	- 🔗	✅ 3 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestIdleMode)	- 🔗	✅ 1 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestMetric10kDPS\|TestMetricsFromFile)	- 🔗	✅ 6 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestTraceNoBackend10kSPS\|TestTrace1kSPSWithAttrs)	- 🔗	✅ 8 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestMetricResourceProcessor\|TestTrace10kSPS)	- 🔗	✅ 12 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestTraceBallast1kSPSWithAttrs\|TestTraceBallast1kSPSAddAttrs)	- 🔗	✅ 10 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestBallastMemory\|TestLog10kDPS)	- 🔗	✅ 18 ❌ 0 ⏭ 0 🔗	See Details
setup-environment	- 🔗	N/A	See Details

✅ e2e-tests workflow has finished in 16 minutes 32 seconds and finished at 2nd Mar, 2023.

Job	Failed Steps	Tests
kubernetes-test (v1.26.0)	- 🔗	N/A	See Details
kubernetes-test (v1.25.3)	- 🔗	N/A	See Details
kubernetes-test (v1.24.7)	- 🔗	N/A	See Details
kubernetes-test (v1.23.13)	- 🔗	N/A	See Details

🔎 See details on Foresight

^{*You can configure Foresight comments in your organization settings page.}

jessberg · 2023-02-13T16:48:24Z

It looks like it is failing the lint tests; I ran golangci-lint run --fix. Is there any other linting that is done for this project? I can't seem to pinpoint what issue it has.

jpkrohling · 2023-03-02T13:19:37Z

You can run make lint on the component's root, which should behave the same way as the CI.

janssenlima · 2023-03-08T11:33:49Z

I built the collector with this fix. I'm using it since last week, and it works fine.

pkositsyn · 2023-03-08T14:02:45Z

I built the collector with this fix. I'm using it since last week, and it works fine.

Does it prove correctness?

janssenlima · 2023-03-08T21:51:12Z

I built the collector with this fix. I'm using it since last week, and it works fine.

Does it prove correctness?

Yes.
I didn't find any problems.

janssenlima · 2023-03-08T22:04:22Z

And as a side note - this change could possibly lead to grpc max message size exceeded error (default 4mb limit) in some cases with really big traces. By grouping we make messages bigger in size, which only contibutes to error, although it's definitely worth it for performance.

It's happened to me. I adjusted the max_recv_msg_size_mib in the collector. Additionally, I've guided the use of gRPC compression in the microservice.

jpkrohling · 2023-03-27T16:02:43Z

@pkositsyn, are you happy with this change, or do you have concerns about the correctness of the final result?

jpkrohling · 2023-05-04T12:38:25Z

I'm merging this, as I think it's worth having it there. The one thing I wish this had, which could be part of a follow-up PR, is the warning about gRPC messages exceeding the max and providing concrete instructions on how to avoid it (based on @janssenlima's comment perhaps).

jpkrohling

I like this PR, but there are a few things making me uncomfortable:

nit: HasMultipleEndpoints could have been marked as private
the problem with the grpc message size highlights a bigger design problem: we should rely on the common batching features of the collector instead. Perhaps we could use connectors here, and batch things on the second pipeline? @kovrus, what do you think?
I'm still not 100% sure we have correctness, as @pkositsyn mentioned. Perhaps it's evidence that this could have more tests?

github-actions · 2023-05-26T22:02:34Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2023-06-10T05:19:44Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

forestsword · 2023-08-23T13:24:05Z

@jessberg have you moved on? Should someone else pick up this work? I've been running this for a few months now and would like the issue resolved.

forestsword · 2023-08-23T13:28:13Z

I believe this issue is referencing the same problem but I might be wrong: #13826

jessberg · 2023-08-25T01:54:26Z

Hi, sorry, yes, I have moved on and no longer have the cycles for this. Sorry about that; hopefully this partial version will work for some people.

nicacioliveira · 2023-10-26T19:05:52Z

Hi, do we will have this work continued? I experiencing the same problem with a load close to 300k rps using the load-balancer where each request has big traces.

rhnasc · 2024-04-11T18:00:43Z

This (merged) PR appears to fix this issue.
#30141

jessberg added 6 commits February 3, 2023 13:48

Trace exporter has batching.

480666d

Logs batching complete.

557b28d

Modularize tracing code.

6d1a64b

lint

b74a8e6

Add to change log.

03c3682

Fix change log formatting

7ed46ae

jessberg requested a review from a team February 3, 2023 19:36

jessberg requested a review from jpkrohling as a code owner February 3, 2023 19:36

github-actions bot added the exporter/loadbalancing label Feb 3, 2023

github-actions bot assigned bogdandrutu Feb 3, 2023

pkositsyn suggested changes Feb 7, 2023

View reviewed changes

Using MoveAndAppendTo

24b8879

jessberg marked this pull request as draft February 10, 2023 15:51

jessberg marked this pull request as ready for review February 13, 2023 16:47

github-actions bot assigned dmitryax Feb 13, 2023

Adding quicker implementation if we have only one backend.

e7baf84

atoulme approved these changes Mar 8, 2023

View reviewed changes

jpkrohling requested changes May 4, 2023

View reviewed changes

github-actions bot added the Stale label May 26, 2023

github-actions bot closed this Jun 10, 2023

[exporter/loadbalancingexporter] batch before exporting in load balancer #18273

[exporter/loadbalancingexporter] batch before exporting in load balancer #18273

Uh oh!

Conversation

jessberg commented Feb 3, 2023

Uh oh!

linux-foundation-easycla bot commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpkrohling commented Feb 6, 2023

Uh oh!

pkositsyn commented Feb 6, 2023

Uh oh!

pkositsyn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkositsyn commented Feb 8, 2023

Uh oh!

runforesight bot commented Feb 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Foresight Summary

build-and-test duration(43 minutes 34 seconds) has decreased 27 minutes 26 seconds compared to main branch avg(1 hour 11 minutes).

⭕ build-and-test-windows workflow has finished in 7 seconds (43 minutes 13 seconds less than main branch avg.) and finished at 2nd Mar, 2023.

✅ check-links workflow has finished in 42 seconds (1 minute 47 seconds less than main branch avg.) and finished at 2nd Mar, 2023.

✅ telemetrygen workflow has finished in 1 minute 7 seconds (1 minute 58 seconds less than main branch avg.) and finished at 2nd Mar, 2023.

✅ changelog workflow has finished in 2 minutes 23 seconds and finished at 2nd Mar, 2023.

✅ build-and-test workflow has finished in 43 minutes 34 seconds (27 minutes 26 seconds less than main branch avg.) and finished at 2nd Mar, 2023.

✅ prometheus-compliance-tests workflow has finished in 11 minutes 17 seconds (⚠️ 2 minutes 20 seconds more than main branch avg.) and finished at 2nd Mar, 2023.

✅ load-tests workflow has finished in 19 minutes 36 seconds (⚠️ 2 minutes 27 seconds more than main branch avg.) and finished at 2nd Mar, 2023.

✅ e2e-tests workflow has finished in 16 minutes 32 seconds and finished at 2nd Mar, 2023.

Uh oh!

jessberg commented Feb 13, 2023

Uh oh!

jpkrohling commented Mar 2, 2023

Uh oh!

janssenlima commented Mar 8, 2023

Uh oh!

pkositsyn commented Mar 8, 2023

Uh oh!

janssenlima commented Mar 8, 2023

Uh oh!

janssenlima commented Mar 8, 2023

Uh oh!

jpkrohling commented Mar 27, 2023

Uh oh!

jpkrohling commented May 4, 2023

Uh oh!

jpkrohling left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 26, 2023

Uh oh!

github-actions bot commented Jun 10, 2023

Uh oh!

forestsword commented Aug 23, 2023

Uh oh!

forestsword commented Aug 23, 2023

Uh oh!

linux-foundation-easycla bot commented Feb 3, 2023 •

edited

Loading

runforesight bot commented Feb 8, 2023 •

edited

Loading

`build-and-test` duration(43 minutes 34 seconds) has decreased 27 minutes 26 seconds compared to main branch avg(1 hour 11 minutes).

⭕ build-and-test-windows workflow has finished in 7 seconds (43 minutes 13 seconds less than `main` branch avg.) and finished at 2nd Mar, 2023.

✅ check-links workflow has finished in 42 seconds (1 minute 47 seconds less than `main` branch avg.) and finished at 2nd Mar, 2023.

✅ telemetrygen workflow has finished in 1 minute 7 seconds (1 minute 58 seconds less than `main` branch avg.) and finished at 2nd Mar, 2023.

✅ build-and-test workflow has finished in 43 minutes 34 seconds (27 minutes 26 seconds less than `main` branch avg.) and finished at 2nd Mar, 2023.

✅ prometheus-compliance-tests workflow has finished in 11 minutes 17 seconds (⚠️ 2 minutes 20 seconds more than `main` branch avg.) and finished at 2nd Mar, 2023.

✅ load-tests workflow has finished in 19 minutes 36 seconds (⚠️ 2 minutes 27 seconds more than `main` branch avg.) and finished at 2nd Mar, 2023.

nicacioliveira commented Oct 26, 2023 •

edited

Loading