Add histogram and output metrics for output latency #37445

fearful-symmetry · 2023-12-14T17:43:58Z

This adds an output latency metric to the last 30s metrics under monitoring.metrics.libbeat.output.write.latency:

sudo cat logs/metricbeat-20231214.ndjson | grep "last 30s" | jq -c .monitoring.metrics.libbeat.output.write

{"bytes":34869,"latency":{"histogram":{"count":2,"max":260,"mean":196,"median":196,"min":132,"p75":260,"p95":260,"p99":260,"p999":260,"stddev":64}}}
{"bytes":44541,"latency":{"histogram":{"count":5,"max":276,"mean":231.8,"median":251,"min":132,"p75":268,"p95":276,"p99":276,"p999":276,"stddev":51.27338490874188}}}
{"bytes":27483,"latency":{"histogram":{"count":7,"max":536,"mean":276.7142857142857,"median":251,"min":132,"p75":276,"p95":536,"p99":536,"p999":536,"stddev":114.43365424773341}}}

There's a few caveats with support for this:

Kafka isn't supported, as we're using a 3rd party async library, and I don't see a good way to hook into it or measure sync writes to the output
Ditto with the async logstash output
This is currently measured in nanoseconds; although that seems like an excessive level of granularity, I discovered that on my linux machine running from an M.2 SSD, writes with the file output are too fast for millisecond-level granularity.

I'm also mildly annoyed with the resolution of how floats are printed:

"mean":5630.393145161291,"median":4065.5,"min":1420,"p75":4667.5,"p95":10702.99999999999

Currently looking at a way to fix that, but it'll probably require some refactoring of the metrics logger.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

mergify · 2023-12-14T17:44:33Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-12-14T18:34:00Z

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 50 min 32 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

cmacknz · 2023-12-14T18:44:08Z

This is currently measured in nanoseconds; although that seems like an excessive level of granularity, I discovered that on my linux machine running from an M.2 SSD, writes with the file output are too fast for millisecond-level granularity

I wouldn't optimize for the file output, make sure it is easy to read and cheap to store when using the output we will see in reality.

If output latency is some small value like 50ns it can effectively be zero, that is so fast it is not the problem and we don't need to measure with any more precision.

fearful-symmetry · 2023-12-14T19:03:31Z

@cmacknz yah, milliseconds produces slightly less annoying output for the ES output, at least.

elasticmachine · 2023-12-14T19:13:05Z

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-12-14T18:16:22.649+0000
Duration: 56 min 28 sec

Test stats 🧪

Test	Results
Failed	0
Passed	27592
Skipped	1933
Total	29525

Steps errors

Expand to view the steps failures

`Error signal`

Took 0 min 0 sec . View more details here
Description: Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-12-14T21:17:52Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-12-14T19:04:44.570+0000
Duration: 133 min 51 sec

Test stats 🧪

Test	Results
Failed	0
Passed	28732
Skipped	2015
Total	30747

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

libbeat/outputs/elasticsearch/client.go

leehinman · 2023-12-15T17:14:13Z

libbeat/outputs/logstash/sync.go

 	}

 	for len(events) > 0 {
+		begin := time.Now()


I think this is too large of a span, we have reconnect and windowed events in here. I think we need to re-think this part. We really just want the time span to start right before the data is sent over the network, and end right when we know if data is written (ACK), even if there is an error we want to "bill" the time spent communicating with logstash.

FYI it's OK if we want to pull out logstash from this PR and make a new issue to track that going forward.

Yah. I actually kinda chose that span deliberately, having reconnect be part of the count actually seemed fairly accurate? It's a blocking operation, so a reconnect would would influence the latency?

It definitely increases the overall latency for sending the event, but that isn't what this metric is focused on. This is just trying to get a sense of how long does it take for the output to receive the data and then get back to us with acknowledgment that the data is written.

An example might help, if the events/sec drop, but this metric goes up, we have a good indicator that we should be looking at the output (logstash/elasticsearch/kafka) for issues, not the beat.

Alright, that makes sense. Tinkered with it a bit.

libbeat/outputs/redis/client.go

elasticmachine · 2023-12-18T19:55:15Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-12-18T17:42:18.958+0000
Duration: 133 min 41 sec

Test stats 🧪

Test	Results
Failed	0
Passed	28732
Skipped	2015
Total	30747

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-12-19T00:48:51Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-12-18T22:36:27.018+0000
Duration: 132 min 9 sec

Test stats 🧪

Test	Results
Failed	0
Passed	28732
Skipped	2015
Total	30747

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

leehinman

LGTM

* add metrics for output latency * plz linter * use milliseconds * move around timekeeping statements * move logstash latency window

add metrics for output latency

57f44c5

fearful-symmetry added the Team:Elastic-Agent Label for the Agent team label Dec 14, 2023

fearful-symmetry self-assigned this Dec 14, 2023

fearful-symmetry requested a review from a team as a code owner December 14, 2023 17:43

fearful-symmetry requested review from faec and leehinman December 14, 2023 17:43

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 14, 2023

fearful-symmetry changed the title ~~add metrics for output latency~~ Add histogram and output metrics for output latency Dec 14, 2023

plz linter

945ae29

use milliseconds

c700468

leehinman requested changes Dec 15, 2023

View reviewed changes

move around timekeeping statements

34166a6

move logstash latency window

0b01546

fearful-symmetry requested a review from leehinman December 19, 2023 16:52

leehinman approved these changes Dec 19, 2023

View reviewed changes

fearful-symmetry merged commit c9f1426 into elastic:main Jan 3, 2024

strawgate mentioned this pull request Jan 31, 2025

Add latency metrics for logstash async output #42565

Merged

This was referenced Feb 4, 2025

[8.x](backport #42565) Add latency metrics for logstash async output #42583

Merged

[8.17](backport #42565) Add latency metrics for logstash async output #42584

Closed

[8.18](backport #42565) Add latency metrics for logstash async output #42585

Merged

mergify bot mentioned this pull request Feb 4, 2025

[9.0](backport #42565) Add latency metrics for logstash async output #42586

Merged

Add histogram and output metrics for output latency #37445

Add histogram and output metrics for output latency #37445

Uh oh!

Conversation

fearful-symmetry commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

mergify bot commented Dec 14, 2023

Uh oh!

elasticmachine commented Dec 14, 2023

❕ Build Aborted

Build stats

🤖 GitHub comments

Uh oh!

cmacknz commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fearful-symmetry commented Dec 14, 2023

Uh oh!

elasticmachine commented Dec 14, 2023

❕ Build Aborted

Build stats

Test stats 🧪

Steps errors

Error signal

🤖 GitHub comments

Uh oh!

elasticmachine commented Dec 14, 2023

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Uh oh!

Uh oh!

leehinman Dec 15, 2023

Choose a reason for hiding this comment

Uh oh!

fearful-symmetry Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

leehinman Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

fearful-symmetry Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticmachine commented Dec 18, 2023

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Uh oh!

elasticmachine commented Dec 19, 2023

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Uh oh!

leehinman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fearful-symmetry commented Dec 14, 2023 •

edited

Loading

cmacknz commented Dec 14, 2023 •

edited

Loading

`Error signal`