Skip to content

Conversation

@fearful-symmetry
Copy link
Contributor

@fearful-symmetry fearful-symmetry commented Dec 14, 2023

closes #37258

This adds an output latency metric to the last 30s metrics under monitoring.metrics.libbeat.output.write.latency:

sudo cat logs/metricbeat-20231214.ndjson | grep "last 30s" | jq -c .monitoring.metrics.libbeat.output.write

{"bytes":34869,"latency":{"histogram":{"count":2,"max":260,"mean":196,"median":196,"min":132,"p75":260,"p95":260,"p99":260,"p999":260,"stddev":64}}}
{"bytes":44541,"latency":{"histogram":{"count":5,"max":276,"mean":231.8,"median":251,"min":132,"p75":268,"p95":276,"p99":276,"p999":276,"stddev":51.27338490874188}}}
{"bytes":27483,"latency":{"histogram":{"count":7,"max":536,"mean":276.7142857142857,"median":251,"min":132,"p75":276,"p95":536,"p99":536,"p999":536,"stddev":114.43365424773341}}}

There's a few caveats with support for this:

  • Kafka isn't supported, as we're using a 3rd party async library, and I don't see a good way to hook into it or measure sync writes to the output
  • Ditto with the async logstash output
  • This is currently measured in nanoseconds; although that seems like an excessive level of granularity, I discovered that on my linux machine running from an M.2 SSD, writes with the file output are too fast for millisecond-level granularity.

I'm also mildly annoyed with the resolution of how floats are printed:

"mean":5630.393145161291,"median":4065.5,"min":1420,"p75":4667.5,"p95":10702.99999999999

Currently looking at a way to fix that, but it'll probably require some refactoring of the metrics logger.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@fearful-symmetry fearful-symmetry added the Team:Elastic-Agent Label for the Agent team label Dec 14, 2023
@fearful-symmetry fearful-symmetry self-assigned this Dec 14, 2023
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner December 14, 2023 17:43
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 14, 2023
@fearful-symmetry fearful-symmetry changed the title add metrics for output latency Add histogram and output metrics for output latency Dec 14, 2023
@mergify
Copy link
Contributor

mergify bot commented Dec 14, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Contributor

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 50 min 32 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2023

This is currently measured in nanoseconds; although that seems like an excessive level of granularity, I discovered that on my linux machine running from an M.2 SSD, writes with the file output are too fast for millisecond-level granularity

I wouldn't optimize for the file output, make sure it is easy to read and cheap to store when using the output we will see in reality.

If output latency is some small value like 50ns it can effectively be zero, that is so fast it is not the problem and we don't need to measure with any more precision.

@fearful-symmetry
Copy link
Contributor Author

@cmacknz yah, milliseconds produces slightly less annoying output for the ES output, at least.

@elasticmachine
Copy link
Contributor

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-12-14T18:16:22.649+0000

  • Duration: 56 min 28 sec

Test stats 🧪

Test Results
Failed 0
Passed 27592
Skipped 1933
Total 29525

Steps errors 1

Expand to view the steps failures

Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-12-14T19:04:44.570+0000

  • Duration: 133 min 51 sec

Test stats 🧪

Test Results
Failed 0
Passed 28732
Skipped 2015
Total 30747

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

}

for len(events) > 0 {
begin := time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too large of a span, we have reconnect and windowed events in here. I think we need to re-think this part. We really just want the time span to start right before the data is sent over the network, and end right when we know if data is written (ACK), even if there is an error we want to "bill" the time spent communicating with logstash.

FYI it's OK if we want to pull out logstash from this PR and make a new issue to track that going forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah. I actually kinda chose that span deliberately, having reconnect be part of the count actually seemed fairly accurate? It's a blocking operation, so a reconnect would would influence the latency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely increases the overall latency for sending the event, but that isn't what this metric is focused on. This is just trying to get a sense of how long does it take for the output to receive the data and then get back to us with acknowledgment that the data is written.

An example might help, if the events/sec drop, but this metric goes up, we have a good indicator that we should be looking at the output (logstash/elasticsearch/kafka) for issues, not the beat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, that makes sense. Tinkered with it a bit.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-12-18T17:42:18.958+0000

  • Duration: 133 min 41 sec

Test stats 🧪

Test Results
Failed 0
Passed 28732
Skipped 2015
Total 30747

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-12-18T22:36:27.018+0000

  • Duration: 132 min 9 sec

Test stats 🧪

Test Results
Failed 0
Passed 28732
Skipped 2015
Total 30747

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fearful-symmetry fearful-symmetry merged commit c9f1426 into elastic:main Jan 3, 2024
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
* add metrics for output latency

* plz linter

* use milliseconds

* move around timekeeping statements

* move logstash latency window
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Elastic-Agent Label for the Agent team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add latency metric to libbeat.output metrics

4 participants