-
Notifications
You must be signed in to change notification settings - Fork 5k
Add histogram and output metrics for output latency #37445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add histogram and output metrics for output latency #37445
Conversation
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
I wouldn't optimize for the file output, make sure it is easy to read and cheap to store when using the output we will see in reality. If output latency is some small value like 50ns it can effectively be zero, that is so fast it is not the problem and we don't need to measure with any more precision. |
|
@cmacknz yah, |
❕ Build Aborted
Expand to view the summary
Build stats
Test stats 🧪
Steps errors
Expand to view the steps failures
|
libbeat/outputs/logstash/sync.go
Outdated
| } | ||
|
|
||
| for len(events) > 0 { | ||
| begin := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too large of a span, we have reconnect and windowed events in here. I think we need to re-think this part. We really just want the time span to start right before the data is sent over the network, and end right when we know if data is written (ACK), even if there is an error we want to "bill" the time spent communicating with logstash.
FYI it's OK if we want to pull out logstash from this PR and make a new issue to track that going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah. I actually kinda chose that span deliberately, having reconnect be part of the count actually seemed fairly accurate? It's a blocking operation, so a reconnect would would influence the latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely increases the overall latency for sending the event, but that isn't what this metric is focused on. This is just trying to get a sense of how long does it take for the output to receive the data and then get back to us with acknowledgment that the data is written.
An example might help, if the events/sec drop, but this metric goes up, we have a good indicator that we should be looking at the output (logstash/elasticsearch/kafka) for issues, not the beat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, that makes sense. Tinkered with it a bit.
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
leehinman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* add metrics for output latency * plz linter * use milliseconds * move around timekeeping statements * move logstash latency window
closes #37258
This adds an output latency metric to the
last 30smetrics undermonitoring.metrics.libbeat.output.write.latency:There's a few caveats with support for this:
This is currently measured in nanoseconds; although that seems like an excessive level of granularity, I discovered that on my linux machine running from an M.2 SSD, writes with thefileoutput are too fast for millisecond-level granularity.I'm also mildly annoyed with the resolution of how floats are printed:
Currently looking at a way to fix that, but it'll probably require some refactoring of the metrics logger.
Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.