-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Backport "HBASE-27155 Improvements to low level scanner tracing" to branch-2.5 #4601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport "HBASE-27155 Improvements to low level scanner tracing" to branch-2.5 #4601
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@ndimiduk It does make sense to me that we would decorate the spans with events, and not try to maintain counters, as did the original patch. I only did that because aggregate counts at the end of the RPC were sufficient for my needs then. If I had needed a log of events I would have had to do something more like this. It is nice to have a timeline of events recorded into the span(s) instead, in typical form for otel.
Agreed, converting events into metrics by logging and then counting occurrences over the logs is quite typical here. The events and the times at which they occur are the raw data that can serve several different approaches to analyzing the timeline. I would approve this draft fwiw |
|
Okay, after too long of a delay, I have collected some data in which I have some amount confidence. The raw data and summaries are in this Google Sheet for your examination. The charts are pasted here for your reference. Test MethodologyI tested three different builds: a baseline (ecf758b), that baseline (ecf758b) + HBASE-27153, and that baseline (ecf758b) + HBASE-27153 + HBASE-27155. In all cases, the test is run with tracing disabled -- all that's measured here is the impact of the code changes made to facilitate manual instrumentation by each patch. The test run was a YCSB workload that I happened to have handy, with 20% writes and 80% random reads. The test runs for a little over 30 minutes. The data collected is the total test runtime, and the read latencies reported by YCSB at p95 and p99. The test methodology was to first prepare a dataset by populating a pre-split table using the YCSB load feature, flush and major compact the table, snapshot the table. Each test iteration involved dropping the table, cloning the snapshot back into place, and then applying the test workload. I kept an eye on cluster metrics as things ran. Client and server generally agree on number of requests served/sec. Each test, it took about 20 minutes for the block cache hit rate to climb from a starting point of 50% to the steady state of around 70%. I made an effort to exclude as much as possible the impact of compactions on the test results -- ASYNC_WAL was used, and dropping the table after each test run dropped the pending compaction work that accumulated via the write portion of the workload. My Interpretation of resultsThe changes introduced with HBASE-27153 appear to have an overall positive impact on read throughput and latency, although in latency in particular, there is a disturbingly large amount of internal variance. HBASE-27155 appears to undo all of that improvement and then do additional harm. It is my opinion that the regression of 2ms at p95 and 1ms at p99 is too expensive to accept for the inclusion of HBASE-27155. Next StepsWe can conclude analysis here and decide to commit one or both patches. Or, we can attempt further analysis. The next analysis step I would suggest is collection and comparison of flame graphs at several points during the test period. Please advise. |
|
Thanks for doing this analysis @ndimiduk . I will note this on HBASE-27155 and resolve that issue. So I would propose the next steps here are:
|
|
I resolved HBASE-27155 as WontFix so will close this PR too. All please feel free to reopen/unresolve if you'd like to pursue this at some later time. |
|
@apurtell Just for my own curiosity, and because I already have the test rig and I earlier did the work of forward porting your counter-based patch, I'm running it through the same cycles. And since I'm here, I'll also run |
|
@apurtell Results from the counter-based patch and |



This branch starts with #4574 and then adds a translation of ndimiduk/hbase@93a6fed into otel span events.
How I think you might use this, @apurtell , is to set up tracing with the Logging Exporter, run your test bench, and then parse the logs to summon them into metrics.