-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28385 make Scan estimates more realistic #5713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/DefaultOperationQuota.java
Outdated
Show resolved
Hide resolved
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
I believe this test failure is unrelated |
|
We've removed the minimum wait interval logic. I'm going to open a separate Jira regarding tiny backoffs and how we can improve them. |
|
💔 -1 overall
This message was automatically generated. |
2 similar comments
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
c2578ef to
a2e58b4
Compare
|
🤔 that's new. Just pushed a rebase |
|
🎊 +1 overall
This message was automatically generated. |
hbase-server/src/main/java/org/apache/hadoop/hbase/quotas/DefaultOperationQuota.java
Show resolved
Hide resolved
| // scan with a small maxBlockBytesScanned would not prefer to pull down | ||
| // a larger payload. So we should estimate with the assumption that long-running scans | ||
| // are appropriately configured to approach their maxScannerResultSize per RPC call | ||
| estimate = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should only double if the scan reached the previous limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind elaborating a bit? I don't know exactly what you mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So every time a next() comes in, we double the previous BBS as the new limit. So let's say someone is doing a long scan with a caching of 1. So each time it's doing only 1 block of IO, but we are still doubling the new estimate. So it becomes harder to get a quota if we are always estimating 2x more than really fetching. The doubling is good for reaching a plateau quickly, but then it should stop.
For the 1 block/1 caching case, it might only be 64k vs 128kb estimate. Not a huge deal. But lets say we have a can that's doing 25mb per next(). To each time set the estimate to 50mb, that's quite a difference. Ideally once the scan itself plateaus in BBS, that should be the limit going forward. So let's say for the 25mb example, it does 64, 129, 256, etc up to 25. The next request the estimate is 50, but the resulting bbs is only 25. Ideally after that we just use 25 as the estimate going forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So every time a next() comes in, we double the previous BBS as the new limit
Is this your proposal for what we should do, or your read on what the current estimation logic will do? If we're worried about overestimation, then I think this logic is actually worse than what you've described because we don't double the previous BBS — rather we multiply the max BBS by the next call sequence. So for 1 block/1 caching case, we'd quickly estimate at a, likely much higher, Math.min(maxScanerResultSize, maxScanEstimate). This is more of a feature than a bug though, because I think we'd prefer to overestimate against a saturated quota. For example, if we only manage to scan a few kb per next call because we're riding against the quota limit, but really the scan would prefer to pull down 100MB at a time, then we'll continue to use tiny estimations without the aforementioned estimate progression.
So, rather than risk falling into a cycle of quota saturation -> long running scans with tiny estimates -> tons of RPC calls -> quota saturation ..., I think we'd rather make the assumption describe in the comment here:
So we should estimate with the assumption that long-running scans
are appropriately configured to approach their maxScannerResultSize per RPC call
at the expense of potentially overestimating some scans. We also mitigate the risk of overestimation with the introduction of maxScanEstimate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was describing what I thought the current implementation does. It seems like every new scan will increase the estimate u til we hit the server configured max. But I'd like to stop increasing when we stop exceeding the old limit. Imagine the following 4 next calls:
Estimate = 64k; Scanned = 64k
Estimate = 128k; Scanned = 128k
Estimate = 256k; Scanned = 128k
Estimate = 384k; Scanned = 128k
So the actually scanned amount had leveled out, but the estimate keeps increasing. Ideally it'd be like this instead:
Estimate = 64k; Scanned = 64k
Estimate = 128k; Scanned = 128k
Estimate = 256k; Scanned = 128k
Estimate = 128k; Scanned = 128k <- we didn't hit our last estimate, so stop over estimating.
Or something else that doesn't keep increasing. Of course the impact here is not huge yet, but for larger values it seems to matter a bit more. Unless I'm missing something
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
9b9e165 to
0274500
Compare
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
I believe the two test failures on the latest build are unrelated. |
|
🎊 +1 overall
This message was automatically generated. |
…ed (#5713) Signed-off-by: Bryan Beaudreault <[email protected]>
…ed (#5713) Signed-off-by: Bryan Beaudreault <[email protected]>
…ed (#5713) Signed-off-by: Bryan Beaudreault <[email protected]>
…ed (apache#5713) Signed-off-by: Bryan Beaudreault <[email protected]>
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
Background
https://issues.apache.org/jira/browse/HBASE-28385
Let's say you're running a table scan with a throttle of 100MB/sec per RegionServer. Ideally your scans are going to pull down large results, often containing hundreds or thousands of blocks.
You will estimate each scan as costing a single block of read capacity, and if your quota is already exhausted then the server will evaluate the backoff required for your estimated consumption (1 block) to be available. This will often be ~1ms, causing your retries to basically be immediate.
Obviously it will routinely take much longer than 1ms for 100MB of IO to become available in the given configuration, so your retries will be destined to fail. At worst this can cause a saturation of your server's RPC layer, and at best this causes erroneous exhaustion of the client's retries.
Proposal
This makes two major changes:
maxBlockBytesScanned,prevBlockBytesScanned, andprevBlockBytesScannedDifffor each scanner in theRegionScannerHolder.Scanner#nextcall, beginning with 0Testing
I've deployed this to a test cluster and confirmed that large scan estimates quickly produced useful estimates and, consequently, meaningful waitInterval backoffs. I can also try to write a unit test which confirms this behavior.
cc @hgromer @eab148 @bozzkar