-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-28804: Implement asynchronous retrieval of bucket-cache data from persistence #6182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePreadReader.java
Outdated
Show resolved
Hide resolved
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
7853e54 to
8377eb9
Compare
| assertTrue(reader.prefetchStarted() || reader.prefetchComplete()); | ||
|
|
||
| // Added some delay as we have started the timer a bit late. | ||
| Thread.sleep(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this extra sleep after the loop above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was present earlier along with the comment at line 378.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to remove this one and check if the test passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I assume now that we have an extra wait of 10000, we don't need this one anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
| while (!reader.prefetchStarted() && !reader.prefetchComplete()) { | ||
| // Wait until the prefetch is triggered. | ||
| Thread.sleep(500); | ||
| assertTrue("Prefetch should start post configured delay", | ||
| getElapsedTime(startTime) > PrefetchExecutor.getPrefetchDelay()); | ||
| if (timeout <= 0) break; | ||
| timeout -= 500; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Waiter.waitFor(Configuration conf, long timeout, long interval, Predicate<E> predicate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
| long timeout = 120000; | ||
| try { | ||
| while (!cache.backingMap.containsKey(cacheKey) || cache.ramCache.containsKey(cacheKey)) { | ||
| Thread.sleep(100); | ||
| if (timeout <= 0) { | ||
| break; | ||
| } | ||
| timeout -= 100; | ||
| } | ||
| } finally { | ||
| Thread.sleep(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Waiter.waitFor(Configuration conf, long timeout, long interval, Predicate<E> predicate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wchevreuil , the config object is not available with us here. Probably I can leave out the change this. I had tried to change it to avoid infinite wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just create one HBaseConfiguration.create() and pass along? Waiter only uses it to check for the hbase.test.wait.for.ratio property anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
| long timeout = 120000; | ||
| while (!cache.backingMap.containsKey(cacheKey) || cache.ramCache.containsKey(cacheKey)) { | ||
| Thread.sleep(100); | ||
| if (timeout <= 0) { | ||
| break; | ||
| } | ||
| timeout -= 100; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Waiter.waitFor(Configuration conf, long timeout, long interval, Predicate<E> predicate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot more changes needed to get config to this location. I will leave out this change for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add UTs for checking the behaviour of all methods changed as part of this async initialisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could validate all public methods. I am not sure if we can do it for private methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public methods are enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCache.java
Outdated
Show resolved
Hide resolved
hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
Outdated
Show resolved
Hide resolved
hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestPrefetch.java
Outdated
Show resolved
Hide resolved
9ba3087 to
d79f091
Compare
wchevreuil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, +1.
|
💔 -1 overall
This message was automatically generated. |
d79f091 to
52ee5e5
Compare
|
🎊 +1 overall
This message was automatically generated. |
52ee5e5 to
69295b8
Compare
|
💔 -1 overall
This message was automatically generated. |
69295b8 to
64b2a5e
Compare
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
retest |
…om persistence During the retrieval of data from bucket cache persistence file, a transient structure that stores the blocks ordered by filename is constructed from the backing map entries. The population of this transient structure is done during the server start-up. This process increases the region-server startup time, if the bucketcache has large number of blocks. This population happens inline with the server restart and blocks the server for several minutes. This makes the server restart inconvenient for the external users. Restarts during upgrade can run into timeout issues due to this delay in the server startup. Hence, the recommendation in this Jira is to make the cache-retrieval asynchronous to the server startup. During a server startup, a new thread is spawn that reads the persistence file and creates the required structures from persistence file. The server continues with the restart and does not wait for the bucket-cache initialisation to complete. Note that the bucket cache is not available immediately for usage and will only be ready to use after the data is repopulated from persistence into memory. The prefetch thread that may start before the bucket-cache is initialized is modified to wait until the bucket cache is initialized. Change-Id: I2c136d7b1d884f74642d29923172a1ad4ada36e4
Change-Id: I239c357a135a058650a20c5707c7c7303a248c85
Change-Id: Ie160b249f6a2bff18fd8a577ae32a263b7de25ea
…on state. Change-Id: Ib54f565152c391da727be7413dd775fb507daea7
Change-Id: I446475dc84d52403f67762b878af874b9cbe1937
Change-Id: I49d8e39cafff6cbe103f300235bec4aacd42934d
Change-Id: I8db2162c8237c05fd98d3c771f50e5a92c9b78b5
Change-Id: I391e5e453e077b7e5733cd997130467b356a3e66
Change-Id: Ieb6b5841cb7b23d919fbebbe420c0083167d6d0b
64b2a5e to
2c8438d
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Hi @wchevreuil, I checked that the failing test is: `(base) janardhan.hungund@MacBook-Pro-4 hbase % git log
Hence, the failure does not seem to be related to this change. Thanks, |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Hi @wchevreuil, The failing unit test, org.apache.hadoop.hbase.master.janitor.TestCatalogJanitor.testAlreadyRunningStatus does not seem to be related to this change. The local execution of the test passed on the same branch.
|
…om persistence (apache#6182) Signed-off-by: Wellington Chevreuil <[email protected]> Change-Id: I515d81fdc484cd49308d3164067a2287b8901520
…om persistence (apache#6182) Signed-off-by: Wellington Chevreuil <[email protected]> Change-Id: I515d81fdc484cd49308d3164067a2287b8901520
…om persistence (#6182) Signed-off-by: Wellington Chevreuil <[email protected]>
…om persistence (#6182) (#6219) Signed-off-by: Wellington Chevreuil <[email protected]>
… bucket-cache data from persistence (apache#6182) (apache#6219) Signed-off-by: Wellington Chevreuil <[email protected]>
… bucket-cache data from persistence (apache#6182) (apache#6219) Signed-off-by: Wellington Chevreuil <[email protected]>
…om persistence (apache#6182) (apache#6219) The cherry-pick of HBASE-29727 into branch-2.6 accidently brought most of HBASE-28804 into branch-2, this is to backport the remaining original change in HBASE-28804. Signed-off-by: Wellington Chevreuil <[email protected]> Change-Id: I816141ccebf91191aaa4cb6afa04343b64663eb0
…om persistence (apache#6182) (apache#6219) The cherry-pick of HBASE-29727 into branch-2.6 accidently brought most of HBASE-28804 into branch-2, this is to backport the remaining original change in HBASE-28804. Signed-off-by: Wellington Chevreuil <[email protected]> Change-Id: I816141ccebf91191aaa4cb6afa04343b64663eb0
…om persistence (apache#6182) (apache#6219) The cherry-pick of HBASE-29727 into branch-2.6 accidently brought most of HBASE-28804 into branch-2, this is to backport the remaining original change in HBASE-28804. Signed-off-by: Wellington Chevreuil <[email protected]> Change-Id: I816141ccebf91191aaa4cb6afa04343b64663eb0
During the retrieval of data from bucket cache persistence file, a transient structure that stores the blocks ordered by filename is constructed from the backing map entries. The population of this transient structure is done during the server start-up. This process increases the region-server startup time, if the bucketcache has large number of blocks.
This population happens inline with the server restart and blocks the server for several minutes. This makes the server restart inconvenient for the external users. Restarts during upgrade can run into timeout issues due to this delay in the server startup.
Hence, the recommendation in this Jira is to make the cache-retrieval asynchronous to the server startup. During a server startup, a new thread is spawn that reads the persistence file and creates the required structures from persistence file. The server continues with the restart and does not wait for the bucket-cache initialisation to complete.
Note that the bucket cache is not available immediately for usage and will only be ready to use after the data is repopulated from persistence into memory.
The prefetch thread that may start before the bucket-cache is initialized is modified to wait until the bucket cache is initialized.
Change-Id: I2c136d7b1d884f74642d29923172a1ad4ada36e4