-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29292 Revise TestRecreateCluster #6981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Is this still a problem? Now we will store meta location in master region, not zookeeper, so when starting up, we can load the location in master region and find out that the region server is gone and schedule a SCP for it. |
|
And in HBASE-26245, we will also store the region server list in master local region, so even if the WAL Directories are gone, we could still find the region server list. So what are we trying to fix here? |
|
@Apache9 Hi Duo, thanks for your comment, and let me try to explain below why this patch is still needed. please see if that makes senses.
are you referring we need to use hbck tool to schedule SCP for it? or did I miss anything about new logic in HMaster that would schedule SCP for unknown servers automatically in the startup routine? Based on the tests added in this PR, specifically This PR aligns with the suggestion made by @petersomogyi in this comment: we could introduce an optional feature flag to automatically schedule SCPs when the configuration knob Here the new test
that patch solves the problem of region server list, but those are unknown servers, the cluster would be hanging and till a manual SCP via HBCK, it won't move further. I also changed the title of this JIRA to make it more clear the goal to have less manual operations if possible with a opt-in feature (that would be compatible with existing logic), also, this is different from the original attempt in PR #2113 that we don't remove/delete meta table, and only schedule SCP those unknown servers. |
Yeah, IIRC, we would only trigger an SCP for a given server if we find a WAL for that exact server.
Definitely useful to have such auto recovery feature. |
|
I checked the test, the problem is because of the test itself. Before restarting the cluster, it does not flush the master, and when cleaning up WALs, it deletes both MasterRegion's store dir and wal dir, not only the wal dir... The correct way to simulate what we actually do on cloud, is to use a differtent rootDir and walRootDir, and delete the walRootDir and zk data, right? Just modify the code like this You can find out that all the 4 tests can pass... |
|
I think the only missing thing here is that, we need to expose an admin API to flush the master region. |
|
Ah good, we do have an admin API for it, flushMasterStore... Which was added in HBASE-27028. |
Co-authored-by: Josh Elser <[email protected]> Co-authored-by: Sergey Soldatov <[email protected]>
6de1b54 to
a150e5e
Compare
Thanks Duo for your time, you're right, and thanks for pointing out that the test has a logic issue that cleanup the master data in the rootDir, after updating it, the tests pass with the right simulation that using fresh walRootDir and zk data. (We're not trying to solve the unflushed master region problem, and we expected cluster shutdown normally and master region also flushed as HFiles to the root directory.) this simplify the scope of this jira and I change it to |
|
@Apache9 asking one more question, how unlikely the master region would not be flushed to MASTER_STORE_DIR automatically (if we don't call the Without my original proposal to handle unknown servers, the assumption is that cluster should be shutdown normally such that MasterRegion should not be empty or has gaps for unflushed list of region servers. (I added a new boolean flag to delete the would you think in this situation, my original proposal to handle unknown servers would help ? I revisit the code, and it's not that likely, because every |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
If MasterRegion losses data, we could get UnknownServers or other critical problems. But for me, I always do not suggest adding the fix logic in normal code path, we should add it in hbck or other operation tools, and users need to use it with caution, as it may cause new damages to the system... |
Signed-off-by: Wellington Chevreuil <[email protected]> (cherry picked from commit 902067d)
TestRecreateCluster has a logic error and and would cleanup the master region in the root directory, this is not the right case we would like to support in the cloud storage that we expect the hbase cluster to shutdown gracefully and master region to be flushed to permanent storage as HFiles.