Skip to content

Conversation

@taklwu
Copy link
Contributor

@taklwu taklwu commented May 9, 2025

TestRecreateCluster has a logic error and and would cleanup the master region in the root directory, this is not the right case we would like to support in the cloud storage that we expect the hbase cluster to shutdown gracefully and master region to be flushed to permanent storage as HFiles.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache9
Copy link
Contributor

Apache9 commented May 10, 2025

Is this still a problem? Now we will store meta location in master region, not zookeeper, so when starting up, we can load the location in master region and find out that the region server is gone and schedule a SCP for it.

@Apache9
Copy link
Contributor

Apache9 commented May 10, 2025

And in HBASE-26245, we will also store the region server list in master local region, so even if the WAL Directories are gone, we could still find the region server list.

So what are we trying to fix here?

@taklwu
Copy link
Contributor Author

taklwu commented May 12, 2025

@Apache9 Hi Duo, thanks for your comment, and let me try to explain below why this patch is still needed. please see if that makes senses.

we can load the location in master region and find out that the region server is gone and schedule a SCP for it

are you referring we need to use hbck tool to schedule SCP for it? or did I miss anything about new logic in HMaster that would schedule SCP for unknown servers automatically in the startup routine?

Based on the tests added in this PR, specifically TestRecreateCluster#testRecreateCluster_UserTableEnabled_CleanupWALAndZNodes , we've confirmed that the current master branch still does not automatically schedule SCPs for unknown servers. However, manually triggering SCPs via hbck does resolve the issue.

This PR aligns with the suggestion made by @petersomogyi in this comment: we could introduce an optional feature flag to automatically schedule SCPs when the configuration knob hbase.master.assign.regions.on.unknown.servers is enabled.

Here the new test testRecreateCluster_UserTableEnabled_CleanupWALAndZNodes_WithRecoverUnknownServer confirmed that only when we enable this feature, the master and cluster could start without any further manually operations.

And in HBASE-26245, we will also store the region server list in master local region, so even if the WAL Directories are gone, we could still find the region server list.

that patch solves the problem of region server list, but those are unknown servers, the cluster would be hanging and till a manual SCP via HBCK, it won't move further.

I also changed the title of this JIRA to make it more clear


the goal to have less manual operations if possible with a opt-in feature (that would be compatible with existing logic), also, this is different from the original attempt in PR #2113 that we don't remove/delete meta table, and only schedule SCP those unknown servers.

@taklwu taklwu requested a review from Apache9 May 12, 2025 20:20
@taklwu taklwu changed the title HBASE-29292 Automatically reassign regions when WALs and ZK go missing HBASE-29292 Optional feature to automatically schedule SCP when regions are on unknown servers during master startup May 12, 2025
@wchevreuil
Copy link
Contributor

wchevreuil commented May 13, 2025

Here the new test testRecreateCluster_UserTableEnabled_CleanupWALAndZNodes_WithRecoverUnknownServer confirmed that only when we enable this feature, the master and cluster could start without any further manually operations.

Yeah, IIRC, we would only trigger an SCP for a given server if we find a WAL for that exact server.

the goal to have less manual operations if possible with a opt-in feature (that would be compatible with existing logic),

Definitely useful to have such auto recovery feature.

@Apache9
Copy link
Contributor

Apache9 commented May 13, 2025

I checked the test, the problem is because of the test itself. Before restarting the cluster, it does not flush the master, and when cleaning up WALs, it deletes both MasterRegion's store dir and wal dir, not only the wal dir...

The correct way to simulate what we actually do on cloud, is to use a differtent rootDir and walRootDir, and delete the walRootDir and zk data, right? Just modify the code like this

Apache9@b430565

You can find out that all the 4 tests can pass...

@Apache9
Copy link
Contributor

Apache9 commented May 13, 2025

I think the only missing thing here is that, we need to expose an admin API to flush the master region.

@Apache9
Copy link
Contributor

Apache9 commented May 13, 2025

Ah good, we do have an admin API for it, flushMasterStore...

Which was added in HBASE-27028.

Co-authored-by: Josh Elser <[email protected]>
Co-authored-by: Sergey Soldatov <[email protected]>
@taklwu taklwu force-pushed the HBASE-29292-master branch from 6de1b54 to a150e5e Compare May 13, 2025 19:07
@taklwu taklwu changed the title HBASE-29292 Optional feature to automatically schedule SCP when regions are on unknown servers during master startup HBASE-29292 Revise TestRecreateCluster May 13, 2025
@taklwu
Copy link
Contributor Author

taklwu commented May 13, 2025

Before restarting the cluster, it does not flush the master, and when cleaning up WALs, it deletes both MasterRegion's store dir and wal dir, not only the wal dir...
The correct way to simulate what we actually do on cloud, is to use a differtent rootDir and walRootDir, and delete the walRootDir and zk data, right?

Thanks Duo for your time, you're right, and thanks for pointing out that the test has a logic issue that cleanup the master data in the rootDir, after updating it, the tests pass with the right simulation that using fresh walRootDir and zk data. (We're not trying to solve the unflushed master region problem, and we expected cluster shutdown normally and master region also flushed as HFiles to the root directory.)

this simplify the scope of this jira and I change it to Revise TestRecreateCluster

@taklwu taklwu requested a review from wchevreuil May 13, 2025 19:12
@taklwu
Copy link
Contributor Author

taklwu commented May 13, 2025

@Apache9 asking one more question, how unlikely the master region would not be flushed to MASTER_STORE_DIR automatically (if we don't call the flushMasterStore via shell)?

Without my original proposal to handle unknown servers, the assumption is that cluster should be shutdown normally such that MasterRegion should not be empty or has gaps for unflushed list of region servers. (I added a new boolean flag to delete the MASTER_STORE_DIR on rootDir to simulate if the master data are empty)

would you think in this situation, my original proposal to handle unknown servers would help ? I revisit the code, and it's not that likely, because every getMaster or MasterCallable would trigger the flush to the master region, just if things happened, we would need to run SCP via hbck

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 29s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+1 💚 mvninstall 3m 6s master passed
+1 💚 compile 3m 5s master passed
+1 💚 checkstyle 0m 36s master passed
+1 💚 spotbugs 1m 34s master passed
+1 💚 spotless 0m 44s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 5s the patch passed
+1 💚 compile 3m 9s the patch passed
+1 💚 javac 3m 9s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 36s the patch passed
+1 💚 spotbugs 1m 39s the patch passed
+1 💚 hadoopcheck 11m 52s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 0m 44s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 9s The patch does not generate ASF License warnings.
38m 22s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #6981
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux ed24c77cd8be 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / c220764
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 84 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 30s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 25s master passed
+1 💚 compile 0m 58s master passed
+1 💚 javadoc 0m 28s master passed
+1 💚 shadedjars 6m 1s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 5s the patch passed
+1 💚 compile 0m 57s the patch passed
+1 💚 javac 0m 57s the patch passed
+1 💚 javadoc 0m 27s the patch passed
+1 💚 shadedjars 6m 1s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 211m 34s hbase-server in the patch passed.
237m 24s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #6981
Optional Tests javac javadoc unit compile shadedjars
uname Linux 6cba84f2e2c6 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / c220764
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/testReport/
Max. process+thread count 4988 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented May 15, 2025

@Apache9 asking one more question, how unlikely the master region would not be flushed to MASTER_STORE_DIR automatically (if we don't call the flushMasterStore via shell)?

Without my original proposal to handle unknown servers, the assumption is that cluster should be shutdown normally such that MasterRegion should not be empty or has gaps for unflushed list of region servers. (I added a new boolean flag to delete the MASTER_STORE_DIR on rootDir to simulate if the master data are empty)

would you think in this situation, my original proposal to handle unknown servers would help ? I revisit the code, and it's not that likely, because every getMaster or MasterCallable would trigger the flush to the master region, just if things happened, we would need to run SCP via hbck

If MasterRegion losses data, we could get UnknownServers or other critical problems. But for me, I always do not suggest adding the fix logic in normal code path, we should add it in hbck or other operation tools, and users need to use it with caution, as it may cause new damages to the system...

@taklwu taklwu merged commit 902067d into apache:master May 15, 2025
1 check passed
@taklwu taklwu deleted the HBASE-29292-master branch May 15, 2025 17:42
taklwu added a commit that referenced this pull request May 15, 2025
Signed-off-by: Wellington Chevreuil <[email protected]>

(cherry picked from commit 902067d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants