HBASE-29292 Revise TestRecreateCluster #6981

taklwu · 2025-05-09T18:10:56Z

TestRecreateCluster has a logic error and and would cleanup the master region in the root directory, this is not the right case we would like to support in the cloud storage that we expect the hbase cluster to shutdown gracefully and master region to be flushed to permanent storage as HFiles.

Apache9 · 2025-05-10T07:34:09Z

Is this still a problem? Now we will store meta location in master region, not zookeeper, so when starting up, we can load the location in master region and find out that the region server is gone and schedule a SCP for it.

Apache9 · 2025-05-10T07:38:40Z

And in HBASE-26245, we will also store the region server list in master local region, so even if the WAL Directories are gone, we could still find the region server list.

So what are we trying to fix here?

taklwu · 2025-05-12T20:17:36Z

@Apache9 Hi Duo, thanks for your comment, and let me try to explain below why this patch is still needed. please see if that makes senses.

we can load the location in master region and find out that the region server is gone and schedule a SCP for it

are you referring we need to use hbck tool to schedule SCP for it? or did I miss anything about new logic in HMaster that would schedule SCP for unknown servers automatically in the startup routine?

Based on the tests added in this PR, specifically TestRecreateCluster#testRecreateCluster_UserTableEnabled_CleanupWALAndZNodes , we've confirmed that the current master branch still does not automatically schedule SCPs for unknown servers. However, manually triggering SCPs via hbck does resolve the issue.

This PR aligns with the suggestion made by @petersomogyi in this comment: we could introduce an optional feature flag to automatically schedule SCPs when the configuration knob hbase.master.assign.regions.on.unknown.servers is enabled.

Here the new test testRecreateCluster_UserTableEnabled_CleanupWALAndZNodes_WithRecoverUnknownServer confirmed that only when we enable this feature, the master and cluster could start without any further manually operations.

And in HBASE-26245, we will also store the region server list in master local region, so even if the WAL Directories are gone, we could still find the region server list.

that patch solves the problem of region server list, but those are unknown servers, the cluster would be hanging and till a manual SCP via HBCK, it won't move further.

I also changed the title of this JIRA to make it more clear

the goal to have less manual operations if possible with a opt-in feature (that would be compatible with existing logic), also, this is different from the original attempt in PR #2113 that we don't remove/delete meta table, and only schedule SCP those unknown servers.

wchevreuil · 2025-05-13T09:49:17Z

Here the new test testRecreateCluster_UserTableEnabled_CleanupWALAndZNodes_WithRecoverUnknownServer confirmed that only when we enable this feature, the master and cluster could start without any further manually operations.

Yeah, IIRC, we would only trigger an SCP for a given server if we find a WAL for that exact server.

the goal to have less manual operations if possible with a opt-in feature (that would be compatible with existing logic),

Definitely useful to have such auto recovery feature.

Apache9 · 2025-05-13T11:32:51Z

I checked the test, the problem is because of the test itself. Before restarting the cluster, it does not flush the master, and when cleaning up WALs, it deletes both MasterRegion's store dir and wal dir, not only the wal dir...

The correct way to simulate what we actually do on cloud, is to use a differtent rootDir and walRootDir, and delete the walRootDir and zk data, right? Just modify the code like this

Apache9@b430565

You can find out that all the 4 tests can pass...

Apache9 · 2025-05-13T11:33:39Z

I think the only missing thing here is that, we need to expose an admin API to flush the master region.

Apache9 · 2025-05-13T11:35:46Z

Ah good, we do have an admin API for it, flushMasterStore...

Which was added in HBASE-27028.

Co-authored-by: Josh Elser <[email protected]> Co-authored-by: Sergey Soldatov <[email protected]>

taklwu · 2025-05-13T19:10:38Z

Before restarting the cluster, it does not flush the master, and when cleaning up WALs, it deletes both MasterRegion's store dir and wal dir, not only the wal dir...
The correct way to simulate what we actually do on cloud, is to use a differtent rootDir and walRootDir, and delete the walRootDir and zk data, right?

Thanks Duo for your time, you're right, and thanks for pointing out that the test has a logic issue that cleanup the master data in the rootDir, after updating it, the tests pass with the right simulation that using fresh walRootDir and zk data. (We're not trying to solve the unflushed master region problem, and we expected cluster shutdown normally and master region also flushed as HFiles to the root directory.)

this simplify the scope of this jira and I change it to Revise TestRecreateCluster

taklwu · 2025-05-13T20:16:18Z

@Apache9 asking one more question, how unlikely the master region would not be flushed to MASTER_STORE_DIR automatically (if we don't call the flushMasterStore via shell)?

Without my original proposal to handle unknown servers, the assumption is that cluster should be shutdown normally such that MasterRegion should not be empty or has gaps for unflushed list of region servers. (I added a new boolean flag to delete the MASTER_STORE_DIR on rootDir to simulate if the master data are empty)

would you think in this situation, my original proposal to handle unknown servers would help ? I revisit the code, and it's not that likely, because every getMaster or MasterCallable would trigger the flush to the master region, just if things happened, we would need to run SCP via hbck

Apache-HBase · 2025-05-14T01:29:17Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 29s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	hbaseanti	0m 0s		Patch does not have any anti-patterns.
			_ master Compile Tests _
+1 💚	mvninstall	3m 6s		master passed
+1 💚	compile	3m 5s		master passed
+1 💚	checkstyle	0m 36s		master passed
+1 💚	spotbugs	1m 34s		master passed
+1 💚	spotless	0m 44s		branch has no errors when running spotless:check.
			_ Patch Compile Tests _
+1 💚	mvninstall	3m 5s		the patch passed
+1 💚	compile	3m 9s		the patch passed
+1 💚	javac	3m 9s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 36s		the patch passed
+1 💚	spotbugs	1m 39s		the patch passed
+1 💚	hadoopcheck	11m 52s		Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚	spotless	0m 44s		patch has no errors when running spotless:check.
			_ Other Tests _
+1 💚	asflicense	0m 9s		The patch does not generate ASF License warnings.
		38m 22s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#6981
Optional Tests	dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname	Linux ed24c77cd8be 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `c220764`
Default Java	Eclipse Adoptium-17.0.11+9
Max. process+thread count	84 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/console
versions	git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2025-05-14T04:48:17Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 30s		Docker mode activated.
-0 ⚠️	yetus	0m 3s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ master Compile Tests _
+1 💚	mvninstall	3m 25s		master passed
+1 💚	compile	0m 58s		master passed
+1 💚	javadoc	0m 28s		master passed
+1 💚	shadedjars	6m 1s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	3m 5s		the patch passed
+1 💚	compile	0m 57s		the patch passed
+1 💚	javac	0m 57s		the patch passed
+1 💚	javadoc	0m 27s		the patch passed
+1 💚	shadedjars	6m 1s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
+1 💚	unit	211m 34s		hbase-server in the patch passed.
		237m 24s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR	#6981
Optional Tests	javac javadoc unit compile shadedjars
uname	Linux 6cba84f2e2c6 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `c220764`
Default Java	Eclipse Adoptium-17.0.11+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/testReport/
Max. process+thread count	4988 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6981/3/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache9 · 2025-05-15T14:28:53Z

@Apache9 asking one more question, how unlikely the master region would not be flushed to MASTER_STORE_DIR automatically (if we don't call the flushMasterStore via shell)?

Without my original proposal to handle unknown servers, the assumption is that cluster should be shutdown normally such that MasterRegion should not be empty or has gaps for unflushed list of region servers. (I added a new boolean flag to delete the MASTER_STORE_DIR on rootDir to simulate if the master data are empty)

would you think in this situation, my original proposal to handle unknown servers would help ? I revisit the code, and it's not that likely, because every getMaster or MasterCallable would trigger the flush to the master region, just if things happened, we would need to run SCP via hbck

If MasterRegion losses data, we could get UnknownServers or other critical problems. But for me, I always do not suggest adding the fix logic in normal code path, we should add it in hbck or other operation tools, and users need to use it with caution, as it may cause new damages to the system...

Signed-off-by: Wellington Chevreuil <[email protected]> (cherry picked from commit 902067d)

taklwu mentioned this pull request May 9, 2025

HBASE-29292 Automatically reassign regions when WALs and ZK go missing #6968

Closed

This comment has been minimized.

Sign in to view

taklwu requested review from petersomogyi, stoty and wchevreuil May 9, 2025 23:35

taklwu requested a review from Apache9 May 12, 2025 20:20

taklwu changed the title ~~HBASE-29292 Automatically reassign regions when WALs and ZK go missing~~ HBASE-29292 Optional feature to automatically schedule SCP when regions are on unknown servers during master startup May 12, 2025

wchevreuil approved these changes May 13, 2025

View reviewed changes

HBASE-29292 Revise TestRecreateCluster

a150e5e

Co-authored-by: Josh Elser <[email protected]> Co-authored-by: Sergey Soldatov <[email protected]>

taklwu force-pushed the HBASE-29292-master branch from 6de1b54 to a150e5e Compare May 13, 2025 19:07

taklwu changed the title ~~HBASE-29292 Optional feature to automatically schedule SCP when regions are on unknown servers during master startup~~ HBASE-29292 Revise TestRecreateCluster May 13, 2025

taklwu requested a review from wchevreuil May 13, 2025 19:12

spotless fix

c220764

This comment has been minimized.

Sign in to view

taklwu merged commit 902067d into apache:master May 15, 2025
1 check passed

taklwu deleted the HBASE-29292-master branch May 15, 2025 17:42

taklwu added a commit that referenced this pull request May 15, 2025

HBASE-29292 Revise TestRecreateCluster (#6981)

338d2b3

Signed-off-by: Wellington Chevreuil <[email protected]> (cherry picked from commit 902067d)

HBASE-29292 Revise TestRecreateCluster #6981

HBASE-29292 Revise TestRecreateCluster #6981

Uh oh!

Conversation

taklwu commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Apache9 commented May 10, 2025

Uh oh!

Apache9 commented May 10, 2025

Uh oh!

taklwu commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wchevreuil commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Apache9 commented May 13, 2025

Uh oh!

Apache9 commented May 13, 2025

Uh oh!

Apache9 commented May 13, 2025

Uh oh!

taklwu commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taklwu commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Apache-HBase commented May 14, 2025

Uh oh!

Apache-HBase commented May 14, 2025

Uh oh!

Apache9 commented May 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

taklwu commented May 9, 2025 •

edited

Loading

taklwu commented May 12, 2025 •

edited

Loading

wchevreuil commented May 13, 2025 •

edited

Loading

taklwu commented May 13, 2025 •

edited

Loading

taklwu commented May 13, 2025 •

edited

Loading