Skip to content

Conversation

@Apache9
Copy link
Contributor

@Apache9 Apache9 commented Jul 25, 2025

No description provided.

@Apache9 Apache9 self-assigned this Jul 25, 2025
@Apache9
Copy link
Contributor Author

Apache9 commented Jul 25, 2025

I've added a test to reproduce the problem.

Let me think how to fix it properly, it may break our abstraction way of WAL filtering...

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache9 Apache9 changed the title HBASE-29463 add tests HBASE-29463 Bidirectional serial replication will block if a region’s last edit before rs crashed was from the peer cluster Jul 26, 2025
@Apache9 Apache9 requested a review from Copilot July 26, 2025 11:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug in bidirectional serial replication where replication can get stuck if a region's last edit before a RegionServer crash was from the peer cluster. The fix ensures that sequence IDs are properly recorded for entries that have global replication scope, even when they are filtered out during replication.

Key changes:

  • Modified the serial replication WAL reader to track sequence IDs for entries with global scope before filtering
  • Enhanced the base test class to support bidirectional replication testing with utility methods
  • Added a test case to reproduce and verify the fix for the stuck replication scenario

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
TestReplicationBase.java Refactored peer management methods to support testing with different source/target clusters
TestBidirectionSerialReplicationStuck.java New test case that reproduces the bidirectional replication stuck scenario
SerialReplicationSourceWALReader.java Core fix to record sequence IDs for global scope entries before filtering
ScopeWALEntryFilter.java Made hasGlobalScope method public static and restored null/empty scope filtering


private String encodedRegionName;

private long sequenceId = HConstants.NO_SEQNUM;
Copy link

Copilot AI Jul 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sequenceId field is not reset to HConstants.NO_SEQNUM after being used. This could lead to stale sequence IDs being recorded if subsequent entries don't have global scope. Consider resetting it in the removeEntryFromStream method.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use encodedRegionName to determine whether to record last pushed sequence id so resetting it is enough.

Comment on lines 43 to 44
if (scopes == null || scopes.isEmpty()) {
return null;
Copy link

Copilot AI Jul 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change reverts the filtering behavior for entries with null or empty scopes, which contradicts the comment that was removed about not filtering entire entries for serial replication. This could break serial replication marker handling.

Suggested change
if (scopes == null || scopes.isEmpty()) {
return null;
// Do not filter out the entire entry if scopes are null or empty to preserve serial replication markers
if (scopes == null || scopes.isEmpty()) {
return entry;

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code is also want to solve the problem where we filter out an entire Entry which causes serial replication to be stuck. Since now we will store the necessary information before filtering, it is OK to restore the old logic.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache9
Copy link
Contributor Author

Apache9 commented Aug 8, 2025

OK, there are still some problems...

Seems we still need to know why an entry is skipped then we can determine whther we should update the last pushed sequence id. For example, when sending wal entries for a new range, the first wal is opening marker, we will always filter it out because it does not have a replication scope. In this case, if we just treat it as 'can push', then we will update the last pushed sequence id even if the previous range is not finished.

Let me think how to deal with this scenario...

… last edit before rs crashed was from the peer cluster
@Apache9
Copy link
Contributor Author

Apache9 commented Aug 9, 2025

OK, there are still some problems...

Seems we still need to know why an entry is skipped then we can determine whther we should update the last pushed sequence id. For example, when sending wal entries for a new range, the first wal is opening marker, we will always filter it out because it does not have a replication scope. In this case, if we just treat it as 'can push', then we will update the last pushed sequence id even if the previous range is not finished.

Let me think how to deal with this scenario...

So the solution is go back to the old way, tell the filter we are serial replication peer, so the filter should not return null as we still need to wal entry for recording progress.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase

This comment has been minimized.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 30s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
_ master Compile Tests _
+1 💚 mvninstall 3m 43s master passed
+1 💚 compile 3m 23s master passed
+1 💚 checkstyle 0m 37s master passed
+1 💚 spotbugs 1m 38s master passed
+1 💚 spotless 0m 49s branch has no errors when running spotless:check.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 9s the patch passed
+1 💚 compile 3m 21s the patch passed
+1 💚 javac 3m 21s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 37s the patch passed
+1 💚 spotbugs 1m 40s the patch passed
+1 💚 hadoopcheck 12m 3s Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚 spotless 0m 45s patch has no errors when running spotless:check.
_ Other Tests _
+1 💚 asflicense 0m 10s The patch does not generate ASF License warnings.
40m 10s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7172/7/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #7172
Optional Tests dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname Linux 0971da7eceb0 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / b1d1f55
Default Java Eclipse Adoptium-17.0.11+9
Max. process+thread count 85 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7172/7/console
versions git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 31s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 14s master passed
+1 💚 compile 0m 57s master passed
+1 💚 javadoc 0m 29s master passed
+1 💚 shadedjars 6m 10s branch has no errors when building our shaded downstream artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 3m 0s the patch passed
+1 💚 compile 0m 57s the patch passed
+1 💚 javac 0m 57s the patch passed
+1 💚 javadoc 0m 28s the patch passed
+1 💚 shadedjars 6m 7s patch has no errors when building our shaded downstream artifacts.
_ Other Tests _
+1 💚 unit 218m 3s hbase-server in the patch passed.
244m 59s
Subsystem Report/Notes
Docker ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7172/7/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR #7172
Optional Tests javac javadoc unit compile shadedjars
uname Linux da8421298c1d 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / b1d1f55
Default Java Eclipse Adoptium-17.0.11+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7172/7/testReport/
Max. process+thread count 5159 (vs. ulimit of 30000)
modules C: hbase-server U: hbase-server
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-7172/7/console
versions git=2.34.1 maven=3.9.8
Powered by Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

@Apache9 Apache9 requested review from NihalJain and ndimiduk August 12, 2025 14:51
@Apache9
Copy link
Contributor Author

Apache9 commented Aug 12, 2025

I know this is not the area which you two are good at @ndimiduk @NihalJain , but please help reviewing to get this fixed. I think we do have some end users in the community which start to use the serial replication feature, we should take this chance to make it more stable.

Thanks.

entry.getEdit().getCells().clear();
return entry;
} else {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will normal replication handle correctly the case when the entry is non-null but the cells list is empty? I wonder if we can avoid having two contracts for the two types of replication. I also that that it's safer to avoid returning null objects when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual replication code for serial and none-serial replication are the same, where we will not replicate anything when there is no cell in WALEdit.
Returning null is by-design here, which means we can skip this entry. We need to change a bunch of code if we want to change this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bummer. Okay, fair enough.

@Apache9 Apache9 merged commit bea4272 into apache:master Aug 14, 2025
1 check passed
Apache9 added a commit that referenced this pull request Aug 14, 2025
… last edit before rs crashed was from the peer cluster (#7172)

Signed-off-by: Nick Dimiduk <[email protected]>
(cherry picked from commit bea4272)
Apache9 added a commit to Apache9/hbase that referenced this pull request Aug 14, 2025
… last edit before rs crashed was from the peer cluster (apache#7172)

Signed-off-by: Nick Dimiduk <[email protected]>
(cherry picked from commit bea4272)
Apache9 added a commit that referenced this pull request Aug 14, 2025
… last edit before rs crashed was from the peer cluster (#7172) (#7225)

(cherry picked from commit bea4272)

Signed-off-by: Nick Dimiduk <[email protected]>
Apache9 added a commit that referenced this pull request Aug 14, 2025
… last edit before rs crashed was from the peer cluster (#7172) (#7225)

(cherry picked from commit bea4272)

Signed-off-by: Nick Dimiduk <[email protected]>
Apache9 added a commit that referenced this pull request Aug 14, 2025
… last edit before rs crashed was from the peer cluster (#7172) (#7225)

(cherry picked from commit bea4272)

Signed-off-by: Nick Dimiduk <[email protected]>
sanjeet006py pushed a commit to sanjeet006py/hbase that referenced this pull request Sep 26, 2025
… last edit before rs crashed was from the peer cluster (apache#7172) (apache#7225)

(cherry picked from commit bea4272)

Signed-off-by: Nick Dimiduk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants