Skip to content

Conversation

@hpatro
Copy link
Collaborator

@hpatro hpatro commented Jun 4, 2025

Fixes #2171

Handle divergent shard-id across primary and replica from nodes.conf and reconcile all the nodes in the shard to the primary node's shard-id.

@hpatro hpatro requested a review from PingXie June 4, 2025 20:25
@codecov
Copy link

codecov bot commented Jun 4, 2025

Codecov Report

Attention: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Project coverage is 73.57%. Comparing base (3789b29) to head (8f658fe).
Report is 14 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 0.00% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2174      +/-   ##
============================================
+ Coverage     71.43%   73.57%   +2.13%     
============================================
  Files           122      122              
  Lines         66210    73777    +7567     
============================================
+ Hits          47300    54281    +6981     
- Misses        18910    19496     +586     
Files with missing lines Coverage Δ
src/cluster_legacy.c 89.50% <0.00%> (+2.77%) ⬆️

... and 32 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hpatro hpatro force-pushed the handle-divergent-shard-id-nodes-conf branch from 5c52a9d to 8f658fe Compare June 10, 2025 20:27
@hpatro
Copy link
Collaborator Author

hpatro commented Jun 11, 2025

@martinrvisser I tried few scenarios manually, they seem to run fine. Could you try running this patch and see if it helps ?

@hpatro
Copy link
Collaborator Author

hpatro commented Jun 11, 2025

Test scenarios:

43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd myself,master - 0 1749488968682 13 connected 5461-16383
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
vars currentEpoch 13 lastVoteEpoch 9
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd myself,master - 0 1749488968682 13 connected 5461-16383
vars currentEpoch 13 lastVoteEpoch 9
43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd master - 0 1749488968682 13 connected 5461-16383
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd myself,slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
vars currentEpoch 13 lastVoteEpoch 9
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd myself,slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd master - 0 1749488968682 13 connected 5461-16383
vars currentEpoch 13 lastVoteEpoch 9

All of these load fine.

@martinrvisser
Copy link

@martinrvisser I tried few scenarios manually, they seem to run fine. Could you try running this patch and see if it helps ?

@hpatro fine in my tests also, nice work
2804808:M 13 Jun 2025 12:13:25.599 * Node 386bac3e5337540f2bce4181ff8980f33a989667 has a different shard id (4009ad6c27a3fe91b709a6502bd111b5d58eada0) than its primary's shard id bb85c3d913bb6f9e3d02e68ae89b5c5e61111fb4 (f713a7d0f3079e30fcc47348260a09de0c9b8b72). Updating replica's shard id to match primary's shard id.
2804808:M 13 Jun 2025 12:13:25.599 * Node configuration loaded, I'm 386bac3e5337540f2bce4181ff8980f33a989667

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Harkrishn Patro <[email protected]>
Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look correct, please also update the top comment to desc it. So it is somehow like a corrupted (or a outupdated) nodes.conf cause the trouble?

@hpatro
Copy link
Collaborator Author

hpatro commented Jun 18, 2025

This look correct, please also update the top comment to desc it. So it is somehow like a corrupted (or a outupdated) nodes.conf cause the trouble?

Yes. Updated.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hpatro hpatro added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 20, 2025
@hpatro hpatro merged commit 4a3e713 into valkey-io:unstable Jun 20, 2025
81 checks passed
stockholmux added a commit to stockholmux/valkey that referenced this pull request Aug 13, 2025
Signed-off-by: Kyle J. Davis <[email protected]>
@hpatro hpatro added the release-notes This issue should get a line item in the release notes label Aug 13, 2025
@zuiderkwast zuiderkwast moved this to 8.0.5 (to be released) in Valkey 8.0 Aug 21, 2025
@zuiderkwast zuiderkwast moved this to To be backported in Valkey 8.1 Aug 21, 2025
@zuiderkwast zuiderkwast moved this to To be backported in Valkey 7.2 Aug 21, 2025
zuiderkwast pushed a commit to vitarb/valkey that referenced this pull request Aug 21, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Viktor Söderqvist <[email protected]>
zuiderkwast pushed a commit that referenced this pull request Aug 22, 2025
…d id (#2174)

Fixes #2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Viktor Söderqvist <[email protected]>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Sep 16, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Viktor Söderqvist <[email protected]>
@ranshid ranshid moved this from To be backported to In Progress in Valkey 8.1 Sep 30, 2025
ranshid pushed a commit to ranshid/valkey that referenced this pull request Sep 30, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
@ranshid ranshid moved this from In Progress to To be backported in Valkey 8.1 Sep 30, 2025
ranshid pushed a commit to ranshid/valkey that referenced this pull request Sep 30, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
ranshid pushed a commit to ranshid/valkey that referenced this pull request Sep 30, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
@zuiderkwast
Copy link
Contributor

I tried backporting this one to 7.2 but I couldn't get the test case to pass, so I'm skipping it for now. If anyone wants to help backporting it, feel free to open a PR targeting the 7.2 branch. Some difficulties to note:

  • src/cluster_legacy.c doesn't exist in 7.2. It's still called src/cluster.c here.
  • Variables for primaries and replicas are using the old names masters and slaves.

zuiderkwast pushed a commit that referenced this pull request Oct 1, 2025
…d id (#2174)

Fixes #2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
@zuiderkwast zuiderkwast moved this from To be backported to 8.1.4 in Valkey 8.1 Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: To be backported
Status: 8.0.5
Status: 8.1.4

Development

Successfully merging this pull request may close these issues.

[BUG] "Unrecoverable error: corrupted cluster config file" due to incorrect shard-id

6 participants