Skip to content

Conversation

@skolosov-snap
Copy link
Contributor

@skolosov-snap skolosov-snap commented Feb 5, 2025

Currently, ValKey doesn't allow to detach replica attached to primary node. So, if you want to change cluster topology the only way to do it is to reset (CLUSTER RESET command) the node. However, this results into removing node from the cluster what affects clients. All clients will keep sending traffic to this node (with getting inaccurate responses) until they refresh their topology.

In this change we implement supporting of new argument for CLUSTER REPLICATE command: CLUSTER REPLICATE NO ONE. When calling this command the node will be converted from replica to empty primary node but still staying in the cluster. Thus, all traffic coming from the clients to this node can be redirected to correct node.

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 91589d1 to ff96c0f Compare February 5, 2025 22:51
@codecov
Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 81.81818% with 4 lines in your changes missing coverage. Please review.

Project coverage is 71.08%. Comparing base (09f9630) to head (ff4dc90).
Report is 11 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 81.81% 4 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1674      +/-   ##
============================================
+ Coverage     71.00%   71.08%   +0.08%     
============================================
  Files           123      123              
  Lines         65675    65704      +29     
============================================
+ Hits          46631    46706      +75     
+ Misses        19044    18998      -46     
Files with missing lines Coverage Δ
src/commands.def 100.00% <ø> (ø)
src/cluster_legacy.c 85.85% <81.81%> (-0.24%) ⬇️

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster-replicate.json file should be updated and as part of the build commands.def will get updated. or if it was accidentally not staged, please add that.

Also, could you run the clang-format on your end to fix some of the formatting issue.

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch 2 times, most recently from fde1ab6 to 4abbae8 Compare February 7, 2025 23:46
@skolosov-snap
Copy link
Contributor Author

cluster-replicate.json file should be updated and as part of the build commands.def will get updated. or if it was accidentally not staged, please add that.

Also, could you run the clang-format on your end to fix some of the formatting issue.

Updated.

@zuiderkwast zuiderkwast added the major-decision-pending Major decision pending by TSC team label Feb 10, 2025
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature makes sense to me.

@valkey-io/core-team New arguments = major decision. Please approve or vote if you agree.

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 4abbae8 to 85238e6 Compare February 10, 2025 16:12
@zuiderkwast
Copy link
Contributor

The CI job "DCO" is failing. You need to use git commit -s. See the Details link next to the DCO job.

Why we need it? See here: https://github.com/valkey-io/valkey/blob/unstable/CONTRIBUTING.md#developer-certificate-of-origin thanks!

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 85238e6 to 3789227 Compare February 10, 2025 17:24
@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 3789227 to e4e8b24 Compare February 10, 2025 17:36
@skolosov-snap
Copy link
Contributor Author

The CI job "DCO" is failing. You need to use git commit -s. See the Details link next to the DCO job.

Why we need it? See here: https://github.com/valkey-io/valkey/blob/unstable/CONTRIBUTING.md#developer-certificate-of-origin thanks!

Done

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from e4e8b24 to bed392b Compare February 11, 2025 16:31
@skolosov-snap
Copy link
Contributor Author

Any objection to merge it?

@zuiderkwast
Copy link
Contributor

We're busy making the 8.1.0 release candidate just now. This one will need to wait and get merged after that.

@madolson
Copy link
Member

Any objection to merge it?

We should also have some tests validating this new behavior works as intended. Have a cluster, disconnect the replica, make sure slots/shards and all are still consistent and the rest of the cluster agrees on the state.

@PingXie
Copy link
Member

PingXie commented Feb 17, 2025

if you want to change cluster topology the only way to do it is to reset (CLUSTER RESET command) the node. However, this results into removing node from the cluster what affects clients.

Can we introduce a new mode so it doesn't forget all the nodes in the cluster? I think conceptually we are discussing a form of reset still so it seems to me that the solution is too tactical. Maybe CLUSTER RESET SOFT?

BTW, I just noticed that the forget path is not always working. The reset node joined back to the cluster quickly.

In this change we implement supporting of new argument for CLUSTER REPLICATE command: CLUSTER REPLICATE NO ONE. When calling this command the node will be converted from replica to empty primary node but still staying in the cluster. Thus, all traffic coming from the clients to this node can be redirected to correct node.

I don't see the implementation moves the node to a new shard. This would leave two primaries (one real and one empty) in the original shard, which will confuse the client.

@hpatro
Copy link
Collaborator

hpatro commented Feb 17, 2025

I don't see the implementation moves the node to a new shard. This would leave two primaries (one real and one empty) in the original shard, which will confuse the client.

Good catch!

@skolosov-snap
Copy link
Contributor Author

skolosov-snap commented Feb 19, 2025

if you want to change cluster topology the only way to do it is to reset (CLUSTER RESET command) the node. However, this results into removing node from the cluster what affects clients.

Can we introduce a new mode so it doesn't forget all the nodes in the cluster? I think conceptually we are discussing a form of reset still so it seems to me that the solution is too tactical. Maybe CLUSTER RESET SOFT?

IMHO that is just a syntactical question. Whatever command name we would come up with, the behavior of it would be the same: turn replica into primary with leaving it in the cluster. If you think the name of CLUSTER RESET SOFT is better I can support it in that name. My personal opinion is that CLUSTER RESET is not the best command to implement this feature, because the main thing what CLUSTER RESET does is excluding node from the cluster and that is exactly what we want to avoid. On the other hand CLUSTER REPLICATE NO ONE is consistent with similar non-cluster version of REPLICAOF NO ONE what should not confuse client but even give some kind of similarity.

BTW, I just noticed that the forget path is not always working.

What do you mean by "forget path is not always working"? What forget path? We are not doing any forgetting here.

The reset node joined back to the cluster quickly.

AFAIU if node is reset it will not be added back to the cluster automatically, but only when somebody does it explicitly. So if you want to remove replica manually (i.e. node is scheduled for maintenance) the only option for you is to reset with affecting all clients (start seeing errors).

In this change we implement supporting of new argument for CLUSTER REPLICATE command: CLUSTER REPLICATE NO ONE. When calling this command the node will be converted from replica to empty primary node but still staying in the cluster. Thus, all traffic coming from the clients to this node can be redirected to correct node.

I don't see the implementation moves the node to a new shard. This would leave two primaries (one real and one empty) in the original shard, which will confuse the client.

What is a shard? Would you please shed a light on this term? Is it something special for ValKey? AFAIU shard is primary with replicas attached to it. So, when we switched role of the node from replica to empty primary, doesn't it mean that we moved it to separate new shard?

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from bed392b to b419cfb Compare February 19, 2025 23:04
@skolosov-snap
Copy link
Contributor Author

I believe I figured it out. There is internal shards dict that needs to be updated.

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch 2 times, most recently from b9ad4fd to 95c0b42 Compare February 20, 2025 00:06
@madolson madolson added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Mar 10, 2025
@skolosov-snap
Copy link
Contributor Author

What is the next step?

@hwware
Copy link
Member

hwware commented Mar 11, 2025

What is the next step?

Wait for merge and release on Valkey 9

@skolosov-snap
Copy link
Contributor Author

skolosov-snap commented Mar 11, 2025

What is the next step?

Wait for merge and release on Valkey 9

Actually, my question was related to the merging process. How should it be merged? Is it something I should do to make it happen? Merging button is still not available to me.

@hwware
Copy link
Member

hwware commented Mar 12, 2025

When it is time to merge, Valkey maintainer will merge it for you, Do not worry @valkey-io/core-team

@zuiderkwast zuiderkwast moved this from Optional for next release candidate to In Progress in Valkey 9.0 Mar 16, 2025
@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from a801680 to 7ce7224 Compare March 18, 2025 23:19
@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch 2 times, most recently from 64da23f to dca695e Compare April 1, 2025 14:42
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

We just need to update the version numbers to 9.0.0

@zuiderkwast
Copy link
Contributor

@PingXie @madolson Do you still want to review this or can we merge it?

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 8dcbbed to 0cf9db8 Compare April 2, 2025 15:12
@hpatro
Copy link
Collaborator

hpatro commented Apr 2, 2025

I think it's safe to merge @zuiderkwast and address any further comment in future (if any).

Shall I go ahead?

@zuiderkwast
Copy link
Contributor

@hpatro I've been accused of YOLO-merging PRs before, when people were still reviewing. 😆

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a few minor suggestions. Thanks @skolosov-snap!

@skolosov-snap skolosov-snap force-pushed the skolosov/replicate-no-one branch from 3c4b996 to ff4dc90 Compare April 7, 2025 15:48
@madolson
Copy link
Member

madolson commented Apr 7, 2025

I've been accused of YOLO-merging PRs before, when people were still reviewing. 😆

TBF, I'm fairly sure we've all done that. When I was at Kube-con I heard so many endless drama stories about dysfunctional projects where everyone was always mad at each other about code quality, sneaking changes without reviews, etc. By comparison people seem to think we are a highly functional group.

@PingXie
Copy link
Member

PingXie commented Apr 7, 2025

By comparison people seem to think we are a highly functional group.

Small correction 😀

@hpatro hpatro merged commit 7407520 into valkey-io:unstable Apr 7, 2025
52 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Valkey 9.0 Apr 7, 2025
@hpatro
Copy link
Collaborator

hpatro commented Apr 7, 2025

@skolosov-snap Thanks for the PR. It's merged now 🥳 . Could you update the docs as well ?

The repository for docs: https://github.com/valkey-io/valkey-doc/

@skolosov-snap skolosov-snap deleted the skolosov/replicate-no-one branch April 8, 2025 02:47
murphyjacob4 pushed a commit to enjoy-binbin/valkey that referenced this pull request Apr 13, 2025
Currently, Valkey doesn't allow to detach replica attached to primary
node. So, if you want to change cluster topology the only way to do it
is to reset (````CLUSTER RESET```` command) the node. However, this
results into removing node from the cluster what affects clients. All
clients will keep sending traffic to this node (with getting inaccurate
responses) until they refresh their topology.

In this change we implement supporting of new argument for CLUSTER
REPLICATE command: ````CLUSTER REPLICATE NO ONE````. When calling this
command the node will be converted from replica to empty primary node
but still staying in the cluster. Thus, all traffic coming from the
clients to this node can be redirected to correct node.

Signed-off-by: Sergey Kolosov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-approved Major decision approved by TSC team needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants