Skip to content

Conversation

@pieturin
Copy link
Contributor

@pieturin pieturin commented Nov 14, 2024

In some cases, when meeting a new node, if the handshake times out, we can end up with an inconsistent view of the cluster where the new node knows about all the nodes in the cluster, but the cluster does not know about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound link but no inbound link, in this case it probably means this node does not know us. In this case we (re-)send a MEET packet to this node to do a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the outbound link to force a reconnect and sending of a PING packet so that the other node recognizes the link as belonging to us. This prevents cases where a node could send MEET packets in a loop because it thinks the other node does not have an inbound link.

This fixes the bug described in #1251.

@pieturin pieturin force-pushed the cluster-handshake-fix branch from 1920952 to d65423e Compare November 14, 2024 20:16
@codecov
Copy link

codecov bot commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.90%. Comparing base (fd58f8d) to head (8086a07).
Report is 285 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1307      +/-   ##
============================================
+ Coverage     70.62%   70.90%   +0.28%     
============================================
  Files           117      119       +2     
  Lines         63324    64631    +1307     
============================================
+ Hits          44722    45828    +1106     
- Misses        18602    18803     +201     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.83% <100.00%> (+0.14%) ⬆️

... and 63 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we disconnect the outbound link if inbound link is not available? I think it will lead to the same reconnection flow. Would it help with having simpler code and one unified flow. I'm not sure if it will perform the MEET operation though.

@pieturin
Copy link
Contributor Author

pieturin commented Nov 14, 2024

What if we disconnect the outbound link if inbound link is not available?

In this case we would just re-open an outbound connection, which the other node will accept, but it won't force the other node to recognize us as being part of the cluster if it doesn't trust us yet. The only way to force the other node to add us to its cluster view is for us to send a MEET packet.

@hpatro
Copy link
Collaborator

hpatro commented Nov 14, 2024

What if we disconnect the outbound link if inbound link is not available?

In this case we would just re-opened an outbound connection, which the other node will accept, but it won't force the other node to recognize us as being part of the cluster if it doesn't trust us yet. The only way to force the other node to add us to its cluster view is for us to send a MEET packet.

CLUSTER MEET is an admin operation but I guess we are fine with the case of reinitiating it if the operation wasn't successful in first place and retry it.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do something like #461? only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?) I haven't thought about it carefully, but i feel it is more reliable?

@madolson
Copy link
Member

only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?

Do you mean by like adding a new flag? I think the concern is we could still end up in the inverse state, where the the node that received the "strong" ack will put the other node online but then might go offline.

My original thought was that as long as one node believes the other is part of the cluster, is should try to have the other node join. It's sort of like an "enhanced" version of how we built up the mesh when two disjoin clusters meet each other.

@pieturin
Copy link
Contributor Author

can we do something like #461? only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?) I haven't thought about it carefully, but i feel it is more reliable?

We could do a 3-way handshake to strengthen the handshake reliability, instead of the current -> MEET/PONG, <- PING/PONG. We could add an extra back and forth between the two nodes before clearing the flags. But we can always end up in a situation where one node thinks the handshake is done and the other node times out the handshake because the last packet got delayed or dropped.

With this solution, if the handshake has succeeded on one side and not the other, we ensure both sides will eventually know each other.

@pieturin
Copy link
Contributor Author

With this fix, we can sometime get in a (potentially) infinite loop where a node keeps sending a MEET packet to the other node, but both nodes know each other. This sometimes (although rarely) happens in the second test Handshake eventually succeeds after node handshake timeout on one side with inconsistent view of the cluster.

The following sequence of events can trigger this issue:

  1. Nodes 1 & 2 know each other, but don't know node 0.
  2. We make node 0 meet node 1. But the handshake times-out on node 1's side, but succeed on node 0's side.
  3. When node 1 marks the handshake as timed out, it will close both connections with node 0.
  4. From node 0 perspective, both connections to node 1 are closed. But since it knows node 1, it will re-open an outbound connection to it.
  5. Node 1 will accept the inbound connection from node 0, but it doesn't know this node, so it doesn't register this connection as belonging to any known node (ie: link->node stays NULL).
  6. With the change from this PR, node 0 will detect that node 2 doesn't have any inbound connection to it. So node 0 will send a MEET packet to node 2. (for this bug to happen, node 2 should be met first)
  7. Handshake with node 0 and 2 succeeds.
  8. Now node 2 gossips about node 0 to node 1. So node 1 will add node 0 to its list of known nodes.
  9. Node 1 opens a connection to node 0. At this point node 0 has both inbound and outbound connections to node 1. But from node 1's perspective, it only has an outbound connection to node 0. The inbound connection is not attached to any node (still has link->node set to NULL).
  10. So node 1 sends a MEET packet to node 0, since it doesn't think it has an inbound connection for it.
  11. The handshake completes successfully since node 0 responds to the MEET packet. But still no inbound connection.
  12. So node 1 keeps sending MEET packets to node 0 until node 0 sends a PING packet to node 1. When node 1 receives a PING packet from the node 0, it will set the node's inbound connection (here).

Node 0 should eventually send a PING packet to node 1, but there is no guarantee as to when that can happen. When I reproduce the issue, node 0 never gets a chance to send a PING to node 1 because node 0 overrides node->pong_received for node 1 when node 2 gossips about node 1 with a higher pong_received value. And node 2 always has a lower pong_received value compared to node 1 when trying to select a node to send a PING to here:

if (min_pong_node == NULL || min_pong > this->pong_received) {

I think there are various ways to mitigate this issue:

  1. Do nothing, the MEET packet loop should eventually stop. (but I have to update the test so that it's not flaky)
  2. Change the pinging decision logic to force a PING to every nodes at least every X amount of time, even if we know that another node was able to ping it recently. X can be set to something like 2 * (server.cluster_ping_interval ? server.cluster_ping_interval : server.cluster_node_timeout / 2). This will incur an increase in cluster bus traffic on large clusters.
  3. If I receive a MEET packet and I already have an outbound link for that node, then I should free my existing outbound link to it, and re-open a new one.

I think option 3 should work best without making too many changes to the current logic. But I'm open to suggestions.

@pieturin pieturin force-pushed the cluster-handshake-fix branch from 68324ec to f6eae88 Compare November 27, 2024 22:53
@pieturin
Copy link
Contributor Author

tests/unit/expire.tcl is currently failing in unstable too. I'll rebase once there is a fix.

@enjoy-binbin
Copy link
Member

#1368 the expire test is fixed

In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.

Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
Update test to check node IDs instead of relying on number of words.
Rename nodeIsMeeting() to nodeInMeetState().
Introduce nodeInNormalState() macro.

Signed-off-by: Pierre Turin <[email protected]>
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us.

Also deflaked one of the tests. And improved testing code following PR
comment.

Signed-off-by: Pierre Turin <[email protected]>
Signed-off-by: Pierre Turin <[email protected]>
@pieturin pieturin force-pushed the cluster-handshake-fix branch from 2e8c6f1 to 7ae3b05 Compare November 28, 2024 19:48
@enjoy-binbin enjoy-binbin requested a review from PingXie November 30, 2024 09:02
Sometimes the outbound link from node 0 to node 1 can be disconnected.
Assert that node 0 know node 1 without expecting the node to be marked
as connected.

Signed-off-by: Pierre Turin <[email protected]>
@pieturin pieturin force-pushed the cluster-handshake-fix branch from 3b5aa41 to 6a017b0 Compare December 2, 2024 20:08
@hpatro
Copy link
Collaborator

hpatro commented Dec 3, 2024

@pieturin Could you update the top comment as well about the exact behavior change?

Signed-off-by: Pierre Turin <[email protected]>
Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But I would like someone else as well to look at it before merging this in.

@pieturin
Copy link
Contributor Author

pieturin commented Dec 3, 2024

Updated the PR description.

@pieturin
Copy link
Contributor Author

pieturin commented Dec 3, 2024

@PingXie, @madolson, @enjoy-binbin, any comments on this PR?

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top comment and code LGTM, did not review the tests.

@enjoy-binbin enjoy-binbin changed the title [cluster-bus] Send a MEET packet to a node if there is no inbound link Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout Dec 5, 2024
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the comment isn't really a blocker for me, it's just about documentation.

@madolson madolson merged commit 5f7fe9e into valkey-io:unstable Dec 12, 2024
48 checks passed
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to send out my (partial) review :)

Also I don't think I fully understand the problem that we are trying to solve and why the fix works.

@PingXie
Copy link
Member

PingXie commented Dec 12, 2024

Also I don't think I fully understand the problem that we are trying to solve

Do I understand correctly that we are trying to solve two problems here?

  1. inconsistent cluster topology within the existing node.

More specifically, a new node that hasn't gone through the full handshake process can still get gossiped around the cluster and then picked up by existing nodes that are not involved in the handshake process?

  • existing nodes in a cluster: a, b, c
  • new node: A
  • a is instructed to meet A.
  • the handshake process never completes but A's PONG response to a's MEET should make it to a
  • a then gossips A's id to b and c, which then add A to their cluster view - because the server doesn't check A's MEET state (CLUSTER_STATE_MEET) today.
  • now a thinks A is in the MEET state but b and c think A is normal and A sees no other node.
  1. cluster meet reliability

If A fails to reply PONG to a's MEET, both a and A should eventually remove each other from its cluster view respectively and A will never make to b and c. In this sense, the existing cluster, made up of a, b, and c, is still consistent.

and why the fix works.

The fix for the first problem makes sense to me: resending MEET after the handshake timeout.

I am not sure how/if the second problem is addressed by this PR? if both a and A remove each other from their cluster view respectively, how can the handshake process be restarted?

@pieturin
Copy link
Contributor Author

I am not sure how/if the second problem is addressed by this PR? if both a and A remove each other from their cluster view respectively, how can the handshake process be restarted?

If the handshake times out on both side of the meet, then, this fix won't do anything. This fix is only meant to make sure we never end up with an inconsistent view of the cluster (where one side knows the other nodes, but not the other side). This makes the meet handshake more reliable but not fool proof.
We could revisit the proposal to make MEET synchronous or sticky (see: redis/redis#11095), which would solve the second problem.

The inconsistent view of the cluster can happen in two different ways (that I could see):

  • The handshake succeeds on one node but times out on the other node. This is tested by the test Handshake eventually succeeds after node handshake timeout on one side with inconsistent view of the cluster.
            # Node 0 -- MEET -> Node 1
            # Node 0 <- PONG -- Node 1
            # Node 0 <- PING -- Node 1 [Node 0 will mark the handshake as successful]
            # Node 0 -- PONG -> Node 1 [we drop this message, so node 1 will eventually mark the handshake as timed out]
  • The handshake times out on both sides, but the new node learns about other nodes of the existing cluster through the gossip section in the MEET packet. In this case, even if the handshake times out, the new node will know some of the nodes from the existing cluster, but they will not know this new node. This is tested by the test Handshake eventually succeeds after node handshake timeout on both sides with inconsistent view of the cluster.
            # Node 1 -- MEET -> Node 0 [Node 0 might learn about Node 2 from the gossip section of the msg]
            # Node 1 <- PONG -- Node 0 [we drop this message, so Node 1 will eventually mark the handshake as timed out]
            # Node 1 <- PING -- Node 0 [we drop this message, so Node 1 will never send a PONG and Node 0 will eventually mark the handshake as timed out]
  • a then gossips A's id to b and c, which then add A to their cluster view - because the server doesn't check A's MEET state (CLUSTER_STATE_MEET) today.

We don't gossip a node if it's in HANDSHAKE state. See:

valkey/src/cluster_legacy.c

Lines 4039 to 4048 in 5f7fe9e

/* In the gossip section don't include:
* 1) Nodes in HANDSHAKE state.
* 3) Nodes with the NOADDR flag set.
* 4) Disconnected nodes if they don't have configured slots.
*/
if (this->flags & (CLUSTER_NODE_HANDSHAKE | CLUSTER_NODE_NOADDR) ||
(this->link == NULL && this->numslots == 0)) {
freshnodes--; /* Technically not correct, but saves CPU. */
continue;
}

I'll prepare a follow-up PR to address your comments @PingXie, thanks for the review!

enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Dec 13, 2024
After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets when reconnecting,
and in here we are dropping the MEET. Note that in getNodeFromLinkAndMsg, the node in
the handshake state has a random name and not truly "known", so we don't know the sender.
Dropping the MEET packet can prevent us from creating a random node, avoid incorrect
link binding, and avoid duplicate MEET packet eliminate the handshake state.

Signed-off-by: Binbin <[email protected]>
@pieturin pieturin deleted the cluster-handshake-fix branch December 13, 2024 17:56
vudiep411 pushed a commit to Autxmaton/valkey that referenced this pull request Dec 15, 2024
…ency when handshake timedout (valkey-io#1307)

In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us. This prevents
cases where a node could send MEET packets in a loop because it thinks
the other node does not have an inbound link.

This fixes the bug described in valkey-io#1251.

---------

Signed-off-by: Pierre Turin <[email protected]>
enjoy-binbin added a commit that referenced this pull request Dec 16, 2024
After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.

Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.

Signed-off-by: Binbin <[email protected]>
@PingXie
Copy link
Member

PingXie commented Dec 16, 2024

If the handshake times out on both side of the meet, then, this fix won't do anything.

make sense

We don't gossip a node if it's in HANDSHAKE state.

Right but in this case A is in the MEET state from a's perspective so it gets picked/gossiped.

I'll prepare a follow-up PR to address your comments

Thanks. Will TAL.

@pieturin
Copy link
Contributor Author

Right but in this case A is in the MEET state from a's perspective so it gets picked/gossiped.

Ah yes, you're right, this is what should happen:

            # Node a -- MEET -> Node A [Node A has flags HANDSHAKE|MEET]
            # Node a <- PONG -- Node A [After receiving this packet, Node A has flag MEET]
            # Node a <- PING -- Node A [After receiving this packer, we clear the MEET flag for Node A]
            # Node a -- PONG -> Node A

now a thinks A is in the MEET state but b and c think A is normal and A sees no other node.

This is correct that a thinks A is in MEET state, and b and c can get gossip information about it. But A might know about b and c, from the gossip section of the MEET packet from a.

hwware pushed a commit that referenced this pull request Dec 30, 2024
Add `meet_sent` field in `clusterNode` indicating the last time we sent
a MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period when detecting a node without an inbound link.

When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.

Improve some logging messages to include `human_nodename`. Add
`nodeExceedsHandshakeTimeout()` function.

This is a follow-up to this previous PR:
#1307
And a partial fix to the crash described in:
#1436

---------

Signed-off-by: Pierre Turin <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…ency when handshake timedout (valkey-io#1307)

In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us. This prevents
cases where a node could send MEET packets in a loop because it thinks
the other node does not have an inbound link.

This fixes the bug described in valkey-io#1251.

---------

Signed-off-by: Pierre Turin <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…o#1436)

After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.

Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.

Signed-off-by: Binbin <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…y-io#1441)

Add `meet_sent` field in `clusterNode` indicating the last time we sent
a MEET packet. Use this field to only (re-)send a MEET packet once every
handshake timeout period when detecting a node without an inbound link.

When receiving multiple MEET packets on the same link while the node is
in handshake state, instead of dropping the packet, we now simply
prevent the creation of a new node. This way we still process the MEET
packet's gossip and reply with a PONG as any other packets.

Improve some logging messages to include `human_nodename`. Add
`nodeExceedsHandshakeTimeout()` function.

This is a follow-up to this previous PR:
valkey-io#1307
And a partial fix to the crash described in:
valkey-io#1436

---------

Signed-off-by: Pierre Turin <[email protected]>
@madolson madolson added the release-notes This issue should get a line item in the release notes label Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes This issue should get a line item in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Clusters can become inconsistent if one side times out the handshake during cluster meets

5 participants