-
Notifications
You must be signed in to change notification settings - Fork 955
Save config file and brocast the PONG when configEpoch changed #1813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save config file and brocast the PONG when configEpoch changed #1813
Conversation
This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes, we should save the configuration file and broadcast a PONG as much as possible. For example, if a primary down after bumping the epoch, its replicas may initiate a failover, but the other primaries may refuse to vote because the epoch of the replica has not been updated. Or for example, for some reasons we bump the epoch, if the epoch is not updated in time in the cluster, it may affect the judgment of message staleness. Signed-off-by: Binbin <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #1813 +/- ##
============================================
+ Coverage 70.97% 71.03% +0.05%
============================================
Files 123 123
Lines 65651 65665 +14
============================================
+ Hits 46593 46642 +49
+ Misses 19058 19023 -35
🚀 New features to boost your workflow:
|
madolson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These broadcasts are expensive in large clusters, but none of these seem high frequency. So this seems OK to me. Minor suggestion.
Signed-off-by: Binbin <[email protected]>
This is somehow related with #974 and #1777. When the epoch changes, we should save the configuration file and broadcast a PONG as much as possible. For example, if a primary down after bumping the epoch, its replicas may initiate a failover, but the other primaries may refuse to vote because the epoch of the replica has not been updated. Or for example, for some reasons we bump the epoch, if the epoch is not updated in time in the cluster, it may affect the judgment of message staleness. These broadcasts are expensive in large clusters, but none of these seem high frequency so it should be fine. --------- Signed-off-by: Binbin <[email protected]>
…y-io#1813) This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes, we should save the configuration file and broadcast a PONG as much as possible. For example, if a primary down after bumping the epoch, its replicas may initiate a failover, but the other primaries may refuse to vote because the epoch of the replica has not been updated. Or for example, for some reasons we bump the epoch, if the epoch is not updated in time in the cluster, it may affect the judgment of message staleness. These broadcasts are expensive in large clusters, but none of these seem high frequency so it should be fine. --------- Signed-off-by: Binbin <[email protected]>
…y-io#1813) This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes, we should save the configuration file and broadcast a PONG as much as possible. For example, if a primary down after bumping the epoch, its replicas may initiate a failover, but the other primaries may refuse to vote because the epoch of the replica has not been updated. Or for example, for some reasons we bump the epoch, if the epoch is not updated in time in the cluster, it may affect the judgment of message staleness. These broadcasts are expensive in large clusters, but none of these seem high frequency so it should be fine. --------- Signed-off-by: Binbin <[email protected]>
…y-io#1813) This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes, we should save the configuration file and broadcast a PONG as much as possible. For example, if a primary down after bumping the epoch, its replicas may initiate a failover, but the other primaries may refuse to vote because the epoch of the replica has not been updated. Or for example, for some reasons we bump the epoch, if the epoch is not updated in time in the cluster, it may affect the judgment of message staleness. These broadcasts are expensive in large clusters, but none of these seem high frequency so it should be fine. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Shai Zarka <[email protected]>
…y-io#1813) This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes, we should save the configuration file and broadcast a PONG as much as possible. For example, if a primary down after bumping the epoch, its replicas may initiate a failover, but the other primaries may refuse to vote because the epoch of the replica has not been updated. Or for example, for some reasons we bump the epoch, if the epoch is not updated in time in the cluster, it may affect the judgment of message staleness. These broadcasts are expensive in large clusters, but none of these seem high frequency so it should be fine. --------- Signed-off-by: Binbin <[email protected]>
When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. Signed-off-by: Binbin <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>
When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see #1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes #2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see #1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes #2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]>
…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: chzhoo <[email protected]>
…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b)
…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b)
…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: shanwan1 <[email protected]>
…ated (valkey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. Signed-off-by: Ran Shidlansik <[email protected]>
…dated (#2178) to 7.2 (#2232) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see #1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes #2169. --------- Signed-off-by: Ran Shidlansik <[email protected]>
…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b)
…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b) Signed-off-by: Viktor Söderqvist <[email protected]>
…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b) Signed-off-by: Viktor Söderqvist <[email protected]>
When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see #1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes #2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b) Signed-off-by: Viktor Söderqvist <[email protected]>
…ey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Harkrishn Patro <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Harkrishn Patro <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> (cherry picked from commit 476671b) Signed-off-by: Viktor Söderqvist <[email protected]>
This is somehow related with #974 and #1777. When the epoch changes,
we should save the configuration file and broadcast a PONG as much
as possible.
For example, if a primary down after bumping the epoch, its replicas
may initiate a failover, but the other primaries may refuse to vote
because the epoch of the replica has not been updated.
Or for example, for some reasons we bump the epoch, if the epoch
is not updated in time in the cluster, it may affect the judgment
of message staleness.
These broadcasts are expensive in large clusters, but none of these
seem high frequency so it should be fine.