Commit 2ca0dd8
authored
Make cluster failover delay relative to node timeout (#2449)
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.
| Node timeout | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500 | 250 |
| 3000 | 100 |
| 1500 | 50 |
Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.
The purposes of the failover delay are
1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.
A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.
A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.
These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.
The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.
I hope this PR can be considered safer than #2227, although the two
changes are orthogonal.
Part of issue #2023.
---------
Signed-off-by: Viktor Söderqvist <[email protected]>1 parent 7a5c0d0 commit 2ca0dd8
1 file changed
+19
-8
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5167 | 5167 | | |
5168 | 5168 | | |
5169 | 5169 | | |
| 5170 | + | |
| 5171 | + | |
| 5172 | + | |
| 5173 | + | |
5170 | 5174 | | |
5171 | 5175 | | |
5172 | 5176 | | |
| |||
5212 | 5216 | | |
5213 | 5217 | | |
5214 | 5218 | | |
5215 | | - | |
5216 | | - | |
| 5219 | + | |
| 5220 | + | |
5217 | 5221 | | |
5218 | 5222 | | |
5219 | 5223 | | |
5220 | 5224 | | |
5221 | | - | |
| 5225 | + | |
5222 | 5226 | | |
5223 | | - | |
| 5227 | + | |
| 5228 | + | |
| 5229 | + | |
| 5230 | + | |
| 5231 | + | |
| 5232 | + | |
| 5233 | + | |
| 5234 | + | |
5224 | 5235 | | |
5225 | | - | |
| 5236 | + | |
5226 | 5237 | | |
5227 | 5238 | | |
5228 | | - | |
| 5239 | + | |
5229 | 5240 | | |
5230 | 5241 | | |
5231 | 5242 | | |
| |||
5262 | 5273 | | |
5263 | 5274 | | |
5264 | 5275 | | |
5265 | | - | |
| 5276 | + | |
5266 | 5277 | | |
5267 | 5278 | | |
5268 | 5279 | | |
| |||
5271 | 5282 | | |
5272 | 5283 | | |
5273 | 5284 | | |
5274 | | - | |
| 5285 | + | |
5275 | 5286 | | |
5276 | 5287 | | |
5277 | 5288 | | |
| |||
0 commit comments