Skip to content

[NEW] Faster cluster failover #2023

@zuiderkwast

Description

@zuiderkwast

In very fast networks, we don't need the hard-coded 500ms delay. Can we change these hard-coded numbers to be relative the configured node timeout?

This is to have less downtime during an automatic failover.

        server.cluster->failover_auth_time = now +
                                             500 +           /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
                                             random() % 500; /* Random delay between 0 and 500 milliseconds. */
        /* We add another delay that is proportional to the replica rank.
         * Specifically 1 second * rank. This way replicas that have a probably
         * less updated replication offset, are penalized. */
        server.cluster->failover_auth_time += server.cluster->failover_auth_rank * 1000;

@madolson @enjoy-binbin @hpatro Am I missing anything?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions