-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Hi etcd community:
One of our production 3-member etcd clusters had 2 leader at a time. During this event, the droplet (the Physical Host on which the EC2 instances run) behind the old leader (A) was unavailable and the associated EBS volume failed to serve WAL fsync.
Ideally, the checkquorum raft message should be raised after the election timeout breaches the limit and RecentActive should be false for each peers if no recent hearbeat response or MsgAppResp reached leader.
Lines 1000 to 1020 in 161bf7e
| case pb.MsgCheckQuorum: | |
| // The leader should always see itself as active. As a precaution, handle | |
| // the case in which the leader isn't in the configuration any more (for | |
| // example if it just removed itself). | |
| // | |
| // TODO(tbg): I added a TODO in removeNode, it doesn't seem that the | |
| // leader steps down when removing itself. I might be missing something. | |
| if pr := r.prs.Progress[r.id]; pr != nil { | |
| pr.RecentActive = true | |
| } | |
| if !r.prs.QuorumActive() { | |
| r.logger.Warningf("%x stepped down to follower since quorum is not active", r.id) | |
| r.becomeFollower(r.Term, None) | |
| } | |
| // Mark everyone (but ourselves) as inactive in preparation for the next | |
| // CheckQuorum. | |
| r.prs.Visit(func(id uint64, pr *tracker.Progress) { | |
| if id != r.id { | |
| pr.RecentActive = false | |
| } | |
| }) |
However, the ticker is not triggered due to the disk write stalls in the raft output ready handling logic.
Lines 170 to 173 in 161bf7e
| select { | |
| case <-r.ticker.C: | |
| r.tick() | |
| case rd := <-r.Ready(): |
This can be easily reproduced like the following
Isolated old Leader A is still the leader from its point of view
# run A with the failpoint enabled
[root@ip-10-0-61-148 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | a320313566418492 | 3.4.18 | 20 kB | true | false | 35 | 1741005 | 1741005 | |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-61-148 bin]# curl http://127.0.0.1:1234/go.etcd.io/etcd/etcdserver/raftAfterSave -XPUT -d'sleep(600000)'
[root@ip-10-0-61-148 bin]# iptables -A INPUT -s 10.0.123.82 -j DROP && iptables -A INPUT -s 10.0.171.218 -j DROP
[root@ip-10-0-61-148 bin]# curl -sL http://localhost:2379/metrics | grep "is_leader"
# HELP etcd_server_is_leader Whether or not this member is a leader. 1 if is, 0 otherwise.
# TYPE etcd_server_is_leader gauge
etcd_server_is_leader 1
[root@ip-10-0-61-148 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | a320313566418492 | 3.4.18 | 20 kB | true | false | 35 | 1741050 | 1741050 | |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+Follower B (after the network partition injected, it becomes the leader)
[root@ip-10-0-171-218 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 6e05b88806758f58 | 3.4.17 | 20 kB | false | false | 35 | 1741005 | 1741005 | |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-171-218 bin]# iptables -A INPUT -s 10.0.61.148 -j DROP
[root@ip-10-0-171-218 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 6e05b88806758f58 | 3.4.17 | 20 kB | true | false | 36 | 1741062 | 1741062 | |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Follower C
etcdctl -w table endpoint status
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 70ddedef0fd6218 | 3.4.17 | 20 kB | false | false | 35 | 1741005 | 1741005 | |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-123-82 bin]# iptables -A INPUT -s 10.0.61.148 -j DROP
[root@ip-10-0-123-82 bin]# etcdctl -w table endpoint status
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 70ddedef0fd6218 | 3.4.17 | 20 kB | false | false | 36 | 1741058 | 1741058 | |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+Questions:
- Is it required that the
r.tick()andrd := <-r.Ready()should be executed mutually exclusively? Otherwise, we could separate therd := <- r.Ready()in another indefinite loop as a background routine. - Is the above behavior expected from a raft design perspective?
@gyuho @ptabor @serathius @hexfusion @wilsonwang371 PTAL, thx!