Skip to content

Commit 4cdedce

Browse files
chore: lint
1 parent 528efc0 commit 4cdedce

3 files changed

Lines changed: 252 additions & 1 deletion

File tree

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# Benchlist EWMA Thread: `not a validator` Investigation
2+
3+
PR: <https://github.com/ava-labs/avalanchego/pull/4982>
4+
Date investigated: 2026-02-19
5+
6+
## 1. Exact CI run where the quoted failure appears
7+
8+
The exact log line from the Slack thread is in:
9+
10+
- Run: <https://github.com/ava-labs/avalanchego/actions/runs/22150593840>
11+
- Job: `e2e_existing_network`
12+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/job/64040542894>
13+
- Artifact: `e2e-existing-network-tmpnet-data`
14+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5559966692>
15+
- Artifact zip API URL (requires GitHub auth):
16+
<https://api.github.com/repos/ava-labs/avalanchego/actions/artifacts/5559966692/zip>
17+
18+
Inside that artifact:
19+
20+
- `networks/20260218-173951.147867-avalanchego-e2e/NodeID-BVk3Gf3Rwr5XS7WN7v4ZfjYrkjj7BcMTS/logs/C.log:3220`
21+
- `networks/20260218-173951.147867-avalanchego-e2e/NodeID-BVk3Gf3Rwr5XS7WN7v4ZfjYrkjj7BcMTS/logs/C.log:3221`
22+
23+
Those are the exact `handlerID: 1` + `app_error ... "not a validator"` lines discussed in thread.
24+
25+
## 2. Other run context for PR #4982
26+
27+
Tests workflow runs on `aaronbuchwald/benchlist-ewma`:
28+
29+
- <https://github.com/ava-labs/avalanchego/actions/runs/22187381614> (failed)
30+
- <https://github.com/ava-labs/avalanchego/actions/runs/22153325396> (passed)
31+
- <https://github.com/ava-labs/avalanchego/actions/runs/22150593840> (failed)
32+
- <https://github.com/ava-labs/avalanchego/actions/runs/22117974408> (failed)
33+
- <https://github.com/ava-labs/avalanchego/actions/runs/22117044316> (failed)
34+
- <https://github.com/ava-labs/avalanchego/actions/runs/22116636267> (cancelled)
35+
- <https://github.com/ava-labs/avalanchego/actions/runs/22115556838> (failed)
36+
- <https://github.com/ava-labs/avalanchego/actions/runs/22114576701> (failed)
37+
- <https://github.com/ava-labs/avalanchego/actions/runs/22114209216> (cancelled)
38+
- <https://github.com/ava-labs/avalanchego/actions/runs/22106616407> (failed)
39+
40+
## 3. Log evidence and what it shows
41+
42+
From the same `C.log` around the failure:
43+
44+
- `.../C.log:3101` shows peer added (`adding new peer ... NodeID-B4F...`)
45+
- `.../C.log:3106` shows peer disconnected (`disconnecting peer ... NodeID-B4F...`)
46+
- `.../C.log:3220` shows `handlerID: 1` request rejected as `not a validator`
47+
- `.../C.log:3221` emits the `app_error` response
48+
49+
In this log, all observed `handlerID: 1` + `not a validator` failures happened while the most recent connection state for that peer was `disconnected`.
50+
51+
Within the same run (`22150593840`), the same signature appears in multiple uploaded artifacts:
52+
53+
- `e2e-existing-network-tmpnet-data` (110 `app_error: "not a validator"` lines)
54+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5559966692>
55+
- `e2e-post-latest-tmpnet-data` (206 lines)
56+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5560098491>
57+
- `e2e-ubuntu-latest-tmpnet-data` (168 lines)
58+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5560110597>
59+
- `e2e-custom-arm64-noble-tmpnet-data` (189 lines)
60+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5560136359>
61+
62+
## 4. Code paths (GitHub permalinks)
63+
64+
The relevant behavior on PR head commit (`bad89c8d15bed10c780819a3dee8e27618a11c17`):
65+
66+
- Atomic gossip uses handler ID `1`:
67+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/graft/coreth/plugin/evm/atomic/vm/vm.go#L195>
68+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/graft/coreth/plugin/evm/atomic/vm/vm.go#L289-L290>
69+
- Pull gossip is wrapped in a validator-only gossiper:
70+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/gossip/system.go#L163-L167>
71+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/gossip/gossip.go#L196-L201>
72+
- Inbound request rejection happens here:
73+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/handler.go#L210-L213>
74+
- At that commit, `Has` checked connected validators:
75+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/validators.go#L197-L205>
76+
77+
## 5. Root cause
78+
79+
Root cause is connection-churn/asymmetric connection state, not atomic gossip handler misconfiguration:
80+
81+
- Sender can still have a peer selected for request while receiver has already marked sender disconnected.
82+
- Receiver-side validator gate (`ValidatorHandler.AppRequest`) checks `Has(...)` and returns `ErrNotValidator`.
83+
- Because `Has(...)` was tied to **connected** validator state, transient disconnects produce `not a validator` app errors for validator peers.
84+
85+
## 6. Proposed fix + deterministic test
86+
87+
Local patch prepared:
88+
89+
- `network/p2p/validators.go:197`
90+
`Has(...)` changed to validator membership (`validatorSet.Contains`) rather than connected subset.
91+
- `network/p2p/validators_test.go:375`
92+
Added deterministic test `TestValidatorsHasUsesValidatorMembership` (fails on old behavior, passes with fix).
93+
- `network/p2p/network_test.go:972`
94+
Updated expectations to match new `Has` semantics.
95+
96+
## 7. Validation run
97+
98+
Executed locally:
99+
100+
- `GOCACHE=/tmp/go-build go test ./network/p2p/...`
101+
- `GOCACHE=/tmp/go-build go test ./network/p2p -run 'TestValidatorsHasUsesValidatorMembership|TestNetworkValidators_ConnectAndDisconnect' -count=1`
102+
- `GOCACHE=/tmp/go-build go test ./graft/coreth/plugin/evm/atomic/vm -run TestAtomicTxGossip -count=1`
103+
104+
All passed.
105+
106+
## 8. Historical introduction analysis
107+
108+
### 8.0 Timeline (quick reference)
109+
110+
| Date (UTC-4) | Type | Ref | What changed / observed |
111+
|---|---|---|---|
112+
| 2023-08-23 | Commit | [`a3c46496c2`](https://github.com/ava-labs/avalanchego/commit/a3c46496c2bf7a17d019b96c82be24994da4efa9) | Introduced SDK validator sampling; `Validators.Has` used validator membership semantics. |
113+
| 2023-08-28 | Commit | [`21e7a656a0`](https://github.com/ava-labs/avalanchego/commit/21e7a656a0bb4235839fef964e7c81d7be35168b) | Added validator-only inbound filter (`ValidatorHandler`) that gates on `Has(...)`. |
114+
| 2023-11-30 | Commit | [`de3b16cac8`](https://github.com/ava-labs/avalanchego/commit/de3b16cac8a7f61c92b7c695599071de482d709f) | **Semantic regression introduced**: `Has(...)` changed from validator membership to connected-validator semantics (`peers.has(...) && validatorIDs.Contains(...)`). |
115+
| 2024-03-12 | Commit | [`12cd5ec53a`](https://github.com/ava-labs/avalanchego/commit/12cd5ec53a6fc6226256565c8048b79d3aae76eb) | Reworked validator ranking/sampling for stake-weighted gossip; preserved connected-validator `Has(...)` semantics. |
116+
| 2024-07-04 | Commit | [`45a30f9868`](https://github.com/ava-labs/avalanchego/commit/45a30f9868714d0afea39d3d948ec1dbbfadfc1c) | Switched SDK to send AppErrors, making these rejections clearly visible as `app_error` logs. |
117+
| 2025-08-25 | Commit | [`cc3242f48a`](https://github.com/ava-labs/avalanchego/commit/cc3242f48a6262c52057e98a85d6cf8184dfaca3) | Refactor preserved connected-validator `Has(...)` semantics (`connectedValidators.Contains`). |
118+
| 2026-02-17 | CI run | [`22101996337`](https://github.com/ava-labs/avalanchego/actions/runs/22101996337) | Pre-EWMA base branch run sampled in this analysis: no `not a validator` app_error lines found. |
119+
| 2026-02-17 | Commit | [`be487be128`](https://github.com/ava-labs/avalanchego/commit/be487be12844338e316f54aff9399325d6a771c0) | EWMA benching logic on PR branch (`Add back EWMA benching logic`). |
120+
| 2026-02-17 | CI run | [`22106616407`](https://github.com/ava-labs/avalanchego/actions/runs/22106616407) | First EWMA run sampled: `not a validator` app_error pattern present. |
121+
| 2026-02-18 | CI run | [`22150593840`](https://github.com/ava-labs/avalanchego/actions/runs/22150593840) | Thread’s exact atomic gossip (`handlerID: 1`) instance captured and linked above. |
122+
123+
### 8.1 Underlying code regression (introduced historically)
124+
125+
The semantic change that introduced this bug pattern is in:
126+
127+
- Commit: <https://github.com/ava-labs/avalanchego/commit/de3b16cac8a7f61c92b7c695599071de482d709f>
128+
(`Add p2p.Network component (#2283)`, 2023-11-30)
129+
130+
Before that change, `Validators.Has` checked validator membership only:
131+
132+
- <https://github.com/ava-labs/avalanchego/blob/de3b16cac8a7f61c92b7c695599071de482d709f^/network/p2p/validators.go#L83-L90>
133+
134+
After that change, `Validators.Has` was changed to require both:
135+
- peer is currently connected
136+
- and in validator set
137+
138+
- <https://github.com/ava-labs/avalanchego/blob/de3b16cac8a7f61c92b7c695599071de482d709f/network/p2p/validators.go#L103-L110>
139+
140+
This change appears intentional for validator sampling semantics (comments in the same commit explicitly describe connected-validator behavior), but it had an unintended coupling effect on inbound request validation.
141+
142+
- `Sample returns a random sample of connected validators`:
143+
<https://github.com/ava-labs/avalanchego/blob/de3b16cac8a7f61c92b7c695599071de482d709f/network/p2p/validators.go#L82-L100>
144+
145+
`ValidatorHandler` was already using `Has(...)` for inbound request filtering:
146+
147+
- <https://github.com/ava-labs/avalanchego/blob/21e7a656a0bb4235839fef964e7c81d7be35168b/network/p2p/handler.go#L81-L84>
148+
149+
So once `Has(...)` became connectivity-dependent, transient disconnect windows could return `ErrNotValidator` for validator peers.
150+
151+
This behavior was later refactored but preserved (still connected-validator semantics):
152+
153+
- <https://github.com/ava-labs/avalanchego/commit/cc3242f48a6262c52057e98a85d6cf8184dfaca3>
154+
155+
### 8.2 When CI started showing this on the benchlist EWMA work
156+
157+
Using retained CI artifacts:
158+
159+
- Base branch run before EWMA (`benchlist-response-handling`):
160+
<https://github.com/ava-labs/avalanchego/actions/runs/22101996337>
161+
Local artifact scan: **0** `errorMessage: "not a validator"` lines.
162+
163+
- First EWMA branch run (`benchlist-ewma`):
164+
<https://github.com/ava-labs/avalanchego/actions/runs/22106616407>
165+
Head SHA: `be487be12844338e316f54aff9399325d6a771c0` (`Add back EWMA benching logic`)
166+
Local artifact scan: **3** `errorMessage: "not a validator"` lines.
167+
168+
- Later EWMA runs show the same pattern with higher frequency, including handlerID `1` (atomic gossip) in run `22150593840`:
169+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840>
170+
171+
Interpretation:
172+
- The **bug was introduced historically** in 2023 (`de3b16...`) by changing `Has` semantics.
173+
- EWMA benching/unbenching churn (starting with `be487be...`) appears to **surface/amplify** it in e2e CI.
174+
175+
### 8.3 Why `12cd5ec...` happened (and whether this bug was intentional)
176+
177+
`12cd5ec53a` itself was motivated by stake-weighted push gossip (not by inbound validator gating):
178+
179+
- PR: <https://github.com/ava-labs/avalanchego/pull/2835>
180+
- PR motivation text says block production is stake-weighted and API nodes may want to send tx gossip to peers by stake.
181+
182+
The semantics at issue (`Has` means *connected validator*) were already in place from `de3b16...` in PR #2283:
183+
184+
- PR: <https://github.com/ava-labs/avalanchego/pull/2283>
185+
- In that commit, `Validators` was wired to peer connectivity and both code comments and implementation describe connected-validator behavior.
186+
187+
Conclusion:
188+
189+
- The `Has` semantic shift appears **intentional for sampling/peer-selection behavior**.
190+
- The resulting `ValidatorHandler` false negative (`not a validator` during disconnect churn) appears **accidental side effect** of interface coupling (`ValidatorHandler` reused the same `Has` API introduced earlier for validator-membership checks).

benchlist-personal

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
Alright! To start, I found the exact run from what you copied above (so you can verify):
2+
3+
- Run: <https://github.com/ava-labs/avalanchego/actions/runs/22150593840>
4+
- Job: `e2e_existing_network`
5+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/job/64040542894>
6+
- Artifact: `e2e-existing-network-tmpnet-data`
7+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5559966692>
8+
9+
You're also correct that you can find this in a bunch of runs -- if you check the PR there are a lot of runs with these logs.
10+
11+
Anyway, from the same `C.log` around the failure, we see:
12+
13+
- `.../C.log:3101` shows peer added (`adding new peer ... NodeID-B4F...`)
14+
- `.../C.log:3106` shows peer disconnected (`disconnecting peer ... NodeID-B4F...`)
15+
- `.../C.log:3220` shows `handlerID: 1` request rejected as `not a validator`
16+
- `.../C.log:3221` emits the `app_error` response
17+
18+
In the quoted sequence, all the `handlerID: 1` + `not a validator` failures happened while the most recent connection state for that peer was `disconnected`. More generally, the same signature appears in multiple uploaded artifacts:
19+
20+
- `e2e-existing-network-tmpnet-data` (110 `app_error: "not a validator"` lines)
21+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5559966692>
22+
- `e2e-post-latest-tmpnet-data` (206 lines)
23+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5560098491>
24+
- `e2e-ubuntu-latest-tmpnet-data` (168 lines)
25+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5560110597>
26+
- `e2e-custom-arm64-noble-tmpnet-data` (189 lines)
27+
<https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5560136359>
28+
29+
You can trace the path (not sure if there's a better way than linking you the code)
30+
31+
- Atomic gossip uses handler ID `1`:
32+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/graft/coreth/plugin/evm/atomic/vm/vm.go#L195>
33+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/graft/coreth/plugin/evm/atomic/vm/vm.go#L289-L290>
34+
- Pull gossip is wrapped in a validator-only gossiper:
35+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/gossip/system.go#L163-L167>
36+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/gossip/gossip.go#L196-L201>
37+
- Inbound request rejection happens here:
38+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/handler.go#L210-L213>
39+
- At that commit, `Has` checked connected validators:
40+
- <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/validators.go#L197-L205>
41+
42+
The root cause of this issue doesn't seem to be any misconfiguration for atomic gossip, but churn in connection state; EWMA benching/unbenching seems to surface/amplify it in e2e CI.
43+
44+
In short:
45+
46+
- Sender can still have a peer selected for request while receiver has already marked sender disconnected.
47+
- Receiver-side validator gate (`ValidatorHandler.AppRequest`) checks `Has(...)` and returns `ErrNotValidator`.
48+
- Because `Has(...)` was tied to connected validator state, any transient disconnects produce `not a validator` app errors for validator peers.
49+
50+
This behavior was actually introduced earlier in [`Add p2p.Network component (#2283)`](https://github.com/ava-labs/avalanchego/commit/de3b16cac8a7f61c92b7c695599071de482d709f), in Nov 2023. Before that change, `Validators.Has` [checked validator membership only](https://github.com/ava-labs/avalanchego/blob/a3c46496c2bf7a17d019b96c82be24994da4efa9/network/p2p/validators.go#L82). In `de3b16...`, `Validators.Has` was changed to require my both that a peer is currently connected and in the validator set.
51+
52+
[`Allow configuring push gossip to send txs to validators by stake (#2835)`](https://github.com/ava-labs/avalanchego/commit/12cd5ec53a6fc6226256565c8048b79d3aae76eb) (March 2024) later refactored validator sampling/ranking and preserved those `Has(...)` semantics. The behavior was [later refactored again](https://github.com/ava-labs/avalanchego/commit/cc3242f48a6262c52057e98a85d6cf8184dfaca3), but semantics remained unchanged. (I believe this commit history is correct -- I used Claude + Git Lens to get a history)
53+
54+
I believe this was intentional for sampling/peer-selection behavior (connected validators only), but had an accidental side effect for inbound filtering because `ValidatorHandler` (added earlier in [21e7a656a0](https://github.com/ava-labs/avalanchego/commit/21e7a656a0bb4235839fef964e7c81d7be35168b)) already used `Has(...)` as a validator gate.
55+
56+
I proposed a fix in https://github.com/ava-labs/avalanchego/pull/5009.
57+
- `Has(...)` changed to validator membership (`validatorSet.Contains`) rather than connected subset.
58+
- Added test `TestValidatorsHasUsesValidatorMembership` (fails on old behavior, passes with fix).
59+
- `network/p2p/network_test.go`: Updated expectations to match new `Has` semantics.
60+
61+
(I also restarted CI on your PR Aaron -- hope that was okay).

network/p2p/validators_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -385,7 +385,7 @@ func TestValidatorsHasUsesValidatorMembership(t *testing.T) {
385385
mockValidators.EXPECT().GetCurrentHeight(gomock.Any()).Return(uint64(1), nil).Times(1)
386386
mockValidators.EXPECT().GetValidatorSet(gomock.Any(), uint64(1), subnetID).Return(
387387
map[ids.NodeID]*validators.GetValidatorOutput{
388-
validatorNodeID: &validators.GetValidatorOutput{
388+
validatorNodeID: {
389389
NodeID: validatorNodeID,
390390
Weight: 1,
391391
},

0 commit comments

Comments
 (0)