|
| 1 | +# Benchlist EWMA Thread: `not a validator` Investigation |
| 2 | + |
| 3 | +PR: <https://github.com/ava-labs/avalanchego/pull/4982> |
| 4 | +Date investigated: 2026-02-19 |
| 5 | + |
| 6 | +## 1. Exact CI run where the quoted failure appears |
| 7 | + |
| 8 | +The exact log line from the Slack thread is in: |
| 9 | + |
| 10 | +- Run: <https://github.com/ava-labs/avalanchego/actions/runs/22150593840> |
| 11 | +- Job: `e2e_existing_network` |
| 12 | + <https://github.com/ava-labs/avalanchego/actions/runs/22150593840/job/64040542894> |
| 13 | +- Artifact: `e2e-existing-network-tmpnet-data` |
| 14 | + <https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5559966692> |
| 15 | +- Artifact zip API URL (requires GitHub auth): |
| 16 | + <https://api.github.com/repos/ava-labs/avalanchego/actions/artifacts/5559966692/zip> |
| 17 | + |
| 18 | +Inside that artifact: |
| 19 | + |
| 20 | +- `networks/20260218-173951.147867-avalanchego-e2e/NodeID-BVk3Gf3Rwr5XS7WN7v4ZfjYrkjj7BcMTS/logs/C.log:3220` |
| 21 | +- `networks/20260218-173951.147867-avalanchego-e2e/NodeID-BVk3Gf3Rwr5XS7WN7v4ZfjYrkjj7BcMTS/logs/C.log:3221` |
| 22 | + |
| 23 | +Those are the exact `handlerID: 1` + `app_error ... "not a validator"` lines discussed in thread. |
| 24 | + |
| 25 | +## 2. Other run context for PR #4982 |
| 26 | + |
| 27 | +Tests workflow runs on `aaronbuchwald/benchlist-ewma`: |
| 28 | + |
| 29 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22187381614> (failed) |
| 30 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22153325396> (passed) |
| 31 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22150593840> (failed) |
| 32 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22117974408> (failed) |
| 33 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22117044316> (failed) |
| 34 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22116636267> (cancelled) |
| 35 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22115556838> (failed) |
| 36 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22114576701> (failed) |
| 37 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22114209216> (cancelled) |
| 38 | +- <https://github.com/ava-labs/avalanchego/actions/runs/22106616407> (failed) |
| 39 | + |
| 40 | +## 3. Log evidence and what it shows |
| 41 | + |
| 42 | +From the same `C.log` around the failure: |
| 43 | + |
| 44 | +- `.../C.log:3101` shows peer added (`adding new peer ... NodeID-B4F...`) |
| 45 | +- `.../C.log:3106` shows peer disconnected (`disconnecting peer ... NodeID-B4F...`) |
| 46 | +- `.../C.log:3220` shows `handlerID: 1` request rejected as `not a validator` |
| 47 | +- `.../C.log:3221` emits the `app_error` response |
| 48 | + |
| 49 | +In this log, all observed `handlerID: 1` + `not a validator` failures happened while the most recent connection state for that peer was `disconnected`. |
| 50 | + |
| 51 | +Within the same run (`22150593840`), the same signature appears in multiple uploaded artifacts: |
| 52 | + |
| 53 | +- `e2e-existing-network-tmpnet-data` (110 `app_error: "not a validator"` lines) |
| 54 | + <https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5559966692> |
| 55 | +- `e2e-post-latest-tmpnet-data` (206 lines) |
| 56 | + <https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5560098491> |
| 57 | +- `e2e-ubuntu-latest-tmpnet-data` (168 lines) |
| 58 | + <https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5560110597> |
| 59 | +- `e2e-custom-arm64-noble-tmpnet-data` (189 lines) |
| 60 | + <https://github.com/ava-labs/avalanchego/actions/runs/22150593840/artifacts/5560136359> |
| 61 | + |
| 62 | +## 4. Code paths (GitHub permalinks) |
| 63 | + |
| 64 | +The relevant behavior on PR head commit (`bad89c8d15bed10c780819a3dee8e27618a11c17`): |
| 65 | + |
| 66 | +- Atomic gossip uses handler ID `1`: |
| 67 | + - <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/graft/coreth/plugin/evm/atomic/vm/vm.go#L195> |
| 68 | + - <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/graft/coreth/plugin/evm/atomic/vm/vm.go#L289-L290> |
| 69 | +- Pull gossip is wrapped in a validator-only gossiper: |
| 70 | + - <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/gossip/system.go#L163-L167> |
| 71 | + - <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/gossip/gossip.go#L196-L201> |
| 72 | +- Inbound request rejection happens here: |
| 73 | + - <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/handler.go#L210-L213> |
| 74 | +- At that commit, `Has` checked connected validators: |
| 75 | + - <https://github.com/ava-labs/avalanchego/blob/bad89c8d15bed10c780819a3dee8e27618a11c17/network/p2p/validators.go#L197-L205> |
| 76 | + |
| 77 | +## 5. Root cause |
| 78 | + |
| 79 | +Root cause is connection-churn/asymmetric connection state, not atomic gossip handler misconfiguration: |
| 80 | + |
| 81 | +- Sender can still have a peer selected for request while receiver has already marked sender disconnected. |
| 82 | +- Receiver-side validator gate (`ValidatorHandler.AppRequest`) checks `Has(...)` and returns `ErrNotValidator`. |
| 83 | +- Because `Has(...)` was tied to **connected** validator state, transient disconnects produce `not a validator` app errors for validator peers. |
| 84 | + |
| 85 | +## 6. Proposed fix + deterministic test |
| 86 | + |
| 87 | +Local patch prepared: |
| 88 | + |
| 89 | +- `network/p2p/validators.go:197` |
| 90 | + `Has(...)` changed to validator membership (`validatorSet.Contains`) rather than connected subset. |
| 91 | +- `network/p2p/validators_test.go:375` |
| 92 | + Added deterministic test `TestValidatorsHasUsesValidatorMembership` (fails on old behavior, passes with fix). |
| 93 | +- `network/p2p/network_test.go:972` |
| 94 | + Updated expectations to match new `Has` semantics. |
| 95 | + |
| 96 | +## 7. Validation run |
| 97 | + |
| 98 | +Executed locally: |
| 99 | + |
| 100 | +- `GOCACHE=/tmp/go-build go test ./network/p2p/...` |
| 101 | +- `GOCACHE=/tmp/go-build go test ./network/p2p -run 'TestValidatorsHasUsesValidatorMembership|TestNetworkValidators_ConnectAndDisconnect' -count=1` |
| 102 | +- `GOCACHE=/tmp/go-build go test ./graft/coreth/plugin/evm/atomic/vm -run TestAtomicTxGossip -count=1` |
| 103 | + |
| 104 | +All passed. |
| 105 | + |
| 106 | +## 8. Historical introduction analysis |
| 107 | + |
| 108 | +### 8.0 Timeline (quick reference) |
| 109 | + |
| 110 | +| Date (UTC-4) | Type | Ref | What changed / observed | |
| 111 | +|---|---|---|---| |
| 112 | +| 2023-08-23 | Commit | [`a3c46496c2`](https://github.com/ava-labs/avalanchego/commit/a3c46496c2bf7a17d019b96c82be24994da4efa9) | Introduced SDK validator sampling; `Validators.Has` used validator membership semantics. | |
| 113 | +| 2023-08-28 | Commit | [`21e7a656a0`](https://github.com/ava-labs/avalanchego/commit/21e7a656a0bb4235839fef964e7c81d7be35168b) | Added validator-only inbound filter (`ValidatorHandler`) that gates on `Has(...)`. | |
| 114 | +| 2023-11-30 | Commit | [`de3b16cac8`](https://github.com/ava-labs/avalanchego/commit/de3b16cac8a7f61c92b7c695599071de482d709f) | **Semantic regression introduced**: `Has(...)` changed from validator membership to connected-validator semantics (`peers.has(...) && validatorIDs.Contains(...)`). | |
| 115 | +| 2024-03-12 | Commit | [`12cd5ec53a`](https://github.com/ava-labs/avalanchego/commit/12cd5ec53a6fc6226256565c8048b79d3aae76eb) | Reworked validator ranking/sampling for stake-weighted gossip; preserved connected-validator `Has(...)` semantics. | |
| 116 | +| 2024-07-04 | Commit | [`45a30f9868`](https://github.com/ava-labs/avalanchego/commit/45a30f9868714d0afea39d3d948ec1dbbfadfc1c) | Switched SDK to send AppErrors, making these rejections clearly visible as `app_error` logs. | |
| 117 | +| 2025-08-25 | Commit | [`cc3242f48a`](https://github.com/ava-labs/avalanchego/commit/cc3242f48a6262c52057e98a85d6cf8184dfaca3) | Refactor preserved connected-validator `Has(...)` semantics (`connectedValidators.Contains`). | |
| 118 | +| 2026-02-17 | CI run | [`22101996337`](https://github.com/ava-labs/avalanchego/actions/runs/22101996337) | Pre-EWMA base branch run sampled in this analysis: no `not a validator` app_error lines found. | |
| 119 | +| 2026-02-17 | Commit | [`be487be128`](https://github.com/ava-labs/avalanchego/commit/be487be12844338e316f54aff9399325d6a771c0) | EWMA benching logic on PR branch (`Add back EWMA benching logic`). | |
| 120 | +| 2026-02-17 | CI run | [`22106616407`](https://github.com/ava-labs/avalanchego/actions/runs/22106616407) | First EWMA run sampled: `not a validator` app_error pattern present. | |
| 121 | +| 2026-02-18 | CI run | [`22150593840`](https://github.com/ava-labs/avalanchego/actions/runs/22150593840) | Thread’s exact atomic gossip (`handlerID: 1`) instance captured and linked above. | |
| 122 | + |
| 123 | +### 8.1 Underlying code regression (introduced historically) |
| 124 | + |
| 125 | +The semantic change that introduced this bug pattern is in: |
| 126 | + |
| 127 | +- Commit: <https://github.com/ava-labs/avalanchego/commit/de3b16cac8a7f61c92b7c695599071de482d709f> |
| 128 | + (`Add p2p.Network component (#2283)`, 2023-11-30) |
| 129 | + |
| 130 | +Before that change, `Validators.Has` checked validator membership only: |
| 131 | + |
| 132 | +- <https://github.com/ava-labs/avalanchego/blob/de3b16cac8a7f61c92b7c695599071de482d709f^/network/p2p/validators.go#L83-L90> |
| 133 | + |
| 134 | +After that change, `Validators.Has` was changed to require both: |
| 135 | +- peer is currently connected |
| 136 | +- and in validator set |
| 137 | + |
| 138 | +- <https://github.com/ava-labs/avalanchego/blob/de3b16cac8a7f61c92b7c695599071de482d709f/network/p2p/validators.go#L103-L110> |
| 139 | + |
| 140 | +This change appears intentional for validator sampling semantics (comments in the same commit explicitly describe connected-validator behavior), but it had an unintended coupling effect on inbound request validation. |
| 141 | + |
| 142 | +- `Sample returns a random sample of connected validators`: |
| 143 | + <https://github.com/ava-labs/avalanchego/blob/de3b16cac8a7f61c92b7c695599071de482d709f/network/p2p/validators.go#L82-L100> |
| 144 | + |
| 145 | +`ValidatorHandler` was already using `Has(...)` for inbound request filtering: |
| 146 | + |
| 147 | +- <https://github.com/ava-labs/avalanchego/blob/21e7a656a0bb4235839fef964e7c81d7be35168b/network/p2p/handler.go#L81-L84> |
| 148 | + |
| 149 | +So once `Has(...)` became connectivity-dependent, transient disconnect windows could return `ErrNotValidator` for validator peers. |
| 150 | + |
| 151 | +This behavior was later refactored but preserved (still connected-validator semantics): |
| 152 | + |
| 153 | +- <https://github.com/ava-labs/avalanchego/commit/cc3242f48a6262c52057e98a85d6cf8184dfaca3> |
| 154 | + |
| 155 | +### 8.2 When CI started showing this on the benchlist EWMA work |
| 156 | + |
| 157 | +Using retained CI artifacts: |
| 158 | + |
| 159 | +- Base branch run before EWMA (`benchlist-response-handling`): |
| 160 | + <https://github.com/ava-labs/avalanchego/actions/runs/22101996337> |
| 161 | + Local artifact scan: **0** `errorMessage: "not a validator"` lines. |
| 162 | + |
| 163 | +- First EWMA branch run (`benchlist-ewma`): |
| 164 | + <https://github.com/ava-labs/avalanchego/actions/runs/22106616407> |
| 165 | + Head SHA: `be487be12844338e316f54aff9399325d6a771c0` (`Add back EWMA benching logic`) |
| 166 | + Local artifact scan: **3** `errorMessage: "not a validator"` lines. |
| 167 | + |
| 168 | +- Later EWMA runs show the same pattern with higher frequency, including handlerID `1` (atomic gossip) in run `22150593840`: |
| 169 | + <https://github.com/ava-labs/avalanchego/actions/runs/22150593840> |
| 170 | + |
| 171 | +Interpretation: |
| 172 | +- The **bug was introduced historically** in 2023 (`de3b16...`) by changing `Has` semantics. |
| 173 | +- EWMA benching/unbenching churn (starting with `be487be...`) appears to **surface/amplify** it in e2e CI. |
| 174 | + |
| 175 | +### 8.3 Why `12cd5ec...` happened (and whether this bug was intentional) |
| 176 | + |
| 177 | +`12cd5ec53a` itself was motivated by stake-weighted push gossip (not by inbound validator gating): |
| 178 | + |
| 179 | +- PR: <https://github.com/ava-labs/avalanchego/pull/2835> |
| 180 | +- PR motivation text says block production is stake-weighted and API nodes may want to send tx gossip to peers by stake. |
| 181 | + |
| 182 | +The semantics at issue (`Has` means *connected validator*) were already in place from `de3b16...` in PR #2283: |
| 183 | + |
| 184 | +- PR: <https://github.com/ava-labs/avalanchego/pull/2283> |
| 185 | +- In that commit, `Validators` was wired to peer connectivity and both code comments and implementation describe connected-validator behavior. |
| 186 | + |
| 187 | +Conclusion: |
| 188 | + |
| 189 | +- The `Has` semantic shift appears **intentional for sampling/peer-selection behavior**. |
| 190 | +- The resulting `ValidatorHandler` false negative (`not a validator` during disconnect churn) appears **accidental side effect** of interface coupling (`ValidatorHandler` reused the same `Has` API introduced earlier for validator-membership checks). |
0 commit comments