PoA quorum and HA failover fixes (cherry-pick e1588b6) by MitchTurner · Pull Request #3237 · FuelLabs/fuel-core

MitchTurner · 2026-03-24T21:06:09Z

Squash of 32 commits from poa-quorum-fixes branch. Key changes:

Redis leader lease adapter:

Fix silent read failures by clearing cached Redis connection on stream read errors
Improve lock coverage to eliminate gaps during block production
Release lease on error and use block_time delay for faster failover
Add Prometheus metrics (lease acquisitions, renewals, losses, errors)

Fork prevention (write_block.lua):

Add HEIGHT_EXISTS check that rejects all writes at heights already present in the Redis stream, regardless of epoch
Prevents duplicate-height forks via pigeonhole principle

Sub-quorum block repair:

During reconciliation, detect heights with entries below quorum and repropose the highest-epoch block to all nodes
HEIGHT_EXISTS on nodes that already have it, fresh write on others

Chaos test harness (bin/chaos-test/):

Standalone binary for HA leader lock failover testing under fault injection (Redis kill/restart, proxy faults, node restarts)
Fork detection with Redis stream state dumping
Stall detection (warnings only, not failures — stalls self-recover)
AOF persistence for Redis servers to survive kill/restart cycles
CI workflow for automated chaos testing

Formal verification:

FizzBee models (v1 and v2) for sequencer HA correctness
Adversarial model with nondeterministic reads and tractable state space
Post-mortem analysis of issues found

Off-chain worker fix:

Use StateRewindPolicy::RewindFullRange to prevent NoHistoryForRequestedHeight node bricking

Documentation:

Failover spec updates (HEIGHT_EXISTS, sub-quorum repair)
Interactive scenario diagrams for PoA Redis fencing
HA failover issues tracker and chaos test findings

Please go to the Preview tab and select the appropriate sub-template:

Squash of 32 commits from poa-quorum-fixes branch. Key changes: Redis leader lease adapter: - Fix silent read failures by clearing cached Redis connection on stream read errors - Improve lock coverage to eliminate gaps during block production - Release lease on error and use block_time delay for faster failover - Add Prometheus metrics (lease acquisitions, renewals, losses, errors) Fork prevention (write_block.lua): - Add HEIGHT_EXISTS check that rejects all writes at heights already present in the Redis stream, regardless of epoch - Prevents duplicate-height forks via pigeonhole principle Sub-quorum block repair: - During reconciliation, detect heights with entries below quorum and repropose the highest-epoch block to all nodes - HEIGHT_EXISTS on nodes that already have it, fresh write on others Chaos test harness (bin/chaos-test/): - Standalone binary for HA leader lock failover testing under fault injection (Redis kill/restart, proxy faults, node restarts) - Fork detection with Redis stream state dumping - Stall detection (warnings only, not failures — stalls self-recover) - AOF persistence for Redis servers to survive kill/restart cycles - CI workflow for automated chaos testing Formal verification: - FizzBee models (v1 and v2) for sequencer HA correctness - Adversarial model with nondeterministic reads and tractable state space - Post-mortem analysis of issues found Off-chain worker fix: - Use StateRewindPolicy::RewindFullRange to prevent NoHistoryForRequestedHeight node bricking Documentation: - Failover spec updates (HEIGHT_EXISTS, sub-quorum repair) - Interactive scenario diagrams for PoA Redis fencing - HA failover issues tracker and chaos test findings Please go to the `Preview` tab and select the appropriate sub-template: * [Classic PR](?expand=1&template=default.md) * [Bump version](?expand=1&template=bump_version.md) --------- Co-authored-by: Mitchell Turner <james.mitchell.turner@gmail.com> Co-authored-by: Green Baneling <XgreenX9999@gmail.com>

cursor · 2026-03-24T21:06:17Z

PR Summary

Medium Risk
Medium risk because it changes CI infrastructure (Redis install/start behavior, additional tests, and Rust version for Docker builds), which can cause unexpected pipeline failures or build differences despite no runtime code changes.

Overview
CI coverage for leader-lock/Redis is expanded and made more deterministic. The leader-lock job now builds Redis from source (pinned REDIS_VERSION=8.6.0), starts it with explicit flags and a readiness loop, and also runs fuel-core’s leader_lock unit tests in addition to the existing integration tests.

Adds new tooling/automation around reliability testing and security auditing. Introduces a manual Leader Lock Chaos Tests workflow to run fuel-core-chaos-test across a seed range with log upload, adds a cargo audit job (non-blocking), updates .cargo/audit.toml formatting, and adds changelog entries describing the PoA/HA fixes. Also updates docker-images.yml to use RUST_VERSION: 1.90.0 (from 1.93.0).

^{Written by Cursor Bugbot for commit 5e99a52. This will update automatically on new commits. Configure here.}

+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.matrix.outputs.matrix }}
+    steps:
+      - id: matrix
+        run: |
+          RANGE="${{ inputs.seeds }}"
+          START="${RANGE%-*}"
+          END="${RANGE#*-}"
+          BATCH_SIZE=${{ inputs.parallelism }}
+          BATCHES="["
+          FIRST=true
+          for ((i=START; i<=END; i+=BATCH_SIZE)); do
+            BATCH_END=$((i + BATCH_SIZE - 1))
+            if [ $BATCH_END -gt $END ]; then BATCH_END=$END; fi
+            if [ "$FIRST" = true ]; then FIRST=false; else BATCHES+=","; fi
+            BATCHES+="{\"start\":$i,\"end\":$BATCH_END}"
+          done
+          BATCHES+="]"
+          echo "matrix={\"batch\":$BATCHES}" >> "$GITHUB_OUTPUT"
+
+  chaos-test:


In general, the fix is to add an explicit permissions: block that restricts the GITHUB_TOKEN to the least privileges necessary. This can be defined either at the workflow root (applies to all jobs) or per job. Since both prepare and chaos-test only need to read repository contents (for actions/checkout) and do not push commits, manage issues/PRs, or modify settings, contents: read is sufficient. actions/cache and actions/upload-artifact do not require additional repository-scoped write permissions; they use dedicated cache/artifact infrastructure.

The best minimal fix without changing functionality is to add a workflow-level permissions: block right after the name: line (before on:). This block should set contents: read, which is the recommended baseline for read-only workflows. No other scopes appear needed given the provided steps. This will satisfy CodeQL’s requirement, keep the token as least-privilege, and apply consistently to all jobs in this workflow.

Concretely:

Edit .github/workflows/chaos-test.yml.

After line 1: name: Leader Lock Chaos Tests, insert:
permissions: contents: read

No imports or additional methods are needed; this is pure workflow configuration.

+    needs: prepare
+    runs-on: ubuntu-latest
+    timeout-minutes: 180
+    strategy:
+      fail-fast: false
+      matrix: ${{ fromJson(needs.prepare.outputs.matrix) }}
+    steps:
+      - uses: actions/checkout@v6
+
+      - name: Install Rust toolchain
+        uses: dtolnay/rust-toolchain@stable
+
+      - name: Install Redis
+        run: sudo apt-get update && sudo apt-get install -y redis-server
+
+      - name: Cache cargo
+        uses: actions/cache@v4
+        with:
+          path: |
+            ~/.cargo/registry
+            ~/.cargo/git
+            target
+          key: chaos-test-${{ hashFiles('**/Cargo.lock') }}
+
+      - name: Build chaos test
+        run: cargo build --release -p fuel-core-chaos-test
+
+      - name: Run chaos tests (seeds ${{ matrix.batch.start }}-${{ matrix.batch.end }})
+        run: |
+          FAILED=0
+          for seed in $(seq ${{ matrix.batch.start }} ${{ matrix.batch.end }}); do
+            echo "=== Seed $seed ==="
+            LOG="chaos_seed${seed}.log"
+            cargo run --release -p fuel-core-chaos-test -- \
+              --seed $seed \
+              --duration ${{ inputs.duration }} \
+              --block-time ${{ inputs.block_time }} \
+              --fault-interval ${{ inputs.fault_interval }} \
+              --stall-threshold ${{ inputs.stall_threshold }} \
+              > "$LOG" 2>&1
+            RC=$?
+            if [ $RC -ne 0 ]; then
+              echo "SEED $seed: FAIL"
+              grep -E "FORK|RESULT" "$LOG" | tail -3
+              if grep -q "FORK" "$LOG"; then
+                echo "::error::FORK detected at seed $seed"
+              fi
+              FAILED=$((FAILED + 1))
+            else
+              echo "SEED $seed: PASS"
+            fi
+          done
+          if [ $FAILED -gt 0 ]; then
+            echo "::error::$FAILED seed(s) failed"
+            exit 1
+          fi
+
+      - name: Upload logs
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: chaos-logs-${{ matrix.batch.start }}-${{ matrix.batch.end }}
+          path: chaos_seed*.log
+          retention-days: 14


In general, the fix is to add an explicit permissions block to the workflow so that the GITHUB_TOKEN has only the minimal required scopes. Since this workflow only checks out code, caches build artifacts, builds, runs tests, and uploads artifacts, it only needs read access to repository contents; no write permissions or special scopes (issues, pull-requests, etc.) are required.

The best fix without changing functionality is to add a top-level permissions block (applies to all jobs) right after the name: or on: section in .github/workflows/chaos-test.yml. Set contents: read, which is sufficient for actions/checkout and does not interfere with actions/cache or actions/upload-artifact, as those operate within the workflow’s already-granted scopes. No job-specific permissions overrides are needed since neither prepare nor chaos-test require more than read access.

Concretely, edit .github/workflows/chaos-test.yml to insert:

permissions: contents: read

after the name: Leader Lock Chaos Tests line (line 1) and before on: (line 3). No other code, steps, or dependencies need to change.

cursor · 2026-03-24T21:20:55Z

  AWS_ECR_ORG: fuellabs
  CARGO_TERM_COLOR: always
-  RUST_VERSION: 1.93.0
+  RUST_VERSION: 1.90.0


Cherry-pick downgrades RUST_VERSION from 1.93.0 to 1.90.0

Medium Severity

The cherry-pick introduces a RUST_VERSION regression from 1.93.0 to 1.90.0 in docker-images.yml. Every other location in the repo uses 1.93.0: rust-toolchain.toml, the Dockerfile (rust:1.93.0-bookworm), and ci.yml. While RUST_VERSION isn't currently interpolated via ${{ env.RUST_VERSION }} in this workflow's steps, it's exported as an environment variable available to all steps and any tool or script that reads it would get an incorrect, stale value.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-24T22:00:07Z

+          if [ $FAILED -gt 0 ]; then
+            echo "::error::$FAILED seed(s) failed"
+            exit 1
+          fi


Chaos test loop exits on first seed failure

Medium Severity

GitHub Actions runs bash with set -eo pipefail by default. When cargo run exits with a non-zero code on a failing seed, the shell terminates immediately before reaching RC=$?. This means the loop only processes seeds until the first failure — the FAILED counter logic and the summary at the end are effectively dead code. The intent to run all seeds and collect failures is defeated.

MitchTurner changed the title ~~PoA quorum and HA failover fixes (to release/v0.47.3) (#3225)~~ PoA quorum and HA failover fixes (cherry-pick e1588b6) Mar 24, 2026

github-advanced-security AI found potential problems Mar 24, 2026

View reviewed changes

update changelog

3b25913

cursor Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread .github/workflows/ci.yml Outdated

fix rust version

19732d4

MitchTurner marked this pull request as ready for review March 24, 2026 21:18

MitchTurner requested review from a team, Dentosal and xgreenx as code owners March 24, 2026 21:18

MitchTurner self-assigned this Mar 24, 2026

MitchTurner requested a review from Voxelot March 24, 2026 21:18

cursor Bot reviewed Mar 24, 2026

View reviewed changes

fix check

5e99a52

cursor Bot reviewed Mar 24, 2026

View reviewed changes

MitchTurner closed this Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoA quorum and HA failover fixes (cherry-pick e1588b6)#3237

PoA quorum and HA failover fixes (cherry-pick e1588b6)#3237
MitchTurner wants to merge 4 commits into
masterfrom
chore/cherry-pick-e1588b6

MitchTurner commented Mar 24, 2026

Uh oh!

cursor Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Uh oh!

cursor Bot Mar 24, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@@ -1,4 +1,6 @@
             name: Leader Lock Chaos Tests
+            permissions:
+              contents: read
             on:
               workflow_dispatch:

Conversation

MitchTurner commented Mar 24, 2026

Uh oh!

cursor Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Uh oh!

cursor Bot Mar 24, 2026

Choose a reason for hiding this comment

Cherry-pick downgrades RUST_VERSION from 1.93.0 to 1.90.0

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Mar 24, 2026

Choose a reason for hiding this comment

Chaos test loop exits on first seed failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cursor Bot commented Mar 24, 2026 •

edited

Loading