Skip to content

PoA quorum and HA failover fixes (cherry-pick e1588b6)#3237

Closed
MitchTurner wants to merge 4 commits into
masterfrom
chore/cherry-pick-e1588b6
Closed

PoA quorum and HA failover fixes (cherry-pick e1588b6)#3237
MitchTurner wants to merge 4 commits into
masterfrom
chore/cherry-pick-e1588b6

Conversation

@MitchTurner
Copy link
Copy Markdown
Contributor

Squash of 32 commits from poa-quorum-fixes branch. Key changes:

Redis leader lease adapter:

  • Fix silent read failures by clearing cached Redis connection on stream read errors
  • Improve lock coverage to eliminate gaps during block production
  • Release lease on error and use block_time delay for faster failover
  • Add Prometheus metrics (lease acquisitions, renewals, losses, errors)

Fork prevention (write_block.lua):

  • Add HEIGHT_EXISTS check that rejects all writes at heights already present in the Redis stream, regardless of epoch
  • Prevents duplicate-height forks via pigeonhole principle

Sub-quorum block repair:

  • During reconciliation, detect heights with entries below quorum and repropose the highest-epoch block to all nodes
  • HEIGHT_EXISTS on nodes that already have it, fresh write on others

Chaos test harness (bin/chaos-test/):

  • Standalone binary for HA leader lock failover testing under fault injection (Redis kill/restart, proxy faults, node restarts)
  • Fork detection with Redis stream state dumping
  • Stall detection (warnings only, not failures — stalls self-recover)
  • AOF persistence for Redis servers to survive kill/restart cycles
  • CI workflow for automated chaos testing

Formal verification:

  • FizzBee models (v1 and v2) for sequencer HA correctness
  • Adversarial model with nondeterministic reads and tractable state space
  • Post-mortem analysis of issues found

Off-chain worker fix:

  • Use StateRewindPolicy::RewindFullRange to prevent NoHistoryForRequestedHeight node bricking

Documentation:

  • Failover spec updates (HEIGHT_EXISTS, sub-quorum repair)
  • Interactive scenario diagrams for PoA Redis fencing
  • HA failover issues tracker and chaos test findings

Please go to the Preview tab and select the appropriate sub-template:


Please go to the Preview tab and select the appropriate sub-template:

Squash of 32 commits from poa-quorum-fixes branch. Key changes:

Redis leader lease adapter:
- Fix silent read failures by clearing cached Redis connection on stream
read errors
- Improve lock coverage to eliminate gaps during block production
- Release lease on error and use block_time delay for faster failover
- Add Prometheus metrics (lease acquisitions, renewals, losses, errors)

Fork prevention (write_block.lua):
- Add HEIGHT_EXISTS check that rejects all writes at heights already
present in the Redis stream, regardless of epoch
- Prevents duplicate-height forks via pigeonhole principle

Sub-quorum block repair:
- During reconciliation, detect heights with entries below quorum and
repropose the highest-epoch block to all nodes
- HEIGHT_EXISTS on nodes that already have it, fresh write on others

Chaos test harness (bin/chaos-test/):
- Standalone binary for HA leader lock failover testing under fault
injection (Redis kill/restart, proxy faults, node restarts)
- Fork detection with Redis stream state dumping
- Stall detection (warnings only, not failures — stalls self-recover)
- AOF persistence for Redis servers to survive kill/restart cycles
- CI workflow for automated chaos testing

Formal verification:
- FizzBee models (v1 and v2) for sequencer HA correctness
- Adversarial model with nondeterministic reads and tractable state
space
- Post-mortem analysis of issues found

Off-chain worker fix:
- Use StateRewindPolicy::RewindFullRange to prevent
NoHistoryForRequestedHeight node bricking

Documentation:
- Failover spec updates (HEIGHT_EXISTS, sub-quorum repair)
- Interactive scenario diagrams for PoA Redis fencing
- HA failover issues tracker and chaos test findings

Please go to the `Preview` tab and select the appropriate sub-template:

* [Classic PR](?expand=1&template=default.md)
* [Bump version](?expand=1&template=bump_version.md)

---------

Co-authored-by: Mitchell Turner <james.mitchell.turner@gmail.com>
Co-authored-by: Green Baneling <XgreenX9999@gmail.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented Mar 24, 2026

PR Summary

Medium Risk
Medium risk because it changes CI infrastructure (Redis install/start behavior, additional tests, and Rust version for Docker builds), which can cause unexpected pipeline failures or build differences despite no runtime code changes.

Overview
CI coverage for leader-lock/Redis is expanded and made more deterministic. The leader-lock job now builds Redis from source (pinned REDIS_VERSION=8.6.0), starts it with explicit flags and a readiness loop, and also runs fuel-core’s leader_lock unit tests in addition to the existing integration tests.

Adds new tooling/automation around reliability testing and security auditing. Introduces a manual Leader Lock Chaos Tests workflow to run fuel-core-chaos-test across a seed range with log upload, adds a cargo audit job (non-blocking), updates .cargo/audit.toml formatting, and adds changelog entries describing the PoA/HA fixes. Also updates docker-images.yml to use RUST_VERSION: 1.90.0 (from 1.93.0).

Written by Cursor Bugbot for commit 5e99a52. This will update automatically on new commits. Configure here.

@MitchTurner MitchTurner changed the title PoA quorum and HA failover fixes (to release/v0.47.3) (#3225) PoA quorum and HA failover fixes (cherry-pick e1588b6) Mar 24, 2026
Comment on lines +36 to +57
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.matrix.outputs.matrix }}
steps:
- id: matrix
run: |
RANGE="${{ inputs.seeds }}"
START="${RANGE%-*}"
END="${RANGE#*-}"
BATCH_SIZE=${{ inputs.parallelism }}
BATCHES="["
FIRST=true
for ((i=START; i<=END; i+=BATCH_SIZE)); do
BATCH_END=$((i + BATCH_SIZE - 1))
if [ $BATCH_END -gt $END ]; then BATCH_END=$END; fi
if [ "$FIRST" = true ]; then FIRST=false; else BATCHES+=","; fi
BATCHES+="{\"start\":$i,\"end\":$BATCH_END}"
done
BATCHES+="]"
echo "matrix={\"batch\":$BATCHES}" >> "$GITHUB_OUTPUT"

chaos-test:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}

Copilot Autofix

AI about 2 months ago

In general, the fix is to add an explicit permissions: block that restricts the GITHUB_TOKEN to the least privileges necessary. This can be defined either at the workflow root (applies to all jobs) or per job. Since both prepare and chaos-test only need to read repository contents (for actions/checkout) and do not push commits, manage issues/PRs, or modify settings, contents: read is sufficient. actions/cache and actions/upload-artifact do not require additional repository-scoped write permissions; they use dedicated cache/artifact infrastructure.

The best minimal fix without changing functionality is to add a workflow-level permissions: block right after the name: line (before on:). This block should set contents: read, which is the recommended baseline for read-only workflows. No other scopes appear needed given the provided steps. This will satisfy CodeQL’s requirement, keep the token as least-privilege, and apply consistently to all jobs in this workflow.

Concretely:

  • Edit .github/workflows/chaos-test.yml.
  • After line 1: name: Leader Lock Chaos Tests, insert:
    permissions:
      contents: read
  • No imports or additional methods are needed; this is pure workflow configuration.
Suggested changeset 1
.github/workflows/chaos-test.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/chaos-test.yml b/.github/workflows/chaos-test.yml
--- a/.github/workflows/chaos-test.yml
+++ b/.github/workflows/chaos-test.yml
@@ -1,4 +1,6 @@
 name: Leader Lock Chaos Tests
+permissions:
+  contents: read
 
 on:
   workflow_dispatch:
EOF
@@ -1,4 +1,6 @@
name: Leader Lock Chaos Tests
permissions:
contents: read

on:
workflow_dispatch:
Copilot is powered by AI and may make mistakes. Always verify output.
Comment on lines +58 to +121
needs: prepare
runs-on: ubuntu-latest
timeout-minutes: 180
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.prepare.outputs.matrix) }}
steps:
- uses: actions/checkout@v6

- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable

- name: Install Redis
run: sudo apt-get update && sudo apt-get install -y redis-server

- name: Cache cargo
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: chaos-test-${{ hashFiles('**/Cargo.lock') }}

- name: Build chaos test
run: cargo build --release -p fuel-core-chaos-test

- name: Run chaos tests (seeds ${{ matrix.batch.start }}-${{ matrix.batch.end }})
run: |
FAILED=0
for seed in $(seq ${{ matrix.batch.start }} ${{ matrix.batch.end }}); do
echo "=== Seed $seed ==="
LOG="chaos_seed${seed}.log"
cargo run --release -p fuel-core-chaos-test -- \
--seed $seed \
--duration ${{ inputs.duration }} \
--block-time ${{ inputs.block_time }} \
--fault-interval ${{ inputs.fault_interval }} \
--stall-threshold ${{ inputs.stall_threshold }} \
> "$LOG" 2>&1
RC=$?
if [ $RC -ne 0 ]; then
echo "SEED $seed: FAIL"
grep -E "FORK|RESULT" "$LOG" | tail -3
if grep -q "FORK" "$LOG"; then
echo "::error::FORK detected at seed $seed"
fi
FAILED=$((FAILED + 1))
else
echo "SEED $seed: PASS"
fi
done
if [ $FAILED -gt 0 ]; then
echo "::error::$FAILED seed(s) failed"
exit 1
fi

- name: Upload logs
if: always()
uses: actions/upload-artifact@v4
with:
name: chaos-logs-${{ matrix.batch.start }}-${{ matrix.batch.end }}
path: chaos_seed*.log
retention-days: 14

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}

Copilot Autofix

AI about 2 months ago

In general, the fix is to add an explicit permissions block to the workflow so that the GITHUB_TOKEN has only the minimal required scopes. Since this workflow only checks out code, caches build artifacts, builds, runs tests, and uploads artifacts, it only needs read access to repository contents; no write permissions or special scopes (issues, pull-requests, etc.) are required.

The best fix without changing functionality is to add a top-level permissions block (applies to all jobs) right after the name: or on: section in .github/workflows/chaos-test.yml. Set contents: read, which is sufficient for actions/checkout and does not interfere with actions/cache or actions/upload-artifact, as those operate within the workflow’s already-granted scopes. No job-specific permissions overrides are needed since neither prepare nor chaos-test require more than read access.

Concretely, edit .github/workflows/chaos-test.yml to insert:

permissions:
  contents: read

after the name: Leader Lock Chaos Tests line (line 1) and before on: (line 3). No other code, steps, or dependencies need to change.

Suggested changeset 1
.github/workflows/chaos-test.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/chaos-test.yml b/.github/workflows/chaos-test.yml
--- a/.github/workflows/chaos-test.yml
+++ b/.github/workflows/chaos-test.yml
@@ -1,5 +1,8 @@
 name: Leader Lock Chaos Tests
 
+permissions:
+  contents: read
+
 on:
   workflow_dispatch:
     inputs:
EOF
@@ -1,5 +1,8 @@
name: Leader Lock Chaos Tests

permissions:
contents: read

on:
workflow_dispatch:
inputs:
Copilot is powered by AI and may make mistakes. Always verify output.
Comment thread .github/workflows/ci.yml Outdated
@MitchTurner MitchTurner marked this pull request as ready for review March 24, 2026 21:18
@MitchTurner MitchTurner requested review from a team, Dentosal and xgreenx as code owners March 24, 2026 21:18
@MitchTurner MitchTurner self-assigned this Mar 24, 2026
@MitchTurner MitchTurner requested a review from Voxelot March 24, 2026 21:18
AWS_ECR_ORG: fuellabs
CARGO_TERM_COLOR: always
RUST_VERSION: 1.93.0
RUST_VERSION: 1.90.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cherry-pick downgrades RUST_VERSION from 1.93.0 to 1.90.0

Medium Severity

The cherry-pick introduces a RUST_VERSION regression from 1.93.0 to 1.90.0 in docker-images.yml. Every other location in the repo uses 1.93.0: rust-toolchain.toml, the Dockerfile (rust:1.93.0-bookworm), and ci.yml. While RUST_VERSION isn't currently interpolated via ${{ env.RUST_VERSION }} in this workflow's steps, it's exported as an environment variable available to all steps and any tool or script that reads it would get an incorrect, stale value.

Fix in Cursor Fix in Web

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

if [ $FAILED -gt 0 ]; then
echo "::error::$FAILED seed(s) failed"
exit 1
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chaos test loop exits on first seed failure

Medium Severity

GitHub Actions runs bash with set -eo pipefail by default. When cargo run exits with a non-zero code on a failing seed, the shell terminates immediately before reaching RC=$?. This means the loop only processes seeds until the first failure — the FAILED counter logic and the summary at the end are effectively dead code. The intent to run all seeds and collect failures is defeated.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants