PoA quorum and HA failover fixes (cherry-pick e1588b6)#3237
PoA quorum and HA failover fixes (cherry-pick e1588b6)#3237MitchTurner wants to merge 4 commits into
Conversation
Squash of 32 commits from poa-quorum-fixes branch. Key changes: Redis leader lease adapter: - Fix silent read failures by clearing cached Redis connection on stream read errors - Improve lock coverage to eliminate gaps during block production - Release lease on error and use block_time delay for faster failover - Add Prometheus metrics (lease acquisitions, renewals, losses, errors) Fork prevention (write_block.lua): - Add HEIGHT_EXISTS check that rejects all writes at heights already present in the Redis stream, regardless of epoch - Prevents duplicate-height forks via pigeonhole principle Sub-quorum block repair: - During reconciliation, detect heights with entries below quorum and repropose the highest-epoch block to all nodes - HEIGHT_EXISTS on nodes that already have it, fresh write on others Chaos test harness (bin/chaos-test/): - Standalone binary for HA leader lock failover testing under fault injection (Redis kill/restart, proxy faults, node restarts) - Fork detection with Redis stream state dumping - Stall detection (warnings only, not failures — stalls self-recover) - AOF persistence for Redis servers to survive kill/restart cycles - CI workflow for automated chaos testing Formal verification: - FizzBee models (v1 and v2) for sequencer HA correctness - Adversarial model with nondeterministic reads and tractable state space - Post-mortem analysis of issues found Off-chain worker fix: - Use StateRewindPolicy::RewindFullRange to prevent NoHistoryForRequestedHeight node bricking Documentation: - Failover spec updates (HEIGHT_EXISTS, sub-quorum repair) - Interactive scenario diagrams for PoA Redis fencing - HA failover issues tracker and chaos test findings Please go to the `Preview` tab and select the appropriate sub-template: * [Classic PR](?expand=1&template=default.md) * [Bump version](?expand=1&template=bump_version.md) --------- Co-authored-by: Mitchell Turner <james.mitchell.turner@gmail.com> Co-authored-by: Green Baneling <XgreenX9999@gmail.com>
PR SummaryMedium Risk Overview Adds new tooling/automation around reliability testing and security auditing. Introduces a manual Written by Cursor Bugbot for commit 5e99a52. This will update automatically on new commits. Configure here. |
| runs-on: ubuntu-latest | ||
| outputs: | ||
| matrix: ${{ steps.matrix.outputs.matrix }} | ||
| steps: | ||
| - id: matrix | ||
| run: | | ||
| RANGE="${{ inputs.seeds }}" | ||
| START="${RANGE%-*}" | ||
| END="${RANGE#*-}" | ||
| BATCH_SIZE=${{ inputs.parallelism }} | ||
| BATCHES="[" | ||
| FIRST=true | ||
| for ((i=START; i<=END; i+=BATCH_SIZE)); do | ||
| BATCH_END=$((i + BATCH_SIZE - 1)) | ||
| if [ $BATCH_END -gt $END ]; then BATCH_END=$END; fi | ||
| if [ "$FIRST" = true ]; then FIRST=false; else BATCHES+=","; fi | ||
| BATCHES+="{\"start\":$i,\"end\":$BATCH_END}" | ||
| done | ||
| BATCHES+="]" | ||
| echo "matrix={\"batch\":$BATCHES}" >> "$GITHUB_OUTPUT" | ||
|
|
||
| chaos-test: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 2 months ago
In general, the fix is to add an explicit permissions: block that restricts the GITHUB_TOKEN to the least privileges necessary. This can be defined either at the workflow root (applies to all jobs) or per job. Since both prepare and chaos-test only need to read repository contents (for actions/checkout) and do not push commits, manage issues/PRs, or modify settings, contents: read is sufficient. actions/cache and actions/upload-artifact do not require additional repository-scoped write permissions; they use dedicated cache/artifact infrastructure.
The best minimal fix without changing functionality is to add a workflow-level permissions: block right after the name: line (before on:). This block should set contents: read, which is the recommended baseline for read-only workflows. No other scopes appear needed given the provided steps. This will satisfy CodeQL’s requirement, keep the token as least-privilege, and apply consistently to all jobs in this workflow.
Concretely:
- Edit
.github/workflows/chaos-test.yml. - After line
1: name: Leader Lock Chaos Tests, insert:permissions: contents: read
- No imports or additional methods are needed; this is pure workflow configuration.
| @@ -1,4 +1,6 @@ | ||
| name: Leader Lock Chaos Tests | ||
| permissions: | ||
| contents: read | ||
|
|
||
| on: | ||
| workflow_dispatch: |
| needs: prepare | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 180 | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: ${{ fromJson(needs.prepare.outputs.matrix) }} | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
|
|
||
| - name: Install Rust toolchain | ||
| uses: dtolnay/rust-toolchain@stable | ||
|
|
||
| - name: Install Redis | ||
| run: sudo apt-get update && sudo apt-get install -y redis-server | ||
|
|
||
| - name: Cache cargo | ||
| uses: actions/cache@v4 | ||
| with: | ||
| path: | | ||
| ~/.cargo/registry | ||
| ~/.cargo/git | ||
| target | ||
| key: chaos-test-${{ hashFiles('**/Cargo.lock') }} | ||
|
|
||
| - name: Build chaos test | ||
| run: cargo build --release -p fuel-core-chaos-test | ||
|
|
||
| - name: Run chaos tests (seeds ${{ matrix.batch.start }}-${{ matrix.batch.end }}) | ||
| run: | | ||
| FAILED=0 | ||
| for seed in $(seq ${{ matrix.batch.start }} ${{ matrix.batch.end }}); do | ||
| echo "=== Seed $seed ===" | ||
| LOG="chaos_seed${seed}.log" | ||
| cargo run --release -p fuel-core-chaos-test -- \ | ||
| --seed $seed \ | ||
| --duration ${{ inputs.duration }} \ | ||
| --block-time ${{ inputs.block_time }} \ | ||
| --fault-interval ${{ inputs.fault_interval }} \ | ||
| --stall-threshold ${{ inputs.stall_threshold }} \ | ||
| > "$LOG" 2>&1 | ||
| RC=$? | ||
| if [ $RC -ne 0 ]; then | ||
| echo "SEED $seed: FAIL" | ||
| grep -E "FORK|RESULT" "$LOG" | tail -3 | ||
| if grep -q "FORK" "$LOG"; then | ||
| echo "::error::FORK detected at seed $seed" | ||
| fi | ||
| FAILED=$((FAILED + 1)) | ||
| else | ||
| echo "SEED $seed: PASS" | ||
| fi | ||
| done | ||
| if [ $FAILED -gt 0 ]; then | ||
| echo "::error::$FAILED seed(s) failed" | ||
| exit 1 | ||
| fi | ||
|
|
||
| - name: Upload logs | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: chaos-logs-${{ matrix.batch.start }}-${{ matrix.batch.end }} | ||
| path: chaos_seed*.log | ||
| retention-days: 14 |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 2 months ago
In general, the fix is to add an explicit permissions block to the workflow so that the GITHUB_TOKEN has only the minimal required scopes. Since this workflow only checks out code, caches build artifacts, builds, runs tests, and uploads artifacts, it only needs read access to repository contents; no write permissions or special scopes (issues, pull-requests, etc.) are required.
The best fix without changing functionality is to add a top-level permissions block (applies to all jobs) right after the name: or on: section in .github/workflows/chaos-test.yml. Set contents: read, which is sufficient for actions/checkout and does not interfere with actions/cache or actions/upload-artifact, as those operate within the workflow’s already-granted scopes. No job-specific permissions overrides are needed since neither prepare nor chaos-test require more than read access.
Concretely, edit .github/workflows/chaos-test.yml to insert:
permissions:
contents: readafter the name: Leader Lock Chaos Tests line (line 1) and before on: (line 3). No other code, steps, or dependencies need to change.
| @@ -1,5 +1,8 @@ | ||
| name: Leader Lock Chaos Tests | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| inputs: |
| AWS_ECR_ORG: fuellabs | ||
| CARGO_TERM_COLOR: always | ||
| RUST_VERSION: 1.93.0 | ||
| RUST_VERSION: 1.90.0 |
There was a problem hiding this comment.
Cherry-pick downgrades RUST_VERSION from 1.93.0 to 1.90.0
Medium Severity
The cherry-pick introduces a RUST_VERSION regression from 1.93.0 to 1.90.0 in docker-images.yml. Every other location in the repo uses 1.93.0: rust-toolchain.toml, the Dockerfile (rust:1.93.0-bookworm), and ci.yml. While RUST_VERSION isn't currently interpolated via ${{ env.RUST_VERSION }} in this workflow's steps, it's exported as an environment variable available to all steps and any tool or script that reads it would get an incorrect, stale value.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| if [ $FAILED -gt 0 ]; then | ||
| echo "::error::$FAILED seed(s) failed" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Chaos test loop exits on first seed failure
Medium Severity
GitHub Actions runs bash with set -eo pipefail by default. When cargo run exits with a non-zero code on a failing seed, the shell terminates immediately before reaching RC=$?. This means the loop only processes seeds until the first failure — the FAILED counter logic and the summary at the end are effectively dead code. The intent to run all seeds and collect failures is defeated.


Squash of 32 commits from poa-quorum-fixes branch. Key changes:
Redis leader lease adapter:
Fork prevention (write_block.lua):
Sub-quorum block repair:
Chaos test harness (bin/chaos-test/):
Formal verification:
Off-chain worker fix:
Documentation:
Please go to the
Previewtab and select the appropriate sub-template:Please go to the
Previewtab and select the appropriate sub-template: