OpenTrident Failover + Restore Drill

Decision: this is the next expensive proof move.
Goal: prove OpenTrident can survive leader loss and then cold-restore from the signed snapshot chain.
Standard: no fake green. A drill only passes if Telegram continuity, failover state, and restored state all verify live.

Scope

This runbook proves three things:

a second node can come up as a follower from the current bootstrap manifest
the follower can take over after leader loss
a third cold node can restore from the signed snapshot chain and boot cleanly

This is not a migration demo. It is a persistence drill.

Current Live Inputs

primary VPS: 49.12.7.18
primary runtime repo: /opt/opentrident
primary identity repo: /opt/OpenTrident
primary state dir: /opt/opentrident-data/config
live snapshot head: snap-2026041711-7171a6e8
current bootstrap manifest path: /opt/opentrident-data/config/bootstrap.json
current bootstrap manifest snapshot URL: snap-2026041709-fcfdc8e3

Load-bearing note: the bootstrap manifest is currently stale relative to snapshot-head.
Before the drill starts, regenerate bootstrap.json from the latest snapshot head.

Second load-bearing note: the current manifest uses dockerImage: opentrident:latest, which is a local tag, not a published registry image.
That means nodes B and C must be pre-staged with the image from node A before bootstrap starts.

Success Criteria

The drill only passes if all of these are true:

follower node boots from the refreshed bootstrap manifest
follower stays in follower mode while the leader is healthy
leader is intentionally stopped
follower takes over and becomes the active leader
Telegram still responds through the new leader
/healthz and /readyz stay green on the new leader
a third blank node restores from the signed snapshot chain
restored node comes up cleanly with the expected snapshot head
failover + restore evidence is captured in one log bundle

If any one of these fails, the drill is a failure.

Timebox

prep: 30 min
follower bootstrap: 45 min
forced leader-loss + takeover: 30 min
cold restore on third node: 45 min
evidence collection + teardown: 30 min

Total: about 3 hours.

Infra Budget

Use cheap transient nodes for the drill:

node A: existing primary 49.12.7.18
node B: temporary follower, Hetzner cpx11
node C: temporary cold-restore node, Hetzner cpx11

Delete B and C after the drill unless one becomes the new primary.

Preconditions

Do not start unless these are already true:

MacBook, GitHub, and VPS are synced and clean
primary node is healthy:
- /healthz
- /readyz
snapshot-head exists and is non-none
GitHub snapshot release for the latest snapshot head exists
bootstrap.json exists
Telegram bot is clean:
- no 409 conflict churn
Docker footprint is bounded on the primary node

If any precondition fails, fix that first.

Phase 0 — Regenerate Bootstrap Manifest

Why: current bootstrap.json points at snap-2026041709-fcfdc8e3, while live state is snap-2026041711-7171a6e8.

Run on primary:

ssh -i ~/.ssh/binance_futures_tool root@49.12.7.18
docker exec opentrident-gateway node /app/dist/index.js manifest bootstrap --json | tee /tmp/bootstrap.json
cp /tmp/bootstrap.json /opt/opentrident-data/config/bootstrap.json
cat /opt/opentrident-data/config/bootstrap.json
cat /opt/opentrident-data/config/snapshot-head

Pass condition: the snapshot URL inside bootstrap.json references the current snapshot-head.

Phase 1 — Provision Follower Node

Provision node B in Hetzner:

type: cpx11
image: Ubuntu 24.04
SSH key: existing MacBook Air M5 key
name: opentrident-follower-drill

Record:

public IP
private/Tailscale IP if used
server id

On node B:

ssh -i ~/.ssh/binance_futures_tool root@<NODE_B_IP>
apt-get update
apt-get install -y curl git rsync docker.io docker-compose-plugin nodejs npm
corepack enable || true
mkdir -p /opt/OpenTrident /opt/opentrident /opt/opentrident-data/config

Copy only what is needed:

rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
  root@49.12.7.18:/opt/OpenTrident/ /opt/OpenTrident/

rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
  root@49.12.7.18:/opt/opentrident/ /opt/opentrident/

rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
  root@49.12.7.18:/opt/opentrident-data/config/bootstrap.json /opt/opentrident-data/config/

Do not copy the full live state tree to the follower for this drill. The point is follower bootstrap from the signed path.

Install runtime deps so the host CLI can execute:

cd /opt/opentrident
pnpm install --prod

Pre-stage the live image from node A because the manifest references a local tag:

ssh -i ~/.ssh/binance_futures_tool root@49.12.7.18 \
  'docker save opentrident:latest | gzip -1' \
  | docker load

Phase 2 — Bootstrap Follower From Manifest

On node B:

cd /opt/opentrident
cat /opt/opentrident-data/config/bootstrap.json
python3 -m http.server 18890 --directory /opt/opentrident-data/config

In a second shell, run the cold bootstrap command against that local URL:

cd /opt/opentrident
node dist/index.js bootstrap --from http://127.0.0.1:18890/bootstrap.json

Follower env rules:

set unique OPENTRIDENT_INSTANCE_ID
set follower mode explicitly if supported
keep the same Telegram token only if the runtime follower path does not start polling while in follower mode

Pass condition: node B starts, reports follower state, and does not create a Telegram conflict while node A is still leader.

Evidence to capture:

curl -sf http://127.0.0.1:18889/api/dashboard-data
docker logs opentrident-gateway --tail 200

You want to see:

follower mode
observed leader present
no takeover attempts
no Telegram 409 churn

Phase 3 — Force Leader Loss

On node A:

docker stop opentrident-gateway opentrident-cli

This is intentional leader loss. Do not delete state yet.

Now watch node B:

docker logs -f opentrident-gateway
curl -sf http://127.0.0.1:18889/api/dashboard-data
curl -sf http://127.0.0.1:18889/healthz
curl -sf http://127.0.0.1:18889/readyz

Pass condition:

failover state changes from follower to leader
takeover attempts increment
last cycle status becomes live again
Telegram responses now come from node B

Manual Telegram proof:

send a simple message to the bot
confirm response arrives
confirm node A is still down
confirm node B logs the response path

Phase 4 — Cold Restore On Third Node

Provision node C:

type: cpx11
image: Ubuntu 24.04
name: opentrident-restore-drill

On node C:

ssh -i ~/.ssh/binance_futures_tool root@<NODE_C_IP>
apt-get update
apt-get install -y curl git rsync docker.io docker-compose-plugin nodejs npm
corepack enable || true
mkdir -p /opt/OpenTrident /opt/opentrident /opt/opentrident-data/config

Copy only:

rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
  root@49.12.7.18:/opt/OpenTrident/ /opt/OpenTrident/

rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
  root@49.12.7.18:/opt/opentrident/ /opt/opentrident/

rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
  root@49.12.7.18:/opt/opentrident-data/config/bootstrap.json /opt/opentrident-data/config/

Install runtime deps and pre-stage the image:

cd /opt/opentrident
pnpm install --prod

ssh -i ~/.ssh/binance_futures_tool root@49.12.7.18 \
  'docker save opentrident:latest | gzip -1' \
  | docker load

Serve the manifest locally and restore from the signed snapshot chain only:

cd /opt/opentrident
python3 -m http.server 18890 --directory /opt/opentrident-data/config

Then in a second shell:

cd /opt/opentrident
node dist/index.js bootstrap --from http://127.0.0.1:18890/bootstrap.json

Then verify:

cat /opt/opentrident-data/config/snapshot-head
curl -sf http://127.0.0.1:18889/healthz
curl -sf http://127.0.0.1:18889/readyz
curl -sf http://127.0.0.1:18889/api/dashboard-data

Pass condition:

restored node boots
health is green
snapshot head matches expected chain head or the manifest target
signed snapshot verification passed as part of restore

Phase 5 — Byte-Level State Checks

Compare load-bearing files between leader and restored node:

sha256sum /opt/opentrident-data/config/trust-telemetry-v1.json
sha256sum /opt/opentrident-data/config/bootstrap.json
sha256sum /opt/opentrident-data/config/snapshot-head

Also inspect:

planner-v1.json
memory-v1.json
doctrine-v1.json
playbooks/playbook-store.json

Not every file must match if the live leader advanced during the drill.
What matters is:

restore is internally coherent
snapshot chain verifies
no corrupted/missing state files

Phase 6 — Evidence Bundle

Collect:

node A shutdown timestamp
node B takeover timestamp
Telegram response proof
node C restore timestamp
health checks from B and C
dashboard JSON from A pre-failure, B post-takeover, C post-restore
exact snapshot head used
exact bootstrap manifest used

Save to:

/opt/opentrident-data/config/drills/failover-restore-YYYY-MM-DD/

Minimum files:

primary-pre.json
follower-post.json
restore-post.json
bootstrap.json
snapshot-head.txt
telegram-proof.txt
sha256.txt

Rollback

If follower takeover fails:

restart node A
confirm Telegram returns to A
destroy node B
inspect follower logs and failover state

If cold restore fails:

keep node B as active leader if takeover already succeeded
destroy node C
fix restore path before retrying

Do not leave three half-configured nodes alive.

Failure Modes To Watch

bootstrap manifest points at stale snapshot head
follower accidentally starts Telegram polling before takeover
leader/follower lock does not flip cleanly
restore path downloads bundle but fails signature verification
restored node boots with missing env or missing compose/runtime image
dashboard looks healthy but Telegram still points at the dead node

Any one of these makes the drill fail.

AAA Pass Standard

This drill reaches AAA only if:

node B takes over with no human code changes mid-drill
Telegram continuity is proven live
node C restores from signed snapshots without ad-hoc file surgery
evidence bundle is written
both temporary nodes are destroyed or intentionally promoted after proof

Until then, persistence is promising, not proven.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTrident Failover + Restore Drill

Scope

Current Live Inputs

Success Criteria

Timebox

Infra Budget

Preconditions

Phase 0 — Regenerate Bootstrap Manifest

Phase 1 — Provision Follower Node

Phase 2 — Bootstrap Follower From Manifest

Phase 3 — Force Leader Loss

Phase 4 — Cold Restore On Third Node

Phase 5 — Byte-Level State Checks

Phase 6 — Evidence Bundle

Rollback

Failure Modes To Watch

AAA Pass Standard

FilesExpand file tree

FAILOVER_RESTORE_DRILL.md

Latest commit

History

FAILOVER_RESTORE_DRILL.md

File metadata and controls

OpenTrident Failover + Restore Drill

Scope

Current Live Inputs

Success Criteria

Timebox

Infra Budget

Preconditions

Phase 0 — Regenerate Bootstrap Manifest

Phase 1 — Provision Follower Node

Phase 2 — Bootstrap Follower From Manifest

Phase 3 — Force Leader Loss

Phase 4 — Cold Restore On Third Node

Phase 5 — Byte-Level State Checks

Phase 6 — Evidence Bundle

Rollback

Failure Modes To Watch

AAA Pass Standard