Decision: this is the next expensive proof move.
Goal: prove OpenTrident can survive leader loss and then cold-restore from the signed snapshot chain.
Standard: no fake green. A drill only passes if Telegram continuity, failover state, and restored state all verify live.
This runbook proves three things:
- a second node can come up as a follower from the current bootstrap manifest
- the follower can take over after leader loss
- a third cold node can restore from the signed snapshot chain and boot cleanly
This is not a migration demo. It is a persistence drill.
- primary VPS:
49.12.7.18 - primary runtime repo:
/opt/opentrident - primary identity repo:
/opt/OpenTrident - primary state dir:
/opt/opentrident-data/config - live snapshot head:
snap-2026041711-7171a6e8 - current bootstrap manifest path:
/opt/opentrident-data/config/bootstrap.json - current bootstrap manifest snapshot URL:
snap-2026041709-fcfdc8e3
Load-bearing note: the bootstrap manifest is currently stale relative to snapshot-head.
Before the drill starts, regenerate bootstrap.json from the latest snapshot head.
Second load-bearing note: the current manifest uses dockerImage: opentrident:latest, which is a local tag, not a published registry image.
That means nodes B and C must be pre-staged with the image from node A before bootstrap starts.
The drill only passes if all of these are true:
- follower node boots from the refreshed bootstrap manifest
- follower stays in follower mode while the leader is healthy
- leader is intentionally stopped
- follower takes over and becomes the active leader
- Telegram still responds through the new leader
/healthzand/readyzstay green on the new leader- a third blank node restores from the signed snapshot chain
- restored node comes up cleanly with the expected snapshot head
- failover + restore evidence is captured in one log bundle
If any one of these fails, the drill is a failure.
- prep: 30 min
- follower bootstrap: 45 min
- forced leader-loss + takeover: 30 min
- cold restore on third node: 45 min
- evidence collection + teardown: 30 min
Total: about 3 hours.
Use cheap transient nodes for the drill:
- node A: existing primary
49.12.7.18 - node B: temporary follower, Hetzner
cpx11 - node C: temporary cold-restore node, Hetzner
cpx11
Delete B and C after the drill unless one becomes the new primary.
Do not start unless these are already true:
- MacBook, GitHub, and VPS are synced and clean
- primary node is healthy:
/healthz/readyz
snapshot-headexists and is non-none- GitHub snapshot release for the latest snapshot head exists
bootstrap.jsonexists- Telegram bot is clean:
- no 409 conflict churn
- Docker footprint is bounded on the primary node
If any precondition fails, fix that first.
Why: current bootstrap.json points at snap-2026041709-fcfdc8e3, while live state is snap-2026041711-7171a6e8.
Run on primary:
ssh -i ~/.ssh/binance_futures_tool root@49.12.7.18
docker exec opentrident-gateway node /app/dist/index.js manifest bootstrap --json | tee /tmp/bootstrap.json
cp /tmp/bootstrap.json /opt/opentrident-data/config/bootstrap.json
cat /opt/opentrident-data/config/bootstrap.json
cat /opt/opentrident-data/config/snapshot-headPass condition: the snapshot URL inside bootstrap.json references the current snapshot-head.
Provision node B in Hetzner:
- type:
cpx11 - image: Ubuntu 24.04
- SSH key: existing MacBook Air M5 key
- name:
opentrident-follower-drill
Record:
- public IP
- private/Tailscale IP if used
- server id
On node B:
ssh -i ~/.ssh/binance_futures_tool root@<NODE_B_IP>
apt-get update
apt-get install -y curl git rsync docker.io docker-compose-plugin nodejs npm
corepack enable || true
mkdir -p /opt/OpenTrident /opt/opentrident /opt/opentrident-data/configCopy only what is needed:
rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
root@49.12.7.18:/opt/OpenTrident/ /opt/OpenTrident/
rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
root@49.12.7.18:/opt/opentrident/ /opt/opentrident/
rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
root@49.12.7.18:/opt/opentrident-data/config/bootstrap.json /opt/opentrident-data/config/Do not copy the full live state tree to the follower for this drill. The point is follower bootstrap from the signed path.
Install runtime deps so the host CLI can execute:
cd /opt/opentrident
pnpm install --prodPre-stage the live image from node A because the manifest references a local tag:
ssh -i ~/.ssh/binance_futures_tool root@49.12.7.18 \
'docker save opentrident:latest | gzip -1' \
| docker loadOn node B:
cd /opt/opentrident
cat /opt/opentrident-data/config/bootstrap.json
python3 -m http.server 18890 --directory /opt/opentrident-data/configIn a second shell, run the cold bootstrap command against that local URL:
cd /opt/opentrident
node dist/index.js bootstrap --from http://127.0.0.1:18890/bootstrap.jsonFollower env rules:
- set unique
OPENTRIDENT_INSTANCE_ID - set follower mode explicitly if supported
- keep the same Telegram token only if the runtime follower path does not start polling while in follower mode
Pass condition: node B starts, reports follower state, and does not create a Telegram conflict while node A is still leader.
Evidence to capture:
curl -sf http://127.0.0.1:18889/api/dashboard-data
docker logs opentrident-gateway --tail 200You want to see:
- follower mode
- observed leader present
- no takeover attempts
- no Telegram 409 churn
On node A:
docker stop opentrident-gateway opentrident-cliThis is intentional leader loss. Do not delete state yet.
Now watch node B:
docker logs -f opentrident-gateway
curl -sf http://127.0.0.1:18889/api/dashboard-data
curl -sf http://127.0.0.1:18889/healthz
curl -sf http://127.0.0.1:18889/readyzPass condition:
- failover state changes from follower to leader
- takeover attempts increment
- last cycle status becomes live again
- Telegram responses now come from node B
Manual Telegram proof:
- send a simple message to the bot
- confirm response arrives
- confirm node A is still down
- confirm node B logs the response path
Provision node C:
- type:
cpx11 - image: Ubuntu 24.04
- name:
opentrident-restore-drill
On node C:
ssh -i ~/.ssh/binance_futures_tool root@<NODE_C_IP>
apt-get update
apt-get install -y curl git rsync docker.io docker-compose-plugin nodejs npm
corepack enable || true
mkdir -p /opt/OpenTrident /opt/opentrident /opt/opentrident-data/configCopy only:
rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
root@49.12.7.18:/opt/OpenTrident/ /opt/OpenTrident/
rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
root@49.12.7.18:/opt/opentrident/ /opt/opentrident/
rsync -avz -e "ssh -i ~/.ssh/binance_futures_tool" \
root@49.12.7.18:/opt/opentrident-data/config/bootstrap.json /opt/opentrident-data/config/Install runtime deps and pre-stage the image:
cd /opt/opentrident
pnpm install --prod
ssh -i ~/.ssh/binance_futures_tool root@49.12.7.18 \
'docker save opentrident:latest | gzip -1' \
| docker loadServe the manifest locally and restore from the signed snapshot chain only:
cd /opt/opentrident
python3 -m http.server 18890 --directory /opt/opentrident-data/configThen in a second shell:
cd /opt/opentrident
node dist/index.js bootstrap --from http://127.0.0.1:18890/bootstrap.jsonThen verify:
cat /opt/opentrident-data/config/snapshot-head
curl -sf http://127.0.0.1:18889/healthz
curl -sf http://127.0.0.1:18889/readyz
curl -sf http://127.0.0.1:18889/api/dashboard-dataPass condition:
- restored node boots
- health is green
- snapshot head matches expected chain head or the manifest target
- signed snapshot verification passed as part of restore
Compare load-bearing files between leader and restored node:
sha256sum /opt/opentrident-data/config/trust-telemetry-v1.json
sha256sum /opt/opentrident-data/config/bootstrap.json
sha256sum /opt/opentrident-data/config/snapshot-headAlso inspect:
planner-v1.jsonmemory-v1.jsondoctrine-v1.jsonplaybooks/playbook-store.json
Not every file must match if the live leader advanced during the drill.
What matters is:
- restore is internally coherent
- snapshot chain verifies
- no corrupted/missing state files
Collect:
- node A shutdown timestamp
- node B takeover timestamp
- Telegram response proof
- node C restore timestamp
- health checks from B and C
- dashboard JSON from A pre-failure, B post-takeover, C post-restore
- exact snapshot head used
- exact bootstrap manifest used
Save to:
/opt/opentrident-data/config/drills/failover-restore-YYYY-MM-DD/
Minimum files:
primary-pre.jsonfollower-post.jsonrestore-post.jsonbootstrap.jsonsnapshot-head.txttelegram-proof.txtsha256.txt
If follower takeover fails:
- restart node A
- confirm Telegram returns to A
- destroy node B
- inspect follower logs and failover state
If cold restore fails:
- keep node B as active leader if takeover already succeeded
- destroy node C
- fix restore path before retrying
Do not leave three half-configured nodes alive.
- bootstrap manifest points at stale snapshot head
- follower accidentally starts Telegram polling before takeover
- leader/follower lock does not flip cleanly
- restore path downloads bundle but fails signature verification
- restored node boots with missing env or missing compose/runtime image
- dashboard looks healthy but Telegram still points at the dead node
Any one of these makes the drill fail.
This drill reaches AAA only if:
- node B takes over with no human code changes mid-drill
- Telegram continuity is proven live
- node C restores from signed snapshots without ad-hoc file surgery
- evidence bundle is written
- both temporary nodes are destroyed or intentionally promoted after proof
Until then, persistence is promising, not proven.