Gap
dora cluster subcommands — install, up, down, status, restart, upgrade, uninstall — cover the SSH-backed multi-machine cluster lifecycle. No end-to-end test exercises the full path.
Per target/debug/dora cluster --help:
up Bring up a multi-machine cluster from a cluster.yml file.
status Show the current status of the cluster.
down Tear down the cluster (coordinator and all daemons).
install Install dora-daemon as a systemd service on each machine.
uninstall Uninstall dora-daemon systemd services from each machine.
upgrade Rolling upgrade: SCP the local dora binary to each machine and restart daemons.
restart Restart a running dataflow by name or UUID.
We don't have automated coverage for any of:
cluster install actually writing a systemd unit file on the remote machine.
cluster up successfully SSH-launching a remote daemon and registering it with the coordinator.
- Coordinator ↔ remote daemon WS handshake over the network.
- A dataflow with
deploy: machine: <name> actually landing the node on the right remote.
cluster down cleanly stopping everything.
- Capability gaps (no sudo, no systemctl, non-Linux remote) producing useful errors.
Surfaced by
@haixuanTao in #1624, 2026-04-21.
Why this is hard
E2E testing SSH + systemd + multi-machine lifecycle requires either:
- Container-based harness — compose/k8s/podman setup spinning up 2+ containers with SSH daemons, coordinator on one, daemon on another. High setup cost but fully reproducible in CI.
- Documented manual runbook — a step-by-step human runbook the release-test engineer walks through per RC. Cheaper to build but requires human cycles, not automatable.
Both are defensible outcomes. Option 2 is realistic for RC → GA; option 1 is the right long-term investment but expensive.
Expected scope
At minimum:
- A manual runbook in
docs/qa-runbook.md or a new docs/cluster-release-check.md documenting the exact commands to run against a test cluster, with expected output at each step. Treat it as a checklist the release manager signs off on before tagging GA.
- A
dora cluster status smoke test that can run without real SSH by pointing at a local-only cluster.yml.
Stretch:
- A dockerized 2-node harness at
tests/cluster-e2e/ that brings up coordinator + daemon in separate containers over SSH and runs a small dataflow across them. Opt-in via a feature flag so it doesn't slow every PR.
Impact
dora cluster is the distributed-deployment story — a headline 1.0 feature. Shipping GA with zero automated coverage of the SSH/systemd path means the first user to try it is the test. For a 1.0 release, that's a bar miss.
Affected files
binaries/cli/src/command/cluster/ (all subcommands)
docs/distributed-deployment.md
- No existing tests — this would be greenfield
Gap
dora clustersubcommands —install,up,down,status,restart,upgrade,uninstall— cover the SSH-backed multi-machine cluster lifecycle. No end-to-end test exercises the full path.Per
target/debug/dora cluster --help:We don't have automated coverage for any of:
cluster installactually writing a systemd unit file on the remote machine.cluster upsuccessfully SSH-launching a remote daemon and registering it with the coordinator.deploy: machine: <name>actually landing the node on the right remote.cluster downcleanly stopping everything.Surfaced by
@haixuanTao in #1624, 2026-04-21.
Why this is hard
E2E testing SSH + systemd + multi-machine lifecycle requires either:
Both are defensible outcomes. Option 2 is realistic for RC → GA; option 1 is the right long-term investment but expensive.
Expected scope
At minimum:
docs/qa-runbook.mdor a newdocs/cluster-release-check.mddocumenting the exact commands to run against a test cluster, with expected output at each step. Treat it as a checklist the release manager signs off on before tagging GA.dora cluster statussmoke test that can run without real SSH by pointing at a local-only cluster.yml.Stretch:
tests/cluster-e2e/that brings up coordinator + daemon in separate containers over SSH and runs a small dataflow across them. Opt-in via a feature flag so it doesn't slow every PR.Impact
dora clusteris the distributed-deployment story — a headline 1.0 feature. Shipping GA with zero automated coverage of the SSH/systemd path means the first user to try it is the test. For a 1.0 release, that's a bar miss.Affected files
binaries/cli/src/command/cluster/(all subcommands)docs/distributed-deployment.md