Skip to content

E2E coverage gap: dora cluster install/up/down (SSH + systemd lifecycle) #1704

@heyong4725

Description

@heyong4725

Gap

dora cluster subcommands — install, up, down, status, restart, upgrade, uninstall — cover the SSH-backed multi-machine cluster lifecycle. No end-to-end test exercises the full path.

Per target/debug/dora cluster --help:

up         Bring up a multi-machine cluster from a cluster.yml file.
status     Show the current status of the cluster.
down       Tear down the cluster (coordinator and all daemons).
install    Install dora-daemon as a systemd service on each machine.
uninstall  Uninstall dora-daemon systemd services from each machine.
upgrade    Rolling upgrade: SCP the local dora binary to each machine and restart daemons.
restart    Restart a running dataflow by name or UUID.

We don't have automated coverage for any of:

  • cluster install actually writing a systemd unit file on the remote machine.
  • cluster up successfully SSH-launching a remote daemon and registering it with the coordinator.
  • Coordinator ↔ remote daemon WS handshake over the network.
  • A dataflow with deploy: machine: <name> actually landing the node on the right remote.
  • cluster down cleanly stopping everything.
  • Capability gaps (no sudo, no systemctl, non-Linux remote) producing useful errors.

Surfaced by

@haixuanTao in #1624, 2026-04-21.

Why this is hard

E2E testing SSH + systemd + multi-machine lifecycle requires either:

  1. Container-based harness — compose/k8s/podman setup spinning up 2+ containers with SSH daemons, coordinator on one, daemon on another. High setup cost but fully reproducible in CI.
  2. Documented manual runbook — a step-by-step human runbook the release-test engineer walks through per RC. Cheaper to build but requires human cycles, not automatable.

Both are defensible outcomes. Option 2 is realistic for RC → GA; option 1 is the right long-term investment but expensive.

Expected scope

At minimum:

  • A manual runbook in docs/qa-runbook.md or a new docs/cluster-release-check.md documenting the exact commands to run against a test cluster, with expected output at each step. Treat it as a checklist the release manager signs off on before tagging GA.
  • A dora cluster status smoke test that can run without real SSH by pointing at a local-only cluster.yml.

Stretch:

  • A dockerized 2-node harness at tests/cluster-e2e/ that brings up coordinator + daemon in separate containers over SSH and runs a small dataflow across them. Opt-in via a feature flag so it doesn't slow every PR.

Impact

dora cluster is the distributed-deployment story — a headline 1.0 feature. Shipping GA with zero automated coverage of the SSH/systemd path means the first user to try it is the test. For a 1.0 release, that's a bar miss.

Affected files

  • binaries/cli/src/command/cluster/ (all subcommands)
  • docs/distributed-deployment.md
  • No existing tests — this would be greenfield

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions