feat(uv): per-node managed Python venvs at build + runtime (rescue of #1515)#1820
Conversation
|
😎 Merged successfully - details. |
…d race Pre-landing review of #1820 caught a critical race condition in the original PR design (#1515) that the rescue inherited. Builder.base_working_dir is shared across all nodes (the CLI clones the Builder per task in binaries/cli/src/command/build/local.rs:63). When multiple Python nodes share a working_dir -- which they do whenever: - the dataflow has 2+ runtime nodes with Python operators, OR - the dataflow has 2+ custom Python nodes without git sources ...managed_python_env_dir computed the same `<base>/.dora/python-env` for all of them. Under parallel builds (the --mode parallel flow at local.rs:83), two concurrent prepare_managed_python_env calls would both call `uv venv --clear` against the SAME directory, silently clobbering each other's envs and producing nondeterministic build artifacts. Fix: include node.id in the path. Each node now gets `<working-dir>/.dora/python-envs/<node-id>/` regardless of whether the working_dir is shared. The spawn-side already re-derives the path through the same function, so spawn-time interpreter lookup stays consistent automatically. Added a regression test (env_dir_includes_node_id_so_parallel_builds_do_not_race) that asserts two nodes sharing a working_dir get distinct env dirs. Docs updated (docs/cli.md, guide/src/operations/cli.md) to show the new path layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1515) Today's --uv is a half-measure: it just swaps pip -> uv pip in build commands. It buys uv speed but none of uv hermetic-env semantics -- Python nodes still run against ambient site-packages, which means: - Build-time deps and runtime Python can diverge (system Python updates, user activates a different venv, etc.) - Two dataflows installing conflicting versions of numpy/tensorflow/pytorch overwrite each other site-packages - Two Python nodes in the same dataflow with incompatible deps fight in a single site-packages This adds per-node managed envs to the --uv flow: * dora build --uv now creates <working-dir>/.dora/python-env/ per Python node via uv venv --clear, installs the Dora Python API into it (preferring an in-tree workspace package over PyPI), then runs the build commands inside that venv with VIRTUAL_ENV set and PATH prepended. * dora start / dora run automatically reuse the same interpreter when spawning the node, so build-time deps == runtime deps deterministically. * Failures are fail-closed: if the managed interpreter is missing, we error out rather than silently picking ambient Python. * Custom Python nodes without a build: block keep using ambient uv (script-only nodes do not need isolation). * BuildInfo gains a python_env_dirs map (with #[serde(default)] so existing build-info YAMLs still parse). Plumbing: * libraries/core/src/build/mod.rs: new managed_python_env_dir, managed_python_bin_dir, managed_python_interpreter helpers (pub so the daemon spawn path can reuse them); Builder::build_node_inner prepares the env when uv=true. * libraries/core/src/build/build_command.rs: new prepare_managed_python_env + ensure_managed_python_runtime + apply_managed_python_env / managed PATH composition; run_build_command gains python_env_dir arg. * binaries/cli/src/command/build/local.rs + binaries/daemon/src/lib.rs: capture python_env_dirs per node alongside node_working_dirs. * binaries/daemon/src/spawn/spawner.rs + command.rs: spawn-side uses the managed interpreter directly when one was prepared; falls back to uv run python for unbuilt nodes. * Node restart path re-derives python_env_dir from node_working_dir deterministically so restarts pick up the same venv. dora-core now requires the tokio io-util and macros features for the build-command output piping. Rescue of #1515 (Harsh-Sahu43) -- reimplemented against current main (7+ weeks of build/daemon drift since the PR was opened) using the original design as reference. Author marked tests + docs as TODO; this adds both. Co-Authored-By: Harsh-Sahu43 <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d race Pre-landing review of #1820 caught a critical race condition in the original PR design (#1515) that the rescue inherited. Builder.base_working_dir is shared across all nodes (the CLI clones the Builder per task in binaries/cli/src/command/build/local.rs:63). When multiple Python nodes share a working_dir -- which they do whenever: - the dataflow has 2+ runtime nodes with Python operators, OR - the dataflow has 2+ custom Python nodes without git sources ...managed_python_env_dir computed the same `<base>/.dora/python-env` for all of them. Under parallel builds (the --mode parallel flow at local.rs:83), two concurrent prepare_managed_python_env calls would both call `uv venv --clear` against the SAME directory, silently clobbering each other's envs and producing nondeterministic build artifacts. Fix: include node.id in the path. Each node now gets `<working-dir>/.dora/python-envs/<node-id>/` regardless of whether the working_dir is shared. The spawn-side already re-derives the path through the same function, so spawn-time interpreter lookup stays consistent automatically. Added a regression test (env_dir_includes_node_id_so_parallel_builds_do_not_race) that asserts two nodes sharing a working_dir get distinct env dirs. Docs updated (docs/cli.md, guide/src/operations/cli.md) to show the new path layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dora build --uv` and the managed Python env flow (#1820, `<working-dir>/.dora/python-envs/<node-id>/`) depend on the `uv` binary being on PATH. Without `uv`, the build fails with `failed to spawn `uv venv ...`` which doesn't point users at the install command. Add a `check_uv()` to `dora doctor` that runs `uv --version`: - PASS: `uv: uv 0.10.8 (...)` with the version string - WARN: not installed; advises the install command - WARN: present but `--version` exited non-zero WARN, not FAIL, because pure Rust/C++ users do not need `uv` and a working dora install without `uv` is still healthy -- just can't run Python dataflows under `--uv`. This mirrors the WARN-not-FAIL choice for the /dev/shm check (#1818). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b174b86 to
0e30942
Compare
Two findings from pre-landing review of #1820: Finding 1: silent fallback to ambient uv when build artifacts lack a managed env for a node that needs one. This happens with a stale session file, a prior non---uv build, or `dora start --uv` without rebuilding. The spawn code at lib.rs:2413 passed None to spawn_node and the spawn-side fell back to `uv run python` against the ambient environment. For .py nodes with `build:` and runtime Python operators this silently loses the isolation the user explicitly asked for by passing --uv. Fix: at lib.rs spawn dispatch, when uv=true and python_env_dirs has no entry for the node, compute the expected managed env dir. If the node would need one, bail with a clear error pointing the user at `dora build --uv` (or omit --uv). Finding 2: spawn-side ran the managed interpreter directly but did not set VIRTUAL_ENV or prepend the venv's bin/ to PATH. Python imports resolved correctly (the interpreter was the managed one) but subprocesses (subprocess.run([..]), python, pip, console scripts) and `python -m pip` inside the node still resolved from the ambient environment. Not actually hermetic. Fix: new apply_managed_python_runtime_env helper in spawner.rs sets VIRTUAL_ENV and composes PATH (managed bin first, then node.env's PATH if any, then ambient). Called from both Custom Python spawn (Custom + node.build.is_some()) and Runtime Python spawn paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Both findings addressed in 54f126d. Reviewer was right on both — silent degradation in the missing-env case + the runtime not actually being hermetic. Finding 1 — silent fallback to ambient uv
Error message points the user at the right next step:
Non-Python nodes are unaffected (the predicate function returns Finding 2 — runtime not hermetic New
Called from both spawn paths after the existing env injection but before stdio setup:
Why both Local validation: Caveat I'd flag separately: the runtime hermeticity fix is exercised end-to-end only by actually running |
Pre-landing review of #1820 found that the new fail-closed guard in binaries/daemon/src/lib.rs rejected ALL .py custom nodes under --uv when python_env_dirs has no entry. But: - Build records managed envs only for Python custom nodes WITH `build:` (libraries/core/src/build/mod.rs Builder::build_node_inner filtered on `n.build.is_some()`). - Spawn intentionally falls back to ambient `uv run python` for script- only Python custom nodes (binaries/daemon/src/spawn/command.rs:84, `python_env_dir.filter(|_| node.build.is_some())`). - managed_python_env_dir returned Some for EVERY .py custom node, so the new guard at lib.rs:2419 over-fired: it would bail on script-only Python under `dora start --uv` / `dora run --uv`, breaking dataflows that worked before this PR. Fix: pull the `build.is_some()` requirement into node_requires_managed_python_env itself. managed_python_env_dir now returns None for script-only Python custom nodes — matching the build record AND the spawn-side fallback. The new guard, the build flow, and the spawn flow now all use the same predicate; no callers need a local `.filter(|_| n.build.is_some())` anymore. Tests updated to cover both shapes: - assigns_managed_env_dir_to_build_backed_python_custom_nodes (renamed + node yaml now includes `build:`) - skips_managed_env_dir_for_script_only_python_custom_nodes (new regression test) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Right, the fail-closed guard was over-fired. Fix in 40f1a7d. Root cause — three call sites had drifted to different predicates:
So Fix — pull fn node_requires_managed_python_env(node: &ResolvedNode) -> bool {
match &node.kind {
CoreNodeKind::Custom(custom) => {
is_python_custom_node(custom) && custom.build.is_some()
}
CoreNodeKind::Runtime(runtime) => runtime
.operators
.iter()
.any(|op| matches!(op.config.source, OperatorSource::Python(_))),
}
}
Behavior matrix after the fix:
Tests added/updated:
Local validation: |
Audited every doc file that mentions `--uv`, Python isolation, or the relevant CLI commands. The initial doc updates only covered the CLI build flag in `docs/cli.md` and the mirror in `guide/src/operations/cli.md`. Five gaps remained: - CHANGELOG.md: no entry for the new feature. Added under Unreleased > Added, with separate bullets for the managed env work and the new `dora doctor` uv check. - docs/yaml-spec.md + guide/src/concepts/dataflow-yaml.md: the `build:` field row said "pip/pip3 lines use uv when --uv is passed" -- accurate before but understates what `--uv` does now. Expanded to mention managed venvs + the on-disk layout, with a link to the CLI ref. - docs/debugging.md: the `dora doctor` "Checks performed" list missed the new `uv` check. Added as item #2 between SHM and coordinator (matches the order checks run in). - docs/python-guide.md: had no mention of `--uv` semantics at all. Added a "Reproducible Dependencies with `--uv`" section before "Next Steps" that lays out the four guarantees (build-time deps == runtime deps, no cross-node/cross-dataflow contamination, hermetic subprocesses), notes the script-only-Python carve-out, and cross-references the CLI ref and `dora doctor`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pulls in the CHANGELOG entries that the previous docs commit dropped because the file is tracked as 'Changelog.md' in the git index while the path on disk is 'CHANGELOG.md' (case mismatch on a case-insensitive filesystem). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…solation `examples/python-operator-dataflow/` has `dataflow_conda.yml` that demonstrates per-operator Python isolation via `conda_env:`. After PR #1820 lands, dora has a second way to do the same thing: pass `--uv` and dora creates one managed venv per operator from its `build:` line. Add an "Alternative: managed venvs with --uv" subsection under Installation in the example's README that: - Shows `dora build --uv dataflow.yml && dora run --uv dataflow.yml` - Notes the env layout (`<dataflow-dir>/.dora/python-envs/<node-id>/`) - Explains when to pick which (`conda_env:` for pre-existing conda envs with explicit names; `--uv` for dora to provision from `pip install` build steps automatically) - Cross-references `dora doctor` for the `uv` availability check The example's existing dataflow.yml already has the per-operator `build: pip install -r requirements.txt` lines that `--uv` consumes; no YAML change is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
python-dataflow has two variants: - dataflow.yml -- script-only Python nodes, no build: blocks - dataflow_dynamic.yml -- two nodes with DIFFERENT git+ pip installs per node (opencv-video-capture vs opencv-plot) The existing README claimed "Or use uv to manage the Python environment automatically" for dataflow.yml, which overstated what --uv does in that case: for script-only nodes the spawn just routes through `uv run python` against the ambient uv env, not a dora-managed env. Two updates: 1. Variants table now flags each file's build profile -- dataflow.yml as "script-only / runs against your ambient Python" and dataflow_dynamic.yml as "the natural fit for the --uv managed-env flow" (since each node has its own build: line with different deps). 2. New section "Per-node managed environments (--uv + build:)" uses dataflow_dynamic.yml as the demo: shows the build/run commands, the `.dora/python-envs/<node-id>/` layout, and what `dora doctor` guarantees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-landing review of #1820 found that managed_python_env_dir returned Some for any Runtime node with a Python operator, regardless of whether that operator pins conda_env. Under --uv this caused two problems: - Build wastefully created a uv venv at .dora/python-envs/<node-id>/ for a node whose Python operator was going to run under `conda run` anyway. - Spawn-side, my new apply_managed_python_runtime_env helper then overlaid VIRTUAL_ENV and prepended the uv venv's bin dir to PATH on top of the conda-run command. The conda interpreter would run, but any subprocess (subprocess.run, python -m pip, console scripts) would resolve from .dora/python-envs/... instead of the conda env the user asked for. Mixing two environment managers. Fix in node_requires_managed_python_env: require the matched Python operator to have conda_env.is_none(). This: - Avoids creating wasted venvs at build time - Returns None at spawn time -> apply_managed_python_runtime_env skips injection (its condition is `python_env_dir.is_some()`) - Fail-closed guard in lib.rs sees expected = None -> does not bail - Restart path re-derives None -> spawn falls through to conda path Two regression tests added: - skips_managed_env_dir_for_runtime_python_operators_with_conda_env - assigns_managed_env_dir_for_runtime_python_operators_without_conda_env (confirms we didn't break the no-conda path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Good catch — the conda_env interaction was a real env-manager-collision bug. Fix in c0986b3. Root cause:
Fix: require the matched Python operator's CoreNodeKind::Runtime(runtime) => runtime.operators.iter().any(|operator| {
matches!(
&operator.config.source,
OperatorSource::Python(py) if py.conda_env.is_none()
)
}),Applying the fix at the predicate level fixes every downstream site in one shot:
Regression tests added in
Local validation: The pattern emerging across the last few reviews — these all came from |
|
/trunk merge |
|
I reviewed the diff rather than the PR description. I see one compatibility risk and one testing gap worth considering. Potential regression for Python operators without
|
The cli-python job's "Test Python template project" step has been failing on every main commit since #1820 ("feat(uv): per-node managed Python venvs at build + runtime", commit 6ae0fa9) -- 5+ merged commits all red, blocking the Trunk merge queue alongside the ros2-bridge regression that #1839 just resolved. Root cause ========== #1820 changed `dora build --uv` to install each node package into its own managed venv at `.dora/python-envs/<node>/` (good for deterministic per-node deps + no cross-node contamination). But the `dora new --lang python` templates generated test stubs at `tests/test_<node>.py` do: from <node>.main import main The CI step pytest invocation is `uv run pytest` from the test_python_project root, which resolves against the ambient venv (VIRTUAL_ENV set by the earlier setup-python-uv step to the repo-level .venv/). That venv has dora-rs + pyarrow + pytest + ruff but no node packages -- those are in the per-node venvs. Pre-#1820, `dora build` installed nodes into whatever venv was active, so the imports worked. Actual failure from CircleCI v1.1 API log on main HEAD (job 5882, step "Test Python template project"): FAILED listener-1/tests/test_listener_1.py::test_import_main > from listener_1.main import main E ModuleNotFoundError: No module named 'listener_1' FAILED talker-1/tests/test_talker_1.py::test_import_main > from talker_1.main import main E ModuleNotFoundError: No module named 'talker_1' FAILED talker-2/tests/test_talker_2.py::test_import_main > from talker_2.main import main E ModuleNotFoundError: No module named 'talker_2' 3 failed, 3 passed Fix === After `dora build --uv` (which sets up the per-node venvs), also install the three node packages into the active venv so `uv run pytest` at the project root can resolve their imports: uv pip install -e listener-1 -e talker-1 -e talker-2 This preserves the per-node-venv isolation that #1820 introduced (those venvs still drive `dora build` and `dora run`) while giving the test pass a venv that sees all node packages. The root venv is ephemeral and only used by this CI step, so no production contamination. Scope notes =========== * CI-only fix; the `dora new` template itself is unchanged. Users running `pytest` in their own `dora new` projects will hit the same issue, but fixing that requires a template-level change (e.g. root pyproject.toml that path-depends on each node, or a documented "for testing, also pip install -e <node>/" step in the README). That is a separate follow-up; this PR keeps scope to "unblock the merge queue." * The hardcoded node names listener-1, talker-1, talker-2 match the `dora new --lang python` template (see binaries/cli/src/template/python/mod.rs:126-130). If that template grows or shrinks the node set, this list needs to follow. A loop over `*/pyproject.toml` would be more robust but also more clever-shell; keeping the explicit list for clarity. Validation ========== Local repro is awkward (the CI step builds the whole dora CLI from source first), but the failure mode is mechanically obvious from the log and the fix is a one-line `uv pip install -e` of the same packages the build step references. CircleCI will validate on this PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rescues @Harsh-Sahu43's #1515 design (per-node uv-managed Python envs at build + runtime). Reimplemented against current
mainafter 7+ weeks of build/daemon drift made a direct merge impossible (5 conflicts, including the post-#1716 CI rebalance).Why this matters
Today's
--uvis a half-measure: it just swapspip→uv pipin build commands. It buys uv's speed but none of uv's hermetic-env semantics. Python nodes still run against ambientsite-packages, which means:dora buildwins, no per-dataflow isolationsite-packagesdora buildtoday,dora run3 weeks later, transitive deps drift silentlyIndustry baseline: Python isolation is table stakes for any production-grade Python framework (ROS2 underlay/overlay, Ray runtime_env, Bazel hermetic toolchains, Modal per-function envs).
What this PR does
For Python nodes (
.pycustom nodes with abuild:block, and runtime nodes with Python operators) when--uvis set:dora build --uvcreates<working-dir>/.dora/python-envs/<node-id>/per node viauv venv --clearimport doraprobe)VIRTUAL_ENVset and the env'sbin/(orScripts/on Windows) prepended toPATHdora start/dora runautomatically reuse the same interpreter at spawn timeCustom Python nodes without a
build:block keep using ambientuv run python— script-only nodes don't need isolation. Restart path re-derives the env dir deterministically so node restarts pick up the same venv.Plus a
dora doctoruvavailability check so users withoutuvget a clear hint before they hit a cryptic build error.Pieces
libraries/core/src/build/mod.rsmanaged_python_env_dir,managed_python_bin_dir,managed_python_interpreterhelpers;Builder::build_node_innerprepares the env whenuv=true; newprepare_python_envorchestratorlibraries/core/src/build/build_command.rsprepare_managed_python_env,ensure_managed_python_runtime,apply_managed_python_env, managed-PATH composition;run_build_commandgainspython_env_dirargbinaries/cli/src/command/build/local.rs+binaries/daemon/src/lib.rspython_env_dirsper node alongsidenode_working_dirsbinaries/daemon/src/spawn/spawner.rs+command.rsuv run pythonfor unbuilt nodesbinaries/cli/src/command/doctor.rscheck_uv()step indora doctorlibraries/core/Cargo.tomlio-utilandmacrosfeatures for build-command output pipingdocs/cli.md+guide/src/operations/cli.md--uvsemanticsPer-node env path (race fix from pre-landing review)
Initially used
<working-dir>/.dora/python-env(matching the original PR design). Pre-landing review caught thatBuilder.base_working_diris shared across all nodes — under parallel builds (--mode parallel), two Python nodes sharing a working_dir would race onuv venv --clearagainst the same directory and silently clobber each other. Fixed to<working-dir>/.dora/python-envs/<node-id>/. Regression test added (env_dir_includes_node_id_so_parallel_builds_do_not_race).BuildInfo backward compat
#[serde(default)]means existing build-info YAMLs still parse — verified bybuild_info_deserializes_without_python_env_dirs.Tests added
In
libraries/core/src/build/mod.rs::tests:assigns_managed_env_dir_to_python_custom_nodes— Python.pynode gets<working-dir>/.dora/python-envs/<id>skips_managed_env_dir_for_non_python_custom_nodes— Rust/C++ nodes getNoneenv_dir_includes_node_id_so_parallel_builds_do_not_race— regression test for the race fixbuild_info_deserializes_without_python_env_dirs— old build-info YAML still parsesmanaged_python_helpers_use_platform_layout—bin/pythonon Unix,Scripts/python.exeon WindowsTests use YAML deserialization (not struct literals) to avoid coupling to the exact
ResolvedNodefield list.Local validation
cargo fmt --all -- --check— cleancargo clippy -p dora-core --features build -p dora-cli -p dora-daemon -- -D warnings— cleancargo test -p dora-core --features build— 157 passing (5 new)cargo test -p dora-cli— 145 passingcargo test -p dora-daemon— 32 passingdora doctorsmoke-run on the host — confirmedPASS uv: uv 0.10.8 (...)uvon PATH + a real Python dataflow. Reviewer withuvinstalled can verify:dora build --uv examples/python-dataflow/dataflow.yml --local ls examples/python-dataflow/.dora/python-envs/ # expect one dir per Python node dora run examples/python-dataflow/dataflow.yml --uv --stop-after 5sdoraspawns the managedbin/pythoninterpreter rather thanuv run python.Follow-ups (not in this PR)
forward_build_output'stokio::select!pattern stops draining one stream when the other closes first, dropping output. The Prototype Dora-managed Python isolation for build and runtime #1515 author had already fixed this withtokio_stream::StreamExt::merge, but porting requires addingtokio-streamtodora-core's Cargo.toml and is tangential to Python env management. Filed as fix(build): build output silently truncated when one of stdout/stderr closes first #1821..github/workflows/ci.ymlchange — targeted the pre-ci: rebalance PR CI -> fast Linux-only gate, move breadth to nightly #1716 CI layout. Not portable; new nightly CI already exercises Python examples. Explicit--uvenv-isolation CI coverage can be a follow-up.Closes
Closes #1515.
Co-Authored-By: Harsh-Sahu43
🤖 Generated with Claude Code