This document describes how pipelines and projects are represented in BioVault, how they are executed, how Nextflow templates are wired, and how multi-party sharing works. It also covers how the app imports/runs projects and closes with two rewrite recommendations focused on flexible multiparty workflows and reusable sub-units.
BioVault treats a project as a self-contained unit of execution with a declared input/output contract, and a pipeline as a higher-level DAG that wires multiple projects together. Multiparty execution is handled at the pipeline runner level: a given datasite executes only the steps that target it, while shared outputs are published into SyftBox shared folders with syft.pub.yaml permissions.
Core entities:
- Datasite: a SyftBox identity + local data root (e.g.,
SYFTBOX_EMAILandSYFTBOX_DATA_DIR). - Project: a versioned workflow with inputs/outputs, typically
dynamic-nextfloworshell. - Pipeline: a list of project steps with explicit bindings between step outputs and step inputs.
- Submission: a shared copy of a project (and its assets) in
shared/biovault/submissions/...plussyft.pub.yamland a project message. - Message: the inbox payload that tells a recipient where a submission lives and how to run it.
A project is a folder containing:
project.yaml(spec)workflow.nf(Nextflow workflow, ifdynamic-nextflow) # an entry point, which could also be a shell script if the template is shell or something else, so this gives us flexibiltiy to upgrade in the futureassets/(scripts/data shared with the project)
Something missing here is s a version string for these files (and as schema name) like in kubernetes so we can easily version and add features
Key fields:
name,author,versiontemplate:dynamic-nextfloworshellworkflow:workflow.nforworkflow.shassets: list of asset file paths to include in submissionsinputsandoutputsparameters(user-configurable runtime settings)datasites(optional recipient list for submissions)
The type system is defined in biovault/cli/src/project_spec.rs and is shared between project inputs and pipeline inputs. Supported primitives and composites:
String,Bool,File,DirectoryParticipantSheet,GenotypeRecord,BiovaultContextList[T],Map[String, T],Record{field: Type}?suffix for optional (e.g.,File?)
Type compatibility is validated in the pipeline runner before execution. The pipeline runner checks that step bindings match the expected input types and can resolve literals like File(path).
Runtime path (as implemented in biovault/cli/src/cli/commands/run_dynamic.rs):
project.yamlis loaded and validated.- A
template.nfis loaded from~/.biovault/env/dynamic-nextflow/template.nf. - Inputs and parameters are converted into JSON payloads (
inputs.jsonandparams.json). - BioVault injects
assets_dirandresults_dirinto params. - Nextflow is launched with the template + workflow.
Importantly here we are using command line arguments and env variables to be able to call any external tool at a step in the pipeline, in this case its nextflow, in other cases it could be a syqure binary or a bash or docker command etc.
User workflow contracts:
- The workflow is named
USER. - The first argument to
take:is alwayscontext(theBiovaultContext). - The remaining
take:parameters correspond toinputsfromproject.yaml. - Outputs are emitted by name, matching
project.yamloutputs.
Type mapping (from docs/yaml-types-reference.md):
File-> NextflowpathDirectory,String,Bool-> Nextflowval- Optional
File?can be empty
The generated context provides:
context.params(parameters + injectedassets_dirandresults_dir)context.inputs(input metadata)
Shell projects run a workflow.sh script. The runner injects environment variables, including:
BV_PROJECT_DIR,BV_RESULTS_DIR,BV_ASSETS_DIRBV_INPUT_<INPUT_NAME>andBV_OUTPUT_<OUTPUT_NAME>BV_DATASITES,BV_CURRENT_DATASITE,BV_DATASITE_INDEXBV_SYFTBOX_DATA_DIR,BV_DATASITES_ROOTBV_BIN(path tobvif set by the caller)
Shell support only got added today so dont take any of this stuff as expected i think its just highlighting the kinds of metadata that processes potentially need downstream to do their work flexibilty
The shell runner resolves templates inside input/output paths using:
{current_datasite},{datasites.index},{datasite.index},{datasites}
Core fields (see biovault/cli/src/pipeline_spec.rs and docs/pipeline-system-guide.md):
nameinputs(optional pipeline-level inputs, with optional defaults)steps(ordered list of project runs)
Step fields:
id: step identifieruses: project path or registered project namewith: input bindingspublish: output aliasesstore: database storage (SQL)runs_on: datasites to run on (new)foreach: legacy datasite list (still supported)order: optional (currentlyparalleljust warns and runs sequentially)share: publish outputs into shared datasite storage (new)
Bindings support:
inputs.<name>-> pipeline inputstep.<id>.outputs.<name>-> upstream step outputstep.<id>.outputs.<name>.manifest-> manifest file containing per-datasite outputs- Literals:
File(path),Directory(path),String(value), etc.
When a step output is referenced across multiple datasites, the pipeline runner writes a manifest file under:
<results>/manifests/<step>/<output>_paths.txt
Each line is datasite<TAB>path.
The pipeline runner (in biovault/cli/src/cli/commands/pipeline.rs) executes one datasite at a time:
-
It resolves the current datasite from
BIOVAULT_DATASITE_OVERRIDE, config email,SYFTBOX_EMAIL, orBIOVAULT_DATASITE. -
If a step has
runs_on/foreach, only the matching datasite executes it. -
To force a single process to run all targets, set
BIOVAULT_PIPELINE_RUN_ALL=1. -
i wouldnt use env variables for all this stuff up front i think we can just define this in the yaml spec first and leave the implementation details to the supported runtime / template (the purpose of the template if you check the nextflow ones is to provide execution context and guard rails to running other peoples code by controlling what goes in and where it can run)
For each step run, the runner sets:
BIOVAULT_DATASITE_OVERRIDEto the target datasite.BIOVAULT_DATASITES_OVERRIDEto the full list of step targets.
BIOVAULT_DATASITES_OVERRIDE is used by the shell runner to render {datasites.index} and set BV_DATASITES consistently even when the project’s own datasites list is empty.
The pipeline share block defines file sharing at the pipeline level so projects don’t need to manually write syft.pub.yaml:
share:
allele_freq_shared:
source: allele_freq
path: shared/biovault/shares/{run_id}/{current_datasite}/allele_freq.tsv
read: [client1@sandbox.local, client2@sandbox.local]
write: [client1@sandbox.local, client2@sandbox.local]
admin: [aggregator@sandbox.local]This is an important part of the system which is, we need the built in concept of sharing files between different users and then waiting on them, the code that will copy the file and handle the syft.pub.yaml permission files as well as the downstream code to wait on it should be baked in so these things can be expressed really simply.
Behavior:
pathcan be asyft://URL or a path under the current datasite root.- The runner writes
syft.pub.yamlin the parent directory ofpath. - The shared output is recorded as a
syft://...URL in step outputs. - When downstream steps bind to that output, the pipeline runner resolves the
syft://to a local filesystem path based onSYFTBOX_DATA_DIR.
Available template variables for share.path:
{current_datasite}{datasites.index}/{datasite.index}{datasites}{run_id}(pipeline run ID)
Submitting a project does all of the following (see biovault/cli/src/cli/commands/submit.rs):
- Copies
project.yaml,workflowandassetsinto a shared folder under the sender datasite. - Encrypts assets using SyftBox storage and the recipient list (
project.datasitesor the target recipient). - Writes
syft.pub.yamlwith read/write permissions for recipients. - Sends a project message containing:
project_location(asyft://...URL)- metadata (assets, participants, human text, sender/receiver paths)
syft://{datasite}/path urls are important because they circumvent the need to have DNS/IP of services that are not online while retaining iddntity, and they prevent path traversal hacks by normalizing everything to a datasites/ root and make it easy tot hink about where you are reading and writing from in a networked context rather than a local file system implementation detail.
Processing a project message (see biovault/cli/src/cli/commands/messages.rs) does:
- Resolves the
syft://URL to a local path underSYFTBOX_DATA_DIR. - Copies the submitted project into a local run directory.
- Executes the project (dynamic-nextflow or shell) in a
results-testorresults-realdir. - Copies results back into the submission folder and optionally approves (shares results).
This is the primary mechanism by which the app imports and runs projects. Pipelines can be invoked indirectly via a shell project that runs bv run <pipeline.yaml>.
Current datasite is resolved using:
- the yaml has a template to self reference who you are in the computation
Config::load().emailSYFTBOX_EMAILBIOVAULT_DATASITE
BV_INPUT_<NAME>,BV_OUTPUT_<NAME># inputsBV_PROJECT_DIR,BV_RESULTS_DIR,BV_ASSETS_DIR# places where things ar eexeucintgBV_DATASITES,BV_CURRENT_DATASITE,BV_DATASITE_INDEX# whos executing them nowBV_SYFTBOX_DATA_DIR,BV_DATASITES_ROOT,BV_BIN# where things are on ont he system
This scenario exercises:
- A step that runs on all datasites and shares a file.
- A step that runs on client datasites only and shares a derived file.
- An aggregator step that consumes a manifest of shared files.
- A rebroadcast step that shares a combined output back to clients.
See:
tests/scenarios/share-kitchen-sink/assets/pipeline.yamltests/scenarios/share-kitchen-sink/assets/share-*/project.yaml
Current reuse model:
- Local:
bv project importregisters a project by name; pipelines canuses: <name>. - Path-based:
uses: ./relative/path. - Submissions:
bv submitshares a project to another datasite (with assets and permissions).
Missing pieces that affect reuse today:
- No built-in version pinning in
uses:. - No standard lockfile to record exact commit or hash of the project spec.
- Sharing is local to a datasite (via SyftBox), not a global registry.
-
Introduce a versioned registry + lockfile for reusable sub-units.
- Allow
uses: ://project@1.2.3anduses: git+https://...@sha. # need to think about how to reference these - Store a resolved lockfile in pipelines to freeze input/output schemas, runner type, and asset digests.
- Make the pipeline runner resolve and cache projects before execution, so users can mix local and remote steps without modifying upstream specs.
- Allow
-
Make data sharing and multiparty routing first-class in the execution DAG.
- Model
share,collect, andbroadcastas built-in step types with typed inputs/outputs and implicit SyftBox permissions. - Push these into the pipeline engine rather than shell scripts, so projects can stay single-party and reusable.
- Expose a minimal runner interface (
run(inputs) -> outputs) and let runtimes be pluggable (Nextflow, shell, Python, container) with uniform metadata and data movement semantics.
- Model
These two changes keep multiparty logic in the pipeline layer, while allowing projects to stay small, reusable, and ETL-focused without requiring downstream edits.
okay so i want to create some good yaml specifications including schema names and versions that cover all this stuff so we can do an upgrade first to make sure things work and a re-usable.
i think we can introduce hashes for reusable compeonents so that they can be resolved locally (as well as making them less strict if you just want to use the dev versio lcaolly)
i want native support for referecnign datasites, doing parallell work in different lcoations or just doing it on one location or some subset, and the ability to do round robin by say referencing next_datasite or something
Heres a reduce operation prototypei made a long time ago that shows how you could model that author: "madhava@openmined.org" project: "add" language: "python" description: "Add two numbers" code:
- functions.py
shared_inputs: data: &data FilePipe("{datasite}/data/data.txt") output: &output FilePipe("{datasite}/fedreduce/{project}/data/{step}/result.txt")
shared_outputs: result: &result FilePipe("{author}/fedreduce/{project}/data/result/result.txt")
workflow: datasites: &datasites []
steps:
- first: inputs: - a: StaticPipe(0) # Override input for the first step
- last: output: path: *result permissions: read: - *datasites
- foreach: *datasites
run: "{datasite}"
function: "add"
inputs:
- a: FilePipe("{prev_datasite}/fedreduce/{project}/data/{prev_step}/result.txt")
- b: *data
output:
path: *output
permissions:
read:
- "{next_datasite}"
complete: exists: *result
we need start and stop conditions, waiting all that stuff. we need somethings that can be static and others that are depenednent ont he local runtime environent.
I want your best thinking and expertise and lets start with a good initial spec that doesnt back us into a corner.
The idea that you can build a small ETL between steps for your own uses means things could be really re-usable.
First question:
- what should we call the outer unit and what should the composable units be called
Proposed naming:
- Flow = the outer unit (what we call a pipeline today)
- Module = the reusable composable unit (what we call a project today)
- Step = an instantiated module inside a flow
This keeps the meaning aligned with multi‑party routing while preserving a simple mental model for ETL‑style composition.
Notes for next spec iteration (from requirements):
- Create a new YAML spec in a
flow-spec-guide/spec/folder with real examples modeled after existing pipeline/project YAMLs. - Support both single-file flows (all modules inline) and split files (flow references module files).
- Allow modules to be folders (auto-discover
module.yaml/module.yml) and definemodule_pathsso flows can resolve local modules by name without hardcoding absolute paths. - Add schema name/version (like Kubernetes) at the top for upgrade safety and forward compatibility.
- Add optional manifests (hashes) so reusable components can be resolved locally and pinned, while still allowing a looser dev mode.
- Provide native support for datasite targeting (all, subset, single), parallelism, and round-robin (e.g.,
next_datasite). - Model start/stop conditions, waiting, and conditional completion for multi-party steps.
- Allow static inputs and runtime-dependent inputs (local environment bindings).
- Make ETL sub-steps reusable without editing upstream steps.
- Extend typing to include primitives and collections (lists, maps, sets), and file-like structures (csv, json with key paths, mappings).
- Make formats explicit so we can support singular vs batch processing (e.g., CSV row mapping, JSON key paths).
- Expose execution locations/paths (
{run},{work},{results}) and support patterns likefiles*.tsvfor lists. - Protect against asset changes via hashes or manifests; allow anchors for reusable vars in YAML.
- Allow module folders with auto-discovery of
module.yaml/module.yml, plusmodule_pathsfor safe local lookup by name (explicit allowlist only; no global search).
We will use .local. overlays for on-machine overrides, similar to Kubernetes-style patching. The overlay is a separate YAML file (sidecar) that applies JSON Patch (RFC 6902) or a small strategic-merge subset. This keeps the base Flow immutable while enabling local edits.
Recommended apply order (low -> high precedence):
- Base flow (
flow.yaml) - Sidecar overlay (
flow.local.overlay.yaml) if present - Runtime overlays passed in CLI order (
--overlay ...), last wins
Example CLI:
bv flow run flow.yaml \
--overlay flow.staging.overlay.yaml \
--overlay flow.dev.overlay.yaml
This makes .local. a clear convention for local-only patches, while still allowing explicit overlays for staging/production. Runtime overlays can be inserted after the sidecar by default; if we need a different priority later, we can add a --overlay-order flag.
Recommended resolution rules for local modules:
source.pathmay be a file or a directory. If a directory, resolvemodule.yaml/module.ymlat its root.spec.module_pathsis an explicit allowlist of local search roots. A short name likehellocan resolve to./modules/hellowhen listed inmodule_paths.- Resolution should be disabled unless the module ref explicitly allows local usage (e.g.,
policy.allow_local: true), to prevent accidental loading of local code. - Avoid global recursive search; only search the configured roots and explicit paths.