Add tutorial for launching workers on separate machines (#213)

ultmaster · web-flow · commit df2a159b00b2 · 2025-10-25T14:16:27.000+08:00
diff --git a/.github/workflows/examples-calc-x.yml b/.github/workflows/examples-calc-x.yml
@@ -156,16 +156,16 @@ jobs:
 
       - name: Calc-X training with external store
         run: |
-          set -ex
+          set -euo pipefail
           source .venv/bin/activate
           cd examples/calc_x
           ../../scripts/restart_ray.sh
 
           agl store --port 4747 &
           sleep 5
-          AGL_MANAGED_STORE=0 AGL_CURRENT_ROLE=runner python train_calc_agent.py --external-store-address http://localhost:4747 --val-file data/test_mini.parquet --ci &
+          AGL_MANAGED_STORE=0 AGL_CURRENT_ROLE=runner python train_calc_agent.py --external-store-address http://localhost:4747 --val-file data/test_mini.parquet --ci-fast &
           sleep 5
-          AGL_MANAGED_STORE=0 AGL_CURRENT_ROLE=algorithm python train_calc_agent.py --external-store-address http://localhost:4747 --val-file data/test_mini.parquet --ci
+          AGL_MANAGED_STORE=0 AGL_CURRENT_ROLE=algorithm python train_calc_agent.py --external-store-address http://localhost:4747 --val-file data/test_mini.parquet --ci-fast
 
           pkill -f agl && echo "SIGTERM sent to agl" || echo "No agl process found"
           while pgrep -f agl; do
@@ -178,10 +178,30 @@ jobs:
             sleep 5
           done
           echo "train_calc_agent.py has finished."
-
-          sleep 10
         shell: bash
         env:
           WANDB_BASE_URL: ${{ secrets.MSR_WANDB_BASE_URL }}
           WANDB_API_KEY: ${{ secrets.MSR_WANDB_API_KEY }}
         id: calc_x_train_external_store
+
+      - name: Calc-X training with role-based environment variables
+        run: |
+          set -euo pipefail
+          source .venv/bin/activate
+          cd examples/calc_x
+          ../../scripts/restart_ray.sh
+
+          PYTHONUNBUFFERED=1 AGL_SERVER_HOST=127.0.0.1 AGL_SERVER_PORT=5858 AGL_CURRENT_ROLE=runner python train_calc_agent.py --val-file data/test_mini.parquet --ci-fast &
+          sleep 5
+          PYTHONUNBUFFERED=1 AGL_SERVER_HOST=0.0.0.0 AGL_SERVER_PORT=5858 AGL_CURRENT_ROLE=algorithm python train_calc_agent.py --val-file data/test_mini.parquet --ci-fast
+
+          pkill -f train_calc_agent.py && echo "SIGTERM sent to train_calc_agent.py" || echo "No train_calc_agent.py process found"
+          while pgrep -f train_calc_agent.py; do
+            echo "Waiting for train_calc_agent.py to finish..."
+            sleep 5
+          done
+          echo "train_calc_agent.py has finished."
+        shell: bash
+        env:
+          WANDB_BASE_URL: ${{ secrets.MSR_WANDB_BASE_URL }}
+          WANDB_API_KEY: ${{ secrets.MSR_WANDB_API_KEY }}
diff --git a/docs/tutorials/parallelize.md b/docs/tutorials/parallelize.md
@@ -167,6 +167,39 @@ Set `AGL_SERVER_HOST` and `AGL_SERVER_PORT` if you prefer environment-based conf
 
 Algorithms sometimes require heterogeneous computation resources, such as GPU accelerators, while runners sometimes require a specific environment to run because many agent frameworks are fragile in their dependencies. A role-based launch pattern helps you place the algorithm on a dedicated machine with more GPU memory, while runners can live on another machine with more flexible dependencies. This is possible via `AGL_CURRENT_ROLE="algorithm"` or `AGL_CURRENT_ROLE="runner"` environment variables. When running on different machines, you also need to set `AGL_SERVER_HOST` and `AGL_SERVER_PORT` to the IP address and port of the algorithm machine. You might recognize that this convention is very similar to `MASTER_ADDR` and `MASTER_PORT` in [PyTorch distributed training](https://docs.pytorch.org/docs/stable/notes/ddp.html).
 
+### Launching Algorithm and Runner Roles on Separate Machines
+
+When you want to stretch the algorithm onto a GPU-rich machine and keep rollout workers close to the data source (or on machines with a more permissive dependency stack), launch the same training script in different terminals with role-specific environment variables. The client–server strategy will route each process to the right side of the queue as long as they share the same `AGL_SERVER_HOST`/`AGL_SERVER_PORT` pair.
+
+**1. Pick an address and port for the store.** Decide which machine will host the algorithm. Choose a TCP port that can be reached by the runner machines (for example, open it in your firewall configuration). In this example we will use `10.0.0.4:4747`.
+
+**2. Start the algorithm process.** On the machine that should run the algorithm, expose the store by binding to all network interfaces and mark the role as `algorithm`.
+
+```bash
+export AGL_SERVER_HOST=0.0.0.0
+export AGL_SERVER_PORT=4747
+export AGL_CURRENT_ROLE=algorithm
+
+python train_calc_agent.py
+```
+
+Leaving `AGL_MANAGED_STORE` unset (or setting it to `1`) lets the strategy create the [`LightningStoreServer`][agentlightning.LightningStoreServer] for you. Otherwise, you can use the method in the previous section to create a store on your own.
+
+**3. Start rollout workers on remote machines.** Every runner machine should point to the algorithm host and declare itself as the `runner` role. You can start multiple processes per machine or repeat the command on additional hosts.
+
+```bash
+export AGL_SERVER_HOST=10.0.0.4
+export AGL_SERVER_PORT=4747
+export AGL_CURRENT_ROLE=runner
+python train_calc_agent.py --n-runners 4
+```
+
+The runner process automatically connects via [`LightningStoreClient`][agentlightning.LightningStoreClient]. Adjust `--n-runners` to spawn the desired number of worker processes on that machine.
+
+**4. Scale out as needed.** Repeat step 3 on as many machines as you need. When you are done, stop the algorithm process. However, since the runners are on different machines, the strategy WILL NOT send a cooperative stop signal to the connected runners. So you need to kill the runners on your own.
+
+This role-based launch mirrors what [`Trainer.fit`][agentlightning.Trainer.fit] does inside a single machine while letting you spread work across a fleet. Because every process shares the same training script, you keep a single source of truth for dataset loading, adapters, and tracers, but you can tune compute resources independently for the algorithm and rollout workers.
+
 ### Shared-memory Strategy
 
 [`SharedMemoryExecutionStrategy`][agentlightning.SharedMemoryExecutionStrategy] keeps everything inside one process. The runner runs on the main thread (by default) while the algorithm lives on a Python thread guarded by [`LightningStoreThreaded`][agentlightning.LightningStoreThreaded].
diff --git a/examples/calc_x/train_calc_agent.py b/examples/calc_x/train_calc_agent.py
@@ -105,6 +105,7 @@ def train(
     model: Optional[str],
     llm_proxy: bool,
     ci: bool,
+    ci_fast: bool,
     n_runners: int,
     external_store_address: str,
 ):
@@ -117,6 +118,7 @@ def train(
         llm_proxy: Whether to enable LLM Proxy tracing/adapter.
         ci: Whether to run a minimal CI-style training loop.
         n_runners: The number of runners for the Trainer.
+        ci_fast: Whether to cap the training loop at a single step (implies CI toggles).
         external_store_address: Connects to an external store instead of creating a new one in memory.
     """
     # Load datasets (respect CLI file paths)
@@ -134,7 +136,7 @@ def train(
         config["actor_rollout_ref"]["model"]["path"] = model
 
     # CI toggle keeps everything else the same but you can tweak the lightweight bits here if desired
-    if ci:
+    if ci or ci_fast:
         # Config the experiment name and project name so that they are available to CI
         timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
         EXPERIMENT_NAME = f"calc_x_{timestamp}"
@@ -161,6 +163,11 @@ def train(
         config["trainer"]["project_name"] = PROJECT_NAME
         config["trainer"].pop("save_freq", None)
 
+        if ci_fast:
+            # Extra fast CI toggle for testing purposes.
+            config["trainer"]["total_training_steps"] = 1
+            config["trainer"]["test_freq"] = 1
+
     algorithm = agl.VERL(config)
 
     if external_store_address:
@@ -185,6 +192,9 @@ def main():
     parser.add_argument("--model", type=str, default=None, help="HF model id or path (optional)")
     parser.add_argument("--llm-proxy", action="store_true", help="Enable LLM Proxy tracing/adapter")
     parser.add_argument("--ci", action="store_true", help="Run a minimal CI-style training loop")
+    parser.add_argument(
+        "--ci-fast", action="store_true", help="Limit the training loop to a single step (implies --ci)"
+    )
     parser.add_argument("--n-runners", type=int, default=10, help="Number of runners for Trainer")
     parser.add_argument(
         "--external-store-address",
@@ -203,12 +213,16 @@ def main():
                 "Otherwise the trainer will still try to manage the store lifecycle for you!"
             )
 
+    if args.ci_fast:
+        args.ci = True
+
     train(
         train_file=args.train_file,
         val_file=args.val_file,
         model=args.model,
         llm_proxy=args.llm_proxy,
         ci=args.ci,
+        ci_fast=args.ci_fast,
         n_runners=args.n_runners,
         external_store_address=args.external_store_address,
     )