[recipe] fix: make LangGraph agent example runnable out-of-the-box #3029

philippnormann · 2025-08-12T17:23:33Z

What does this PR do?

Fixes the LangGraph agent recipe so it runs out-of-the-box across different environments. The original example had undefined variables and brittle error handling that caused failures. This PR makes it portable, robust, and self-contained. No breaking API changes.

Checklist Before Starting

Search for similar PRs: https://github.com/search?q=repo%3Avolcengine%2Fverl+langgraph++&type=pullrequests&state=open
Format PR title as [recipe] fix: make LangGraph agent example runnable out-of-the-box
- {modules}: recipe
- {type}: fix
- No breaking API changes

Test

✅ End-to-end validation:

# 1. Generate dataset (parameterized)
python recipe/langgraph_agent/example/create_dataset.py --train_size 1000 --test_size 100

# 2. Run training (no modifications needed)
bash recipe/langgraph_agent/example/run_qwen2.5_3b.sh

# 3. SLURM submission (headers included)
sbatch recipe/langgraph_agent/example/run_qwen2.5_3b.sh

Note on GPUS_PER_NODE and NNODES:

GPUS_PER_NODE: GPUs per node.
Detection order: SLURM_GPUS_ON_NODE (if set) → GPUS_PER_NODE → 2.
NNODES: number of nodes.
Detection order: SLURM_JOB_NUM_NODES (if set) → NNODES → 1.
Total GPUs = GPUS_PER_NODE × NNODES (must be ≥ 2).

Local override (no SLURM_* set):

GPUS_PER_NODE=4 NNODES=2 bash recipe/langgraph_agent/example/run_qwen2.5_3b.sh

Results:

Model converged to 100% validation accuracy (val-core/lighteval/MATH/reward/mean@4: 1.0)
Stable metrics: policy loss, entropy, critic scores all normal
No crashes or hangs during run
Robust handling of malformed tool-call JSON (logs warnings)
Model path fallback works when local model missing
SLURM detection + fallbacks confirmed

API and Usage Example

No breaking API changes. Dataset generator now has a CLI interface:

# Defaults: 5000 train, 500 test → data/math_expression_tool/
python recipe/langgraph_agent/example/create_dataset.py

# Custom sizes & output dir
python recipe/langgraph_agent/example/create_dataset.py \
  --train_size 10000 \
  --test_size 1000 \
  --output_dir my_custom_path

# Training
bash recipe/langgraph_agent/example/run_qwen2.5_3b.sh

# SLURM
sbatch recipe/langgraph_agent/example/run_qwen2.5_3b.sh

Design & Code Changes

Core runability fixes:

run_qwen2.5_3b.sh:
- Replace undefined ARNOLD_* vars with SLURM detection + fallbacks
- Fix dataset paths
- Add HF hub model fallback
- Apply performance tuning from GSPO recipe
chat_model.py: Harden tool-call parsing for malformed JSON
create_dataset.py: Add CLI args (--train_size, --test_size, --output_dir) with defaults

Docs & polish:

Update README.md with CLI params and SLURM example
Sort imports to satisfy ruff linting

Impact: Example now works out-of-the-box in local and cluster environments without edits.

Checklist Before Submitting

Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md)
Pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Documentation updated (README.md)
Manual end-to-end test with convergence results
CI request to be sent in Slack once PR is opened

…expression_tool)

…d args)

…der; apply perf tuning from GSPO recipe

…aults and SLURM sbatch example

CLAassistant · 2025-08-12T17:23:40Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request significantly improves the LangGraph agent example by making it runnable out-of-the-box. The changes, including adding CLI arguments, SLURM support, and more robust error handling, are well-executed. My review focuses on enhancing portability and maintainability. I've identified a critical portability issue in the run script due to a hardcoded network configuration and a high-severity maintainability issue in the chat model from duplicated error-handling code. Addressing these points will further solidify the example's robustness and ease of use across different environments.

gemini-code-assist · 2025-08-12T17:24:58Z

recipe/langgraph_agent/example/run_qwen2.5_3b.sh

+export NCCL_IBEXT_DISABLE=1
+export NCCL_NVLS_ENABLE=1
+export NCCL_IB_HCA=mlx5
+export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1


The UCX_NET_DEVICES variable is hardcoded to a specific configuration with 8 Mellanox network interfaces. This will cause the script to fail on most systems that do not match this exact hardware setup, which undermines the goal of making the example runnable out-of-the-box. To ensure portability, this line should be removed to allow UCX to use default settings, or the device list should be generated dynamically based on the available hardware.

This setting is copied from the GSPO recipe (see test_gspo_3b_math.sh:L26) and has worked across multiple hardware configs in my tests. I’m not a UCX expert and open to feedback if there’s a more portable approach.

recipe/langgraph_agent/chat_model.py

…error handling

…r local runs

…M override

…olcengine#3029) ### What does this PR do? Fixes the LangGraph agent recipe so it runs out-of-the-box across different environments. The original example had undefined variables and brittle error handling that caused failures. This PR makes it portable, robust, and self-contained. No breaking API changes. ### Checklist Before Starting * [x] Search for similar PRs: [https://github.com/search?q=repo%3Avolcengine%2Fverl+langgraph++\&type=pullrequests\&state=open](https://github.com/search?q=repo%3Avolcengine%2Fverl+langgraph++&type=pullrequests&state=open) * [x] Format PR title as `[recipe] fix: make LangGraph agent example runnable out-of-the-box` * `{modules}`: recipe * `{type}`: fix * No breaking API changes ### Test **✅ End-to-end validation:** ```bash # 1. Generate dataset (parameterized) python recipe/langgraph_agent/example/create_dataset.py --train_size 1000 --test_size 100 # 2. Run training (no modifications needed) bash recipe/langgraph_agent/example/run_qwen2.5_3b.sh # 3. SLURM submission (headers included) sbatch recipe/langgraph_agent/example/run_qwen2.5_3b.sh ``` **Note on `GPUS_PER_NODE` and `NNODES`:** - `GPUS_PER_NODE`: GPUs per node. Detection order: `SLURM_GPUS_ON_NODE` (if set) → `GPUS_PER_NODE` → `2`. - `NNODES`: number of nodes. Detection order: `SLURM_JOB_NUM_NODES` (if set) → `NNODES` → `1`. - Total GPUs = `GPUS_PER_NODE × NNODES` (must be ≥ 2). Local override (no `SLURM_*` set): ```bash GPUS_PER_NODE=4 NNODES=2 bash recipe/langgraph_agent/example/run_qwen2.5_3b.sh ``` **Results:** * Model converged to 100% validation accuracy (`val-core/lighteval/MATH/reward/mean@4: 1.0`) * Stable metrics: policy loss, entropy, critic scores all normal * No crashes or hangs during run * Robust handling of malformed tool-call JSON (logs warnings) * Model path fallback works when local model missing * SLURM detection + fallbacks confirmed <img width="3066" height="1288" alt="math_expression_tool – Weights & Biases" src="https://github.com/user-attachments/assets/f08d5799-f9ce-44a2-8fb2-19c7c401c248" /> ### API and Usage Example **No breaking API changes.** Dataset generator now has a CLI interface: ```bash # Defaults: 5000 train, 500 test → data/math_expression_tool/ python recipe/langgraph_agent/example/create_dataset.py # Custom sizes & output dir python recipe/langgraph_agent/example/create_dataset.py \ --train_size 10000 \ --test_size 1000 \ --output_dir my_custom_path # Training bash recipe/langgraph_agent/example/run_qwen2.5_3b.sh # SLURM sbatch recipe/langgraph_agent/example/run_qwen2.5_3b.sh ``` ### Design & Code Changes **Core runability fixes:** * `run_qwen2.5_3b.sh`: * Replace undefined ARNOLD\_\* vars with SLURM detection + fallbacks * Fix dataset paths * Add HF hub model fallback * Apply performance tuning from GSPO recipe * `chat_model.py`: Harden tool-call parsing for malformed JSON * `create_dataset.py`: Add CLI args (`--train_size`, `--test_size`, `--output_dir`) with defaults **Docs & polish:** * Update `README.md` with CLI params and SLURM example * Sort imports to satisfy ruff linting **Impact:** Example now works out-of-the-box in local and cluster environments without edits. ### Checklist Before Submitting * [x] Read the [[Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md)](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md) * [x] Pre-commit checks: `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` * [x] Documentation updated (`README.md`) * [x] Manual end-to-end test with convergence results * [x] CI request to be sent in Slack once PR is opened

philippnormann added 5 commits August 12, 2025 17:58

[recipe] feat: parameterize dataset generator (defaults to data/math_…

ad329c7

…expression_tool)

[recipe] fix: harden ChatModel tool-call parsing (avoid crashes on ba…

138585d

…d args)

[recipe] fix: replace undefined ARNOLD_* vars; add optional SLURM hea…

e8f2aa4

…der; apply perf tuning from GSPO recipe

[recipe] chore: sort imports in create_dataset.py to satisfy ruff

9028173

[recipe] chore: update LangGraph agent README with dataset params/def…

1dac28f

…aults and SLURM sbatch example

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

philippnormann added 3 commits August 12, 2025 19:34

[recipe] refactor: eliminate code duplication in ChatModel tool call …

39c30b4

…error handling

[recipe] fix: set default GPUS_PER_NODE=2 to satisfy ≥2 GPUs check fo…

8026373

…r local runs

[recipe] chore: clarify GPUS_PER_NODE/NNODES detection order and SLUR…

16a8581

…M override

chenhaiq requested a review from wuxibin89 August 13, 2025 02:02

wuxibin89 approved these changes Aug 13, 2025

View reviewed changes

wuxibin89 merged commit 83cfc76 into volcengine:main Aug 13, 2025
8 checks passed

KabakaWilliam mentioned this pull request Oct 13, 2025

[recipe, hardware] feat: Add GRPO with full weight updates for 1.5B models on a single GPU #3747

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[recipe] fix: make LangGraph agent example runnable out-of-the-box #3029

[recipe] fix: make LangGraph agent example runnable out-of-the-box #3029

Uh oh!

philippnormann commented Aug 12, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Aug 12, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 12, 2025

Uh oh!

philippnormann Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[recipe] fix: make LangGraph agent example runnable out-of-the-box #3029

[recipe] fix: make LangGraph agent example runnable out-of-the-box #3029

Uh oh!

Conversation

philippnormann commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

philippnormann Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

philippnormann commented Aug 12, 2025 •

edited

Loading

CLAassistant commented Aug 12, 2025 •

edited

Loading