feat(release): distributed trainer, full SGLang pipeline migration, 5k unique-rollout cap by erfanMhi · Pull Request #64 · one-covenant/grail

erfanMhi · 2026-04-09T03:59:20Z

Lands the work from feat/distributed-training: a complete FSDP2 + DILOCO + PULSE-DiLoCo distributed trainer, a fully migrated SGLang pipelined reference miner (legacy single-GPU AgentEnvLoop removed from the codebase), the GRAIL proof v5 with order-invariant sketches, and the protocol/economics retune that pairs with them. 40 commits ahead of origin/main, ~150 files, calibrated against 545M proof positions across 6 models / 3 GPU types and 430k honest cross-GPU/cross-attn/cross-batch logprob trials. Note: miners are free to fork and run any architecture they want, the reference build is just what the team maintains and ships.

Reference miner fully migrated to the pipelined SGLang path: grail/mining/engine.py is the only generation path in the shipped codebase, GRAIL_PIPELINE_ENABLED is deleted, default backend flips vllm → sglang, GPU layout validated at startup, pipeline init failure exits to the supervisor instead of silently dropping back to single-GPU mode.
Mining throughput: a single miner now sustains 4,477 rollouts per 360s window on 8x A100 SXM4 (12.44 r/s steady state, 97% of the pure SGLang TP=4 gen ceiling), roughly 35x the previous main-branch baseline of ~128 rollouts/window. E2e benchmarked on basilica-grail-tester with Qwen2.5-7B; full architecture and per-stage measurements in .claude/optimization-runs/FINAL_ARCHITECTURE.md.
Mining architecture: GPUs 0-3 run a single SGLang TP=4 instance (mem_fraction=0.88, max_running_requests=512, context length 4096), GPUs 4-7 run 4× HF proof subprocess workers that persist across windows (grail/mining/proof_worker.py), the wallet stays in the parent and signs returned commitment hashes, and the engine loop pipelines the next iteration's gen with the previous iteration's proof collection via asyncio.to_thread background submit. Bottleneck is now SGLang gen at 12.44 r/s; proof has ~70% slack so the headline scales with bigger SGLang boxes.
Proof loop runs log_softmax + gather on the proof GPU: per-sequence PCIe traffic 1.2 GB → ~16 KB (5× speedup per proof GPU, the change that unblocked multi-worker scaling). Numpy-packed ProofJob (sequences as int32[n_seqs, max_len]) gives ~50× faster pickling than list[list[int]].
New docker/Dockerfile.miner builds on the validator base via ARG GRAIL_BASE_IMAGE and runs a two-pass uv pip install (sglang[all]==0.5.10 first, then forced reinstall of grail's transformers==4.57.1 / tokenizers==0.22.1 / torch==2.9.1) so SGLang's compiled bits stay loadable.
SGLang backend rewritten to use the native /generate endpoint with input_ids directly via httpx: no BPE re-tokenization, no stop-token stripping, exact output_ids returned. Binary _cap_max_tokens clamp on overlong prompts so the validator's termination_valid check never sees a partially-clamped completion.
Checkpoint weights hash: SHA-256 + tobytes() → xxh3-128 over a zero-copy memoryview (~10× faster on a 7B model, ~1-2 s on the publish path) which directly clawed back the per-window checkpoint blackout. Stamped on every FULL publish (live, async-snapshot, anchor-background); the anchor path degrades to no-hash on staging-load failure so the network keeps moving.
Distributed trainer: new grail/trainer/distributed/ module with four selectable strategies (FSDP2 default, DDP, DILOCO, PULSE-DiLoCo v6) dispatched via GRAIL_DIST_STRATEGY / GRAIL_DIST_NPROC. 2.18x speedup measured on 2x B200 / Qwen3-8B (192s vs 419s per epoch, 31.9K vs 14.6K tok/s, 34.79% vs 15.97% MFU, 109 GB vs 182 GB peak), confirmed at 250s/epoch on real chain data. 75 new tests across strategy + integration suites all pass on CPU without NCCL.
PULSE-DiLoCo v6 fix: transmits raw FP32 values at fired indices (BF16 is mask-only, removes quantization overshoot) and calls malloc_trim(0) after each sync to reclaim 10-50 GB of glibc heap. Uses BF16-gated sparse all-gather with FP32 CPU residual buffer and a 4 GB memory guard that falls back to dense per-parameter all-reduce.
Snapshot gating only updates snapshots/latest/ after a DILOCO outer sync so non-consensus weights are never published. SGLang inference server now hot-reloads from checkpoint without process restart.
GRAIL proof v5: top-K indices now sorted by position (not magnitude rank), making the sketch invariant to GEMM tile reordering across batch sizes. PROOF_TOPK 32→16, PROOF_SKETCH_TOLERANCE_BASE 30→6000, growth 3.0→5.0. Multi-model benchmark over 6 models (Qwen2.5/Qwen3/Llama-3.2 from 0.6B to 8B) and 3 GPUs (B200, A100 PCIe, L40), 18 pairwise comparisons, 545M total positions, 27% headroom on the worst-case passing pair, ~10⁻¹⁶⁷ forgery probability at K=32. Validators accept v4+v5, reject v1/v2.
FA2 is now pinned for both miner and validator in grail/protocol/constants.py (immutable, no env override on the proof path). The validator/miner Dockerfile pulls the prebuilt cu12+torch2.9+cp312 flash-attn 2.8.3 wheel from Dao-AILab (SHA256-pinned, ~10 s install vs 30-60 min source build) and runs after the final uv sync so it is not pruned.
New ProtocolViolationError(RuntimeError) replaces bare assert invariants on the data path so failures stay loud under python -O. Used by GenerationParams.from_checkpoint_metadata, the GPU layout startup validator, and the proof and proof-worker code paths.
GenerationParams.from_checkpoint_metadata parses the trainer-published dict at the trust boundary and rejects missing/malformed max_tokens, temperature, top_p, top_k, repetition_penalty. Trainer's per-checkpoint default max_tokens lowered 8192 → 2048 to match production policy.
CheckpointMetadata adds a required thinking_mode field ("native" or "instructed"); validate_metadata() rejects checkpoints missing it. The validator overrides the process-wide GRAIL_THINKING_MODE from the resolved checkpoint so all downstream get_thinking_config() callers see the trainer-published value.
New WindowEnvConfig resolves the per-window env config (env_id, env_params, generation_params, thinking_mode) once per window in WindowProcessor.process_window; on MissingCheckpointMetadataError the entire window is skipped (was: silent per-miner fallback that scored every miner against the wrong baseline).
Latent seed-derivation bug fixed in _validate_rollouts: when a miner dropped a rollout group, the validator's reconstructed seeds drifted by one slot from the miner's, hard-failing every prompt-validation check on that miner. Uses rollout_group_raw directly now with a hard-fail on non-int values.
MAX_NEW_TOKENS renamed to MAX_NEW_TOKENS_PROTOCOL_CAP to make explicit that this is the immutable network bound, not a default.
LogprobValidator switched to a single-knob median importance-sampling check: dev_i = exp(|model_lp − miner_lp|) − 1; pass iff median(dev_i over K=32) ≤ 0.10. Calibrated on 430,650 honest trials across A100/B200/L40, batch sizes 1-16, sdpa+fa2, three model sizes: 0 false positives, max honest median dev = 0.066 (50% headroom). Fail-closes on length mismatch and any unresolved challenge position.
DistributionValidator now replays RepetitionPenaltyLogitsProcessor + TemperatureLogitsWarper on cached logits before reading chosen-token probs, fixing systematic min_prob=0.0 failures on rollouts that used non-default sampling params (top_k/top_p deliberately not applied: hard masks suffer ~30-75% bf16 prefill-vs-decode drift, RFC-0018 tracks the proper fix).
Validator hardening: get_model() raises when device=None and CUDA is unavailable instead of silently saturating ~60 vCPUs and ~150 GB RAM (a real production incident); the validator CLI runs a startup GPU probe before any subtensor or R2 work; docker-compose.validator.yml adds NVIDIA_VISIBLE_DEVICES=all, NVIDIA_DRIVER_CAPABILITIES=compute,utility, shm_size: 2g, pid: host, GRAIL_VALIDATOR_GPU_IDS, and GRAIL_VALIDATOR_IMAGE.
Economics retune: UNIQUE_ROLLOUTS_CAP 1,500 → 5,000 (period cap 60k over 12 windows), GRAIL_BURN_PERCENTAGE 80% → 90%. Miner share at 100% cap is unchanged but underperforming miners now leak more emission to burn, and the new ceiling matches what the SGLang miner can actually produce.
Trainer perf: chunked-logits path applies lm_head in seq_len // 256 chunks for ~10 GB → ~312 MB peak memory on the logit/log-prob tensors (default-on via GRAIL_TRAINER_CHUNKED_LOGITS=1), opt-in FA4 attention for Blackwell, opt-in Liger kernels (RMSNorm/RoPE/SwiGLU), torch.compile on the base transformer with fixed-shape padding, token-budget micro-batching via greedy LPT bin-packing (<0.1% imbalance, with rank-padding via dummy fwd/bwd to keep FSDP2 collectives aligned), MFU tracking, pluggable advantage estimators (grpo, dr_grpo, dapo with 12 unit tests).
New SnapshotManager.adopt_snapshot_atomic() writes metadata then renames so snapshots are never partially written.
grail/shared/constants.py (387 lines) deleted and split into grail/protocol/constants.py (immutable, zero os.getenv calls) plus grail/shared/config.py (deployment-tunable), enforcing the protocol-constants-are-immutable rule across the ~80 import sites that moved.
Test infrastructure: pytest-xdist parallel CI with separate -m serial job for global-singleton tests, filelocks around HF model downloads, session-scope Qwen2.5-1.5B fixture with per-test deepcopy, dead test files removed.
Python ≥3.12 required (CI matrix, Dockerfile, all sub-project pyprojects), uv.lock regenerated dropping async-timeout / backports-asyncio-runner / exceptiongroup / tomli backports.
Offline trainer adds SGLangServerManager alongside VLLM, switches HF logprob recomputation from inference_mode() to no_grad() (autograd-compatible), and renames batch_size → micro_batch_size.
Breaking: Python 3.10/3.11 unsupported, proof v1/v2 rejected, mining single-GPU mode removed, GRAIL_PIPELINE_ENABLED deleted, grail/shared/constants.py import path gone, LogprobValidator metadata renamed (logprob_mismatches → logprob_median_dev + logprob_max_dev), checkpoints missing any of the 5 generation_params or thinking_mode rejected, get_model(device=None) raises when CUDA unavailable, validator compose file requires update.

- Add grail/trainer/distributed/ module (9 files): FSDP2 parallelism, distributed checkpointing, gradient sync, TP-aware logprobs, launcher for both torchrun and multiprocessing.Process paths - Integrate into TrainerNeuron via GRAIL_DIST_NPROC env var dispatch - Add SnapshotManager.adopt_snapshot_atomic() for FSDP2 checkpoint adoption - Refactor shared/constants into protocol/constants (immutable) and shared/config (deployment-configurable), migrate all imports - Add FA4 attention handler, Liger kernel integration, torch.compile support, token-budget batching, sequence packing, MFU tracking - Add pluggable advantage estimators (grpo, dr_grpo, dapo)

- test_grail_sampling_shape_check.py and test_grail_termination_check.py were 100% commented out, referencing deleted grail.grail.Verifier APIs

- DDP (static_graph=False), DILOCO (bare model, CPU global params, Nesterov outer SGD), and PULSE-DiLoCo (BF16-gated sparse all-gather with FP32 residual buffer) alongside existing FSDP2 - Snapshot gating: snapshots/latest only updated after DILOCO outer sync to prevent exposing non-consensus weights to upload worker - PULSE memory guard: falls back to per-parameter dense all-reduce when sparse all-gather would exceed 4 GB GPU budget

- 57 tests covering config parsing, context unwrap, DILOCO outer step math, PULSE BF16 gating, residual conservation, snapshot gating, memory guard thresholds, resume checkpoint roundtrip

- Session-scope Qwen model loading in conftest, deepcopy per test (avoids reloading 1.5B model for each test function) - Remove unused session-scoped mutable CopycatTracker fixture - Collapse 9 near-identical Qwen batch equivalence tests into 1 regression + 1 parametrized exploratory test - Rewrite test_engine.py to test actual behavior (gen_params propagation, idempotent shutdown) instead of private attribute identity checks - Add docstrings to test_advantages.py and test_basics.py

- Add pytest-xdist>=3.5.0 to dev dependencies - Register 'serial' marker for tests that touch global singletons - Mark TestServiceIntegration as serial (mutates COPYCAT_TRACKER global)

- Split CI into parallel (-n auto) and serial test runs - Add filelocks around HF model downloads to prevent cache corruption - Clean server-specific HF_HOME/HF_HUB_CACHE env vars in test fixtures - Mark throughput-sensitive codec tests as serial

…nt sketch - Sort top-k indices by position to make sketch invariant to GEMM tile ordering - PROOF_TOPK 32 -> 16, PROOF_SKETCH_TOLERANCE_BASE 30 -> 6000, growth 3.0 -> 5.0 - Validators accept v4/v5 (drop v1/v2); forgery probability 10^-167 at K=32

- get_model() raises when device=None and CUDA is unavailable; tests must pass device="cpu" explicitly - Validator CLI runs a startup GPU probe before any subtensor or R2 work - Docker compose adds NVIDIA_VISIBLE_DEVICES, NVIDIA_DRIVER_CAPABILITIES, and shm_size

- Remove top_k/repetition_penalty extra_body to avoid 400 errors with older openai clients - Log error response body (truncated to 200 chars) on retryable failures

- Add reload_with_new_checkpoint() to SGLangServerManager - Stops server, updates model path, and restarts with new weights

- v6: BF16 gate as mask, transmit raw FP32 s to eliminate quantization overshoot - Force FP32 outer optimizer with explicit precision check; DILOCO H now counts optimizer steps - Tune glibc MALLOC_MMAP_THRESHOLD_, malloc_trim after sync, drop transient clones in save path

- Replace torch.inference_mode() with torch.no_grad() so recomputed logprobs stay autograd-compatible - Pass use_cache=False to skip the unused KV-cache path during logprob extraction - Preserve caller training/eval state via was_training try/finally

- Add SGLangServerManager as alternative to VLLMServerManager - Patch shared constants (GRPO_VARIANT, ADV_ESTIMATOR, etc.) from Hydra config - Restore training mode and gradient checkpointing after rollout generation - Pass WandB project/entity/run_name/notes from yaml config; require Python >=3.12

- Follow upstream TrainingConfig API rename from distributed training

- requires-python ">=3.12"; black/mypy/ruff/pyright targets, CI matrix, Dockerfile, .python-version - Auto-fix UP017 (datetime.UTC) and UP041 (TimeoutError); drop dead sys.version_info gate in execution.py - Regenerate uv.lock + research subproject locks; drop async-timeout, exceptiongroup, tomli backports

- validate_metadata() flags missing temperature/top_p/top_k/repetition_penalty in generation_params - Add per-key missing-field tests and an all-missing-but-max_tokens regression case

…penalty - SGLang otherwise falls back to model generation_config defaults (e.g. Qwen3 top_k=20) - Mismatch with checkpoint generation_params causes systematic min_prob=0.0 failures - Verified all 5 sampling params accepted via openai client extra_body on SGLang 0.5.9/0.5.10

- SGLang 0.5.10 hits cudaErrorIllegalAddress during piecewise graph warmup for Qwen3 on B200 - Workaround recommended by SGLang error message itself

…en probs - Replays repetition_penalty + temperature on cached logits to match miner sampling - top_k/top_p deliberately omitted: hard masks suffer ~30-75% bf16 drift mismatches - Trusts validate_metadata() guarantees on generation_params keys; KeyError if missing

…gprob check - Replaces elementwise abs/rel tolerances with median(exp(|Δlp|) - 1) <= LOGPROB_IS_EPS=0.10 - Calibrated on 430k honest trials (0% FP, max honest median dev 0.066, ~50% headroom) - Fail-closes on length mismatch and unresolved positions; adds median robustness tests

…ublishes - Replace SHA-256 + tobytes() with xxh3-128 + memoryview (~10x faster, deterministic since xxhash 0.8.0) - CheckpointPublisher stamps weights_hash on every FULL (live, async-snapshot, anchor background) - Anchor background path logs and ships without hash on staging-load failure; synchronous paths raise

- requires-python ">=3.12"; black/mypy/ruff/pyright targets, CI matrix, Dockerfile, .python-version - Auto-fix UP017 (datetime.UTC) and UP041 (TimeoutError); drop dead sys.version_info gate in execution.py - Regenerate uv.lock + research subproject locks; drop async-timeout, exceptiongroup, tomli backports

- Dockerfile.miner (new): SGLang 0.5.10 on top of validator base; two-pass install with forced reinstall of grail-pinned transformers/torch - docker-compose.validator.yml: GRAIL_VALIDATOR_IMAGE/TEST_MODE/GPU_IDS overrides, pid: host for cross-process partial-cleanup, --no-sync command

- Tiny new module replacing bare assert on the data path (Python -O strips assert) - Carries offending values in the message so failures are debuggable from logs alone

…CAP and bump caps - Rename clarifies it as the immutable network cap, not a default; trainer max_tokens is the operating limit - Bump UNIQUE_ROLLOUTS_CAP 1500->5000 and GRAIL_BURN_PERCENTAGE 80%->90% to match basilica production policy - Update episode.py, rollout schema, schema/termination validators, and two unit tests to the renamed symbol

…undary - GenerationParams.from_checkpoint_metadata raises ProtocolViolationError on missing/malformed fields - Caps max_tokens at MAX_NEW_TOKENS_PROTOCOL_CAP; normalises top_k=0/None so backends use server default - New integration test drives the trainer-payload->validator chain end-to-end (regression for 8192/2048 bug)

- Send prompt token IDs directly via httpx, read exact output_ids back; no text re-tokenization, no stop-token stripping - Add max_model_len + binary _cap_max_tokens clamp (return max_new or 0); overlong prompts return empty completion instead of partial - top_k/repetition_penalty pass through as top-level fields; new unit test covers the decision boundary

- gpu_logprobs=True keeps logits on GPU; only chosen-token logprobs (~16 KB/seq) cross PCIe - Per-seq peak bounded by LOGPROB_CHUNK*vocab*4B (~312 MB); CPU path retained as OOM fallback - Replace bare asserts in compute_proofs and ProofWorker with ProtocolViolationError

- Remove legacy AgentEnvLoop path and GRAIL_PIPELINE_ENABLED flag; default backend flips vllm->sglang - Thread per-checkpoint gen_params from metadata to backend; raise ProtocolViolationError at trust boundary - Validate GPU layout at startup; child SGLang/vLLM env inherits PATH; pipeline init failure exits to supervisor

- get_default_generation_params returns max_tokens=2048 (was 8192) to match production policy - Protocol cap MAX_NEW_TOKENS_PROTOCOL_CAP=8192 still applies; this is the per-checkpoint default - Update test_dynamic_env_config.py assertions that hard-coded 8192

- .env.example + miner.md + miner-debugging.md: document mandatory pipeline mode and sglang-default backend - validator.md + incentive-mechanism.md: update burn (80%->90%) and rollout cap (2,500->5,000) to match constants - Reflect MAX_NEW_TOKENS_PROTOCOL_CAP rename and checkpoint-driven sampling policy throughout

…ting field - CheckpointMetadata.thinking_mode required ("native"|"instructed"); validated in validate_metadata - thinking.get_thinking_config default flips to "instructed"; doc that miner/validator override at runtime

…eed bug - Lift checkpoint env_id/env_params/generation_params/thinking_mode into WindowEnvConfig resolved once per window - MissingCheckpointMetadataError now skips the entire window (was: silent fallback per miner) - Use miner-published rollout_group int directly for seed derivation; previous file-encounter order drifted on dropped groups

- Default fixture now includes thinking_mode="instructed" so existing happy-path tests stay green - New tests cover missing/empty/invalid thinking_mode and the two valid values

…ing metadata - Lift resolve_window_env_config into ValidationService.process_window so the abort fires BEFORE availability/selection/inference counters advance - Pass the resolved WindowEnvConfig down into WindowProcessor instead of refetching - Bump validation/window_skipped_missing_metadata on the abort path

- format miner_validator.py, service.py, window_processor.py - format tests/unit/validation/test_window_env_config.py

coderabbitai · 2026-04-09T03:59:28Z

Important

Review skipped

Too many files!

This PR contains 156 files, which is 6 over the limit of 150.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a252384f-b374-4291-9b1a-4107eaa5bc3f

📥 Commits

Reviewing files that changed from the base of the PR and between c662966 and aa06671.

⛔ Files ignored due to path filters (4)

research/infrastructure/uv.lock is excluded by !**/*.lock
research/offline_trainer/uv.lock is excluded by !**/*.lock
research/trl/uv.lock is excluded by !**/*.lock
uv.lock is excluded by !**/*.lock

📒 Files selected for processing (156)

.cursor/rules/dependencies.mdc
.cursor/rules/project_context.mdc
.env.example
.github/workflows/ci.yml
.github/workflows/docker-publish.yml
.gitignore
.python-version
CONTRIBUTING.md
docker/Dockerfile
docker/Dockerfile.miner
docker/docker-compose.validator.yml
docs/incentive-mechanism.md
docs/miner-debugging.md
docs/miner.md
docs/validator.md
grail/cli/__init__.py
grail/cli/mine.py
grail/cli/validate.py
grail/environments/backends/base.py
grail/environments/backends/sglang.py
grail/environments/episode.py
grail/environments/execution.py
grail/environments/factory.py
grail/environments/proofs.py
grail/infrastructure/chain.py
grail/infrastructure/checkpoint_consumer.py
grail/infrastructure/comms.py
grail/infrastructure/delta_checkpoint.py
grail/infrastructure/network.py
grail/mining/config.py
grail/mining/engine.py
grail/mining/proof_worker.py
grail/mining/weight_sync.py
grail/model/fa4_attention.py
grail/model/provider.py
grail/model/train_loading.py
grail/monitoring/backends/wandb_backend.py
grail/monitoring/manager.py
grail/neurons/base.py
grail/neurons/miner.py
grail/neurons/trainer.py
grail/neurons/validator.py
grail/protocol/constants.py
grail/protocol/crypto.py
grail/protocol/errors.py
grail/protocol/grail_verifier.py
grail/protocol/signatures.py
grail/schemas/rollout.py
grail/scoring/weights.py
grail/shared/checkpoint_paths.py
grail/shared/config.py
grail/shared/constants.py
grail/shared/retention_utils.py
grail/shared/thinking.py
grail/trainer/advantages.py
grail/trainer/algorithms/grpo.py
grail/trainer/checkpoint_publisher.py
grail/trainer/config.py
grail/trainer/dashboard_logger.py
grail/trainer/distributed/__init__.py
grail/trainer/distributed/checkpoint.py
grail/trainer/distributed/compat.py
grail/trainer/distributed/config.py
grail/trainer/distributed/grad_utils.py
grail/trainer/distributed/launcher.py
grail/trainer/distributed/logprobs.py
grail/trainer/distributed/parallelism.py
grail/trainer/distributed/training_service.py
grail/trainer/evaluator.py
grail/trainer/inference_server.py
grail/trainer/metrics.py
grail/trainer/snapshot_manager.py
grail/trainer/training_process.py
grail/trainer/trust.py
grail/trainer/upload_worker.py
grail/validation/miner_validator.py
grail/validation/sampling.py
grail/validation/service.py
grail/validation/validators/distribution.py
grail/validation/validators/environment.py
grail/validation/validators/proof.py
grail/validation/validators/schema.py
grail/validation/validators/termination.py
grail/validation/window_processor.py
pyproject.toml
research/infrastructure/QUICKSTART.md
research/infrastructure/SETUP_GUIDE.md
research/infrastructure/VALIDATION.md
research/infrastructure/basilica_manager.py
research/infrastructure/create_qwen_pods.py
research/infrastructure/debug_executors.py
research/infrastructure/deploy.py
research/infrastructure/deploy_parallel.py
research/infrastructure/deploy_parallel_basilica.py
research/infrastructure/deploy_qwen_only.py
research/infrastructure/example_simple.py
research/infrastructure/experiment_runner.py
research/infrastructure/lium_manager.py
research/infrastructure/nohup_experiment_runner.py
research/infrastructure/pyproject.toml
research/infrastructure/r2_uploader.py
research/infrastructure/requirements.txt
research/infrastructure/test_lium_api.py
research/infrastructure/test_lium_manager.py
research/offline_trainer/bin/run_offline_grpo.py
research/offline_trainer/conf/offline_grpo.yaml
research/offline_trainer/pyproject.toml
research/offline_trainer/run_offline_grpo.py
research/offline_trainer/src/grail_offline/data/offline_rollouts.py
research/offline_trainer/src/grail_offline/pipelines/offline_grpo.py
research/offline_trainer/tests/smoke_test.py
research/offline_trainer/tests/test_gpu_integration.py
research/offline_trainer/tests/test_offline_trainer.py
research/trl/analysis_integration_example.py
research/trl/delta_checkpoint_callback.py
research/trl/pyproject.toml
research/trl/run_parallel_training.py
research/trl/train_trl_grpo.py
research/trl/train_trl_sft.py
research/verl/train_verl_grpo.py
scripts/benchmark_parallel_varied_kernels.py
tests/conftest.py
tests/integration/infrastructure/test_block_timestamp.py
tests/integration/infrastructure/test_delta_checkpoint_flow.py
tests/integration/protocol/test_proof_cross_framework.py
tests/integration/test_dynamic_env_config.py
tests/integration/test_gen_params_pipeline.py
tests/integration/test_miner_validator_proofs.py
tests/integration/validation/test_pipeline_golden.py
tests/proof_test_utils.py
tests/trainer/test_grpo_gpu_real_data.py
tests/trainer/test_grpo_integration.py
tests/trainer/test_grpo_unit.py
tests/unit/environments/test_refactored_modules.py
tests/unit/environments/test_sglang_backend.py
tests/unit/infrastructure/test_checkpoint_cleanup.py
tests/unit/infrastructure/test_delta_checkpoint.py
tests/unit/infrastructure/test_sparse_codec.py
tests/unit/mining/test_config.py
tests/unit/mining/test_engine.py
tests/unit/mining/test_weight_sync.py
tests/unit/protocol/test_proof_unit.py
tests/unit/shared/test_basics.py
tests/unit/test_qwen_batch_equivalence.py
tests/unit/test_weight_computer.py
tests/unit/trainer/test_advantages.py
tests/unit/trainer/test_distributed_integration.py
tests/unit/trainer/test_distributed_strategies.py
tests/unit/trainer/test_trust.py
tests/unit/validation/test_copycat_service.py
tests/unit/validation/test_grail_sampling_shape_check.py
tests/unit/validation/test_grail_termination_check.py
tests/unit/validation/test_logprob_parallelization.py
tests/unit/validation/test_trust_list_publish.py
tests/unit/validation/test_window_env_config.py
tools/vllm-server/pyproject.toml

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/distributed-training

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

erfanMhi added 30 commits April 1, 2026 23:33

fix(tests): remove test_compliance_env.py for unmerged module

ba68945

test(validation): remove dead commented-out test files

94b8044

- test_grail_sampling_shape_check.py and test_grail_termination_check.py were 100% commented out, referencing deleted grail.grail.Verifier APIs

test(trainer): add DDP, DILOCO, and PULSE-DiLoCo strategy tests

ecbc4b3

- 57 tests covering config parsing, context unwrap, DILOCO outer step math, PULSE BF16 gating, residual conservation, snapshot gating, memory guard thresholds, resume checkpoint roundtrip

test: add pytest-xdist for parallel test execution

10dcee0

- Add pytest-xdist>=3.5.0 to dev dependencies - Register 'serial' marker for tests that touch global singletons - Mark TestServiceIntegration as serial (mutates COPYCAT_TRACKER global)

fix(sglang): drop extra_body params and improve error logging

bf063ab

- Remove top_k/repetition_penalty extra_body to avoid 400 errors with older openai clients - Log error response body (truncated to 200 chars) on retryable failures

feat(trainer): add SGLang server checkpoint reload

45e0416

- Add reload_with_new_checkpoint() to SGLangServerManager - Stops server, updates model path, and restarts with new weights

docs(infra): add eval service sync note to checkpoint consumer

a37078f

refactor(offline): rename batch_size to micro_batch_size in tests

e2b2c58

- Follow upstream TrainingConfig API rename from distributed training

feat(checkpoint): require all sampling params in metadata

1d345e5

- validate_metadata() flags missing temperature/top_p/top_k/repetition_penalty in generation_params - Add per-key missing-field tests and an all-missing-but-max_tokens regression case

fix(sglang): disable piecewise CUDA graph capture on Blackwell

e17df6e

- SGLang 0.5.10 hits cudaErrorIllegalAddress during piecewise graph warmup for Qwen3 on B200 - Workaround recommended by SGLang error message itself

feat(protocol): add ProtocolViolationError for invariant violations

75cb94b

- Tiny new module replacing bare assert on the data path (Python -O strips assert) - Carries offending values in the message so failures are debuggable from logs alone

erfanMhi added 10 commits April 8, 2026 19:40

test(checkpoint): cover thinking_mode validate_metadata cases

d2c93b1

- Default fixture now includes thinking_mode="instructed" so existing happy-path tests stay green - New tests cover missing/empty/invalid thinking_mode and the two valid values

style: drop stray blank line in test_window_env_config

f7dd06d

style(validation): apply ruff format to validation modules

aa06671

- format miner_validator.py, service.py, window_processor.py - format tests/unit/validation/test_window_env_config.py

erfanMhi merged commit f2a5860 into main Apr 9, 2026
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(release): distributed trainer, full SGLang pipeline migration, 5k unique-rollout cap#64

feat(release): distributed trainer, full SGLang pipeline migration, 5k unique-rollout cap#64
erfanMhi merged 40 commits intomainfrom
feat/distributed-training

erfanMhi commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erfanMhi commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant