Conversation
- Add grail/trainer/distributed/ module (9 files): FSDP2 parallelism, distributed checkpointing, gradient sync, TP-aware logprobs, launcher for both torchrun and multiprocessing.Process paths - Integrate into TrainerNeuron via GRAIL_DIST_NPROC env var dispatch - Add SnapshotManager.adopt_snapshot_atomic() for FSDP2 checkpoint adoption - Refactor shared/constants into protocol/constants (immutable) and shared/config (deployment-configurable), migrate all imports - Add FA4 attention handler, Liger kernel integration, torch.compile support, token-budget batching, sequence packing, MFU tracking - Add pluggable advantage estimators (grpo, dr_grpo, dapo)
- test_grail_sampling_shape_check.py and test_grail_termination_check.py were 100% commented out, referencing deleted grail.grail.Verifier APIs
- DDP (static_graph=False), DILOCO (bare model, CPU global params, Nesterov outer SGD), and PULSE-DiLoCo (BF16-gated sparse all-gather with FP32 residual buffer) alongside existing FSDP2 - Snapshot gating: snapshots/latest only updated after DILOCO outer sync to prevent exposing non-consensus weights to upload worker - PULSE memory guard: falls back to per-parameter dense all-reduce when sparse all-gather would exceed 4 GB GPU budget
- 57 tests covering config parsing, context unwrap, DILOCO outer step math, PULSE BF16 gating, residual conservation, snapshot gating, memory guard thresholds, resume checkpoint roundtrip
- Session-scope Qwen model loading in conftest, deepcopy per test (avoids reloading 1.5B model for each test function) - Remove unused session-scoped mutable CopycatTracker fixture - Collapse 9 near-identical Qwen batch equivalence tests into 1 regression + 1 parametrized exploratory test - Rewrite test_engine.py to test actual behavior (gen_params propagation, idempotent shutdown) instead of private attribute identity checks - Add docstrings to test_advantages.py and test_basics.py
- Add pytest-xdist>=3.5.0 to dev dependencies - Register 'serial' marker for tests that touch global singletons - Mark TestServiceIntegration as serial (mutates COPYCAT_TRACKER global)
- Split CI into parallel (-n auto) and serial test runs - Add filelocks around HF model downloads to prevent cache corruption - Clean server-specific HF_HOME/HF_HUB_CACHE env vars in test fixtures - Mark throughput-sensitive codec tests as serial
…nt sketch - Sort top-k indices by position to make sketch invariant to GEMM tile ordering - PROOF_TOPK 32 -> 16, PROOF_SKETCH_TOLERANCE_BASE 30 -> 6000, growth 3.0 -> 5.0 - Validators accept v4/v5 (drop v1/v2); forgery probability 10^-167 at K=32
- get_model() raises when device=None and CUDA is unavailable; tests must pass device="cpu" explicitly - Validator CLI runs a startup GPU probe before any subtensor or R2 work - Docker compose adds NVIDIA_VISIBLE_DEVICES, NVIDIA_DRIVER_CAPABILITIES, and shm_size
- Remove top_k/repetition_penalty extra_body to avoid 400 errors with older openai clients - Log error response body (truncated to 200 chars) on retryable failures
- Add reload_with_new_checkpoint() to SGLangServerManager - Stops server, updates model path, and restarts with new weights
- v6: BF16 gate as mask, transmit raw FP32 s to eliminate quantization overshoot - Force FP32 outer optimizer with explicit precision check; DILOCO H now counts optimizer steps - Tune glibc MALLOC_MMAP_THRESHOLD_, malloc_trim after sync, drop transient clones in save path
- Replace torch.inference_mode() with torch.no_grad() so recomputed logprobs stay autograd-compatible - Pass use_cache=False to skip the unused KV-cache path during logprob extraction - Preserve caller training/eval state via was_training try/finally
- Add SGLangServerManager as alternative to VLLMServerManager - Patch shared constants (GRPO_VARIANT, ADV_ESTIMATOR, etc.) from Hydra config - Restore training mode and gradient checkpointing after rollout generation - Pass WandB project/entity/run_name/notes from yaml config; require Python >=3.12
- Follow upstream TrainingConfig API rename from distributed training
- requires-python ">=3.12"; black/mypy/ruff/pyright targets, CI matrix, Dockerfile, .python-version - Auto-fix UP017 (datetime.UTC) and UP041 (TimeoutError); drop dead sys.version_info gate in execution.py - Regenerate uv.lock + research subproject locks; drop async-timeout, exceptiongroup, tomli backports
- validate_metadata() flags missing temperature/top_p/top_k/repetition_penalty in generation_params - Add per-key missing-field tests and an all-missing-but-max_tokens regression case
…penalty - SGLang otherwise falls back to model generation_config defaults (e.g. Qwen3 top_k=20) - Mismatch with checkpoint generation_params causes systematic min_prob=0.0 failures - Verified all 5 sampling params accepted via openai client extra_body on SGLang 0.5.9/0.5.10
- SGLang 0.5.10 hits cudaErrorIllegalAddress during piecewise graph warmup for Qwen3 on B200 - Workaround recommended by SGLang error message itself
…en probs - Replays repetition_penalty + temperature on cached logits to match miner sampling - top_k/top_p deliberately omitted: hard masks suffer ~30-75% bf16 drift mismatches - Trusts validate_metadata() guarantees on generation_params keys; KeyError if missing
…gprob check - Replaces elementwise abs/rel tolerances with median(exp(|Δlp|) - 1) <= LOGPROB_IS_EPS=0.10 - Calibrated on 430k honest trials (0% FP, max honest median dev 0.066, ~50% headroom) - Fail-closes on length mismatch and unresolved positions; adds median robustness tests
…ublishes - Replace SHA-256 + tobytes() with xxh3-128 + memoryview (~10x faster, deterministic since xxhash 0.8.0) - CheckpointPublisher stamps weights_hash on every FULL (live, async-snapshot, anchor background) - Anchor background path logs and ships without hash on staging-load failure; synchronous paths raise
- requires-python ">=3.12"; black/mypy/ruff/pyright targets, CI matrix, Dockerfile, .python-version - Auto-fix UP017 (datetime.UTC) and UP041 (TimeoutError); drop dead sys.version_info gate in execution.py - Regenerate uv.lock + research subproject locks; drop async-timeout, exceptiongroup, tomli backports
- Dockerfile.miner (new): SGLang 0.5.10 on top of validator base; two-pass install with forced reinstall of grail-pinned transformers/torch - docker-compose.validator.yml: GRAIL_VALIDATOR_IMAGE/TEST_MODE/GPU_IDS overrides, pid: host for cross-process partial-cleanup, --no-sync command
- Tiny new module replacing bare assert on the data path (Python -O strips assert) - Carries offending values in the message so failures are debuggable from logs alone
…CAP and bump caps - Rename clarifies it as the immutable network cap, not a default; trainer max_tokens is the operating limit - Bump UNIQUE_ROLLOUTS_CAP 1500->5000 and GRAIL_BURN_PERCENTAGE 80%->90% to match basilica production policy - Update episode.py, rollout schema, schema/termination validators, and two unit tests to the renamed symbol
…undary - GenerationParams.from_checkpoint_metadata raises ProtocolViolationError on missing/malformed fields - Caps max_tokens at MAX_NEW_TOKENS_PROTOCOL_CAP; normalises top_k=0/None so backends use server default - New integration test drives the trainer-payload->validator chain end-to-end (regression for 8192/2048 bug)
- Send prompt token IDs directly via httpx, read exact output_ids back; no text re-tokenization, no stop-token stripping - Add max_model_len + binary _cap_max_tokens clamp (return max_new or 0); overlong prompts return empty completion instead of partial - top_k/repetition_penalty pass through as top-level fields; new unit test covers the decision boundary
- gpu_logprobs=True keeps logits on GPU; only chosen-token logprobs (~16 KB/seq) cross PCIe - Per-seq peak bounded by LOGPROB_CHUNK*vocab*4B (~312 MB); CPU path retained as OOM fallback - Replace bare asserts in compute_proofs and ProofWorker with ProtocolViolationError
- Remove legacy AgentEnvLoop path and GRAIL_PIPELINE_ENABLED flag; default backend flips vllm->sglang - Thread per-checkpoint gen_params from metadata to backend; raise ProtocolViolationError at trust boundary - Validate GPU layout at startup; child SGLang/vLLM env inherits PATH; pipeline init failure exits to supervisor
- get_default_generation_params returns max_tokens=2048 (was 8192) to match production policy - Protocol cap MAX_NEW_TOKENS_PROTOCOL_CAP=8192 still applies; this is the per-checkpoint default - Update test_dynamic_env_config.py assertions that hard-coded 8192
- .env.example + miner.md + miner-debugging.md: document mandatory pipeline mode and sglang-default backend - validator.md + incentive-mechanism.md: update burn (80%->90%) and rollout cap (2,500->5,000) to match constants - Reflect MAX_NEW_TOKENS_PROTOCOL_CAP rename and checkpoint-driven sampling policy throughout
…ting field
- CheckpointMetadata.thinking_mode required ("native"|"instructed"); validated in validate_metadata
- thinking.get_thinking_config default flips to "instructed"; doc that miner/validator override at runtime
…eed bug - Lift checkpoint env_id/env_params/generation_params/thinking_mode into WindowEnvConfig resolved once per window - MissingCheckpointMetadataError now skips the entire window (was: silent fallback per miner) - Use miner-published rollout_group int directly for seed derivation; previous file-encounter order drifted on dropped groups
- Default fixture now includes thinking_mode="instructed" so existing happy-path tests stay green - New tests cover missing/empty/invalid thinking_mode and the two valid values
…ing metadata - Lift resolve_window_env_config into ValidationService.process_window so the abort fires BEFORE availability/selection/inference counters advance - Pass the resolved WindowEnvConfig down into WindowProcessor instead of refetching - Bump validation/window_skipped_missing_metadata on the abort path
- format miner_validator.py, service.py, window_processor.py - format tests/unit/validation/test_window_env_config.py
Contributor
|
Important Review skippedToo many files! This PR contains 156 files, which is 6 over the limit of 150. ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (4)
📒 Files selected for processing (156)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Lands the work from
feat/distributed-training: a complete FSDP2 + DILOCO + PULSE-DiLoCo distributed trainer, a fully migrated SGLang pipelined reference miner (legacy single-GPUAgentEnvLoopremoved from the codebase), the GRAIL proof v5 with order-invariant sketches, and the protocol/economics retune that pairs with them. 40 commits ahead oforigin/main, ~150 files, calibrated against 545M proof positions across 6 models / 3 GPU types and 430k honest cross-GPU/cross-attn/cross-batch logprob trials. Note: miners are free to fork and run any architecture they want, the reference build is just what the team maintains and ships.grail/mining/engine.pyis the only generation path in the shipped codebase,GRAIL_PIPELINE_ENABLEDis deleted, default backend flipsvllm→sglang, GPU layout validated at startup, pipeline init failure exits to the supervisor instead of silently dropping back to single-GPU mode..claude/optimization-runs/FINAL_ARCHITECTURE.md.mem_fraction=0.88,max_running_requests=512, context length 4096), GPUs 4-7 run 4× HF proof subprocess workers that persist across windows (grail/mining/proof_worker.py), the wallet stays in the parent and signs returned commitment hashes, and the engine loop pipelines the next iteration's gen with the previous iteration's proof collection viaasyncio.to_threadbackground submit. Bottleneck is now SGLang gen at 12.44 r/s; proof has ~70% slack so the headline scales with bigger SGLang boxes.log_softmax + gatheron the proof GPU: per-sequence PCIe traffic 1.2 GB → ~16 KB (5× speedup per proof GPU, the change that unblocked multi-worker scaling). Numpy-packedProofJob(sequences asint32[n_seqs, max_len]) gives ~50× faster pickling thanlist[list[int]].docker/Dockerfile.minerbuilds on the validator base viaARG GRAIL_BASE_IMAGEand runs a two-passuv pip install(sglang[all]==0.5.10first, then forced reinstall of grail'stransformers==4.57.1/tokenizers==0.22.1/torch==2.9.1) so SGLang's compiled bits stay loadable./generateendpoint withinput_idsdirectly viahttpx: no BPE re-tokenization, no stop-token stripping, exactoutput_idsreturned. Binary_cap_max_tokensclamp on overlong prompts so the validator'stermination_validcheck never sees a partially-clamped completion.tobytes()→xxh3-128over a zero-copymemoryview(~10× faster on a 7B model, ~1-2 s on the publish path) which directly clawed back the per-window checkpoint blackout. Stamped on every FULL publish (live, async-snapshot, anchor-background); the anchor path degrades to no-hash on staging-load failure so the network keeps moving.grail/trainer/distributed/module with four selectable strategies (FSDP2 default, DDP, DILOCO, PULSE-DiLoCo v6) dispatched viaGRAIL_DIST_STRATEGY/GRAIL_DIST_NPROC. 2.18x speedup measured on 2x B200 / Qwen3-8B (192s vs 419s per epoch, 31.9K vs 14.6K tok/s, 34.79% vs 15.97% MFU, 109 GB vs 182 GB peak), confirmed at 250s/epoch on real chain data. 75 new tests across strategy + integration suites all pass on CPU without NCCL.malloc_trim(0)after each sync to reclaim 10-50 GB of glibc heap. Uses BF16-gated sparse all-gather with FP32 CPU residual buffer and a 4 GB memory guard that falls back to dense per-parameter all-reduce.snapshots/latest/after a DILOCO outer sync so non-consensus weights are never published. SGLang inference server now hot-reloads from checkpoint without process restart.PROOF_TOPK32→16,PROOF_SKETCH_TOLERANCE_BASE30→6000, growth 3.0→5.0. Multi-model benchmark over 6 models (Qwen2.5/Qwen3/Llama-3.2 from 0.6B to 8B) and 3 GPUs (B200, A100 PCIe, L40), 18 pairwise comparisons, 545M total positions, 27% headroom on the worst-case passing pair, ~10⁻¹⁶⁷ forgery probability at K=32. Validators accept v4+v5, reject v1/v2.grail/protocol/constants.py(immutable, no env override on the proof path). The validator/miner Dockerfile pulls the prebuilt cu12+torch2.9+cp312 flash-attn 2.8.3 wheel from Dao-AILab (SHA256-pinned, ~10 s install vs 30-60 min source build) and runs after the finaluv syncso it is not pruned.ProtocolViolationError(RuntimeError)replaces bareassertinvariants on the data path so failures stay loud underpython -O. Used byGenerationParams.from_checkpoint_metadata, the GPU layout startup validator, and the proof and proof-worker code paths.GenerationParams.from_checkpoint_metadataparses the trainer-published dict at the trust boundary and rejects missing/malformedmax_tokens,temperature,top_p,top_k,repetition_penalty. Trainer's per-checkpoint defaultmax_tokenslowered 8192 → 2048 to match production policy.CheckpointMetadataadds a requiredthinking_modefield ("native"or"instructed");validate_metadata()rejects checkpoints missing it. The validator overrides the process-wideGRAIL_THINKING_MODEfrom the resolved checkpoint so all downstreamget_thinking_config()callers see the trainer-published value.WindowEnvConfigresolves the per-window env config (env_id, env_params, generation_params, thinking_mode) once per window inWindowProcessor.process_window; onMissingCheckpointMetadataErrorthe entire window is skipped (was: silent per-miner fallback that scored every miner against the wrong baseline)._validate_rollouts: when a miner dropped a rollout group, the validator's reconstructed seeds drifted by one slot from the miner's, hard-failing every prompt-validation check on that miner. Usesrollout_group_rawdirectly now with a hard-fail on non-int values.MAX_NEW_TOKENSrenamed toMAX_NEW_TOKENS_PROTOCOL_CAPto make explicit that this is the immutable network bound, not a default.dev_i = exp(|model_lp − miner_lp|) − 1; pass iff median(dev_i over K=32) ≤ 0.10. Calibrated on 430,650 honest trials across A100/B200/L40, batch sizes 1-16, sdpa+fa2, three model sizes: 0 false positives, max honest median dev = 0.066 (50% headroom). Fail-closes on length mismatch and any unresolved challenge position.RepetitionPenaltyLogitsProcessor+TemperatureLogitsWarperon cached logits before reading chosen-token probs, fixing systematicmin_prob=0.0failures on rollouts that used non-default sampling params (top_k/top_pdeliberately not applied: hard masks suffer ~30-75% bf16 prefill-vs-decode drift, RFC-0018 tracks the proper fix).get_model()raises whendevice=Noneand CUDA is unavailable instead of silently saturating ~60 vCPUs and ~150 GB RAM (a real production incident); the validator CLI runs a startup GPU probe before any subtensor or R2 work;docker-compose.validator.ymladdsNVIDIA_VISIBLE_DEVICES=all,NVIDIA_DRIVER_CAPABILITIES=compute,utility,shm_size: 2g,pid: host,GRAIL_VALIDATOR_GPU_IDS, andGRAIL_VALIDATOR_IMAGE.UNIQUE_ROLLOUTS_CAP1,500 → 5,000 (period cap 60k over 12 windows),GRAIL_BURN_PERCENTAGE80% → 90%. Miner share at 100% cap is unchanged but underperforming miners now leak more emission to burn, and the new ceiling matches what the SGLang miner can actually produce.lm_headinseq_len // 256chunks for ~10 GB → ~312 MB peak memory on the logit/log-prob tensors (default-on viaGRAIL_TRAINER_CHUNKED_LOGITS=1), opt-in FA4 attention for Blackwell, opt-in Liger kernels (RMSNorm/RoPE/SwiGLU),torch.compileon the base transformer with fixed-shape padding, token-budget micro-batching via greedy LPT bin-packing (<0.1% imbalance, with rank-padding via dummy fwd/bwd to keep FSDP2 collectives aligned), MFU tracking, pluggable advantage estimators (grpo,dr_grpo,dapowith 12 unit tests).SnapshotManager.adopt_snapshot_atomic()writes metadata then renames so snapshots are never partially written.grail/shared/constants.py(387 lines) deleted and split intograil/protocol/constants.py(immutable, zeroos.getenvcalls) plusgrail/shared/config.py(deployment-tunable), enforcing the protocol-constants-are-immutable rule across the ~80 import sites that moved.-m serialjob for global-singleton tests, filelocks around HF model downloads, session-scope Qwen2.5-1.5B fixture with per-testdeepcopy, dead test files removed.uv.lockregenerated droppingasync-timeout/backports-asyncio-runner/exceptiongroup/tomlibackports.inference_mode()tono_grad()(autograd-compatible), and renamesbatch_size→micro_batch_size.GRAIL_PIPELINE_ENABLEDdeleted,grail/shared/constants.pyimport path gone,LogprobValidatormetadata renamed (logprob_mismatches→logprob_median_dev+logprob_max_dev), checkpoints missing any of the 5generation_paramsorthinking_moderejected,get_model(device=None)raises when CUDA unavailable, validator compose file requires update.