Skip to content

v2.1.19#670

Merged
joellidin merged 15 commits intomainfrom
dev
Jan 5, 2026
Merged

v2.1.19#670
joellidin merged 15 commits intomainfrom
dev

Conversation

@joellidin
Copy link
Copy Markdown
Collaborator

@joellidin joellidin commented Jan 5, 2026

  • Networking optimization & Typo fix
  • Update net-tuning.sh
  • feat: implement Tensor Parallelism (TP) support
  • (hparams) Add anneal mode configuration
  • (neurons) Add anneal mode dataset support
  • (neurons) Add anneal mode scheduler support
  • (docs) Update dataset path to anneal
  • (dataset) Add anneal mode file naming support
  • (neurons) Revert debug dict sampling indices
  • (hparams) Update bootstrap version
  • Bump run version
  • (docker) Add DP/TP env vars for 4-GPU setup

Description

Related Issue(s)

  • Closes #[issue number]

Type of Change

  • Feature (adding new functionality)
  • Fix (resolving a bug or issue)
  • Docs (documentation updates)
  • Refactor (code changes that don't affect functionality)
  • Maintenance (dependency updates or other maintenance)
  • Tests (adding or improving tests)
  • Breaking change (fix or feature with incompatible API changes)
  • Other: _____

Branch Naming

  • My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

  • My commits are small, atomic, and have proper commit messages
  • Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

  • I've performed a self-review of my code
  • I've added appropriate docstrings following the project's conventions
  • I've added proper logging where necessary (without trailing periods)
  • I've applied linting and formatting with Ruff
  • My code generates no new warnings

Testing

  • I've added tests for new functionality or bug fixes
  • All tests pass locally with my changes
  • Test coverage has not decreased

Documentation

  • I've updated documentation to reflect my changes
  • I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

Release Notes

  • New Features

    • Added tensor parallelism support for distributed training
    • Introduced annealing mode for enhanced learning rate scheduling
    • Added network tuning utilities for high-throughput configuration
  • Improvements

    • Enhanced distributed training performance with optimized reduction operations and memory management
    • Improved gradient validation and reconstruction across distributed ranks
    • Updated dataset paths and structure
  • Configuration

    • New environment variables for distributed training control: TP_DEGREE, DP_SHARD
    • New annealing configuration parameters
  • Chores

    • Version bumped to 2.1.19

✏️ Tip: You can customize this high-level summary in your review settings.

Xayaan and others added 15 commits December 14, 2025 15:03
Add comprehensive Tensor Parallelism implementation enabling large models
to be efficiently distributed across multiple GPUs.

Key Features:
- DTensor-based model parallelization with configurable TP_DEGREE
- Automatic gradient accumulation adjustment for TP topology
- Mixed TP + FSDP configuration support
- Parallel dimensions: (dp_replicate, dp_shard, tp, pp, cp)

Performance:
- 4x faster throughput (8,000 vs 2,000 tok/s)
- 3.2x faster window time (138s vs 444s)
- Equal learning quality to FSDP

Critical Fixes:
- DTensor collective deadlock prevention in neurons.py
- Validator distributed barrier for multi-rank TP setups
- Miner TP ownership calculation (dp_rank = rank // tp_degree)
- Gradient statistics logging with proper error handling

Implementation:
- model_factory.py: DTensor parallelization and config validation
- sharded_sampler.py: TP-aware gradient accumulation calculation
- neurons.py: DTensor deadlock fixes and sync improvements
- hparams.py: TP_DEGREE/DP_SHARD environment variable support
- distributed.py: Enhanced distributed communication helpers
- miner.py/validator.py/trainer.py: TP ownership and barrier logic

Configuration:
- Simple env var config: TP_DEGREE=2 CUDA_VISIBLE_DEVICES=0,1

Testing:
- Validated with 2B model on 3 miners (2 TP, 1 FSDP)
- No regressions in existing FSDP training
- All gradient aggregation tests passing
Add new anneal_mode configuration block to support learning rate
annealing with cosine warmup and decay schedules.

- Enable annealing starting at global step 6100
- Configure 100 warmup steps and 550 decay steps
- Set peak LR at 0.5x and minimum at 0.0x
- Use "anneal" file prefix for checkpoint naming
Implement anneal mode configuration in miner and validator to enable
single-shard training without automatic dataset cycling. When anneal
mode is enabled, nodes use the anneal file prefix and skip all shard
swapping logic.

- Read anneal_mode config from hparams (enabled flag and file_prefix)
- Pass file_prefix and anneal_mode to ShardedDatasetManager
- Skip shard switching logic when anneal mode is enabled
- Update ShardedDatasetManager to support anneal mode behavior
- Lock to single shard with no cycling in anneal mode
- Add logging to indicate anneal mode activation
Implement alternative learning rate scheduling for anneal mode that uses
linear warmup to a peak factor followed by linear decay to a minimum,
replacing the standard warmup + cosine annealing pattern.

- Add anneal mode detection in scheduler initialization
- Implement linear warmup to peak_lr_factor over warmup_inner_steps
- Add linear decay over decay_outer_steps to eta_min_factor
- Skip manual warmup scaling when anneal mode is enabled
- Add checkpoint recovery logic for anneal mode scheduler replay
- Calculate anneal progress from start_global_step, not from zero
- Replay only anneal scheduler steps during checkpoint recovery
Update documentation to reflect the new anneal dataset format and
correct file naming conventions across miner, validator, and shared
dataset documentation.

- Change DATASET_BINS_PATH from "tokenized/" to "anneal/"
- Update rclone commands to use anneal/ directory paths
- Fix shard file naming (anneal_000000.npy,
  sample_ids_anneal_000000.npy)
- Add DATASET_BINS_PATH environment variable to validator examples
- Update partial migration example to use single shard for simplicity
Update SharedShardedDataset.locate_shards to handle anneal mode file
naming convention, using sample_ids_anneal_ prefix when
file_prefix="anneal". Update download script to detect anneal mode from
output path and pass correct file_prefix parameter.

- Add anneal mode detection in download_dataset.py based on path
- Pass file_prefix parameter to locate_shards for proper naming
- Handle sample_ids_anneal_ prefix in locate_shards function
- Change default path from "remote/tokenized/" to "anneal/"

This fixes the download script to correctly locate anneal dataset files
named anneal_000000.npy and sample_ids_anneal_000000.npy.
Restore debug dict sampling indices to match main branch behavior before
the TP PR changes. The TP PR changed sampling from fixed ranges to
[-2:], which caused checkpoint verification L2 norm warnings (0.5030 >
0.1 threshold) due to index mismatches between aggregator-saved values
and local node comparisons.

- Revert miner debug dict sampling from [-2:] to [10:12]
- Revert validator debug dict sampling from [-2:] to [:2]
- Revert param state tracking from [-2:] to slice(10, 12)
- Add index_range=(10, 12) to sync score comparison call
- Add index_range parameter to compare_model_with_debug_dict function
  with default (0, 2)

This ensures consistent sampling indices across the system and
eliminates checkpoint verification warnings caused by mismatched
comparison ranges.
Add DP_SHARD and TP_DEGREE environment variables to compose.yml to
override hparams.json defaults for validator configuration.

- DP_SHARD defaults to 4 (4-way data parallelism)
- TP_DEGREE defaults to 1 (no tensor parallelism)
- Enables flexible parallelism strategy without modifying hparams
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 5, 2026

Walkthrough

This change introduces tensor parallelism (TP) support across the training pipeline, implements anneal-mode dataset management and scheduler behavior, updates hyperparameters with new configuration fields, and adds supporting infrastructure for distributed training and script utilities. Multiple core components are modified to handle TP-aware operations, gradient reconstruction, and anneal-driven behavior.

Changes

Cohort / File(s) Summary
Configuration & Environment
.gitignore, docker/compose.yml
Added ignore patterns for validator state, tokenized data, and checkpoints. Added DP_SHARD and TP_DEGREE environment variables to Docker compose configuration.
Hyperparameters & Version
hparams/hparams.json, src/tplr/__init__.py
Added new anneal_mode configuration block with warmup/decay parameters; updated checkpoint initialization version to 2.1.18 and window to 62834. Bumped package version from 2.1.18 to 2.1.19.
Documentation
docs/miner.md, docs/shared_sharded_dataset.md, docs/validator.md
Updated all dataset path references from tokenized/ to anneal/, including environment variable examples, migration commands, and shard paths.
Core Neurons
neurons/evaluator.py, neurons/miner.py, neurons/trainer.py, neurons/validator.py
Introduced TP degree support with distributed reduction scaling; implemented anneal-mode dataset and scheduler handling; added gradient reconstruction for TP shards; implemented per-parameter error feedback with TP-aware logic; extended gradient merging for FSDP/TP hybrid modes; added timing instrumentation for forward/backward passes; introduced memory and configuration logging.
Scripts
scripts/abort_multipart_uploads.py, scripts/cleanup_bucket.py, scripts/download_dataset.py, scripts/net-tuning.sh
Added async multipart upload abort utility; expanded object cleanup to include checkpoint, gradient, debug, aggregator, and peers files; updated dataset download to support anneal file prefix; added network tuning configuration script.
Core Library — Distributed & Compression
src/tplr/distributed.py, src/tplr/compress.py, src/tplr/comms.py
Added batched_all_reduce for efficient grouped reductions; introduced shape caching and dynamic basis matrix precomputation for new shard sizes in compression; added DTensor safe serialization and TMPDIR-respecting temporary directory handling.
Core Library — Hyperparameters & Model
src/tplr/hparams.py, src/tplr/model_factory.py
Implemented deep_merge for recursive dictionary merging of layered configs; added centralized parallel configuration validation with environment-driven TP/DP overrides and auto-calculation of dp_shard.
Core Library — Dataset & Sampling
src/tplr/sharded_dataset.py, src/tplr/sharded_sampler.py
Added anneal_mode parameter to disable shard switching; introduced file_prefix for sample_ids path customization; added TP degree support with DP rank/world size computation for data sharding; updated sampler constructors to propagate tp_degree.
Core Library — Neurons & Chain
src/tplr/neurons.py, src/tplr/chain.py
Implemented DTensor shard metadata extraction and gradient gathering; added cross-TP shape compatibility via metadata propagation; enhanced gradient decompression with param_xshape/param_totalk from metadata; added anneal-mode scheduler replay; improved debug dict broadcast and validation. Changed chain fallback behavior to use gather peers when active peers unavailable.
Tests
tests/test_comms.py, tests/test_evaluator.py, tests/test_model_comparison.py, tests/test_prepare_gradient_dict.py, tests/unit/test_model_factory.py, tests/unit/test_neurons.py
Added skip marker for cross-device test; set default tp_degree=1 in evaluator mocks; refactored debug sampling to use last-2-element strategy instead of index_range; updated metadata assertions to validate structural keys; aligned test expectations with new validation messages; removed check_uid_index_overlap export and index_range parameter from compare_model_with_debug_dict.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • v2.1.18 #662: Shares dataset/shard and scheduler code paths (sharded_dataset locate_shards, shard state logic, hparams scheduler settings, neurons shard management).
  • feat: Implement Tensor Parallelism (TP) support #621: Implements matching tensor parallelism support across neurons/miner/trainer/validator, sharded_sampler, model_factory, distributed reductions.
  • feat/anneal #669: Introduces identical anneal_mode features and matching changes to hparams, ShardedDatasetManager, sample_ids handling, and neuron updates.

Suggested reviewers

  • shivam-MBZUAI
  • amiiir-sarfi

Poem

🐰 A hop, skip, and tensor-parallel jump,
With anneal modes and shards in a lump,
The gradients dance, the samplers align,
And distributed dreams? Now they truly shine! 🌟
Convergence awaits this training ascend!

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description provides a bullet-point summary of changes and objectives but does not follow the repository's template structure. Required template sections (Description, Related Issues, Type of Change, Branch Naming, Commit Messages, Code Quality, Testing, Documentation) are either missing, incomplete, or left unchecked. Complete the PR description template by filling out Description, Type of Change checkboxes, Code Quality, Testing, and Documentation sections to provide structured context for reviewers.
Docstring Coverage ⚠️ Warning Docstring coverage is 60.26% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'v2.1.19' is a version bump label that does not clearly convey the main technical changes (Tensor Parallelism support, anneal mode implementation, networking optimization) introduced in this substantial pull request. Consider a more descriptive title that highlights the primary feature (e.g., 'Add Tensor Parallelism and anneal mode support for v2.1.19') to help reviewers quickly understand the scope.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 5, 2026

Codecov Report

❌ Patch coverage is 42.95082% with 174 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/tplr/neurons.py 34.71% 79 Missing ⚠️
src/tplr/model_factory.py 66.66% 29 Missing ⚠️
src/tplr/distributed.py 4.76% 20 Missing ⚠️
src/tplr/compress.py 19.04% 17 Missing ⚠️
src/tplr/sharded_dataset.py 0.00% 12 Missing ⚠️
src/tplr/sharded_sampler.py 64.70% 6 Missing ⚠️
src/tplr/chain.py 0.00% 5 Missing ⚠️
src/tplr/comms.py 55.55% 4 Missing ⚠️
src/tplr/hparams.py 81.81% 2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (42.95%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (56.45%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #670      +/-   ##
==========================================
- Coverage   57.99%   56.45%   -1.54%     
==========================================
  Files          27       27              
  Lines        4930     5165     +235     
==========================================
+ Hits         2859     2916      +57     
- Misses       2071     2249     +178     
Files with missing lines Coverage Δ
src/tplr/__init__.py 100.00% <100.00%> (ø)
src/tplr/hparams.py 62.19% <81.81%> (+1.66%) ⬆️
src/tplr/comms.py 63.17% <55.55%> (-1.92%) ⬇️
src/tplr/chain.py 35.36% <0.00%> (-0.66%) ⬇️
src/tplr/sharded_sampler.py 86.58% <64.70%> (-5.96%) ⬇️
src/tplr/sharded_dataset.py 22.43% <0.00%> (-1.71%) ⬇️
src/tplr/compress.py 77.56% <19.04%> (-3.40%) ⬇️
src/tplr/distributed.py 16.52% <4.76%> (-1.12%) ⬇️
src/tplr/model_factory.py 59.86% <66.66%> (-1.23%) ⬇️
src/tplr/neurons.py 70.80% <34.71%> (-7.12%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Fix all issues with AI Agents 🤖
In @neurons/miner.py:
- Around line 695-737: When assembling gradients, normalize the local tp_degree
and guard the TP branches to avoid dropping shards: check the computed tp_degree
= self.tp_degree if self.tp_degree > 1 else 1 and if tp_degree > self.world_size
or self.world_size % tp_degree != 0, fall back to tp_degree = 1 (and emit a
warning via self.logger or processLogger) so the code uses the safe FSDP merge;
ensure branches that use _reconstruct_tp_gradients(gathered) and the hybrid path
that computes num_fsdp_groups = self.world_size // tp_degree treat
num_fsdp_groups == 0 as fallback to the pure FSDP merge, freeing processed
gathered entries only for valid slices and not silently ignoring remainder
ranks.

In @neurons/trainer.py:
- Around line 55-62: The TP degree parsing can return None or raise on invalid
env values; update the block that sets self.tp_degree (where tt =
getattr(self.hparams, "torchtitan", None) and base_tp = getattr(tt, "tp_degree",
None)) to mirror Validator.__init__: ensure base_tp defaults to 1 if
missing/None, read env_val = os.environ.get("TP_DEGREE"), if env_val is not None
try to parse int(env_val) in a try/except and fall back to base_tp on failure,
then coerce to max(1, parsed_value); apply the same robust parsing to the other
occurrence around lines 72-73 so both Trainer tp_degree assignments use the same
safe logic.

In @scripts/abort_multipart_uploads.py:
- Around line 75-78: The abort_multipart_upload call is passing the wrong value
for Bucket (using account_id); change Bucket=account_id to use the actual S3
bucket variable used elsewhere in the script (e.g., Bucket=bucket or
Bucket=bucket_name) so
client.abort_multipart_upload(Bucket=<actual_bucket_variable>, Key=key,
UploadId=upload_id) supplies the real bucket name rather than the account_id.
- Around line 57-59: The script is using account_id as the S3 bucket name;
instead validate and use the R2_GRADIENTS_BUCKET_NAME environment variable
instead: add retrieval and validation of R2_GRADIENTS_BUCKET_NAME at script
startup, error out if missing, then replace all occurrences of Bucket=account_id
(e.g., in client.list_multipart_uploads and the subsequent abort/delete calls
around the same block) with Bucket=bucket_name (or whatever local variable you
create from R2_GRADIENTS_BUCKET_NAME) so bucket operations target the actual
gradients bucket rather than the account ID.
- Around line 20-22: The script incorrectly appends its own directory to
sys.path when trying to import tplr; change it to add the repository root (the
parent of the scripts directory) instead and use sys.path.insert(0, ...) so the
project package is resolved first. Replace the current
sys.path.append(os.path.dirname(os.path.abspath(__file__))) logic used before
the import tplr with code that computes the repo root (the parent directory of
the scripts folder) and inserts that root at the front of sys.path, and update
the comment to say "Add repository root to path to import tplr" to match other
scripts.

In @scripts/net-tuning.sh:
- Line 30: Update the comment string "# === CONNECTION BEHAVIOUR ===" to
American spelling by replacing "BEHAVIOUR" with "BEHAVIOR" (i.e. change it to "#
=== CONNECTION BEHAVIOR ==="); also scan the script for any other occurrences of
"BEHAVIOUR" and normalize them to "BEHAVIOR" to keep spelling consistent across
the codebase.
- Line 33: The script sets "net.ipv4.tcp_congestion_control = bbr"
unconditionally which will fail on kernels lacking BBR; either document the
kernel 4.9+ and tcp_bbr requirement near the top of the script (next to the "set
-euo pipefail" line) or make the BBR setting conditional: check for kernel
support (e.g., existence/readability of
/proc/sys/net/ipv4/tcp_congestion_control or tcp_bbr in /sys/modules) before
applying the sysctl, or if you prefer to keep it non-fatal suppress sysctl
errors for this call by running the system-wide apply with error suppression
(e.g., sysctl --system 2>/dev/null || true) so the script won't exit on "Invalid
argument".

In @src/tplr/neurons.py:
- Around line 393-401: outer_step currently uses rank-specific continue paths
after _bcast_flag(has_update) which causes collective divergence; replace those
continues with a rank-aligned skip flag. Specifically, where you check on_src
and compare src_tensor.shape != p.shape (and similarly full_grad_src.shape !=
p.shape for replicated params), do not continue on the source only; instead set
a local skip_param = 1 when the source detects a mismatch/unreconstructable
tensor (leave skip_param = 0 otherwise), then call skip_param =
_bcast_flag(skip_param) so all ranks receive the same decision; if skip_param is
true then (a) on_src free any full_grad_src and/or src_tensor as needed and (b)
continue — otherwise proceed to call distribute_tensor() / dist.broadcast() and
the other collectives in lockstep. Ensure this pattern is applied around the
checks that currently reference src_tensor, full_grad_src, p,
distribute_tensor(), and dist.broadcast() so all ranks execute the same
collectives.
- Around line 34-41: Reverse the DTensor import order so the public API is tried
first: attempt to import DTensor from torch.distributed.tensor and assign it to
_DTENSOR_TYPE, and only on ImportError fall back to the legacy/private
torch.distributed._tensor.DTensor; update the try/except around the DTensor
import (affecting the _DTENSOR_TYPE binding) so isinstance checks use the public
DTensor when available.

In @tests/test_model_comparison.py:
- Line 160: test_custom_index_range builds debug_dict using param_flat[-2:] but
calls compare_model_with_debug_dict without index_range, so the default (0,2)
compares first 2 elements to the last 2; fix by either passing
index_range=(-2,0) to compare_model_with_debug_dict in test_custom_index_range
so it samples the last two elements as intended (reference:
compare_model_with_debug_dict call and param_flat[-2:]), or change the test
fixture to use non-uniform parameter values so that comparing first vs last
elements would fail and reveal the bug.
- Around line 197-228: The test is unreliable because setup_model uses uniform
parameter values so param_flat[:2] == param_flat[-2:], and it also builds
debug_dict from param_flat[-2:] while compare_model_with_debug_dict defaults to
index_range=(0, 2), causing a false pass; fix by either (A) making setup_model
initialize non-uniform parameters so first vs last elements differ (e.g., vary
values across tensor) to validate index selection, or (B) align the test slices
by changing debug_dict to use param_flat[:2] or by calling
compare_model_with_debug_dict(model, debug_dict, learning_rate,
index_range=(len-2, len)) /explicitly passing an index_range matching the
debug_dict; update the creation of debug_dict, mismatched_debug_dict, or the
compare_model_with_debug_dict call accordingly so the sampled indices actually
match or intentionally mismatch.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
scripts/cleanup_bucket.py (1)

33-33: Update the docstring to reflect expanded scope.

The docstring mentions only 'checkpoint', 'gradient', and 'start_window', but the implementation now filters six prefixes including 'aggregator', 'debug', and 'peers'.

🔎 Proposed fix
-    """Delete objects in the R2 bucket that start with 'checkpoint', 'gradient', or 'start_window'"""
+    """Delete objects in the R2 bucket that start with 'checkpoint', 'gradient', 'start_window', 'aggregator', 'debug', or 'peers'"""
src/tplr/sharded_sampler.py (1)

47-88: Harden tp_degree env parsing to match Miner and avoid ValueError on bad TP_DEGREE

The TP wiring here looks sound (effective DP calculation and DP-only sharding), but the env handling is brittle:

  • tp_degree = int(os.environ.get("TP_DEGREE", 1)) will raise immediately if TP_DEGREE is set to a non‑integer string.
  • In neurons/miner.Miner.__init__ you already do a safer parse (try/except, clamp to ≥1, log and fall back to hparams).

To keep behavior consistent and avoid surprising crashes from mis-set env vars, consider mirroring the Miner logic here, e.g.:

Suggested robust parsing for tp_degree
-        # With TP, only count DP dimension for effective batch size
-        # If tp_degree not provided, fall back to environment variable
-        if tp_degree is None:
-            tp_degree = int(os.environ.get("TP_DEGREE", 1))
-        self.tp_degree = tp_degree
+        # With TP, only count DP dimension for effective batch size
+        # If tp_degree not provided, fall back to environment variable / hparams-style default
+        if tp_degree is None:
+            raw_tp = os.environ.get("TP_DEGREE")
+            if raw_tp is not None:
+                try:
+                    tp_degree = max(1, int(raw_tp))
+                except (ValueError, TypeError):
+                    tplr.logger.warning(
+                        f"[TP Sampler] Invalid TP_DEGREE='{raw_tp}', falling back to TP=1"
+                    )
+                    tp_degree = 1
+            else:
+                tp_degree = 1
+        self.tp_degree = tp_degree

The rest of the TP sharding logic in set_window_uid (DP rank computation and fallback with a warning when world_size % tp_degree != 0) looks correct and aligns with the Miner’s DP/TP split.

src/tplr/neurons.py (1)

118-129: DTensor does not expose .device; use to_local().device for robust device access

The DTensor/TP path in prepare_gradient_dict is well thought out:

  • You avoid full_tensor() by working on g.to_local() shards, which is key to staying deadlock-safe with DTensor.
  • shard_metadata captures exactly what the miner needs to later reconstruct TP parameters (is_tp_sharded, local_shape, global_shape, shard_dim, tp_rank, tp_world_size).
  • Gradient compression uses the local shard for TP params but still writes full-parameter xshapes/totalks into gradient["metadata"], which the miner then overrides for reconstructed TP params when needed.

However, there is a device attribute issue: PyTorch 2.7.x's DTensor does not expose a .device property. Update the following locations to use to_local().device:

  1. Line 65: grad_local = g.to_local().to(p.device) → use p.to_local().device
  2. Line 88: grad_to_compress = g.to(p.device) → use p.to_local().device
  3. Error-feedback preload loop: miner.error_feedback[n].to(param.device, ...) → use param.to_local().device

Also applies to: 135-205, 271-304

neurons/trainer.py (1)

3604-3607: DTensor gradient update path has a double-del bug and should skip non-finite grads safely

In update_model_with_gradient, the DTensor branch has a real bug:

  • Inside the if not torch.isfinite(local_view).all(): block you del new_grad, local_view and then later, unconditionally, you do del new_grad, local_view again.
  • On the non‑finite path this second del will raise an UnboundLocalError / NameError and break evaluation.

Also, the comment says “Skip this parameter update but don't break the loop”, but there’s no continue to skip the later unconditional del.

A minimal fix is:

  • Keep the early del ... + torch.cuda.empty_cache() on the non‑finite path.
  • Immediately continue to the next parameter so you don’t touch new_grad again.
  • Only del new_grad, local_view in the finite path.
Suggested fix for DTensor gradient handling
             local_view = new_grad.to_local()
             if not torch.isfinite(local_view).all():
-                    del new_grad, local_view
-                    torch.cuda.empty_cache()
-                    # Don't continue here - let the gradient be zero or handle it properly
-                    # This prevents rank-specific skipping which causes deadlocks
-                    tplr.log_with_context(
-                        level="warning",
-                        message=f"Non-finite gradient detected for {n}, setting to zero",
-                        sync_window=self.sync_window,
-                        current_window=self.current_window,
-                        eval_uid=eval_uid,
-                    )
-                    # Skip this parameter update but don't break the loop
-                else:
-                    # Apply update directly to data
-                    p.data.sub_(
-                        new_grad,
-                        alpha=self.lr * self.hparams.eval_lr_factor,
-                    )
-                del new_grad, local_view
+                del new_grad, local_view
+                torch.cuda.empty_cache()
+                tplr.log_with_context(
+                    level="warning",
+                    message=f"Non-finite gradient detected for {n}, skipping update for this parameter",
+                    sync_window=self.sync_window,
+                    current_window=self.current_window,
+                    eval_uid=eval_uid,
+                )
+                # Skip this parameter update but continue evaluating others
+                continue
+            else:
+                # Apply update directly to data
+                p.data.sub_(
+                    new_grad,
+                    alpha=self.lr * self.hparams.eval_lr_factor,
+                )
+            del new_grad, local_view

The earlier changes to use validator‑side xshapes/totalks and ref = torch.empty(param_xshape, ...) look correct and will work as intended once this control‑flow bug is fixed.

Also applies to: 3658-3665, 3671-3673, 3725-3777

🧹 Nitpick comments (19)
scripts/cleanup_bucket.py (1)

21-23: Remove duplicate import.

dotenv is already imported at line 9, making this import redundant.

🔎 Proposed fix
 import sys
 import asyncio
-from dotenv import load_dotenv
 from aiobotocore.session import get_session
 from tplr import logger
src/tplr/chain.py (1)

394-402: Consider using a distinct variable name for fallback peers to improve clarity.

The fallback to self.peers enables evaluation when no active peers are discovered, which aligns with the PR's distributed training enhancements. However, the semantic shift of the active_peers variable warrants attention:

  • Variable reassignment: active_peers is reassigned from "currently active peers" (line 380, used for inactive tracking at line 386) to "fallback gather peers" (line 399) when no active peers exist. While this logic is intentional and correct, the semantic shift could confuse future readers. Consider using a distinct variable name like peers_for_eval to clarify intent.

Type safety and validation concerns are not issues here: self.peers is explicitly typed as list[int] (comms.py:142), so the int(uid) conversion is safe, and peer validation against uid_to_stake is already performed at line 411 during eval peer filtering.

tests/test_model_comparison.py (1)

218-220: Potential logic issue: conditional check without else branch.

The code checks if param_flat.numel() >= 2: at line 219 but then unconditionally includes the parameter name in mismatched_debug_dict at line 220. If any parameter has fewer than 2 elements, this creates an inconsistent debug dict where some parameters that passed the condition are included with [0.999, 0.999], but the logic doesn't explicitly handle parameters that fail the condition.

While this is unlikely to cause issues in practice (standard Linear layers have many parameters), consider making the intent explicit:

🔎 Suggested refactor for clarity
     mismatched_debug_dict = {}
     for name, param in model.named_parameters():
         param_flat = param.flatten()
         # Create mismatched values (different from actual last 2 elements)
         if param_flat.numel() >= 2:
             mismatched_debug_dict[name + "_debug"] = [0.999, 0.999]
+        # else: skip parameters with fewer than 2 elements
neurons/miner.py (2)

252-314: TP-aware ownership and error-feedback buffers look correct; consider logging invalid TP layouts

The DP/TP ownership logic and DTensor-aware buffers look good:

  • dp_rank = rank // tp_degree, dp_world_size = world_size // tp_degree ensures all ranks in a TP group own the same parameter set, and each DP group owns a disjoint slice of parameters.
  • Using DT.to_local() and allocating CPU error-feedback buffers based on the local shard shape avoids full-parameter allocations and matches the later offload path in tplr.neurons.prepare_gradient_dict.

One small improvement: when self.tp_degree > 1 but world_size < self.tp_degree or world_size % self.tp_degree != 0, you silently fall back to dp_rank = rank, dp_world_size = world_size. For debugging misconfigured runs it would help to log a warning similar to the sampler’s TP warning so operators see that TP was effectively disabled for ownership.


402-417: Anneal mode dataset behavior is coherent; guard against non-mapping anneal_mode configs

The anneal-mode wiring is consistent:

  • anneal_config = getattr(self.hparams, "anneal_mode", {}) drives file_prefix and anneal_mode into ShardedDatasetManager, so anneal_ datasets are used only when enabled.
  • In run(), you pin current_shard and current_shard_epoch to 0 and completely bypass shard swapping when anneal_mode is enabled, which matches the “single-shard anneal” behavior described in the PR.

One thing to watch: this assumes hparams.anneal_mode is a dict-like object with .get. If it ever becomes a SimpleNamespace or other structured type, this will raise. If that’s a realistic evolution of your config stack, it’s safer to normalize:

anneal_cfg_obj = getattr(self.hparams, "anneal_mode", {})
if isinstance(anneal_cfg_obj, dict):
    anneal_config = anneal_cfg_obj
else:
    # e.g. SimpleNamespace → dict
    anneal_config = vars(anneal_cfg_obj)

Then reuse anneal_config everywhere you’re calling .get.

Also applies to: 492-501, 557-588

src/tplr/neurons.py (1)

846-899: Anneal-mode catchup logic is reasonable; ensure anneal_mode schema is stable

The anneal-mode branch in handle_checkpoint_catchup does what you describe:

  • Computes progress relative to anneal_start_global_step.
  • Derives anneal_inner_steps = anneal_outer_steps * inner_steps and replays exactly those steps on inner_scheduler.
  • Sets warmup_inner_steps / warmup_steps_taken so warmup tracking matches the anneal phase.
  • Still runs catchup via catchup_with_aggregation_server to align weights with the chain.

As with Miner, this assumes instance.hparams.anneal_mode is a dict-like object. If there’s any chance it becomes a structured config object, it would be safer to normalize it (e.g. via vars() when it’s a SimpleNamespace) before using .get.

tests/test_comms.py (1)

154-155: CI skip for test_put_local is understandable; consider making it conditional

Marking test_put_local as unconditionally skipped avoids the /dev/shm/tmp cross-device link issue in CI, but it also removes coverage for the local storage path everywhere.

If feasible, consider:

  • Using @pytest.mark.skipif keyed on CI environment / platform (e.g., only skip when running under your CI image), or
  • Adding a TODO in the reason to revisit once the underlying filesystem issue is resolved.

That way the test can still run locally or in fixed environments.

tests/unit/test_model_factory.py (1)

243-245: Consider testing tie_embeddings validation.

The addition of tie_embeddings=False is necessary for the test to pass with tp_degree=2. However, consider adding test cases to verify:

  • What happens when tie_embeddings=True with tp_degree > 1
  • What happens when tie_embeddings is omitted with tp_degree > 1

This would ensure the validation logic in _validate_parallel_config (mentioned in the relevant code snippet) is properly tested.

🔎 Suggested additional test case
def test_miner_role_custom_tp_with_tie_embeddings(self):
    """Test that tie_embeddings=True is rejected with tp_degree > 1."""
    hparams = SimpleNamespace(
        torchtitan=SimpleNamespace(tp_degree=2, tie_embeddings=True)
    )
    with self.assertRaisesRegex(
        ValueError, "tie_embeddings.*cannot be used with tp_degree > 1"
    ):
        create_parallel_dims(world_size=2, hparams=hparams, role="miner")
scripts/download_dataset.py (1)

42-42: Anneal mode detection is functional but substring-based

Using file_prefix = "anneal" if "anneal" in str(output_path) else "train" works, but it will also trigger on any path that merely contains "anneal" as a substring (e.g., /data/annealing_runs). That’s probably fine for current usage, but if you want this to be stricter and less surprising, consider checking the final path component instead:

Optional refinement
-    file_prefix = "anneal" if "anneal" in str(output_path) else "train"
+    output_path_str = str(output_path)
+    last_component = os.path.basename(output_path_str.rstrip("/"))
+    file_prefix = "anneal" if last_component == "anneal" else "train"

Passing file_prefix through to SharedShardedDataset.locate_shards and defaulting --path/DATASET_BINS_PATH to anneal/ are otherwise aligned with the new anneal shard naming.

Also applies to: 107-109, 114-115, 244-245, 269-271

src/tplr/comms.py (1)

50-56: DTensor-safe torch.load and TMPDIR-aware temp dirs look good; consider unifying temp handling

Registering Replicate, Shard, and _StridedShard with torch.serialization.add_safe_globals behind a try/except ImportError is the right pattern to make torch.load(..., weights_only=True) DTensor-safe without introducing a hard dependency.

Respecting TMPDIR both for self.temp_dir and the per-UID temp directory in put() is also a nice operational improvement (easier to redirect scratch to larger or faster disks).

If you want to tighten things further, consider reusing self.temp_dir for put() as well (e.g., os.path.join(self.temp_dir, str(uid))) so all temporary artifacts for a Comms instance live under a single, configurable root instead of mixing self.temp_dir and a separate base_tmp/uid tree.

Also applies to: 107-110, 323-327

neurons/evaluator.py (1)

39-40: Clarify tp_degree source and semantics in custom eval aggregation

You now:

  • Derive self.tp_degree from TP_DEGREE / hparams.torchtitan.tp_degree, and
  • Divide the globally all‑reduced loss, tokens, and bytes by tp_degree when tp_degree > 1.

Because dist_helper.ddp_reduce reduces over the full world group, scaling all three aggregates by the same factor leaves average_loss, perplexity, and bpb numerically unchanged; only the reported total_tokens/total_bytes change. For the evaluator role, create_parallel_dims still forces tp=1 and uses DP only, so self.tp_degree can differ from the actual TP degree used at eval time.

Two suggestions:

  • Consider deriving tp_degree for logging/normalization from the actual ParallelDims (or dist_helper) instead of from hparams/env to avoid drift between train-time and eval-time configs.
  • If the goal is to avoid double-counting across TP shards when TP is genuinely active at eval time, a more robust long-term approach would be to have ddp_reduce accept a data-parallel process group, removing the need for manual division.

Behavior is acceptable as-is, but it would be good to double-check this against your intended TP/DP topology.

Also applies to: 236-240, 803-825

tests/unit/test_neurons.py (1)

6-7: Test updates correctly match new compare_model_with_debug_dict behavior; minor test-style nit

  • Swapping in a real torch.nn.Parameter for param1 is a good fix: it more closely matches what compare_model_with_debug_dict will see in production and should avoid attribute surprises.
  • The updated calls compare_model_with_debug_dict(self.model, debug_dict, 0.01) align with the new, simplified signature.
  • Marking the mismatch test with @pytest.mark.skip is reasonable while you track down the NaN behavior.

One small nit: since this module is still structured around unittest.TestCase and can be run via unittest.main(), you might prefer @unittest.skip("reason") for the skipped test so it behaves the same under both pytest and pure unittest runners. Not a blocker.

The revised mock_broadcast.call_count expectations in the catchup tests also look consistent with the comment annotations (“two per successful window + one failed window”, etc.).

Also applies to: 48-55, 57-75, 481-483, 527-529, 575-577

src/tplr/sharded_dataset.py (1)

75-76: Anneal-mode dataset handling and naming look consistent; consider a small safety guard

  • locate_shards now mapping file_prefix=="anneal" to anneal_{idx:06d}.npy and sample_ids_anneal_{idx:06d}.npy matches the updated docs and scripts/download_dataset.py behavior.
  • ShardedDatasetManager’s anneal_mode flag correctly:
    • Keeps file_prefix configurable (so you can opt into anneal explicitly),
    • Skips preparing/switching to a “next” shard, and
    • Avoids deleting old shard files in swap_datasets.

This achieves the intended “single-shard, no cycling” semantics.

If you want to make misconfiguration more obvious, you could add a light assertion in initialize_datasets when anneal_mode is True to ensure current_shard_index is 0 (or log a warning when it isn’t), since the log message suggests we always stay on shard 0.

Also applies to: 124-132, 245-254, 271-285, 399-403, 421-423

src/tplr/model_factory.py (1)

280-387: Validator/miner DP/TP env overrides are well-structured; align logging with effective ParallelDims

The create_parallel_dims changes for validator/miner:

  • Safely parse TP_DEGREE / DP_SHARD from the environment (with clear error messages),
  • Fall back to hparams.torchtitan when env vars are absent,
  • Auto-compute dp_shard from world_size when it’s not explicitly set, and
  • Run everything through _validate_parallel_config (including the new tie_embeddings constraint),

which is a nice, explicit configuration flow.

One small mismatch: in initialize_torchtitan_model, the miner logging still reads tp_degree/dp_shard from hparams.torchtitan, so the log message can drift from the actual ParallelDims when env overrides are used. It would be more accurate to log from pdims (or re-use the resolved values from create_parallel_dims) so operational logs reflect the true runtime parallel layout.

Functionally, the current implementation should behave correctly; this is mainly about observability/operability.

Also applies to: 398-465, 612-620

neurons/trainer.py (3)

173-241: Anneal scheduler branch looks consistent; consider config-shape assumptions

The anneal branch (linear warmup then linear decay) is wired correctly: warmup_inner_steps, decay_outer_steps, peak_lr_factor, and eta_min_factor combine to give LR scaling of learning_rate * peak_lr_factor at peak and learning_rate * eta_min_factor at the end of decay.

Two points to keep in mind:

  • anneal_mode is assumed to be a mapping (supporting .get). If older hparams ever define it as a boolean or something else, this will raise. If backward compatibility matters, consider normalizing it (e.g., defaulting non-dicts to {}).
  • You’re logging effective_lr only in the non‑anneal branch; in anneal mode the actual eta_min is implicit. That’s fine, just be aware metrics consumers might see different logging detail per mode.

No functional blockers here; just configuration-shape consistency to verify against existing hparams.


1105-1122: Final norms batched reduction: TP scaling only for batch_count

The final reduction of grad_norm_sq, weight_norm_sq, and batch_count via batched_all_reduce is coherent:

  • Norms are summed across ranks and square‑rooted into global L2 norms; no TP scaling is required because you’re conceptually aggregating over all shards.
  • Only batch_count is divided by self.tp_degree (both in the distributed and non‑distributed branches), under the same assumption that TP ranks see identical batches.

One caveat: if someone sets TP_DEGREE>1 in a single‑GPU / non‑distributed run, the non‑distributed branch will divide batch_count by tp_degree, which will undercount batches. That’s primarily a misconfiguration issue, but you might want to guard with if self.world_size > 1 around the TP normalization to make this more robust.


4082-4091: TP_DEGREE parsing in _training_pool_digest should align with trainer/validator

The digest recreation now safe‑parses TP_DEGREE and passes tp_degree through to MinerSampler. That’s helpful, but note:

  • Here the fallback is getattr(self.hparams, "tp_degree", 1), whereas elsewhere you prefer hparams.torchtitan.tp_degree.
  • The comment claims “tp_degree doesn't affect _global_indices()”, which is fine if MinerSampler._global_indices() truly ignores TP, but if that ever changes, mismatched tp_degree here vs on miners will cause false digest mismatches and spurious slashing.

I’d suggest:

  • Reusing the same tp_degree resolution logic as in set_dataloader / Validator.__init__ to keep everything in sync.
  • Optionally adding a small assertion or log if tp_degree * dp_degree != world_size (if you track dp degree), to catch misconfigurations early.

Functionality is otherwise correct; this is about future‑proofing and config consistency.

Also applies to: 4093-4105

neurons/validator.py (2)

337-351: TP degree init here is robust; consider sharing logic with Trainer

This TP init is nicely hardened:

  • Env override (TP_DEGREE) with try/except and a warning on invalid values.
  • Fallback to hparams.torchtitan.tp_degree with a safe default of 1.

Given you’re doing similar work in Trainer.set_dataloader and _training_pool_digest, it would be worth unifying into a small helper (e.g., tplr.neurons.resolve_tp_degree(hparams)) to avoid future drift between components and keep TP semantics consistent across miner/validator/trainer.


4078-4091: Align _training_pool_digest TP_DEGREE handling with the rest of the system

The validator’s _training_pool_digest now:

  • Parses TP_DEGREE from the environment with a safe fallback to hparams.tp_degree.
  • Passes tp_degree and world_size into MinerSampler, with a comment that TP doesn’t affect _global_indices().

To avoid subtle mismatches with miners:

  • Consider using the same torchtitan.tp_degree‑aware resolution as in Validator.__init__ / Trainer.set_dataloader.
  • If _global_indices() is truly TP‑agnostic today, maybe assert or log on first use if it ever starts using tp_degree, so digest semantics don’t silently change.

Functionally fine now, but centralizing TP degree resolution would make future changes safer.

Also applies to: 4093-4105

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7a67c5 and 2506663.

📒 Files selected for processing (30)
  • .gitignore
  • docker/compose.yml
  • docs/miner.md
  • docs/shared_sharded_dataset.md
  • docs/validator.md
  • hparams/hparams.json
  • neurons/evaluator.py
  • neurons/miner.py
  • neurons/trainer.py
  • neurons/validator.py
  • scripts/abort_multipart_uploads.py
  • scripts/cleanup_bucket.py
  • scripts/download_dataset.py
  • scripts/net-tuning.sh
  • src/tplr/__init__.py
  • src/tplr/chain.py
  • src/tplr/comms.py
  • src/tplr/compress.py
  • src/tplr/distributed.py
  • src/tplr/hparams.py
  • src/tplr/model_factory.py
  • src/tplr/neurons.py
  • src/tplr/sharded_dataset.py
  • src/tplr/sharded_sampler.py
  • tests/test_comms.py
  • tests/test_evaluator.py
  • tests/test_model_comparison.py
  • tests/test_prepare_gradient_dict.py
  • tests/unit/test_model_factory.py
  • tests/unit/test_neurons.py
🧰 Additional context used
🧬 Code graph analysis (10)
tests/test_model_comparison.py (1)
src/tplr/neurons.py (1)
  • compare_model_with_debug_dict (1363-1463)
src/tplr/compress.py (1)
tests/test_comms.py (1)
  • _get_smaller_split (1102-1111)
scripts/download_dataset.py (1)
src/tplr/sharded_dataset.py (2)
  • SharedShardedDataset (58-239)
  • locate_shards (100-133)
src/tplr/neurons.py (2)
tests/conftest.py (1)
  • totalks (141-144)
src/tplr/compress.py (2)
  • batch_decompress (522-618)
  • decode (298-330)
neurons/validator.py (3)
src/tplr/comms.py (1)
  • get (1398-1513)
neurons/trainer.py (1)
  • set_dataloader (52-98)
src/tplr/compress.py (1)
  • decompress (464-519)
tests/unit/test_model_factory.py (1)
src/tplr/model_factory.py (1)
  • create_parallel_dims (280-475)
neurons/trainer.py (1)
src/tplr/distributed.py (2)
  • ddp_reduce (97-131)
  • batched_all_reduce (351-406)
neurons/evaluator.py (1)
src/tplr/distributed.py (1)
  • ddp_reduce (97-131)
neurons/miner.py (3)
src/tplr/sharded_dataset.py (3)
  • ShardedDatasetManager (242-446)
  • initialize_datasets (385-412)
  • swap_datasets (414-446)
src/tplr/neurons.py (1)
  • outer_step (309-630)
src/tplr/compress.py (6)
  • decompress (464-519)
  • decode (298-330)
  • encode (249-295)
  • compress (411-415)
  • compress (417-421)
  • compress (424-461)
tests/unit/test_neurons.py (1)
src/tplr/neurons.py (1)
  • compare_model_with_debug_dict (1363-1463)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: lint-and-format (3.11)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
🔇 Additional comments (33)
.gitignore (1)

84-86: These additions look good.

The new patterns are reasonable and appropriate for supporting the tensor parallelism and anneal-mode features. The validator-state-*.pt pattern aligns well with the existing .npz pattern, and the tokenized and checkpoints/ entries support the expanded distributed training and dataset pipeline.

src/tplr/__init__.py (1)

23-23: LGTM! Version bump is correct.

The version increment from 2.1.18 to 2.1.19 is syntactically correct and aligns with the PR title.

scripts/cleanup_bucket.py (1)

75-88: Verify that the expanded deletion scope is intentional.

The filtering has been expanded from 3 prefixes ('checkpoint', 'gradient', 'start_window') to 6 prefixes (adding 'aggregator', 'debug', 'peers'). This significantly broadens the scope of data deletion. Ensure that:

  • The new prefixes are intentional and align with the cleanup requirements
  • The script has been tested to confirm no critical data is inadvertently deleted
  • Operators are aware of the expanded scope before running this script
scripts/abort_multipart_uploads.py (3)

25-43: LGTM!

Environment variable validation is correctly implemented with clear error messaging before accessing credentials.


88-90: LGTM!

Top-level exception handling with proper exit code for script failure.


93-95: LGTM!

Standard async entry point pattern.

src/tplr/distributed.py (1)

351-406: LGTM! Efficient batched reduction implementation.

The batched all-reduce implementation is well-designed:

  • Groups values by operation type to minimize collective calls
  • Handles non-distributed and empty-dict fast paths
  • Uses float32 for consistent cross-rank communication
  • Properly unpacks results back to per-name dictionary

This should significantly improve performance when reducing multiple scalar metrics in distributed training scenarios.

src/tplr/compress.py (2)

224-246: LGTM! Dynamic shape handling for TP scenarios.

The _ensure_shape_in_dict helper correctly handles the case where TP introduces new local shard shapes not present in the initial dictionary. The implementation properly:

  • Computes the appropriate chunk size using _get_smaller_split
  • Generates DCT basis matrices on-demand
  • Stores them in the appropriate dictionaries for future reuse

261-270: Fast/slow path pattern is well-designed with accurate comments.

The try-except optimization for shape lookups is sound. The fast path assumes shapes are pre-loaded (during initialization for preset shapes, or after first TP step for dynamic shapes), while the slow path dynamically registers new shapes via _ensure_shape_in_dict. The _ensure_shape_in_dict method guards against re-adding shapes with if s not in self.shape_dict, ensuring the slow path only triggers once per new shard size, exactly as the comments describe.

The pattern works correctly for both 2D weights (lines 261-270) and 1D weights (lines 281-287).

hparams/hparams.json (2)

56-57: LGTM! Version and window updates.

The checkpoint initialization version bump to "2.1.18" and window update to 62834 align with the PR objectives for bootstrap/run version updates.


3-11: The eta_min_factor: 0.0 setting in anneal_mode is intentional. The codebase shows two distinct scheduler modes: anneal_mode explicitly defaults to eta_min_factor: 0.0 (neurons/trainer.py line 182) to decay learning rate completely to zero, while normal training mode uses eta_min_factor: 0.1 (line 218) as the default minimum. This distinction is intentional by design, with the anneal_mode implementing a complete decay schedule while preserving a small learning rate floor in normal training. The implementation and logging confirm this behavior is deliberate.

src/tplr/hparams.py (2)

66-84: LGTM! Deep merge implementation is correct.

The deep_merge function properly handles recursive dictionary merging:

  • Uses base.copy() to avoid mutating the original
  • Recursively merges nested dictionaries
  • Override values take precedence for leaf nodes

The shallow copy on line 78 is safe because nested dictionaries are handled recursively on line 81, creating new dictionary objects rather than sharing references.


98-98: Proper hierarchical config merging.

The replacement of shallow update() with deep_merge ensures nested configuration blocks (like optimizer, torchtitan, anneal_mode) are properly merged across the configuration hierarchy rather than being completely replaced. This is essential for the new anneal_mode feature and TP configuration.

Also applies to: 173-173, 199-199, 214-214

docs/miner.md (1)

173-180: Anneal dataset path in docs matches code-level anneal mode

Updating DATASET_BINS_PATH to "anneal/" in the miner docs aligns with the anneal-mode dataset naming used elsewhere (e.g., file_prefix="anneal" in ShardedDatasetManager). This keeps operator docs consistent with the new default behavior.

neurons/miner.py (1)

1181-1414: TP gradient reconstruction path is logically consistent; watch memory pressure and quantization types

The _reconstruct_tp_gradients pipeline matches the compression stack:

  • Decompress → transform.decode → concatenate on shard_dim → transform.encode → compress is consistent with how prepare_gradient_dict encodes shards.
  • Using shard_metadata (is_tp_sharded, local_shape, global_shape, shard_dim) to decide which params need TP reconstruction vs simple copy-through is sane.
  • Re-populating metadata["xshapes"] / ["totalks"] from reconstruction metadata gives validators the correct shapes even under TP/DP config mismatches.

Two non-blocking considerations:

  • This path fully materializes each TP-sharded gradient on GPU before re-compressing. For very large layers and high tp_degree, this can be a noticeable memory spike. If you do hit OOMs here, you may want to gate reconstruction per-parameter (or per-module) and/or temporarily lower topk_compression only for reconstructed tensors.
  • quant_params can be dict/tuple/list; you already normalize them to the right device before decompress, which should match the compress/decompress contract. Just ensure any future changes to TopKCompressor.compress keep returning a single quant_params object per param, not per-block sequences, or this reconstruction logic will need adjusting.
tests/test_evaluator.py (1)

70-70: LGTM! Test fixture properly updated for TP support.

The addition of tp_degree = 1 to the test fixture correctly initializes the new attribute with a sensible default (no tensor parallelism). This aligns with the broader PR changes introducing TP-degree awareness in the Evaluator.

docs/validator.md (1)

148-148: LGTM! Documentation updated for anneal dataset path.

The dataset path updates to "anneal/" are consistent across all configuration examples (Docker .env, runtime exports, and environment variables section). This aligns with the broader PR changes introducing anneal mode support.

Also applies to: 246-246, 330-330

tests/test_prepare_gradient_dict.py (2)

107-111: Improved test flexibility with per-key validation.

The change from exact dictionary equality to per-key assertions is a good practice that:

  • Allows for future metadata extensions without breaking tests
  • Provides clearer assertion messages when validation fails
  • Validates the required structure (window, xshapes, totalks keys)

136-144: Consistent per-key metadata validation.

The updated assertions properly validate each expected metadata key with clear error messages. This matches the pattern established in the earlier test and maintains consistency.

tests/unit/test_model_factory.py (2)

256-256: Improved error message assertions with regex patterns.

The updated regex patterns make the tests more resilient to minor wording changes in error messages while still validating the essential content. This is a good maintainability improvement.

Also applies to: 265-265, 273-273, 283-283


278-280: Consistent tie_embeddings handling in TP tests.

The addition of tie_embeddings=False is consistent with the earlier custom TP test at lines 243-245, ensuring proper validation when tp_degree > 1.

docker/compose.yml (1)

31-32: DP_SHARD and TP_DEGREE defaults are correctly configured for 4-GPU setup.

The defaults DP_SHARD=4 and TP_DEGREE=1 are appropriate and properly validated:

  • World size calculation (DP_SHARD × TP_DEGREE = 4 × 1 = 4) matches the 4 allocated GPUs (device_ids 0-3)
  • Environment variables are validated as integers in the codebase and can be overridden at runtime
  • Configuration applies consistently to all node types (miner, validator, etc.)
  • Backward compatibility and runtime validation logic are present in src/tplr/model_factory.py and src/tplr/sharded_sampler.py
docs/shared_sharded_dataset.md (1)

49-49: Anneal path and filename docs now consistent with code

The updated DATASET_BINS_PATH="anneal/" and the rclone examples using anneal/anneal_000000.npy and sample_ids_anneal_000000.npy match the new anneal-specific resolution logic in SharedShardedDataset.locate_shards and scripts/download_dataset.py. No functional issues from the docs side.

Also applies to: 141-154

src/tplr/model_factory.py (1)

24-27: Centralized parallel config validation is solid; tie_embeddings constraint for TP>1 is important

The new _validate_parallel_config does a good job of pushing all the invariants into one place:

  • Rejects zero/negative degrees early.
  • Prevents simultaneous dp_replicate > 1 and dp_shard > 1.
  • Enforces dp_replicate > 1 only when tp/pp/cp == 1.
  • Ensures world_size is divisible by the combined parallel degree.
  • Requires torchtitan.tie_embeddings to be False when tp_degree > 1, which is necessary for sharded embeddings.

Combined with get_titan_model_args reading tie_embeddings from hparams.torchtitan, this should catch most misconfigurations up front instead of failing deeper inside TorchTitan.

Just be aware that some existing configs that previously “worked by accident” (e.g., non‑1 dp_replicate and dp_shard at the same time, or leaving tie_embeddings=True while turning on TP) will now fail fast with explicit errors, which is likely desirable but worth calling out in release notes.

Also applies to: 61-83, 216-278

neurons/trainer.py (4)

488-494: TP-aware averaging in evaluate_model depends on TP/DP layout

Dividing the reduced total_loss and n_batches by self.tp_degree makes sense if:

  • The reduction runs over all ranks in all TP groups (global process group), and
  • TP ranks see the same data while DP groups see distinct data.

In that case the all‑reduce sum is duplicated tp_degree‑times and your division recovers a DP‑only aggregate.

If there’s any configuration where:

  • The reduction group is already limited to DP ranks, or
  • TP groups don’t actually process identical batches,

then this additional division would under‑scale metrics.

Worth double‑checking how dist_helper.ddp_reduce is configured (group vs global world), especially for mixed TP+FSDP setups, to ensure the TP normalization here is always correct.


611-647: Batched optimizer metric reductions look correct and more efficient

The switch to dist_helper.batched_all_reduce for Adam metrics is clean:

  • All scalar sums/maxes are grouped into a single batched call.
  • grad_norm_max and update_norm_max correctly use ReduceOp.MAX, everything else defaults to SUM.
  • The results are unpacked and used exactly as before.

Semantics (summing squared norms across all ranks, then **0.5) remain unchanged apart from the more efficient communication pattern.


805-808: Per-step fwd/bwd timing and grad-norm logging are sound

The added timing instrumentation in inner_steps:

  • Measures forward and backward durations per micro‑batch and aggregates them per optimizer step (step_forward_time, step_backward_time).
  • Logs the accumulated fwd/bwd times once per inner step and resets the counters afterwards.
  • Safely handles grad_norm_before_clip whether it’s a Tensor (including DTensor) or plain float via the .item() guard.

This is a low‑overhead, thread‑safe way to get step‑level timings and grad norms and doesn’t affect training behavior.

Also applies to: 868-878, 882-901, 1019-1024, 1029-1036


927-955: TP-aware token/loss reductions and anneal warmup gating look consistent

Within the optimizer‑step block:

  • Using a single batched_all_reduce for tokens_sum, loss_sum, and an AVG loss_avg is a good consolidation of collectives.
  • Dividing tokens_sum and loss_sum by self.tp_degree aligns with the idea that TP ranks process the same data and should not be over-counted; loss_avg is already an AVG across ranks, so leaving it untouched is correct.
  • global_tokens and global_loss_sum are accumulated from the TP‑normalized step values, so downstream window metrics see DP‑normalized quantities.
  • Manual warmup scaling is correctly disabled when anneal_mode.enabled is true, leaving warmup purely to the anneal scheduler while still respecting flatten windows.

Behaviorally this matches the TP story used in evaluate_model and keeps the anneal and non‑anneal modes separate.

Also applies to: 959-998

neurons/validator.py (5)

32-36: CUDA allocator env default is reasonable; just be aware of process-wide scope

Setting PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" by default (when unset) before importing torch is a good way to reduce fragmentation for large FSDP/TP models.

Just keep in mind:

  • This affects the whole process, including any other components importing torch in the same interpreter.
  • If users want a different allocator behavior, they must explicitly set PYTORCH_CUDA_ALLOC_CONF before launching the validator, which your guard correctly honors.

No code change needed; this is more of a behavioral note.


549-555: Anneal-mode dataset prefix/mode wiring matches ShardedDatasetManager contract

The anneal dataset wiring looks correct:

  • anneal_config = getattr(self.hparams, "anneal_mode", {})
  • anneal_enabled = anneal_config.get("enabled", False)
  • anneal_prefix = anneal_config.get("file_prefix", "anneal") if anneal_enabled else "train"
  • Passed into ShardedDatasetManager(file_prefix=anneal_prefix, anneal_mode=anneal_enabled), matching its constructor.

This aligns with the manager’s behavior: locking to a single shard and skipping shard swaps when anneal_mode=True, and using anneal_-prefixed files.

Just ensure all anneal-mode runs use the same file_prefix convention as the preprocessing pipeline and docs (anneal_*.npy etc.), which it sounds like you’ve already updated elsewhere in the PR.

Also applies to: 556-564


1287-1291: Anneal-mode shard pinning and swap skipping are coherent with dataset manager

Two related changes:

  • Right after computing shard state, you force current_shard = 0 and shard_epoch = 0 if self.dataset_manager.anneal_mode is true, so initial dataset init always uses shard 0.
  • In the main loop, you re-check anneal_mode and completely skip shard re‑compute and swap_datasets() when it’s enabled.

Combined with ShardedDatasetManager.anneal_mode (which already prevents preparing/swapping next shards), this cleanly enforces single‑shard anneal behavior without double‑swapping or index drift.

Looks good and consistent with the anneal-mode design.

Also applies to: 1333-1363


1698-1701: Pre-slashing barrier is a good safeguard against TP/DDP deadlocks

The new pre‑slashing barrier:

if self.world_size > 1 and dist.is_initialized():
    dist_helper.safe_barrier("validator_pre_slashing", self.local_rank)

before slash_for_poor_sync() and slash_for_missing_gradients() is a sensible addition:

  • Ensures all ranks reach a common point before master runs potentially long async work that might call collectives.
  • Uses safe_barrier, which handles device‑specific barriers and errors gracefully.

This should help avoid subtle deadlocks in mixed TP+FSDP setups where some ranks might otherwise advance early.


2048-2086: Gradient-apply consensus and slashing ordering are correct

The new handling around gradient application failures is well thought out:

  • Capture a local exception_msg only when update_model_with_gradient fails on this rank.
  • Run dist_helper.all_ok(...) so all ranks participate in the same collective before any slashing or restore behavior.
  • Only after all_ok finishes, have master call slash_for_invalid_gradient once, using either the captured message (if any) or a generic one.

This avoids rank‑specific early exits that previously could have led to mismatched barrier participation and deadlocks, while still ensuring invalid gradients are penalized.

Comment on lines 695 to 737
# ------------------------------------------------------------
# rank-0 merges & uploads the full gradient
# Handles three cases: Pure FSDP, Pure TP, and TP+FSDP hybrid
# ------------------------------------------------------------
gradient = {}
processed_state_dict = {}
processed_state_dict = {} # Initialize here for proper scope
tp_degree = self.tp_degree if self.tp_degree > 1 else 1

if self.is_master:
assert gathered is not None
for i, shard in enumerate(gathered):
if shard is not None:
gradient.update(shard)
gathered[i] = None # Free memory immediately after using shard

if tp_degree == 1:
# Case 1: Pure FSDP - each rank owns different parameters
# Simple merge with no conflicts
for i, shard in enumerate(gathered):
if shard:
gradient.update(shard)
gathered[i] = None # Free memory immediately

elif self.world_size == tp_degree:
# Case 2: Pure TP - all ranks have same parameters sharded
# Need to reconstruct full parameters from shards
gradient = self._reconstruct_tp_gradients(gathered)

else:
# Case 3: TP + FSDP hybrid
# Process each FSDP group's TP shards separately
num_fsdp_groups = self.world_size // tp_degree

for fsdp_id in range(num_fsdp_groups):
# Extract TP shards for this FSDP group
start_idx = fsdp_id * tp_degree
end_idx = start_idx + tp_degree
tp_shards = gathered[start_idx:end_idx]

# Reconstruct this FSDP group's parameters
reconstructed = self._reconstruct_tp_gradients(tp_shards)
gradient.update(reconstructed)

# Free memory for processed shards
for idx in range(start_idx, end_idx):
gathered[idx] = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard TP+FSDP gradient merge against invalid world_size/tp_degree combinations

The three-way split for gradient assembly (pure FSDP, pure TP, TP+FSDP hybrid) is a good structure, but the TP branches currently assume a valid layout:

  • For hybrid mode you use num_fsdp_groups = self.world_size // tp_degree and slice gathered[start_idx:end_idx] with tp_degree shards per group.
  • If self.tp_degree > 1 but world_size < tp_degree you get num_fsdp_groups == 0gradient stays empty yet you still proceed and upload.
  • If world_size > tp_degree but world_size % tp_degree != 0 you’ll silently ignore any ranks in the “remainder” group (e.g. world_size=6, tp_degree=4 processes only shards 0–3 and drops 4–5).

Given you already treat invalid TP layouts as “no-TP” elsewhere, I’d strongly recommend normalizing tp_degree here as well:

Suggested guard for invalid TP layout in gradient merge
-            tp_degree = self.tp_degree if self.tp_degree > 1 else 1
+            # Normalise TP degree: fall back to 1 (pure FSDP) if layout is invalid
+            raw_tp = self.tp_degree if self.tp_degree > 1 else 1
+            if (
+                raw_tp > 1
+                and (self.world_size < raw_tp or self.world_size % raw_tp != 0)
+            ):
+                tplr.logger.warning(
+                    f"[TP Merge] Invalid TP layout: world_size={self.world_size}, "
+                    f"tp_degree={raw_tp}. Falling back to FSDP-style merge."
+                )
+                tp_degree = 1
+            else:
+                tp_degree = raw_tp

This way invalid TP_DEGREE / world_size pairings don’t silently drop shards or produce an empty gradient; you fall back to the safe FSDP merge path with an explicit warning.

🤖 Prompt for AI Agents
In @neurons/miner.py around lines 695-737, When assembling gradients, normalize
the local tp_degree and guard the TP branches to avoid dropping shards: check
the computed tp_degree = self.tp_degree if self.tp_degree > 1 else 1 and if
tp_degree > self.world_size or self.world_size % tp_degree != 0, fall back to
tp_degree = 1 (and emit a warning via self.logger or processLogger) so the code
uses the safe FSDP merge; ensure branches that use
_reconstruct_tp_gradients(gathered) and the hybrid path that computes
num_fsdp_groups = self.world_size // tp_degree treat num_fsdp_groups == 0 as
fallback to the pure FSDP merge, freeing processed gathered entries only for
valid slices and not silently ignoring remainder ranks.

Comment on lines +55 to +62
# Effective TP degree with env override; fallback to hparams (prefer torchtitan.tp_degree)
tt = getattr(self.hparams, "torchtitan", None)
base_tp = (
getattr(tt, "tp_degree", None)
if tt
else getattr(self.hparams, "tp_degree", 1)
)
self.tp_degree = max(1, int(os.environ.get("TP_DEGREE", base_tp)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden tp_degree parsing to avoid None/invalid env causing crashes

The new tp_degree logic is good, but there’s a corner case:

  • tt = getattr(self.hparams, "torchtitan", None)
  • base_tp = getattr(tt, "tp_degree", None) when tt exists but has no tp_degree
  • If TP_DEGREE is unset, os.environ.get("TP_DEGREE", base_tp) may return None
  • int(None) will raise a TypeError

You also don’t guard against non‑integer TP_DEGREE here, unlike the more robust handling in Validator.__init__.

I’d recommend mirroring the validator logic and ensuring there’s always a sane integer fallback:

Suggested fix for robust TP degree parsing
-        tt = getattr(self.hparams, "torchtitan", None)
-        base_tp = (
-            getattr(tt, "tp_degree", None)
-            if tt
-            else getattr(self.hparams, "tp_degree", 1)
-        )
-        self.tp_degree = max(1, int(os.environ.get("TP_DEGREE", base_tp)))
+        tt = getattr(self.hparams, "torchtitan", None)
+        # Prefer torchtitan.tp_degree, then hparams.tp_degree, then 1
+        default_tp = 1
+        if tt is not None:
+            default_tp = getattr(tt, "tp_degree", getattr(self.hparams, "tp_degree", 1))
+        else:
+            default_tp = getattr(self.hparams, "tp_degree", 1)
+
+        raw_tp = os.environ.get("TP_DEGREE")
+        if raw_tp is not None:
+            try:
+                self.tp_degree = max(1, int(raw_tp))
+            except (ValueError, TypeError):
+                tplr.logger.warning(
+                    f"Invalid TP_DEGREE='{raw_tp}', falling back to default tp_degree={default_tp}"
+                )
+                self.tp_degree = max(1, int(default_tp))
+        else:
+            self.tp_degree = max(1, int(default_tp))

This also keeps TP_DEGREE parsing consistent between trainer and validator and avoids surprising crashes from config/env drift.

Also applies to: 72-73

🤖 Prompt for AI Agents
In @neurons/trainer.py around lines 55-62, The TP degree parsing can return None
or raise on invalid env values; update the block that sets self.tp_degree (where
tt = getattr(self.hparams, "torchtitan", None) and base_tp = getattr(tt,
"tp_degree", None)) to mirror Validator.__init__: ensure base_tp defaults to 1
if missing/None, read env_val = os.environ.get("TP_DEGREE"), if env_val is not
None try to parse int(env_val) in a try/except and fall back to base_tp on
failure, then coerce to max(1, parsed_value); apply the same robust parsing to
the other occurrence around lines 72-73 so both Trainer tp_degree assignments
use the same safe logic.

Comment on lines +20 to +22
# Add parent directory to path to import tplr
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
import tplr
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if tplr exists as a directory in the repo
find . -name "tplr" -type d 2>/dev/null | head -20
find . -name "tplr.py" -type f 2>/dev/null | head -20

# List files in scripts directory to understand context
ls -la scripts/ 2>/dev/null | head -20

# Show the structure at repo root
ls -la ./ | grep -E "^d|tplr"

# Check the actual abort_multipart_uploads.py file around lines 20-22
echo "=== Content around lines 20-22 ==="
sed -n '15,25p' scripts/abort_multipart_uploads.py

Repository: one-covenant/templar

Length of output: 2371


🏁 Script executed:

#!/bin/bash
# Check the structure of src/tplr
echo "=== Content of src/tplr ==="
ls -la src/tplr/ | head -20

# Check if there's a setup.py or pyproject.toml at repo root
echo "=== Python setup files in repo root ==="
ls -la | grep -E "setup\.(py|cfg)|pyproject\.toml|__init__"

# Check if there's an __init__.py in src or src/tplr
echo "=== Check for __init__.py files ==="
find src -name "__init__.py" -type f

# Check sys.path imports in other scripts to see the pattern
echo "=== Check how other scripts import tplr ==="
grep -r "sys.path" scripts/*.py 2>/dev/null
grep -r "import tplr" scripts/*.py 2>/dev/null

Repository: one-covenant/templar

Length of output: 2663


🏁 Script executed:

#!/bin/bash
# Check pyproject.toml for package configuration
cat pyproject.toml

Repository: one-covenant/templar

Length of output: 2386


Fix incorrect sys.path manipulation for tplr import.

The code adds only the script's own directory (scripts/) to sys.path, but tplr is located at src/tplr/. The comment "Add parent directory" is misleading—the code should add the repository root instead. This pattern is already used correctly in other scripts in the same directory.

Proposed fix
 # Add parent directory to path to import tplr
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 import tplr
🤖 Prompt for AI Agents
In @scripts/abort_multipart_uploads.py around lines 20-22, The script
incorrectly appends its own directory to sys.path when trying to import tplr;
change it to add the repository root (the parent of the scripts directory)
instead and use sys.path.insert(0, ...) so the project package is resolved
first. Replace the current
sys.path.append(os.path.dirname(os.path.abspath(__file__))) logic used before
the import tplr with code that computes the repo root (the parent directory of
the scripts folder) and inserts that root at the front of sys.path, and update
the comment to say "Add repository root to path to import tplr" to match other
scripts.

Comment on lines +57 to +59
try:
# List all multipart uploads
response = await client.list_multipart_uploads(Bucket=account_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, examine the full context of the file around lines 45-59
echo "=== Full context of abort_multipart_uploads.py (lines 40-70) ==="
cat -n scripts/abort_multipart_uploads.py | sed -n '40,70p'

Repository: one-covenant/templar

Length of output: 1412


🏁 Script executed:

#!/bin/bash
# Search for R2_GRADIENTS references
echo "=== R2_GRADIENTS references in Python files ==="
rg -n 'R2_GRADIENTS' --type py -C3

echo ""
echo "=== All Bucket= references in Python files ==="
rg -n 'Bucket=' --type py -C2

Repository: one-covenant/templar

Length of output: 44116


🏁 Script executed:

#!/bin/bash
# Check for bucket configuration patterns in the codebase
echo "=== Search for bucket configuration patterns ==="
rg -n 'bucket' --type py -i | head -40

Repository: one-covenant/templar

Length of output: 3109


Add missing R2_GRADIENTS_BUCKET_NAME environment variable and use it instead of account_id for bucket operations.

Line 59 and 77 incorrectly use Bucket=account_id. The account ID (used in the endpoint URL) is different from the bucket name. The repository defines R2_GRADIENTS_BUCKET_NAME as a required environment variable (see config.py and test fixtures), but this script doesn't validate or use it. Add the missing environment variable validation and replace all Bucket=account_id with Bucket=bucket_name.

Example fix
     required_vars = [
         "R2_GRADIENTS_ACCOUNT_ID",
+        "R2_GRADIENTS_BUCKET_NAME",
         "R2_GRADIENTS_WRITE_ACCESS_KEY_ID",
         "R2_GRADIENTS_WRITE_SECRET_ACCESS_KEY",
     ]

     # Get credentials from environment
     account_id = os.environ["R2_GRADIENTS_ACCOUNT_ID"]
+    bucket_name = os.environ["R2_GRADIENTS_BUCKET_NAME"]
     access_key_id = os.environ["R2_GRADIENTS_WRITE_ACCESS_KEY_ID"]
     secret_access_key = os.environ["R2_GRADIENTS_WRITE_SECRET_ACCESS_KEY"]

-            response = await client.list_multipart_uploads(Bucket=account_id)
+            response = await client.list_multipart_uploads(Bucket=bucket_name)

-                    await client.abort_multipart_upload(Bucket=account_id, Key=key, UploadId=upload_id)
+                    await client.abort_multipart_upload(Bucket=bucket_name, Key=key, UploadId=upload_id)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @scripts/abort_multipart_uploads.py around lines 57-59, The script is using
account_id as the S3 bucket name; instead validate and use the
R2_GRADIENTS_BUCKET_NAME environment variable instead: add retrieval and
validation of R2_GRADIENTS_BUCKET_NAME at script startup, error out if missing,
then replace all occurrences of Bucket=account_id (e.g., in
client.list_multipart_uploads and the subsequent abort/delete calls around the
same block) with Bucket=bucket_name (or whatever local variable you create from
R2_GRADIENTS_BUCKET_NAME) so bucket operations target the actual gradients
bucket rather than the account ID.

Comment on lines +75 to +78
try:
await client.abort_multipart_upload(
Bucket=account_id, Key=key, UploadId=upload_id
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Same bucket name issue in abort call.

Consistent with the earlier issue, Bucket=account_id should use the actual bucket name.

🔎 Proposed fix
                     await client.abort_multipart_upload(
-                        Bucket=account_id, Key=key, UploadId=upload_id
+                        Bucket=bucket_name, Key=key, UploadId=upload_id
                     )
🤖 Prompt for AI Agents
In @scripts/abort_multipart_uploads.py around lines 75-78, The
abort_multipart_upload call is passing the wrong value for Bucket (using
account_id); change Bucket=account_id to use the actual S3 bucket variable used
elsewhere in the script (e.g., Bucket=bucket or Bucket=bucket_name) so
client.abort_multipart_upload(Bucket=<actual_bucket_variable>, Key=key,
UploadId=upload_id) supplies the real bucket name rather than the account_id.

# === CONNECTION BEHAVIOUR ===
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_congestion_control = bbr
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n scripts/net-tuning.sh

Repository: one-covenant/templar

Length of output: 1919


🏁 Script executed:

# Check if BBR is available in the system kernel
grep -r "tcp_bbr" /lib/modules/ 2>/dev/null | head -5

Repository: one-covenant/templar

Length of output: 46


🏁 Script executed:

# Test what happens when sysctl tries to set an unavailable congestion control
# First, let's see what congestion controls are available on this system
sysctl net.ipv4.tcp_available_congestion_control 2>/dev/null || echo "Not available in this sandbox"

Repository: one-covenant/templar

Length of output: 93


🏁 Script executed:

# Check kernel documentation and common behavior of tcp_congestion_control
# Look for any error handling or documentation in the script
head -50 scripts/net-tuning.sh

Repository: one-covenant/templar

Length of output: 1450


🌐 Web query:

BBR tcp congestion control kernel requirements Linux availability

💡 Result:

Short answer

  • BBR (BBRv1) is in mainline Linux since kernel 4.9 (CONFIG_TCP_CONG_BBR → module tcp_bbr). It’s available in most modern distro kernels; enable it by loading tcp_bbr, setting fq qdisc and net.ipv4.tcp_congestion_control=bbr. [1][2][3]

BBRv2 status

  • BBRv2 started as an out‑of‑tree Google project and required patched kernels; upstreaming has been worked on but distribution support is limited/gradual (some kernels/stable trees may enable CONFIG_TCP_CONG_BBR2 but it is not universally present in all distro LTS kernels as of recent reports). Check your distro/kernel for CONFIG_TCP_CONG_BBR2 if you need BBRv2. [4][5]

Practical checks & enable (commands)

  • Check kernel version:
    uname -r
  • Check available/active congestion control:
    cat /proc/sys/net/ipv4/tcp_available_congestion_control
    sysctl net.ipv4.tcp_congestion_control
  • Load & enable BBR (example):
    sudo modprobe tcp_bbr
    sudo sysctl -w net.core.default_qdisc=fq
    sudo sysctl -w net.ipv4.tcp_congestion_control=bbr
  • Verify module:
    lsmod | grep tcp_bbr
    ss -tin (look for bbr in use). [2][3]

References

  1. LKDDb / kernel config info for CONFIG_TCP_CONG_BBR (shows presence since 4.9). [6]
  2. How‑to guides showing kernel requirement >= 4.9 and enable steps (examples). [2][3]
  3. Tutorials / writeups about enabling/verifying BBR on Linux. [3][5]
  4. BBRv2 development/upstream discussion and deployment notes (BBRv2 was out‑of‑tree and upstreaming has been discussed; distro enablement varies). [4][7]

If you want, tell me your distribution and uname -r and I’ll check whether your kernel likely provides BBR and whether BBRv2 is available.


🌐 Web query:

sysctl --system behavior unavailable parameter error handling

💡 Result:

Short summary

  • sysctl (procps) when run as "sysctl --system" tries to apply every key from /etc/sysctl.d/*.d, /run/sysctl.d and /etc/sysctl.conf; it prints an error line for each key it cannot set (Invalid argument, Permission denied, No such file or directory) and will return a non‑zero exit status when one or more entries fail. [1][2]

  • systemd’s service (systemd-sysctl) which applies the same files at boot behaves similarly: if applying one or more settings fails the service exits with failure (systemd-sysctl.service shows Result: exit-code). This commonly surfaces in containers or during kernel/module timing where a sysctl key doesn’t exist yet or /proc/sys is read‑only. [3][4]

Common causes

  • Key not present because kernel/config/module not loaded (Invalid argument / No such file).
  • /proc/sys read‑only (containers, early boot) -> Permission or read‑only errors.
  • Different implementations/wildcard support between procps' sysctl and systemd-sysctl can change visible behavior. [1][3]

Practical fixes / workarounds

  • Remove or guard problematic keys: only set keys that your kernel supports, or move distribution/device-specific keys out of the global sysctl files. (Best fix.)
  • Delay or ensure module is loaded before applying the key (e.g., load module or use a systemd unit that runs after the module). [2]
  • Ignore sysctl failures when running interactively or in scripts:
    • run sysctl --system 2>/dev/null || true
    • or redirect only known error messages: sysctl --system 2>&1 | grep -v 'Invalid argument'
  • For container installs, skip applying sysctls or ensure /proc/sys is writable, or use container-aware packaging that avoids failing on sysctl errors. [4][5]

If you want, tell me your distro and whether this happens at boot or when you run sysctl manually and I’ll give the exact file(s) to edit and the safest command to use.

Sources
[1] systemd/systemd issue discussing differences and errors from sysctl --system.
[2] Debian bug discussion about systemd-sysctl failing to apply config at boot.
[3] EndeavourOS forum thread showing systemd-sysctl service failing with exit-code when apply fails.
[4] GitLab/GitHub issues reporting failures in container/read‑only /sys when sysctl is applied.


Document BBR kernel requirement or make it optional for broader compatibility.

BBR requires Linux kernel 4.9+ and the tcp_bbr module. On systems without support, sysctl --system will fail with an "Invalid argument" error due to the set -euo pipefail at the top of the script. Rather than adding an unreliable modprobe check, consider:

  • Adding a comment documenting the kernel 4.9+ requirement
  • Making BBR optional by wrapping it in a conditional or removing it if broader compatibility is needed
  • Allowing sysctl errors to be suppressed if BBR is not critical: sysctl --system 2>/dev/null || true
🤖 Prompt for AI Agents
In @scripts/net-tuning.sh around line 33, The script sets
"net.ipv4.tcp_congestion_control = bbr" unconditionally which will fail on
kernels lacking BBR; either document the kernel 4.9+ and tcp_bbr requirement
near the top of the script (next to the "set -euo pipefail" line) or make the
BBR setting conditional: check for kernel support (e.g., existence/readability
of /proc/sys/net/ipv4/tcp_congestion_control or tcp_bbr in /sys/modules) before
applying the sysctl, or if you prefer to keep it non-fatal suppress sysctl
errors for this call by running the system-wide apply with error suppression
(e.g., sysctl --system 2>/dev/null || true) so the script won't exit on "Invalid
argument".

Comment on lines +34 to 41
# Cache DTensor type for fast isinstance checks
try:
from torch.distributed._tensor import DTensor as _DTensor

_DTENSOR_TYPE = _DTensor
except ImportError:
_DTENSOR_TYPE = DT
from torch.optim.lr_scheduler import LRScheduler
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

PyTorch 2.7.x torch.distributed._tensor.DTensor import compatibility

💡 Result:

Short answer: In PyTorch 2.7.x DTensor lives under torch.distributed.tensor (the public namespace). Code that imports torch.distributed._tensor.DTensor (the older internal path) should switch to torch.distributed.tensor.DTensor or use a small compatibility try/except fallback.

Recommended compatibility snippet:

try:
from torch.distributed.tensor import DTensor # PyTorch ≥ ~2.5 (public API)
except Exception:
from torch.distributed._tensor import DTensor # older internal path (fallback)

Sources: PyTorch distributed.tensor docs (DTensor) and upstream repo/docs showing the new torch.distributed.tensor namespace. [1][2]


🌐 Web query:

PyTorch DTensor public API torch.distributed.tensor vs private _tensor module

💡 Result:

Short answer

  • torch.distributed._tensor was the older, private import path for DTensor (used in PyTorch ≤2.4). Use of it in libraries caused breakage when the DTensor import path changed. [1][2]
  • torch.distributed.tensor is the public DTensor API (current documented entrypoint; includes DTensor, distribute_tensor, DeviceMesh, placements, factory functions, etc.). Prefer this public module for new code. [1]
  • If you need to support older PyTorch releases, detect torch.version at runtime and fall back to importing from torch.distributed._tensor when torch.distributed.tensor is unavailable. Several projects (Transformers, etc.) hit this exact compatibility issue. [2][3]

References

  • PyTorch DTensor (public) docs: torch.distributed.tensor. [1]
  • HuggingFace issues showing import-path changes / compatibility breakage (DTensor moved from torch.distributed._tensor → torch.distributed.tensor). [2]
  • Related GitHub discussion / compatibility reports. [3]

🌐 Web query:

PyTorch 2.7 DTensor isinstance checks compatibility

💡 Result:

Short answer: Many projects saw breakage around PyTorch 2.5+ because DTensor's import/location and runtime type changed, so isinstance(obj, DTensor) can fail across PyTorch versions. To be robust on PyTorch 2.4→2.7 you should either (A) import whichever DTensor symbol exists at runtime and use that for isinstance, or (B) avoid strict isinstance checks and use a small duck‑typing test (e.g. check for attributes like device_mesh/placements/local_tensor) instead. [1][2][3][4]

How to implement (concise):

  • Safe import that covers versions:
    try:
    from torch.distributed.tensor import DTensor
    except Exception:
    from torch.distributed._tensor import DTensor

    if isinstance(x, DTensor): ...

    (This covers the PyTorch change introduced around 2.5 and used in later docs/2.7.) [2][4]

  • Duck‑typing fallback (recommended when interacting with third‑party code that may mix DTensor/Tensor):
    if hasattr(x, "device_mesh") and hasattr(x, "placements"):
    # treat as DTensor-like

Why: DTensor was relocated/renamed in the PyTorch codebase (breaking imports and type identity for some code paths), and there are active bug reports where isinstance checks on DTensor return False in mixed contexts. Using a safe import or duck‑typing avoids these compatibility problems. [2][3][4]

Sources:

  • PyTorch DTensor docs (2.7). [1]
  • Transformer / ecosystem issues discussing DTensor import/path changes since PyTorch 2.5. [2][5]
  • PyTorch GitHub issues showing runtime isinstance / mixed Tensor vs DTensor problems. [3][4]

🏁 Script executed:

grep -n "^DT\s*=" src/tplr/neurons.py | head -5

Repository: one-covenant/templar

Length of output: 46


🏁 Script executed:

rg "^\s*(from|import).*\bDT\b" src/tplr/neurons.py -A 2 -B 2

Repository: one-covenant/templar

Length of output: 258


🏁 Script executed:

rg "DT\s*=" src/tplr/neurons.py

Repository: one-covenant/templar

Length of output: 46


Update DTensor import to prioritize the public API for PyTorch 2.7.x compatibility

The current code tries torch.distributed._tensor.DTensor (private, legacy path from PyTorch ≤2.4) before falling back to the public torch.distributed.tensor.DTensor alias. In PyTorch 2.5+, the private path is deprecated in favor of the public torch.distributed.tensor namespace.

Reverse the import order to try the public API first:

Suggested import structure
try:
    from torch.distributed.tensor import DTensor as _DTensor
    _DTENSOR_TYPE = _DTensor
except ImportError:
    from torch.distributed.tensor import DTensor as DT
    _DTENSOR_TYPE = DT

This ensures compatibility across PyTorch 2.4 (private path) through 2.7+ (public path) and reduces the risk of isinstance check failures that can occur with mixed import paths in some contexts.

🤖 Prompt for AI Agents
In @src/tplr/neurons.py around lines 34-41, Reverse the DTensor import order so
the public API is tried first: attempt to import DTensor from
torch.distributed.tensor and assign it to _DTENSOR_TYPE, and only on ImportError
fall back to the legacy/private torch.distributed._tensor.DTensor; update the
try/except around the DTensor import (affecting the _DTENSOR_TYPE binding) so
isinstance checks use the public DTensor when available.

Comment on lines +393 to +401
# Extract miner's xshapes/totalks from metadata if available (for cross-TP-degree compatibility)
miner_xshapes = None
miner_totalks = None
if on_src and src_sd is not None:
metadata = src_sd.get("metadata", {})
if isinstance(metadata, dict):
miner_xshapes = metadata.get("xshapes")
miner_totalks = metadata.get("totalks")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file structure and the flagged lines
git ls-files src/tplr/neurons.py

Repository: one-covenant/templar

Length of output: 83


🏁 Script executed:

# Check the size of the file
wc -l src/tplr/neurons.py

Repository: one-covenant/templar

Length of output: 88


🏁 Script executed:

# Read the relevant sections: lines 393-401 (snippet context) and the flagged ranges
sed -n '393,401p' src/tplr/neurons.py

Repository: one-covenant/templar

Length of output: 442


🏁 Script executed:

# Read the broader context around the critical lines mentioned
sed -n '467,500p' src/tplr/neurons.py

Repository: one-covenant/templar

Length of output: 1992


🏁 Script executed:

# Also check the other flagged ranges to understand the pattern
sed -n '521,577p' src/tplr/neurons.py

Repository: one-covenant/templar

Length of output: 2300


🌐 Web query:

PyTorch DTensor distributed collective operations synchronization deadlock rank divergence

💡 Result:

Short explanation

  • Deadlocks in DTensor collectives happen when ranks diverge in the sequence, arguments, or device selection for collective ops (e.g., one rank calls all_gather while another calls nothing or a different collective), or when communicators/device choices differ (NCCL choosing different GPU indices). DTensor redistribution (redistribute(), full_tensor(), etc.) implicitly issues collectives (all_gather, all_to_all, all_reduce, reduce_scatter), so any rank-level divergence during a redistribute will deadlock. [1][2][3]

Common root causes

  • Rank divergence: different control‑flow or code paths cause some ranks to skip or reorder collectives. [1][4]
  • Mismatched device selection / NCCL communicator creation: e.g., first collective picks different CUDA device per rank causing one rank to create a communicator while another does not. [3]
  • Forgetting to wait on async ops or incorrectly mixing synchronous/asynchronous semantics across ranks. [1][7]

Quick debugging steps (use in this order)

  1. Insert a host-side barrier or monitored_barrier before the DTensor collective to detect which ranks are desynchronized (monitored_barrier reports non-responding ranks). Use torch.distributed.monitored_barrier for debugging. [1]
  2. Ensure all ranks execute the same sequence of collectives with identical group arguments and placements. Audit control flow and any early exits. [1][7]
  3. Check DTensor placements/DeviceMesh are identical on all ranks before calling redistribute() or full_tensor(). DTensor redistributes map to specific collectives—any mismatch causes hang. [1]
  4. If using CUDA/NCCL, ensure consistent device choice across ranks (use local_rank mapping, set CUDA_VISIBLE_DEVICES, or pass device_ids explicitly to barrier/collectives). Avoid relying on global rank % num_gpus heuristics. [3]
  5. If using async_op=True, always capture the returned Work and call wait() / block_current_stream() at appropriate points; otherwise ranks may proceed inconsistently. [1]
  6. Reproduce with a minimal script and run with 2–4 ranks locally (makes debugging easier). Use logging to show which collectives each rank reaches. [4]

Preventive best practices

  • Make collectives deterministic in code path: avoid rank-specific branches around communication.
  • Warm up process group with a simple all_reduce/all_gather on a known device after init_process_group to stabilize device/communicator choices. [3]
  • Use monitored_barrier before complex redistributes in development. [1]
  • Prefer explicit device arguments (device_ids or using local_rank) when calling collectives/barrier on CUDA to avoid implicit different-device selection. [3]

References

  • PyTorch distributed docs (barrier, monitored_barrier, Work semantics). [1]
  • PyTorch DTensor API / redistribute behavior (which collectives are used). [2]
  • GitHub issue on NCCL/barrier hang and device-selection problem (examples & workarounds like device_ids / CUDA_VISIBLE_DEVICES). [3]
  • Forum posts and community debugging examples about broadcast/all_gather deadlocks. [4]

If you want, I can: (a) analyze a minimal repro you have and point to the exact divergence, or (b) produce a minimal test script that demonstrates common DTensor redistribute deadlocks and how to fix them.


Avoid rank-divergent continue paths in outer_step to prevent collective deadlocks

outer_step carries shard_metadata and uses miner/validator xshapes/totalks correctly, but error paths after _bcast_flag(has_update) cause rank divergence in collective calls:

  • Lines 531–542: if on_src and src_tensor.shape != p.shape for a DTensor, you continue without calling distribute_tensor().
  • Lines 569–576: if on_src and full_grad_src.shape != p.shape for a replicated param, you continue without calling dist.broadcast().

On non-source ranks, on_src is False, so they skip these conditions and proceed directly to the collective operation. This leaves source rank having already skipped while non-source ranks attempt to join a collective, causing NCCL/DTensor deadlocks.

All ranks must execute the same sequence of collective calls. Replace rank-specific continue statements with a broadcasted skip_param flag that all ranks evaluate uniformly:

Safe rank-aligned skip pattern
skip_param = 0
if on_src:
    # ... build full_grad_src, check validity ...
    if shape_mismatch_or_unreconstructed:
        skip_param = 1

skip_param = _bcast_flag(skip_param)
if skip_param:
    # All ranks skip together
    if on_src and full_grad_src is not None:
        del full_grad_src
    continue

# Now all ranks know payload is valid; execute collectives in lockstep
🤖 Prompt for AI Agents
In @src/tplr/neurons.py around lines 393-401, outer_step currently uses
rank-specific continue paths after _bcast_flag(has_update) which causes
collective divergence; replace those continues with a rank-aligned skip flag.
Specifically, where you check on_src and compare src_tensor.shape != p.shape
(and similarly full_grad_src.shape != p.shape for replicated params), do not
continue on the source only; instead set a local skip_param = 1 when the source
detects a mismatch/unreconstructable tensor (leave skip_param = 0 otherwise),
then call skip_param = _bcast_flag(skip_param) so all ranks receive the same
decision; if skip_param is true then (a) on_src free any full_grad_src and/or
src_tensor as needed and (b) continue — otherwise proceed to call
distribute_tensor() / dist.broadcast() and the other collectives in lockstep.
Ensure this pattern is applied around the checks that currently reference
src_tensor, full_grad_src, p, distribute_tensor(), and dist.broadcast() so all
ranks execute the same collectives.



@pytest.mark.asyncio
@pytest.mark.skip(reason="Flaky test: CUDA model comparison has precision issues")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "*.py" | xargs grep -l "def compare_model_with_debug_dict" | head -5

Repository: one-covenant/templar

Length of output: 85


🏁 Script executed:

# Check the implementation of compare_model_with_debug_dict
rg -n -B 2 -A 15 'def compare_model_with_debug_dict' --type py

Repository: one-covenant/templar

Length of output: 1014


🏁 Script executed:

# Check the test file context around lines 160-228
cat -n tests/test_model_comparison.py | sed -n '155,230p'

Repository: one-covenant/templar

Length of output: 3410


🏁 Script executed:

# Check the setup_model fixture to confirm the uniform fill values
cat -n tests/test_model_comparison.py | sed -n '25,50p'

Repository: one-covenant/templar

Length of output: 1025


🏁 Script executed:

# Check if the index_range parameter is used in compare_model_with_debug_dict
rg -n -A 5 'index_range' src/tplr/neurons.py | head -30

Repository: one-covenant/templar

Length of output: 623


The skip reason is accurate; verify the test logic in test_custom_index_range instead.

The skip marker at line 160 is correct—test_different_devices uses [:2] (first 2 elements) at line 172, which matches the implementation's default index_range=(0, 2). No index mismatch exists.

However, test_custom_index_range (lines 196–228) has a critical flaw: it creates debug_dict with the last 2 elements (param_flat[-2:] at line 206) but calls compare_model_with_debug_dict without specifying an index_range parameter (line 209). The function defaults to index_range=(0, 2), causing it to compare the first 2 elements against the last 2 elements in the debug dict. The test passes only because the fixture uses uniform parameter values (all weights filled with a single constant per parameter), making the first 2 and last 2 elements identical. This defeats the test's stated purpose: to verify "last 2 elements" sampling. Either pass index_range=(-2, 0) explicitly at line 209, or use non-uniform parameter values in the fixture to expose the mismatch.

🤖 Prompt for AI Agents
In @tests/test_model_comparison.py around line 160, test_custom_index_range
builds debug_dict using param_flat[-2:] but calls compare_model_with_debug_dict
without index_range, so the default (0,2) compares first 2 elements to the last
2; fix by either passing index_range=(-2,0) to compare_model_with_debug_dict in
test_custom_index_range so it samples the last two elements as intended
(reference: compare_model_with_debug_dict call and param_flat[-2:]), or change
the test fixture to use non-uniform parameter values so that comparing first vs
last elements would fail and reveal the bug.

Comment on lines +197 to 228
"""Test that the function samples from the last 2 elements of parameters (TP-compatible)."""
model = setup_model
learning_rate = 0.01

# Modify specific indices of model parameters to have different values
with torch.no_grad():
# Check if parameter is large enough for our test
param = model.linear1.weight.data.flatten()
if param.numel() > 7:
# Set indices 0-2 to 0.1 (default)
param[0:2] = 0.1
# Set indices 5-7 to 0.5 (different value)
param[5:7] = 0.5

# Create debug dict with exact matches for both default and custom indices
# Create debug dict with last 2 elements (matches to_local() behavior)
debug_dict = {}
for name, param in model.named_parameters():
param_flat = param.flatten()
num_elements = param_flat.numel()
debug_dict[name + "_debug"] = param_flat[:2].detach().cpu().tolist()

# Test with default indices (0, 2) - should be an exact match
default_result = await compare_model_with_debug_dict(
model, debug_dict, learning_rate
)

# Should show exact match because we're checking default indices
assert default_result["avg_steps_behind"] == pytest.approx(0.0, abs=1e-6)

# Skip further tests if the parameters aren't large enough
first_param = next(iter(model.parameters()))
if first_param.numel() <= 7:
pytest.skip("Model parameters too small for advanced index testing")

# Create a new debug dict for the custom indices test
custom_debug_dict = {}
for name, param in model.named_parameters():
param_flat = param.flatten()
num_elements = param_flat.numel()

# Only include parameters with enough elements for the custom range
if num_elements >= 7:
# Include values that will match the custom range too
values = param_flat[5:7].detach().cpu().tolist()
custom_debug_dict[name + "_debug"] = values
# Sample last 2 elements to match the new to_local() implementation
debug_dict[name + "_debug"] = param_flat[-2:].detach().cpu().tolist()

# Test with a custom range that should match
matching_result = await compare_model_with_debug_dict(
model, custom_debug_dict, learning_rate, index_range=(5, 7)
)
# Test that comparison matches when using last 2 elements
result = await compare_model_with_debug_dict(model, debug_dict, learning_rate)

# Should be an exact match
assert matching_result["param_count"] > 0 # Ensure we actually compared something
assert matching_result["avg_steps_behind"] == pytest.approx(0.0, abs=1e-6)
# Should show exact match because we're using last 2 elements consistently
assert result["avg_steps_behind"] == pytest.approx(0.0, abs=1e-6)

# Test with mismatched values to ensure comparison detects differences
mismatched_debug_dict = {}
for name, param in model.named_parameters():
if name == "linear1.weight":
param_flat = param.flatten()
if param_flat.numel() >= 7:
# mismatch values
mismatched_debug_dict[name + "_debug"] = [0.3, 0.3]
param_flat = param.flatten()
# Create mismatched values (different from actual last 2 elements)
if param_flat.numel() >= 2:
mismatched_debug_dict[name + "_debug"] = [0.999, 0.999]

# Test with custom indices (5, 7) - should detect the difference
mismatched_result = await compare_model_with_debug_dict(
model, mismatched_debug_dict, learning_rate, index_range=(5, 7)
model, mismatched_debug_dict, learning_rate
)

# Verify we actually compared parameters
# Should detect the mismatch
assert mismatched_result["param_count"] > 0

# Calculate expected difference and verify it matches actual difference
expected_diff = (
0.5 - 0.3
) / learning_rate # Difference of 0.2, normalized by learning rate

# The average steps behind should be greater than 0 (indicating a difference)
assert mismatched_result["avg_steps_behind"] > 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find the test file and see its full context
find . -name "test_model_comparison.py" -type f

Repository: one-covenant/templar

Length of output: 96


🏁 Script executed:

# Also find the main source files to understand the setup_model and compare_model_with_debug_dict
find . -name "*.py" -type f | grep -E "(test_|src/|conftest)" | head -20

Repository: one-covenant/templar

Length of output: 1114


🏁 Script executed:

# Read the test file to see the full context
wc -l ./tests/test_model_comparison.py

Repository: one-covenant/templar

Length of output: 100


🏁 Script executed:

# Read the test file
cat -n ./tests/test_model_comparison.py

Repository: one-covenant/templar

Length of output: 13228


🏁 Script executed:

# Find the compare_model_with_debug_dict function
find . -name "*.py" -type f -exec grep -l "def compare_model_with_debug_dict" {} \;

Repository: one-covenant/templar

Length of output: 85


🏁 Script executed:

# Look for the implementation in the neurons module
find . -name "neurons.py" -type f

Repository: one-covenant/templar

Length of output: 85


🏁 Script executed:

# Read the neurons.py file
wc -l ./src/tplr/neurons.py

Repository: one-covenant/templar

Length of output: 90


🏁 Script executed:

# Read the neurons.py file with line numbers
cat -n ./src/tplr/neurons.py

Repository: one-covenant/templar

Length of output: 50376


🏁 Script executed:

# Search for compare_model_with_debug_dict function
rg -n "async def compare_model_with_debug_dict" ./src/tplr/neurons.py

Repository: one-covenant/templar

Length of output: 109


🏁 Script executed:

# Get the function definition and surrounding context
sed -n '1200,1400p' ./src/tplr/neurons.py | cat -n

Repository: one-covenant/templar

Length of output: 9944


🏁 Script executed:

# Continue reading the compare_model_with_debug_dict function
sed -n '1363,1470p' ./src/tplr/neurons.py | cat -n

Repository: one-covenant/templar

Length of output: 4691


Test cannot verify its claim due to uniform parameter values.

This test claims to verify "last 2 elements" sampling behavior, but the setup_model fixture initializes all parameters with uniform values (e.g., all weights in linear1 are 0.1, all biases are 0.01). When values are uniform, param_flat[:2] == param_flat[-2:], so the test cannot distinguish which elements are actually being sampled.

Additionally, line 206 creates debug_dict with the last 2 elements (param_flat[-2:]), but compare_model_with_debug_dict defaults to index_range=(0, 2) (first 2 elements). These should mismatch, but the test expects an exact match at line 212. This only "works" because all parameter values are uniform, making the test unreliable.

🔎 Recommended fix

Use non-uniform parameter values to properly test the index behavior:

@pytest.mark.asyncio
async def test_custom_index_range(setup_model):
    """Test that the function samples from the last 2 elements of parameters (TP-compatible)."""
-   model = setup_model
+   # Create model with non-uniform values to properly test index behavior
+   model = SimpleModel()
+   with torch.no_grad():
+       # Set non-uniform values so first != last elements
+       model.linear1.weight.flatten()[0] = 0.1
+       model.linear1.weight.flatten()[1] = 0.2
+       model.linear1.weight.flatten()[-2] = 0.3
+       model.linear1.weight.flatten()[-1] = 0.4
+       # Repeat for other parameters...

Or verify that the implementation is being called with the correct index_range parameter to match the debug dict's slice.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @tests/test_model_comparison.py around lines 197-228, The test is unreliable
because setup_model uses uniform parameter values so param_flat[:2] ==
param_flat[-2:], and it also builds debug_dict from param_flat[-2:] while
compare_model_with_debug_dict defaults to index_range=(0, 2), causing a false
pass; fix by either (A) making setup_model initialize non-uniform parameters so
first vs last elements differ (e.g., vary values across tensor) to validate
index selection, or (B) align the test slices by changing debug_dict to use
param_flat[:2] or by calling compare_model_with_debug_dict(model, debug_dict,
learning_rate, index_range=(len-2, len)) /explicitly passing an index_range
matching the debug_dict; update the creation of debug_dict,
mismatched_debug_dict, or the compare_model_with_debug_dict call accordingly so
the sampled indices actually match or intentionally mismatch.

@joellidin joellidin merged commit bc7e48e into main Jan 5, 2026
6 of 8 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Jan 5, 2026
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants