Skip to content

feat/new anneal shard#679

Merged
joellidin merged 3 commits intodevfrom
feat/new-anneal-shard
Jan 11, 2026
Merged

feat/new anneal shard#679
joellidin merged 3 commits intodevfrom
feat/new-anneal-shard

Conversation

@joellidin
Copy link
Copy Markdown
Collaborator

@joellidin joellidin commented Jan 11, 2026

  • (hparams) Lower anneal peak LR to 0.25
  • (neurons) Switch anneal mode to shard 2
  • Bump run version

Description

Related Issue(s)

  • Closes #[issue number]

Type of Change

  • Feature (adding new functionality)
  • Fix (resolving a bug or issue)
  • Docs (documentation updates)
  • Refactor (code changes that don't affect functionality)
  • Maintenance (dependency updates or other maintenance)
  • Tests (adding or improving tests)
  • Breaking change (fix or feature with incompatible API changes)
  • Other: _____

Branch Naming

  • My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

  • My commits are small, atomic, and have proper commit messages
  • Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

  • I've performed a self-review of my code
  • I've added appropriate docstrings following the project's conventions
  • I've added proper logging where necessary (without trailing periods)
  • I've applied linting and formatting with Ruff
  • My code generates no new warnings

Testing

  • I've added tests for new functionality or bug fixes
  • All tests pass locally with my changes
  • Test coverage has not decreased

Documentation

  • I've updated documentation to reflect my changes
  • I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

Release Notes

  • Chores

    • Version updated to 2.1.24
  • Configuration

    • Peak learning rate factor adjusted from 0.3 to 0.25
  • Updates

    • Modified dataset shard configuration for anneal mode
    • Updated documentation to reflect current dataset shard references

✏️ Tip: You can customize this high-level summary in your review settings.

Reduce peak_lr_factor from 0.3 to 0.25 for improved training stability.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 11, 2026

Walkthrough

This PR updates the anneal mode shard selection from shard 0 to shard 2 across the codebase, adjusts the peak learning rate factor in hyperparameters, and bumps the package version to 2.1.24.

Changes

Cohort / File(s) Summary
Anneal Shard Selection Logic
neurons/miner.py, neurons/validator.py
Hard-coded shard selection in anneal mode changed from shard 0 to shard 2; corresponding comments updated to reflect new shard usage.
Hyperparameter Configuration
hparams/hparams.json
Peak learning rate factor in anneal_mode updated from 0.3 to 0.25, adjusting the peak LR value used by the schedule.
Documentation
docs/shared_sharded_dataset.md
Updated guidance and code examples to reference anneal shard 2 instead of shard 0; changed from "Copy first training shard" to "Copy anneal shard 2".
Version & Comments
src/tplr/__init__.py, src/tplr/sharded_dataset.py
Package version bumped to 2.1.24; comment in initialize_datasets generalized from "we stay on shard 0" to "we stay on one shard".

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • feat/anneal lower lr #677: Both PRs modify the same anneal_mode hyperparameter (peak_lr_factor) in hparams.json.
  • v2.1.18 #662: Both PRs update shard-selection logic in neurons/miner.py and neurons/validator.py for anneal mode.
  • feat/anneal #669: Both PRs directly modify anneal-mode shard-selection behavior introduced in the anneal feature.

Suggested reviewers

  • shivam-MBZUAI
  • amiiir-sarfi

Poem

🐰 A shard shifts from zero to two, hop-hop-hooray!
Learning rates adjust, the anneal finds its way,
Docs and code align in harmony divine,
Version bumped with care—this change is mine! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description includes a high-level summary of changes but lacks proper completion of the provided template with missing or unchecked sections including Related Issues, Type of Change checklist, and other verification items. Complete the PR description template by selecting appropriate Type of Change checkboxes, providing a Related Issue link if applicable, and indicating verification of Branch Naming, Commit Messages, Code Quality, Testing, and Documentation requirements.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat/new anneal shard' accurately captures the main change—switching anneal mode to use shard 2 instead of shard 0.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @neurons/evaluator.py:
- Around line 1421-1440: The help text for the CLI argument defined via
parser.add_argument("--anneal_eval_path", ...) is misleading: update its help
string to state that it accepts either a direct .npy shard file or a directory
(and will fall back to the configured custom evaluation path if unset) so
operators understand the supported input forms; modify only the help parameter
of the --anneal_eval_path argument to clearly list both accepted forms and the
fallback behavior.
- Around line 737-780: The _prepare_eval_dataset function checks for files named
"*_000000.npy" but constructs tplr.SharedShardedDataset with shard_index=2 and
returns a "*_000000.npy" name, causing a mismatch (SharedShardedDataset will
attempt to load "*_000002.npy"). Fix by making shard discovery, instantiation,
and returned shard_name consistent: when you intend to load shard 2, check for
existence of f"{base_name}_000002.npy" (and the fallback prefixes similarly),
instantiate SharedShardedDataset with shard_index=2, and return the matching
"*_000002.npy" name (or if you want shard 0 behavior, change shard_index to 0
and keep checks/returns as *_000000.npy); update all occurrences in
_prepare_eval_dataset (including the hinted_shard/direct_shard checks, the
fallback loop, the file_prefix logic, and the final return) to use the same
shard index convention.

In @neurons/miner.py:
- Around line 437-440: In anneal mode the miner sets current_shard_epoch while
the validator expects shard_epoch, causing a naming mismatch; update the miner
implementation (around the anneal branch where current_shard = 2) to use the
same variable name as the validator (shard_epoch) — either rename
current_shard_epoch to shard_epoch or ensure both variables are set (shard_epoch
= 0 and current_shard = 2) so miner and validator state names match exactly.
🧹 Nitpick comments (3)
neurons/validator.py (1)

1266-1270: Avoid a “magic number” for anneal shard (even if it’s intentionally forced).

Hard-coding current_shard = 2 is consistent with the PR objective, but it’s easy for miner/validator/evaluator/docs to drift. Consider lifting 2 into a single constant (or an anneal_mode.shard_index hparam with default 2) so the policy is centralized.

neurons/evaluator.py (2)

145-182: _NpySequenceDataset: avoid silent dtype conversion that can allocate large RAM.

arr.astype(np.uint32, copy=False) can still allocate if the on-disk dtype isn’t uint32, which defeats mmap and can spike memory on large shards. For eval shards, it may be safer to hard-fail on unexpected dtype (or at least gate it behind an explicit “allow_convert” flag).


965-1031: Public bpb evaluators look clean; consider making max_samples configurable for anneal.

Hard-coding max_samples=1024 is reasonable, but if you expect different shard sizes or want stable cost control, a CLI/hparam knob would make this easier to tune without code changes.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a948591 and c0afef1.

📒 Files selected for processing (7)
  • docs/shared_sharded_dataset.md
  • hparams/hparams.json
  • neurons/evaluator.py
  • neurons/miner.py
  • neurons/validator.py
  • src/tplr/__init__.py
  • src/tplr/sharded_dataset.py
🧰 Additional context used
🧬 Code graph analysis (1)
neurons/evaluator.py (1)
src/tplr/sharded_dataset.py (1)
  • SharedShardedDataset (58-239)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
🔇 Additional comments (7)
src/tplr/sharded_dataset.py (1)

399-399: LGTM: Comment clarification aligns with shard flexibility.

The updated comment accurately reflects that anneal mode stays on a single shard without hardcoding a specific shard number. This aligns with the PR's shift to shard 2 for anneal mode.

src/tplr/__init__.py (1)

23-23: LGTM: Version bump to 2.1.24.

Standard version increment for this release.

hparams/hparams.json (1)

8-8: LGTM: Peak LR reduction for improved stability.

Reducing peak_lr_factor from 0.3 to 0.25 (17% decrease) should improve training stability during the annealing phase, as stated in the PR objectives. This may slightly slow convergence but typically leads to better final model quality.

docs/shared_sharded_dataset.md (1)

149-154: LGTM: Documentation updated to reflect anneal shard 2.

The documentation correctly reflects the switch to anneal shard 2 (anneal_000002.npy and sample_ids_anneal_000002.npy). The updated rclone commands will guide users to copy the correct shard for testing.

neurons/evaluator.py (3)

42-43: Confirm numpy is a declared runtime dependency for evaluator.
You added import numpy as np; make sure the evaluator’s runtime environment/image explicitly includes numpy (even if other components already do).


781-894: Verify torchtitan forward signature used in bpb eval (logits vs loss output).

logits = self.model(input_ids, labels) assumes the model returns logits compatible with cross_entropy_loss(logits, labels). If torchtitan’s forward returns (loss, logits) or a dict/obj, this will miscompute or crash. Please confirm against the exact torchtitan version you’re running in CI/runtime.

Also applies to: 895-964


1091-1095: Nice: evaluation orchestration is now centralized via _run_bpb_eval.
This makes it much easier to add/remove eval modes consistently.

Update all neurons to use shard 2 instead of shard 0 for anneal mode.

- Change anneal shard in miner and validator
- Clarify sharded_dataset.py comment
- Update docs example to use shard 2
@joellidin joellidin force-pushed the feat/new-anneal-shard branch from c0afef1 to 8de0c27 Compare January 11, 2026 01:04
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (57.74%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

Impacted file tree graph

@@           Coverage Diff           @@
##              dev     #679   +/-   ##
=======================================
  Coverage   57.74%   57.74%           
=======================================
  Files          27       27           
  Lines        4977     4977           
=======================================
  Hits         2874     2874           
  Misses       2103     2103           
Files with missing lines Coverage Δ
src/tplr/__init__.py 100.00% <100.00%> (ø)
src/tplr/sharded_dataset.py 22.43% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
neurons/miner.py (1)

437-440: LGTM - Miner anneal shard matches validator configuration.

The change correctly mirrors the validator's anneal shard selection (shard 2), which is critical for proper synchronization between miners and validators during the anneal phase. The logic is consistent and correct.

Note: The same configuration flexibility suggestion from validator.py applies here - consider making the anneal shard number configurable rather than hardcoded to facilitate future changes.

neurons/validator.py (1)

1261-1269: Verify shard 2 dataset files are available in your dataset directory before deployment.

With file_prefix: "anneal" and shard 2 configured, ensure anneal_000002.npy and sample_ids_anneal_000002.npy are available and distributed to all nodes. This is a pre-deployment operational requirement, not a code check.

Make the anneal shard number configurable via hparams.

The shard is currently hardcoded to 2 in both neurons/miner.py and neurons/validator.py, while other anneal parameters (enabled, file_prefix, warmup settings) are already configurable through hparams.anneal_mode. Consider adding an anneal_shard field to hparams.anneal_mode for consistency and flexibility.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0afef1 and 8de0c27.

📒 Files selected for processing (5)
  • docs/shared_sharded_dataset.md
  • neurons/miner.py
  • neurons/validator.py
  • src/tplr/__init__.py
  • src/tplr/sharded_dataset.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/tplr/sharded_dataset.py
  • docs/shared_sharded_dataset.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)
🔇 Additional comments (1)
src/tplr/__init__.py (1)

23-23: LGTM - Version bump aligns with anneal shard changes.

The version increment from 2.1.23 to 2.1.24 is appropriate for the feature changes in this PR (switching anneal shard and adjusting hyperparameters).

@joellidin joellidin merged commit 1fd65b4 into dev Jan 11, 2026
7 of 8 checks passed
@joellidin joellidin deleted the feat/new-anneal-shard branch January 11, 2026 15:27
This was referenced Jan 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant