feat/new anneal shard by joellidin · Pull Request #679 · one-covenant/templar

joellidin · 2026-01-11T00:44:14Z

(hparams) Lower anneal peak LR to 0.25
(neurons) Switch anneal mode to shard 2
Bump run version

Description

Related Issue(s)

Closes #[issue number]

Type of Change

Feature (adding new functionality)
Fix (resolving a bug or issue)
Docs (documentation updates)
Refactor (code changes that don't affect functionality)
Maintenance (dependency updates or other maintenance)
Tests (adding or improving tests)
Breaking change (fix or feature with incompatible API changes)
Other: _____

Branch Naming

My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

My commits are small, atomic, and have proper commit messages
Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

I've performed a self-review of my code
I've added appropriate docstrings following the project's conventions
I've added proper logging where necessary (without trailing periods)
I've applied linting and formatting with Ruff
My code generates no new warnings

Testing

I've added tests for new functionality or bug fixes
All tests pass locally with my changes
Test coverage has not decreased

Documentation

I've updated documentation to reflect my changes
I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

Release Notes

Chores
- Version updated to 2.1.24
Configuration
- Peak learning rate factor adjusted from 0.3 to 0.25
Updates
- Modified dataset shard configuration for anneal mode
- Updated documentation to reflect current dataset shard references

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Reduce peak_lr_factor from 0.3 to 0.25 for improved training stability.

coderabbitai · 2026-01-11T00:44:24Z

Walkthrough

This PR updates the anneal mode shard selection from shard 0 to shard 2 across the codebase, adjusts the peak learning rate factor in hyperparameters, and bumps the package version to 2.1.24.

Changes

Cohort / File(s)	Summary
Anneal Shard Selection Logic `neurons/miner.py`, `neurons/validator.py`	Hard-coded shard selection in anneal mode changed from shard 0 to shard 2; corresponding comments updated to reflect new shard usage.
Hyperparameter Configuration `hparams/hparams.json`	Peak learning rate factor in anneal_mode updated from 0.3 to 0.25, adjusting the peak LR value used by the schedule.
Documentation `docs/shared_sharded_dataset.md`	Updated guidance and code examples to reference anneal shard 2 instead of shard 0; changed from "Copy first training shard" to "Copy anneal shard 2".
Version & Comments `src/tplr/__init__.py`, `src/tplr/sharded_dataset.py`	Package version bumped to 2.1.24; comment in initialize_datasets generalized from "we stay on shard 0" to "we stay on one shard".

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat/anneal lower lr #677: Both PRs modify the same anneal_mode hyperparameter (peak_lr_factor) in hparams.json.
v2.1.18 #662: Both PRs update shard-selection logic in neurons/miner.py and neurons/validator.py for anneal mode.
feat/anneal #669: Both PRs directly modify anneal-mode shard-selection behavior introduced in the anneal feature.

Suggested reviewers

shivam-MBZUAI
amiiir-sarfi

Poem

🐰 A shard shifts from zero to two, hop-hop-hooray!
Learning rates adjust, the anneal finds its way,
Docs and code align in harmony divine,
Version bumped with care—this change is mine! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 68.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description includes a high-level summary of changes but lacks proper completion of the provided template with missing or unchecked sections including Related Issues, Type of Change checklist, and other verification items.	Complete the PR description template by selecting appropriate Type of Change checkboxes, providing a Related Issue link if applicable, and indicating verification of Branch Naming, Commit Messages, Code Quality, Testing, and Documentation requirements.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat/new anneal shard' accurately captures the main change—switching anneal mode to use shard 2 instead of shard 0.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @neurons/evaluator.py:
- Around line 1421-1440: The help text for the CLI argument defined via
parser.add_argument("--anneal_eval_path", ...) is misleading: update its help
string to state that it accepts either a direct .npy shard file or a directory
(and will fall back to the configured custom evaluation path if unset) so
operators understand the supported input forms; modify only the help parameter
of the --anneal_eval_path argument to clearly list both accepted forms and the
fallback behavior.
- Around line 737-780: The _prepare_eval_dataset function checks for files named
"*_000000.npy" but constructs tplr.SharedShardedDataset with shard_index=2 and
returns a "*_000000.npy" name, causing a mismatch (SharedShardedDataset will
attempt to load "*_000002.npy"). Fix by making shard discovery, instantiation,
and returned shard_name consistent: when you intend to load shard 2, check for
existence of f"{base_name}_000002.npy" (and the fallback prefixes similarly),
instantiate SharedShardedDataset with shard_index=2, and return the matching
"*_000002.npy" name (or if you want shard 0 behavior, change shard_index to 0
and keep checks/returns as *_000000.npy); update all occurrences in
_prepare_eval_dataset (including the hinted_shard/direct_shard checks, the
fallback loop, the file_prefix logic, and the final return) to use the same
shard index convention.

In @neurons/miner.py:
- Around line 437-440: In anneal mode the miner sets current_shard_epoch while
the validator expects shard_epoch, causing a naming mismatch; update the miner
implementation (around the anneal branch where current_shard = 2) to use the
same variable name as the validator (shard_epoch) — either rename
current_shard_epoch to shard_epoch or ensure both variables are set (shard_epoch
= 0 and current_shard = 2) so miner and validator state names match exactly.

🧹 Nitpick comments (3)

neurons/validator.py (1)

1266-1270: Avoid a “magic number” for anneal shard (even if it’s intentionally forced).

Hard-coding current_shard = 2 is consistent with the PR objective, but it’s easy for miner/validator/evaluator/docs to drift. Consider lifting 2 into a single constant (or an anneal_mode.shard_index hparam with default 2) so the policy is centralized.

neurons/evaluator.py (2)

145-182: _NpySequenceDataset: avoid silent dtype conversion that can allocate large RAM.

arr.astype(np.uint32, copy=False) can still allocate if the on-disk dtype isn’t uint32, which defeats mmap and can spike memory on large shards. For eval shards, it may be safer to hard-fail on unexpected dtype (or at least gate it behind an explicit “allow_convert” flag).

965-1031: Public bpb evaluators look clean; consider making max_samples configurable for anneal.

Hard-coding max_samples=1024 is reasonable, but if you expect different shard sizes or want stable cost control, a CLI/hparam knob would make this easier to tune without code changes.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a948591 and c0afef1.

📒 Files selected for processing (7)

docs/shared_sharded_dataset.md
hparams/hparams.json
neurons/evaluator.py
neurons/miner.py
neurons/validator.py
src/tplr/__init__.py
src/tplr/sharded_dataset.py

🧰 Additional context used

🧬 Code graph analysis (1)

neurons/evaluator.py (1)

src/tplr/sharded_dataset.py (1)

SharedShardedDataset (58-239)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: test (3.11)
GitHub Check: test (3.12)

🔇 Additional comments (7)

src/tplr/sharded_dataset.py (1)

399-399: LGTM: Comment clarification aligns with shard flexibility.

The updated comment accurately reflects that anneal mode stays on a single shard without hardcoding a specific shard number. This aligns with the PR's shift to shard 2 for anneal mode.

src/tplr/__init__.py (1)

23-23: LGTM: Version bump to 2.1.24.

Standard version increment for this release.

hparams/hparams.json (1)

8-8: LGTM: Peak LR reduction for improved stability.

Reducing peak_lr_factor from 0.3 to 0.25 (17% decrease) should improve training stability during the annealing phase, as stated in the PR objectives. This may slightly slow convergence but typically leads to better final model quality.

docs/shared_sharded_dataset.md (1)

149-154: LGTM: Documentation updated to reflect anneal shard 2.

The documentation correctly reflects the switch to anneal shard 2 (anneal_000002.npy and sample_ids_anneal_000002.npy). The updated rclone commands will guide users to copy the correct shard for testing.

neurons/evaluator.py (3)

42-43: Confirm numpy is a declared runtime dependency for evaluator.
You added import numpy as np; make sure the evaluator’s runtime environment/image explicitly includes numpy (even if other components already do).

781-894: Verify torchtitan forward signature used in bpb eval (logits vs loss output).

logits = self.model(input_ids, labels) assumes the model returns logits compatible with cross_entropy_loss(logits, labels). If torchtitan’s forward returns (loss, logits) or a dict/obj, this will miscompute or crash. Please confirm against the exact torchtitan version you’re running in CI/runtime.

Also applies to: 895-964

1091-1095: Nice: evaluation orchestration is now centralized via _run_bpb_eval.
This makes it much easier to add/remove eval modes consistently.

neurons/evaluator.py

neurons/miner.py

Update all neurons to use shard 2 instead of shard 0 for anneal mode. - Change anneal shard in miner and validator - Clarify sharded_dataset.py comment - Update docs example to use shard 2

codecov · 2026-01-11T01:07:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (57.74%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

@@           Coverage Diff           @@
##              dev     #679   +/-   ##
=======================================
  Coverage   57.74%   57.74%           
=======================================
  Files          27       27           
  Lines        4977     4977           
=======================================
  Hits         2874     2874           
  Misses       2103     2103

Files with missing lines	Coverage Δ
src/tplr/__init__.py	`100.00% <100.00%> (ø)`
src/tplr/sharded_dataset.py	`22.43% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

neurons/miner.py (1)

437-440: LGTM - Miner anneal shard matches validator configuration.

The change correctly mirrors the validator's anneal shard selection (shard 2), which is critical for proper synchronization between miners and validators during the anneal phase. The logic is consistent and correct.

Note: The same configuration flexibility suggestion from validator.py applies here - consider making the anneal shard number configurable rather than hardcoded to facilitate future changes.

neurons/validator.py (1)

1261-1269: Verify shard 2 dataset files are available in your dataset directory before deployment.

With file_prefix: "anneal" and shard 2 configured, ensure anneal_000002.npy and sample_ids_anneal_000002.npy are available and distributed to all nodes. This is a pre-deployment operational requirement, not a code check.

Make the anneal shard number configurable via hparams.

The shard is currently hardcoded to 2 in both neurons/miner.py and neurons/validator.py, while other anneal parameters (enabled, file_prefix, warmup settings) are already configurable through hparams.anneal_mode. Consider adding an anneal_shard field to hparams.anneal_mode for consistency and flexibility.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0afef1 and 8de0c27.

📒 Files selected for processing (5)

docs/shared_sharded_dataset.md
neurons/miner.py
neurons/validator.py
src/tplr/__init__.py
src/tplr/sharded_dataset.py

🚧 Files skipped from review as they are similar to previous changes (2)

src/tplr/sharded_dataset.py
docs/shared_sharded_dataset.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: test (3.11)
GitHub Check: test (3.12)

🔇 Additional comments (1)

src/tplr/__init__.py (1)

23-23: LGTM - Version bump aligns with anneal shard changes.

The version increment from 2.1.23 to 2.1.24 is appropriate for the feature changes in this PR (switching anneal shard and adjusting hyperparameters).

(hparams) Lower anneal peak LR to 0.25

ddb2b52

Reduce peak_lr_factor from 0.3 to 0.25 for improved training stability.

coderabbitai bot reviewed Jan 11, 2026

View reviewed changes

neurons/evaluator.py Outdated Show resolved Hide resolved

neurons/evaluator.py Outdated Show resolved Hide resolved

neurons/miner.py Show resolved Hide resolved

joellidin added 2 commits January 11, 2026 05:04

(neurons) Switch anneal mode to shard 2

e1e76c3

Update all neurons to use shard 2 instead of shard 0 for anneal mode. - Change anneal shard in miner and validator - Clarify sharded_dataset.py comment - Update docs example to use shard 2

Bump run version

8de0c27

joellidin force-pushed the feat/new-anneal-shard branch from c0afef1 to 8de0c27 Compare January 11, 2026 01:04

coderabbitai bot reviewed Jan 11, 2026

View reviewed changes

joellidin merged commit 1fd65b4 into dev Jan 11, 2026
7 of 8 checks passed

joellidin deleted the feat/new-anneal-shard branch January 11, 2026 15:27

This was referenced Jan 11, 2026

v2.1.24 #680

Merged

feat/new anneal shard #683

Merged

v2.1.26 #684

Merged

fix/nan loss #685

Merged

v2.1.27 #686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/new anneal shard#679

feat/new anneal shard#679
joellidin merged 3 commits intodevfrom
feat/new-anneal-shard

joellidin commented Jan 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joellidin commented Jan 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Type of Change

Branch Naming

Commit Messages

Code Quality

Testing

Documentation

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

joellidin commented Jan 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 11, 2026 •

edited

Loading

codecov bot commented Jan 11, 2026 •

edited

Loading