Conversation
Reduce peak_lr_factor from 0.3 to 0.25 for improved training stability.
WalkthroughThis PR updates the anneal mode shard selection from shard 0 to shard 2 across the codebase, adjusts the peak learning rate factor in hyperparameters, and bumps the package version to 2.1.24. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @neurons/evaluator.py:
- Around line 1421-1440: The help text for the CLI argument defined via
parser.add_argument("--anneal_eval_path", ...) is misleading: update its help
string to state that it accepts either a direct .npy shard file or a directory
(and will fall back to the configured custom evaluation path if unset) so
operators understand the supported input forms; modify only the help parameter
of the --anneal_eval_path argument to clearly list both accepted forms and the
fallback behavior.
- Around line 737-780: The _prepare_eval_dataset function checks for files named
"*_000000.npy" but constructs tplr.SharedShardedDataset with shard_index=2 and
returns a "*_000000.npy" name, causing a mismatch (SharedShardedDataset will
attempt to load "*_000002.npy"). Fix by making shard discovery, instantiation,
and returned shard_name consistent: when you intend to load shard 2, check for
existence of f"{base_name}_000002.npy" (and the fallback prefixes similarly),
instantiate SharedShardedDataset with shard_index=2, and return the matching
"*_000002.npy" name (or if you want shard 0 behavior, change shard_index to 0
and keep checks/returns as *_000000.npy); update all occurrences in
_prepare_eval_dataset (including the hinted_shard/direct_shard checks, the
fallback loop, the file_prefix logic, and the final return) to use the same
shard index convention.
In @neurons/miner.py:
- Around line 437-440: In anneal mode the miner sets current_shard_epoch while
the validator expects shard_epoch, causing a naming mismatch; update the miner
implementation (around the anneal branch where current_shard = 2) to use the
same variable name as the validator (shard_epoch) — either rename
current_shard_epoch to shard_epoch or ensure both variables are set (shard_epoch
= 0 and current_shard = 2) so miner and validator state names match exactly.
🧹 Nitpick comments (3)
neurons/validator.py (1)
1266-1270: Avoid a “magic number” for anneal shard (even if it’s intentionally forced).Hard-coding
current_shard = 2is consistent with the PR objective, but it’s easy for miner/validator/evaluator/docs to drift. Consider lifting2into a single constant (or ananneal_mode.shard_indexhparam with default2) so the policy is centralized.neurons/evaluator.py (2)
145-182:_NpySequenceDataset: avoid silent dtype conversion that can allocate large RAM.
arr.astype(np.uint32, copy=False)can still allocate if the on-disk dtype isn’tuint32, which defeats mmap and can spike memory on large shards. For eval shards, it may be safer to hard-fail on unexpected dtype (or at least gate it behind an explicit “allow_convert” flag).
965-1031: Public bpb evaluators look clean; consider makingmax_samplesconfigurable for anneal.Hard-coding
max_samples=1024is reasonable, but if you expect different shard sizes or want stable cost control, a CLI/hparam knob would make this easier to tune without code changes.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
docs/shared_sharded_dataset.mdhparams/hparams.jsonneurons/evaluator.pyneurons/miner.pyneurons/validator.pysrc/tplr/__init__.pysrc/tplr/sharded_dataset.py
🧰 Additional context used
🧬 Code graph analysis (1)
neurons/evaluator.py (1)
src/tplr/sharded_dataset.py (1)
SharedShardedDataset(58-239)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
🔇 Additional comments (7)
src/tplr/sharded_dataset.py (1)
399-399: LGTM: Comment clarification aligns with shard flexibility.The updated comment accurately reflects that anneal mode stays on a single shard without hardcoding a specific shard number. This aligns with the PR's shift to shard 2 for anneal mode.
src/tplr/__init__.py (1)
23-23: LGTM: Version bump to 2.1.24.Standard version increment for this release.
hparams/hparams.json (1)
8-8: LGTM: Peak LR reduction for improved stability.Reducing
peak_lr_factorfrom 0.3 to 0.25 (17% decrease) should improve training stability during the annealing phase, as stated in the PR objectives. This may slightly slow convergence but typically leads to better final model quality.docs/shared_sharded_dataset.md (1)
149-154: LGTM: Documentation updated to reflect anneal shard 2.The documentation correctly reflects the switch to anneal shard 2 (
anneal_000002.npyandsample_ids_anneal_000002.npy). The updated rclone commands will guide users to copy the correct shard for testing.neurons/evaluator.py (3)
42-43: Confirmnumpyis a declared runtime dependency for evaluator.
You addedimport numpy as np; make sure the evaluator’s runtime environment/image explicitly includes numpy (even if other components already do).
781-894: Verify torchtitan forward signature used in bpb eval (logits vs loss output).
logits = self.model(input_ids, labels)assumes the model returns logits compatible withcross_entropy_loss(logits, labels). If torchtitan’s forward returns(loss, logits)or a dict/obj, this will miscompute or crash. Please confirm against the exact torchtitan version you’re running in CI/runtime.Also applies to: 895-964
1091-1095: Nice: evaluation orchestration is now centralized via_run_bpb_eval.
This makes it much easier to add/remove eval modes consistently.
Update all neurons to use shard 2 instead of shard 0 for anneal mode. - Change anneal shard in miner and validator - Clarify sharded_dataset.py comment - Update docs example to use shard 2
c0afef1 to
8de0c27
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (57.74%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage. @@ Coverage Diff @@
## dev #679 +/- ##
=======================================
Coverage 57.74% 57.74%
=======================================
Files 27 27
Lines 4977 4977
=======================================
Hits 2874 2874
Misses 2103 2103
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
neurons/miner.py (1)
437-440: LGTM - Miner anneal shard matches validator configuration.The change correctly mirrors the validator's anneal shard selection (shard 2), which is critical for proper synchronization between miners and validators during the anneal phase. The logic is consistent and correct.
Note: The same configuration flexibility suggestion from validator.py applies here - consider making the anneal shard number configurable rather than hardcoded to facilitate future changes.
neurons/validator.py (1)
1261-1269: Verify shard 2 dataset files are available in your dataset directory before deployment.With
file_prefix: "anneal"and shard 2 configured, ensureanneal_000002.npyandsample_ids_anneal_000002.npyare available and distributed to all nodes. This is a pre-deployment operational requirement, not a code check.Make the anneal shard number configurable via hparams.
The shard is currently hardcoded to 2 in both
neurons/miner.pyandneurons/validator.py, while other anneal parameters (enabled,file_prefix, warmup settings) are already configurable throughhparams.anneal_mode. Consider adding ananneal_shardfield tohparams.anneal_modefor consistency and flexibility.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/shared_sharded_dataset.mdneurons/miner.pyneurons/validator.pysrc/tplr/__init__.pysrc/tplr/sharded_dataset.py
🚧 Files skipped from review as they are similar to previous changes (2)
- src/tplr/sharded_dataset.py
- docs/shared_sharded_dataset.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test (3.11)
- GitHub Check: test (3.12)
🔇 Additional comments (1)
src/tplr/__init__.py (1)
23-23: LGTM - Version bump aligns with anneal shard changes.The version increment from 2.1.23 to 2.1.24 is appropriate for the feature changes in this PR (switching anneal shard and adjusting hyperparameters).
Description
Related Issue(s)
Type of Change
Branch Naming
Commit Messages
Code Quality
Testing
Documentation
If this is a breaking change
Screenshots/Examples
Additional Notes
Summary by CodeRabbit
Release Notes
Chores
Configuration
Updates
✏️ Tip: You can customize this high-level summary in your review settings.