Conversation
Reduce peak_lr_factor from 0.3 to 0.25 for improved training stability.
Update all neurons to use shard 2 instead of shard 0 for anneal mode. - Change anneal shard in miner and validator - Clarify sharded_dataset.py comment - Update docs example to use shard 2
WalkthroughThe PR modifies anneal mode to initialize with shard 2 instead of shard 0 across miner and validator components, updates the documentation example to reflect this change, adjusts the anneal mode hyperparameter peak_lr_factor from 0.3 to 0.25, and increments the package version. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ❌ 3❌ Failed checks (1 warning, 2 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (57.74%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage. @@ Coverage Diff @@
## main #680 +/- ##
=======================================
Coverage 57.74% 57.74%
=======================================
Files 27 27
Lines 4977 4977
=======================================
Hits 2874 2874
Misses 2103 2103
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
neurons/validator.py (1)
1261-1273: Verify shard-2 artifacts are deployed; hardcoding is acceptable if configuration isn't needed.Shard-2 artifacts (
anneal_000002.npy/sample_ids_anneal_000002.npy) are properly documented in the dataset setup guide, and no stale references to shard 0 in anneal mode exist in the codebase. If these files are guaranteed to be available in all deployment environments, the hardcoding is safe for this PR.Optional: make anneal shard configurable (consistent with existing anneal_config pattern)
The codebase already uses
anneal_config.get(key, default)extensively. Consider adding shard_index as a configuration option:--- a/neurons/validator.py +++ b/neurons/validator.py @@ -1266,7 +1266,7 @@ class Validator: # In anneal mode, always use shard 2 if self.dataset_manager.anneal_mode: - current_shard = 2 + current_shard = anneal_config.get("shard_index", 2) shard_epoch = 0
🧹 Nitpick comments (1)
neurons/miner.py (1)
432-447: Miner/validator shard selection is now consistent (anneal shard 2).Main thing to double-check is that shard 2 is universally present/accessible for anneal-mode datasets (otherwise miners will fail early on startup).
Optional: match validator and read shard index from hparams
--- a/neurons/miner.py +++ b/neurons/miner.py @@ - # In anneal mode, always use shard 2 + # In anneal mode, lock to a single shard (default: 2) if self.dataset_manager.anneal_mode: - current_shard = 2 + current_shard = int(anneal_config.get("shard_index", 2)) current_shard_epoch = 0
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/shared_sharded_dataset.mdhparams/hparams.jsonneurons/miner.pyneurons/validator.pysrc/tplr/__init__.pysrc/tplr/sharded_dataset.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
🔇 Additional comments (4)
hparams/hparams.json (1)
3-11: Anneal LR peak reduction looks fine; verify effective LR peak + stability.Given
outer_learning_rate: 0.4, confirm the new anneal peak (outer_lr * peak_lr_factor) is the intended magnitude and doesn’t under-train during anneal runs.src/tplr/sharded_dataset.py (1)
395-402: Comment update matches new anneal behavior.src/tplr/__init__.py (1)
23-23: Version bump to 2.1.24 is consistent with the PR title.docs/shared_sharded_dataset.md (1)
149-154: Documentation correctly reflects shard 2 migration.The updated section accurately documents the testing workflow with shard 2, aligning with the code changes that now initialize anneal mode using shard 2 instead of shard 0. The rclone commands and file references (anneal_000002.npy) are correct.
Description
Related Issue(s)
Type of Change
Branch Naming
Commit Messages
Code Quality
Testing
Documentation
If this is a breaking change
Screenshots/Examples
Additional Notes
Summary by CodeRabbit
Documentation
Configuration
Refactor
Chores
✏️ Tip: You can customize this high-level summary in your review settings.