Skip to content

fix/bump run#681

Merged
joellidin merged 3 commits intodevfrom
fix/bump-run
Jan 13, 2026
Merged

fix/bump run#681
joellidin merged 3 commits intodevfrom
fix/bump-run

Conversation

@joellidin
Copy link
Copy Markdown
Collaborator

@joellidin joellidin commented Jan 13, 2026

  • (hparams) Update checkpoint init to v2.1.24
  • (comms) Add timeout and rate limit handling
  • Bump run version

Description

Related Issue(s)

  • Closes #[issue number]

Type of Change

  • Feature (adding new functionality)
  • Fix (resolving a bug or issue)
  • Docs (documentation updates)
  • Refactor (code changes that don't affect functionality)
  • Maintenance (dependency updates or other maintenance)
  • Tests (adding or improving tests)
  • Breaking change (fix or feature with incompatible API changes)
  • Other: _____

Branch Naming

  • My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

  • My commits are small, atomic, and have proper commit messages
  • Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

  • I've performed a self-review of my code
  • I've added appropriate docstrings following the project's conventions
  • I've added proper logging where necessary (without trailing periods)
  • I've applied linting and formatting with Ruff
  • My code generates no new warnings

Testing

  • I've added tests for new functionality or bug fixes
  • All tests pass locally with my changes
  • Test coverage has not decreased

Documentation

  • I've updated documentation to reflect my changes
  • I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

  • Version Updates

    • Package version bumped to 2.1.25
    • Updated model checkpoint initialization parameters
  • Bug Fixes

    • Enhanced timeout detection and handling for download operations
    • Implemented rate-limit detection with specific logging for throttled requests
    • Enforced maximum timeout boundaries on batch download operations
    • Improved error handling consistency across download workflows

✏️ Tip: You can customize this high-level summary in your review settings.

Bump checkpoint_init_version from 2.1.18 to 2.1.24 and update
checkpoint_init_window from 62834 to 63571 to use the latest stable
checkpoint for evaluation initialization.

The changes detected were:
- checkpoint_init_version: "2.1.18" → "2.1.24"
- checkpoint_init_window: 62834 → 63571
Improve error handling for chunk downloads to better handle rate
limiting and prevent indefinite hangs.

- Add separate asyncio.TimeoutError handler with clearer logging
- Detect rate limiting by checking for 429, throttle, 503, etc.
- Add 10-minute overall timeout for gather downloads
- Prevent indefinite hangs from rate limiting or network issues
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 13, 2026

Walkthrough

This PR updates hyperparameters in hparams.json (checkpoint_init_version and checkpoint_init_window values), bumps the package version from 2.1.24 to 2.1.25, and enhances download error handling in comms.py with explicit asyncio timeout management and rate-limit detection logic.

Changes

Cohort / File(s) Summary
Version & Configuration Updates
hparams/hparams.json, src/tplr/__init__.py
Updated checkpoint_init_version from "2.1.18" to "2.1.24" and checkpoint_init_window from 62834 to 63571; bumped package version to 2.1.25.
Download Error Handling
src/tplr/comms.py
Added explicit asyncio.TimeoutError handling for per-chunk downloads; introduced rate-limit detection via error message inspection (e.g., "429", "throttl", "rate limit", "503"); enforced maximum total timeout (600 seconds) using asyncio.wait_for around batch gather operations; mirrored timeout and rate-limit refinements across both per-chunk and gather-based download workflows.

Possibly related PRs

  • PR #643: Modifies src/tplr/comms.py to add explicit timeout handling for downloads — overlapping work on timeout enforcement.
  • PR #648: Updates the same files (hparams/hparams.json checkpoint parameters and src/tplr/__init__.py version) — direct configuration and version overlap.
  • PR #595: Makes overlapping edits to src/tplr/comms.py download flow and updates hparams/version metadata — related to both error handling and version management.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

Whiskers twitching with timeout delight,
Rate limits caught in the network's sight!
Asyncio waits with a 600-second bound,
Per-chunk refinement—no drops to be found! 🐰⚡

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description provides a bulleted summary of changes but leaves most template sections unchecked and blank, including Type of Change, Branch Naming, Commit Messages, Code Quality, Testing, and Documentation—critical for review clarity. Complete the required template sections: check the relevant Type of Change boxes, confirm branch naming convention compliance, and verify code quality, testing, and documentation requirements were met.
Title check ❓ Inconclusive The title is vague and generic—'fix/bump run' uses non-descriptive terms that don't convey what was actually changed (hparams update, timeout/rate-limit handling, version bump). Revise the title to be more specific and descriptive of the main change, e.g., 'Add timeout and rate-limit handling to comms with hparams and version updates' or focus on the primary fix.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

🧹 Recent nitpick comments
src/tplr/comms.py (2)

1098-1130: Improved error handling for chunk downloads.

The separation of asyncio.TimeoutError handling and rate-limit detection improves observability. A few observations:

  1. The rate-limit detection at line 1121 logs at ERROR level, but since this is an expected transient condition that triggers retry logic, consider using WARNING level instead to reduce noise in error monitoring.

  2. The string-based rate-limit detection is pragmatic but could miss some indicators or match false positives. This is acceptable given the fallback to generic error handling.

Consider using WARNING level for rate-limit detection
                             if any(
                                 indicator in error_str
                                 for indicator in [
                                     "429",
                                     "throttl",
                                     "slow down",
                                     "rate limit",
                                     "503",
                                     "service unavailable",
                                     "too many",
                                 ]
                             ):
-                                tplr.logger.error(
+                                tplr.logger.warning(
                                     f"RATE LIMITING DETECTED downloading chunk {chunk_number}: {e}"
                                 )

1729-1742: Good defensive timeout for gather operations.

The 10-minute overall timeout prevents indefinite hangs from rate limiting or network issues, which is a solid reliability improvement. The implementation correctly wraps the gather call and returns None on timeout, which is consistent with other failure modes.

Optional consideration: The 600-second timeout is hardcoded. For flexibility across different deployment scenarios (e.g., slower networks, larger peer counts), consider making this configurable via environment variable or hparams, similar to how DOWNLOAD_MAX_WORKERS is handled elsewhere in this file.

Optional: Make gather timeout configurable
-                # Overall timeout for all downloads: 10 minutes max
-                # This prevents indefinite hangs from rate limiting or network issues
-                gather_timeout = 600  # 10 minutes
+                # Overall timeout for all downloads: configurable, default 10 minutes
+                # This prevents indefinite hangs from rate limiting or network issues
+                gather_timeout = int(os.getenv("TPLR_GATHER_TIMEOUT", "600"))
📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3dfeec1 and d59ee56.

📒 Files selected for processing (3)
  • hparams/hparams.json
  • src/tplr/__init__.py
  • src/tplr/comms.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/tplr/comms.py (1)
tests/test_dcp_checkpoint.py (2)
  • warning (117-118)
  • error (123-124)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.11)
🔇 Additional comments (2)
hparams/hparams.json (1)

56-57: LGTM! Checkpoint initialization parameters updated.

The checkpoint configuration is updated to use a newer stable checkpoint (version 2.1.24 at window 63571). This aligns with the PR objective of updating checkpoint initialization.

src/tplr/__init__.py (1)

23-23: LGTM! Version bump is appropriate.

The version increment from 2.1.24 to 2.1.25 correctly reflects the behavioral changes introduced in this PR (timeout and rate-limit handling improvements in comms.py).


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 70.58824% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/tplr/comms.py 68.75% 5 Missing ⚠️

❌ Your patch status has failed because the patch coverage (70.58%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (57.69%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #681      +/-   ##
==========================================
- Coverage   57.74%   57.69%   -0.06%     
==========================================
  Files          27       27              
  Lines        4977     4990      +13     
==========================================
+ Hits         2874     2879       +5     
- Misses       2103     2111       +8     
Files with missing lines Coverage Δ
src/tplr/__init__.py 100.00% <100.00%> (ø)
src/tplr/comms.py 65.15% <68.75%> (-0.31%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@joellidin joellidin merged commit 4b59214 into dev Jan 13, 2026
6 of 8 checks passed
@joellidin joellidin deleted the fix/bump-run branch January 13, 2026 23:19
@coderabbitai coderabbitai bot mentioned this pull request Jan 13, 2026
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant