Add optional HuggingFace Hub checkpoint uploading#95
Add optional HuggingFace Hub checkpoint uploading#95forklady42 wants to merge 11 commits intomainfrom
Conversation
Supports deferred uploads for clusters without internet (e.g. Della) via a JSON manifest and `electrai hf-push` CLI command. Immediate upload mode available for nodes with connectivity. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Upload last.ckpt as last_epoch{N}.ckpt so previous versions are
preserved across epochs. Switch from on_train_epoch_end to
on_validation_end so checkpoints exist when the upload runs.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@claude review this PR |
|
Claude finished @forklady42's task in 1m 33s —— View job PR Review: Add optional HuggingFace Hub checkpoint uploading
Overall this is a clean, well-structured addition. The deferred-upload-via-manifest design is well suited to HPC clusters without compute node internet access. A few issues worth addressing before merging: Bugs
If the def _upload_single(entry: dict) -> None:
try:
from huggingface_hub import upload_file
path = Path(entry["path"]) # ← not reached on ImportError
...
except Exception:
logger.warning(
"HF upload failed for %s ...",
path.name, # ← NameError if import failed
exc_info=True,
)Fix: move Feature gap:
|
- Fix potential NameError in _upload_single by moving path assignment above try - Remove unused `private` config key (repo must already exist) - Replace epoch=-1 magic value with None - Only call _save_manifest() in on_train_end when immediate uploads occurred - Raise SystemExit instead of silent logger.error in hf_push for missing manifest Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@claude review again |
|
Claude finished @forklady42's task in 1m 58s —— View job PR Review: Add optional HuggingFace Hub checkpoint uploading
Good progress — the three commits since the initial review have addressed all the bugs. Here's the updated status: Fixed since last review ✅
Remaining issuesHardcoded last_ckpt = self.ckpt_path / "last.ckpt"This happens to work because Unreachable else:
raise ValueError(f"Unknown command: {args.command}")
Missing tests No tests cover Nit
Overall the implementation is solid and the bugs from the previous review are resolved. The remaining items are minor — the unreachable branch is a cleanup, the hardcoded filename just needs a comment, and tests are the main gap before merging. |
There was a problem hiding this comment.
Pull request overview
Adds optional HuggingFace Hub integration for checkpoint uploads, supporting both deferred uploads (via a manifest) and a CLI workflow for offline compute nodes.
Changes:
- Add
HuggingFaceCallbackto queue checkpoints into a JSON manifest and optionally upload immediately. - Add
electrai hf-pushCLI subcommand to upload queued checkpoints from a login node. - Add optional
huggingface-hubdependency via extras and document HF config in sample YAMLs.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/electrai/entrypoints/train.py | Wires an optional HF callback into Lightning callbacks based on config. |
| src/electrai/entrypoints/main.py | Adds hf-push CLI command for uploading from a checkpoint directory. |
| src/electrai/configs/MP/config_resunet.yaml | Documents optional HF config block in an example config. |
| src/electrai/configs/MP/config_resnet.yaml | Documents optional HF config block in an example config. |
| src/electrai/callbacks/hf_upload.py | Implements manifest-based queuing and upload logic plus CLI entry function. |
| src/electrai/callbacks/init.py | Initializes callbacks package (future annotations). |
| pyproject.toml | Adds hf extra for huggingface-hub dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
- Use 1-indexed epoch in uploaded filename to match upload cadence - Copy last.ckpt to stable epoch-specific file at queue time so deferred hf-push uploads the correct snapshot after last.ckpt is overwritten Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Remove unreachable else branch in main.py (argparse handles unknown commands) - Re-export HuggingFaceCallback from callbacks __init__ for ergonomic imports Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@claude review once more. Do not worry about tests at this point |
|
Claude finished @forklady42's task in 2m 0s —— View job PR Review: Add optional HuggingFace Hub checkpoint uploading
Good progress since the last review. All previously reported bugs are resolved and several of the Copilot review points are also addressed. Here's the current state: Fixed since last review ✅
Remaining issues
The standalone # Current (lines 149, 161) — atypical and inconsistent with rest of file
with Path.open(manifest_path) as f:
...
with Path.open(manifest_path, "w") as f:
...
# Should be (matching _load_existing_manifest / _save_manifest)
with manifest_path.open(encoding="utf-8") as f:
...
with manifest_path.open("w", encoding="utf-8") as f:
...
Stable copies accumulate on disk and are never cleaned up ( Every OverallThe implementation is solid and all the substantive bugs from previous reviews are resolved. The three remaining items above are the blockers I'd want addressed before merging: the |
- Delete stable copies after successful immediate upload - Add --clean flag to hf-push to delete local copies after deferred upload - Add logging.basicConfig to main() for CLI log visibility - Fix Path.open() style and add encoding="utf-8" in hf_push Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@claude review again |
|
Claude finished @forklady42's task in 1m 47s —— View job PR Review: Add optional HuggingFace Hub checkpoint uploading
All three issues from the last review are resolved. This PR is in good shape. Fixed since last review ✅
One remaining nit
The help text reads "Delete local copies after successful upload", which is accurate, but it may surprise users that This is a nit; the behaviour is intentional and opt-in. OverallThe implementation is solid, all substantive bugs and consistency issues from previous reviews are resolved, and the design (manifest-based deferred upload, stable copies for snapshot safety, |
Co-Authored-By: Claude Opus 4.6 <[email protected]>
| train(args) | ||
| elif args.command == "test": | ||
| test(args) | ||
| else: |
There was a problem hiding this comment.
Removed because argparse with required=True subparsers already handles unknown commands
|
The bot's nits are becoming tiny, so I'm calling it and marking this for human review. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ryan-williams
left a comment
There was a problem hiding this comment.
I worked with Claude on the @rbw-bot replies above, seems like at least one impt change requested, lmk what you think!
# Conflicts: # uv.lock
- Switch from `on_validation_end` to `on_train_epoch_end` so `last.ckpt` is current (Lightning reorders `ModelCheckpoint` to run last in `on_validation_end`, making the copy stale). - Separate `ImportError` from upload failures in `_upload_single` with a clear "install huggingface-hub" message. Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def on_train_epoch_end(self, trainer, pl_module) -> None: # noqa: ARG002 | ||
| if trainer.sanity_checking: | ||
| return | ||
| epoch = trainer.current_epoch | ||
| if (epoch + 1) % self.every_n_epochs != 0: | ||
| return | ||
| if trainer.global_rank != 0: | ||
| return | ||
|
|
||
| last_ckpt = self.ckpt_path / "last.ckpt" | ||
| if not last_ckpt.exists(): | ||
| return | ||
|
|
||
| # Copy to a stable filename so later hf-push uploads the correct | ||
| # snapshot even after last.ckpt is overwritten by subsequent epochs. | ||
| stable_name = f"last_epoch{epoch + 1:03d}.ckpt" | ||
| stable_path = self.ckpt_path / stable_name | ||
| shutil.copy2(last_ckpt, stable_path) | ||
|
|
||
| self._queue_checkpoint(stable_path, epoch, path_in_repo=stable_name) |
| def _queue_checkpoint( | ||
| self, ckpt_file: Path, epoch: int | None, *, path_in_repo: str | None = None | ||
| ) -> None: | ||
| entry = { | ||
| "path": str(ckpt_file), | ||
| "path_in_repo": path_in_repo or ckpt_file.name, | ||
| "epoch": epoch, | ||
| "repo_id": self.repo_id, | ||
| "uploaded": False, | ||
| } | ||
| self._manifest.append(entry) | ||
| self._save_manifest() | ||
| logger.info("Queued checkpoint for HF upload: %s", ckpt_file.name) |
| def hf_push(ckpt_path: str, *, clean: bool = False) -> None: | ||
| """Upload pending checkpoints from a manifest file. | ||
|
|
||
| Run this from a login node or machine with internet access. | ||
| """ | ||
| ckpt_dir = Path(ckpt_path) | ||
| manifest_path = ckpt_dir / MANIFEST_FILENAME | ||
| if not manifest_path.exists(): | ||
| raise SystemExit(f"No manifest found at {manifest_path}") | ||
|
|
||
| with manifest_path.open(encoding="utf-8") as f: | ||
| manifest = json.load(f) | ||
|
|
||
| pending = [e for e in manifest if not e["uploaded"]] | ||
| if not pending: | ||
| logger.info("All checkpoints already uploaded.") | ||
| return | ||
|
|
||
| logger.info("Uploading %d pending checkpoint(s)...", len(pending)) | ||
| for entry in pending: | ||
| _upload_single(entry) |
| stable_path = self.ckpt_path / stable_name | ||
| shutil.copy2(last_ckpt, stable_path) | ||
|
|
||
| self._queue_checkpoint(stable_path, epoch, path_in_repo=stable_name) |
Summary
HuggingFaceCallbackthat queues saved checkpoints for upload to HuggingFace Hub via a JSON manifestelectrai hf-pushCLI command to upload pending checkpoints from a login node (for slurm clusters where compute nodes lack internet)upload_immediate: true) for nodes with connectivityhuggingface-hubadded as an optional dependency (uv sync --extra hf)Config
Test plan
hfconfig section (no-op)hf.repo_idis setelectrai hf-push --ckpt-path <path>uploads from manifestupload_immediate: trueon a node with internet access🤖 Generated with Claude Code