Skip to content

Add optional HuggingFace Hub checkpoint uploading#95

Open
forklady42 wants to merge 11 commits intomainfrom
betsy/huggingface
Open

Add optional HuggingFace Hub checkpoint uploading#95
forklady42 wants to merge 11 commits intomainfrom
betsy/huggingface

Conversation

@forklady42
Copy link
Collaborator

@forklady42 forklady42 commented Mar 5, 2026

Summary

  • Adds a HuggingFaceCallback that queues saved checkpoints for upload to HuggingFace Hub via a JSON manifest
  • Adds electrai hf-push CLI command to upload pending checkpoints from a login node (for slurm clusters where compute nodes lack internet)
  • Supports immediate upload mode (upload_immediate: true) for nodes with connectivity
  • huggingface-hub added as an optional dependency (uv sync --extra hf)

Config

hf:
  repo_id: your-username/your-repo  # must already exist
  upload_every_n_epochs: 5
  upload_immediate: false

Test plan

  • Verify training works without hf config section (no-op)
  • Verify manifest is written when hf.repo_id is set
  • Test electrai hf-push --ckpt-path <path> uploads from manifest
  • Test upload_immediate: true on a node with internet access
  • Confirm upload failures during training are logged but don't crash

🤖 Generated with Claude Code

forklady42 and others added 2 commits March 5, 2026 10:25
Supports deferred uploads for clusters without internet (e.g. Della)
via a JSON manifest and `electrai hf-push` CLI command. Immediate
upload mode available for nodes with connectivity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Upload last.ckpt as last_epoch{N}.ckpt so previous versions are
preserved across epochs. Switch from on_train_epoch_end to
on_validation_end so checkpoints exist when the upload runs.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@forklady42
Copy link
Collaborator Author

@claude review this PR

@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @forklady42's task in 1m 33s —— View job


PR Review: Add optional HuggingFace Hub checkpoint uploading

  • Gather context and read changed files
  • Review callback implementation
  • Review CLI entrypoints
  • Review configuration changes
  • Review dependency changes
  • Post findings

Overall this is a clean, well-structured addition. The deferred-upload-via-manifest design is well suited to HPC clusters without compute node internet access. A few issues worth addressing before merging:

Bugs

path is potentially undefined in the except block (hf_upload.py:114-118)

If the from huggingface_hub import upload_file import raises (e.g. package not installed), path has never been assigned, so the except block raises a secondary NameError before it can log the original error:

def _upload_single(entry: dict) -> None:
    try:
        from huggingface_hub import upload_file
        path = Path(entry["path"])   # ← not reached on ImportError
        ...
    except Exception:
        logger.warning(
            "HF upload failed for %s ...",
            path.name,  # ← NameError if import failed
            exc_info=True,
        )

Fix: move path = Path(entry["path"]) above the try block. Fix this →


Feature gap: private is silently ignored

self.private is read from config (hf_upload.py:33) but never passed to upload_file — which wouldn't accept it anyway, since private is a create_repo parameter. Users who set private: true expecting it to enforce privacy will be silently misled.

Options:

  1. Remove the private config key entirely (the PR states the repo must already exist).
  2. If you want to support create_repo, call it with private=self.private and exist_ok=True before the first upload.

Fix this →


Minor issues

Hardcoded last.ckpt filename (hf_upload.py:67)

on_train_epoch_end looks for self.ckpt_path / "last.ckpt", which is the default for ModelCheckpoint. If a user configures ModelCheckpoint(filename="last_{epoch}"), the callback silently skips every epoch. Consider reading the filename from the trainer's checkpoint callback, or at least documenting this assumption.

epoch=-1 magic value (hf_upload.py:84)

The sentinel epoch=-1 used for best-checkpoint entries found at on_train_end is not documented. A comment or a more explicit sentinel (e.g. None) would make intent clearer.

on_train_end calls _save_manifest() unconditionally (hf_upload.py:87)

_save_manifest() is already called inside _queue_checkpoint. The extra call at the end of the loop is needed to persist uploaded: True from immediate uploads, but it also fires when no checkpoints were queued (harmless but wasteful). Consider only calling it when the manifest was actually modified.


Missing tests

No tests are added for HuggingFaceCallback or hf_push. Even basic smoke tests (e.g., manifest is created, uploaded flag is flipped, no crash when huggingface_hub is absent) would be valuable given the complexity of the deferred-upload logic.


Nits

  • callbacks/__init__.py is empty except for the from __future__ import — fine, but nothing is exported. Consider re-exporting HuggingFaceCallback here for discoverability.
  • hf_push uses logger.error for the missing-manifest case but returns without raising. For a CLI command, raising a SystemExit or at least printing to stderr would be more user-friendly.
  • The else: raise ValueError(f"Unknown command: {args.command}") branch in main.py:52 is unreachable because argparse with required=True subparsers already handles unknown commands.

- Fix potential NameError in _upload_single by moving path assignment above try
- Remove unused `private` config key (repo must already exist)
- Replace epoch=-1 magic value with None
- Only call _save_manifest() in on_train_end when immediate uploads occurred
- Raise SystemExit instead of silent logger.error in hf_push for missing manifest

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@forklady42
Copy link
Collaborator Author

@claude review again

@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @forklady42's task in 1m 58s —— View job


PR Review: Add optional HuggingFace Hub checkpoint uploading

  • Gather context
  • Read changed files
  • Analyze and post review

Good progress — the three commits since the initial review have addressed all the bugs. Here's the updated status:

Fixed since last review ✅

Issue Status
path undefined in except block Fixed — path is now assigned before the try block (hf_upload.py:111)
private field silently ignored Fixed — option removed entirely from code and config examples
epoch=-1 magic sentinel Fixed — now uses epoch=None
_save_manifest() called unconditionally in on_train_end Fixed — now only called when had_immediate is True
hf_push using logger.error without raising Fixed — now uses raise SystemExit(...)

Remaining issues

Hardcoded last.ckpt filename (hf_upload.py:71)

last_ckpt = self.ckpt_path / "last.ckpt"

This happens to work because train.py passes save_last=True to ModelCheckpoint with its default filename, but the coupling is implicit and fragile. Worth a comment at minimum. Fix this →

Unreachable else branch (main.py:52-53)

else:
    raise ValueError(f"Unknown command: {args.command}")

argparse with required=True subparsers already handles unknown commands and exits before reaching this point. Safe to remove. Fix this →

Missing tests

No tests cover HuggingFaceCallback or hf_push. The deferred-upload logic (manifest creation, uploaded flag flipping, graceful handling when huggingface_hub is absent) would benefit from at least basic smoke tests alongside the existing tests in tests/electrai/. Fix this →


Nit

callbacks/__init__.py is currently empty (just the from __future__ import). Re-exporting HuggingFaceCallback there would make imports more ergonomic: from electrai.callbacks import HuggingFaceCallback.


Overall the implementation is solid and the bugs from the previous review are resolved. The remaining items are minor — the unreachable branch is a cleanup, the hardcoded filename just needs a comment, and tests are the main gap before merging.

@ryan-williams ryan-williams requested a review from Copilot March 5, 2026 20:41
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional HuggingFace Hub integration for checkpoint uploads, supporting both deferred uploads (via a manifest) and a CLI workflow for offline compute nodes.

Changes:

  • Add HuggingFaceCallback to queue checkpoints into a JSON manifest and optionally upload immediately.
  • Add electrai hf-push CLI subcommand to upload queued checkpoints from a login node.
  • Add optional huggingface-hub dependency via extras and document HF config in sample YAMLs.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/electrai/entrypoints/train.py Wires an optional HF callback into Lightning callbacks based on config.
src/electrai/entrypoints/main.py Adds hf-push CLI command for uploading from a checkpoint directory.
src/electrai/configs/MP/config_resunet.yaml Documents optional HF config block in an example config.
src/electrai/configs/MP/config_resnet.yaml Documents optional HF config block in an example config.
src/electrai/callbacks/hf_upload.py Implements manifest-based queuing and upload logic plus CLI entry function.
src/electrai/callbacks/init.py Initializes callbacks package (future annotations).
pyproject.toml Adds hf extra for huggingface-hub dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

forklady42 and others added 4 commits March 5, 2026 15:44
- Use 1-indexed epoch in uploaded filename to match upload cadence
- Copy last.ckpt to stable epoch-specific file at queue time so deferred
  hf-push uploads the correct snapshot after last.ckpt is overwritten

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Remove unreachable else branch in main.py (argparse handles unknown commands)
- Re-export HuggingFaceCallback from callbacks __init__ for ergonomic imports

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@forklady42
Copy link
Collaborator Author

@claude review once more. Do not worry about tests at this point

@claude
Copy link

claude bot commented Mar 9, 2026

Claude finished @forklady42's task in 2m 0s —— View job


PR Review: Add optional HuggingFace Hub checkpoint uploading

  • Gather context
  • Read changed files
  • Analyze and post review

Good progress since the last review. All previously reported bugs are resolved and several of the Copilot review points are also addressed. Here's the current state:

Fixed since last review ✅

Issue Status
path undefined in except block Fixed
private field silently ignored Fixed
epoch=-1 magic sentinel Fixed
_save_manifest() called unconditionally in on_train_end Fixed
hf_push using logger.error without raising Fixed
Unreachable else branch in main.py Fixed — removed entirely
Docstring config key mismatch (hf_upload_immediatehf.upload_immediate) Fixed
Epoch numbering inconsistency (gating uses epoch+1, filename did not) Fixed — filename now uses epoch + 1
Unsafe deferred upload of last.ckpt (overwritten each epoch) Fixed — copies to last_epoch{N:03d}.ckpt before queuing
callbacks/__init__.py exports nothing Fixed — now re-exports HuggingFaceCallback

Remaining issues

Path.open() class method style in hf_push (hf_upload.py:149,161)

The standalone hf_push function uses the unusual class-method form, and is also missing encoding="utf-8" that the rest of the file correctly specifies:

# Current (lines 149, 161) — atypical and inconsistent with rest of file
with Path.open(manifest_path) as f:
    ...
with Path.open(manifest_path, "w") as f:
    ...

# Should be (matching _load_existing_manifest / _save_manifest)
with manifest_path.open(encoding="utf-8") as f:
    ...
with manifest_path.open("w", encoding="utf-8") as f:
    ...

Fix this →


hf_push CLI output is silent unless logging is configured (hf_upload.py:154-168)

hf_push communicates entirely via logger.info / logger.warning, but the electrai CLI sets up no logging handlers. In practice, users running electrai hf-push will see no output at all — not even confirmation that uploads succeeded. Consider using print() or configuring a basic logging handler (e.g. logging.basicConfig(level=logging.INFO)) at the top of main().

Fix this →


Stable copies accumulate on disk and are never cleaned up (hf_upload.py:78-86)

Every upload_every_n_epochs interval, last.ckpt is copied to last_epoch{N:03d}.ckpt. For a 500-epoch run with upload_every_n_epochs: 5, that's 100 copies of the last checkpoint accumulating on disk. When upload_immediate: true, these copies have served their purpose after the upload but are never removed. Worth at least a note in the docstring or config comment so users are aware of the disk usage, or optionally delete the stable copy after a successful immediate upload.


Overall

The implementation is solid and all the substantive bugs from previous reviews are resolved. The three remaining items above are the blockers I'd want addressed before merging: the Path.open style inconsistency is a straightforward fix, the silent CLI output is a usability issue, and the disk accumulation is at minimum worth documenting.
|

- Delete stable copies after successful immediate upload
- Add --clean flag to hf-push to delete local copies after deferred upload
- Add logging.basicConfig to main() for CLI log visibility
- Fix Path.open() style and add encoding="utf-8" in hf_push

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@forklady42
Copy link
Collaborator Author

@claude review again

@claude
Copy link

claude bot commented Mar 9, 2026

Claude finished @forklady42's task in 1m 47s —— View job


PR Review: Add optional HuggingFace Hub checkpoint uploading

  • Gather context
  • Read changed files
  • Analyze and post review

All three issues from the last review are resolved. This PR is in good shape.

Fixed since last review ✅

Issue Status
Path.open() class method style in hf_push Fixed — hf_push now uses manifest_path.open(encoding="utf-8") and manifest_path.open("w", encoding="utf-8") consistently (hf_upload.py:151,165)
Silent CLI output (logger.info with no handlers) Fixed — logging.basicConfig(level=logging.INFO) added to main() (main.py:27)
Stable last_epoch*.ckpt copies accumulate indefinitely Fixed — stable copy is removed after a successful immediate upload (hf_upload.py:86-87)

One remaining nit

--clean deletes best checkpoints without calling this out explicitly (main.py:43-46, hf_upload.py:162-163)

The help text reads "Delete local copies after successful upload", which is accurate, but it may surprise users that --clean also deletes the ckpt_*.ckpt best-model checkpoints queued in on_train_end — not just the last_epoch*.ckpt stable copies. Consider adding a brief note, e.g.:

--clean  Delete local checkpoint files after successful upload
         (includes best-model checkpoints — use with care)

This is a nit; the behaviour is intentional and opt-in.


Overall

The implementation is solid, all substantive bugs and consistency issues from previous reviews are resolved, and the design (manifest-based deferred upload, stable copies for snapshot safety, logging.basicConfig for usable CLI output) is well thought-out. Good to merge once the --clean help text is updated, or even without it if you're comfortable with the current wording.

train(args)
elif args.command == "test":
test(args)
else:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because argparse with required=True subparsers already handles unknown commands

@forklady42
Copy link
Collaborator Author

The bot's nits are becoming tiny, so I'm calling it and marking this for human review.

@forklady42 forklady42 marked this pull request as ready for review March 9, 2026 16:04
@ryan-williams ryan-williams requested a review from Copilot March 9, 2026 19:55
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@ryan-williams ryan-williams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worked with Claude on the @rbw-bot replies above, seems like at least one impt change requested, lmk what you think!

ryan-williams and others added 2 commits March 18, 2026 10:29
- Switch from `on_validation_end` to `on_train_epoch_end` so
  `last.ckpt` is current (Lightning reorders `ModelCheckpoint` to
  run last in `on_validation_end`, making the copy stale).
- Separate `ImportError` from upload failures in `_upload_single`
  with a clear "install huggingface-hub" message.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +63 to +82
def on_train_epoch_end(self, trainer, pl_module) -> None: # noqa: ARG002
if trainer.sanity_checking:
return
epoch = trainer.current_epoch
if (epoch + 1) % self.every_n_epochs != 0:
return
if trainer.global_rank != 0:
return

last_ckpt = self.ckpt_path / "last.ckpt"
if not last_ckpt.exists():
return

# Copy to a stable filename so later hf-push uploads the correct
# snapshot even after last.ckpt is overwritten by subsequent epochs.
stable_name = f"last_epoch{epoch + 1:03d}.ckpt"
stable_path = self.ckpt_path / stable_name
shutil.copy2(last_ckpt, stable_path)

self._queue_checkpoint(stable_path, epoch, path_in_repo=stable_name)
Comment on lines +49 to +61
def _queue_checkpoint(
self, ckpt_file: Path, epoch: int | None, *, path_in_repo: str | None = None
) -> None:
entry = {
"path": str(ckpt_file),
"path_in_repo": path_in_repo or ckpt_file.name,
"epoch": epoch,
"repo_id": self.repo_id,
"uploaded": False,
}
self._manifest.append(entry)
self._save_manifest()
logger.info("Queued checkpoint for HF upload: %s", ckpt_file.name)
Comment on lines +147 to +167
def hf_push(ckpt_path: str, *, clean: bool = False) -> None:
"""Upload pending checkpoints from a manifest file.

Run this from a login node or machine with internet access.
"""
ckpt_dir = Path(ckpt_path)
manifest_path = ckpt_dir / MANIFEST_FILENAME
if not manifest_path.exists():
raise SystemExit(f"No manifest found at {manifest_path}")

with manifest_path.open(encoding="utf-8") as f:
manifest = json.load(f)

pending = [e for e in manifest if not e["uploaded"]]
if not pending:
logger.info("All checkpoints already uploaded.")
return

logger.info("Uploading %d pending checkpoint(s)...", len(pending))
for entry in pending:
_upload_single(entry)
stable_path = self.ckpt_path / stable_name
shutil.copy2(last_ckpt, stable_path)

self._queue_checkpoint(stable_path, epoch, path_in_repo=stable_name)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants