Skip to content

Conversation

@trvachov
Copy link
Contributor

@trvachov trvachov commented Jul 12, 2025

What does this PR do ?

The current export_ckpt code contains logic that attempts some kind of ModelOpt checkpoint load, and if it fails, then defaults to the usual checkpoint load. The current code assumes that if the ModelOpt checkpoint load fails, a None output will be the result, but that isn't always the case -- sometimes it fails by throwing an exception. In this case, I just added a try catch block to deal with this.

This solution is not ideal, but and this PR is just to get the discussion started and also to unblock BioNeMo which is currently stuck on this.

Collection: All

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@jenchen13 jenchen13 self-requested a review July 14, 2025 14:11
@trvachov
Copy link
Contributor Author

@ericharper @ko3n1g can you help me understand the codecov issue?

@maanug-nv maanug-nv force-pushed the trvachov/cpkt_load_fix branch from f18ccbd to 20f7296 Compare July 16, 2025 22:51
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Jul 16, 2025
@ericharper ericharper enabled auto-merge (squash) July 17, 2025 00:11
@ericharper ericharper merged commit 164d12b into NVIDIA-NeMo:main Jul 17, 2025
224 checks passed
github-merge-queue bot pushed a commit to NVIDIA/bionemo-framework that referenced this pull request Jul 18, 2025
### Description
Point NeMo back to `main` branch after PR that fixes bionemo checkpoint
load has been merged (NVIDIA-NeMo/NeMo#14214)

### Type of changes
<!-- Mark the relevant option with an [x] -->

- [ ]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [x]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:

-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing

> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

* If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
* If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [ ] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [ ] I have added/updated tests as needed
 - [ ] All existing tests pass successfully

Signed-off-by: Timur Rvachov <[email protected]>
AmirHussein96 pushed a commit to AmirHussein96/NeMo that referenced this pull request Jul 23, 2025
NVIDIA-NeMo#14214)

* Allow exception in hf ckpt load attempt before fallback to standard load strategy

Signed-off-by: Timur Rvachov <[email protected]>

* Make exception RuntimeError specific

Signed-off-by: Timur Rvachov <[email protected]>

---------

Signed-off-by: Timur Rvachov <[email protected]>
Signed-off-by: Amir Hussein <[email protected]>
monica-sekoyan pushed a commit that referenced this pull request Aug 4, 2025
#14214)

* Allow exception in hf ckpt load attempt before fallback to standard load strategy

Signed-off-by: Timur Rvachov <[email protected]>

* Make exception RuntimeError specific

Signed-off-by: Timur Rvachov <[email protected]>

---------

Signed-off-by: Timur Rvachov <[email protected]>
nasretdinovr pushed a commit to nasretdinovr/NeMo that referenced this pull request Aug 8, 2025
NVIDIA-NeMo#14214)

* Allow exception in hf ckpt load attempt before fallback to standard load strategy

Signed-off-by: Timur Rvachov <[email protected]>

* Make exception RuntimeError specific

Signed-off-by: Timur Rvachov <[email protected]>

---------

Signed-off-by: Timur Rvachov <[email protected]>
guyueh1 pushed a commit to guyueh1/NeMo that referenced this pull request Aug 25, 2025
NVIDIA-NeMo#14214)

* Allow exception in hf ckpt load attempt before fallback to standard load strategy

Signed-off-by: Timur Rvachov <[email protected]>

* Make exception RuntimeError specific

Signed-off-by: Timur Rvachov <[email protected]>

---------

Signed-off-by: Timur Rvachov <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants