-
Notifications
You must be signed in to change notification settings - Fork 98
[WIP] Update base dependencies #1194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Timur Rvachov <[email protected]>
WalkthroughThe PR removes the bionemo-geometric sub-package (code, tests, config, docs, and ownership), updates project configs to drop references, and revises the Dockerfile to new base image and build flows for causal-conv1d, Mamba, and bitsandbytes. It also refreshes the secret baseline timestamp and removes a legacy TE patch. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Docker Build
participant Img as Base Image (nvcr.io/nvidia/pytorch:25.08-py3)
participant Git as Git Repos
participant Pip as pip/uv
participant Py as Python Build
Dev->>Img: Start FROM 25.08-py3
Note over Img: TE patching removed
Dev->>Git: Clone causal-conv1d@<commit>
Dev->>Py: python setup.py build_ext --inplace
Dev->>Py: Build wheel
Py-->>Pip: wheel file
Dev->>Pip: pip install wheel (no deps)
Dev->>Git: Clone trvachov/[email protected]
Dev->>Py: python setup.py build_ext --inplace
Dev->>Pip: pip install .
Note over Pip: bitsandbytes: uninstall via uv, then pip install wheel (no-deps)
Note over Dev: bionemo-geometric pre-install removed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test f23c0f3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
Dockerfile (5)
56-57: Fix invalid apt command:apt-get upgradewith package name is wrongThis will fail or be ignored. Use install with only-upgrade.
Apply:
-apt-get upgrade -qyy \ - rsync +apt-get install -qy --only-upgrade rsync
112-113:wgetis used but never installed (ARM path will fail)Either install wget here or use curl which is already present.
Example fix (using curl):
- wget https://github.com/TileDB-Inc/TileDB/releases/download/2.27.2/tiledb-linux-arm64-2.27.2-1757013.tar.gz -O tiledb.tar.gz && \ + curl -sSL https://github.com/TileDB-Inc/TileDB/releases/download/2.27.2/tiledb-linux-arm64-2.27.2-1757013.tar.gz -o tiledb.tar.gz && \
115-116: Potential build break:xargsmay runapt-get removewith no argsIf no matching packages exist,
xargscan invokeapt-get remove -ywith no args and error. Usexargs -r(GNU) or guard the call.Apply:
- dpkg -l | awk '/libfmt/ {print $2}' | xargs apt-get remove -y && \ - dpkg -l | awk '/spdlog/ {print $2}' | xargs apt-get remove -y && \ + dpkg -l | awk '/libfmt/ {print $2}' | xargs -r apt-get remove -y && \ + dpkg -l | awk '/spdlog/ {print $2}' | xargs -r apt-get remove -y && \If non-GNU xargs is possible, replace with conditional checks.
217-225:uv pip uninstalllacks-yand can hang in non-interactive buildsThese uninstalls will prompt for confirmation and stall the build.
Apply:
-uv pip uninstall bitsandbytes && uv pip install bitsandbytes==0.46.1 +uv pip uninstall -y bitsandbytes || true +uv pip install bitsandbytes==0.46.1 @@ -uv pip uninstall sqlitedict zstandard +uv pip uninstall -y sqlitedict zstandard || true
39-59: Makeset -o pipefailportable by switching shell to bash for RUNsSeveral heredoc RUNs use
set -eo pipefail;/bin/sh(dash) doesn’t supportpipefail. Use bash shell per stage.Apply after each relevant FROM:
FROM ${BASE_IMAGE} AS bionemo2-base +SHELL ["/bin/bash", "-lc"] @@ FROM ${BASE_IMAGE} AS dev +SHELL ["/bin/bash", "-lc"] @@ FROM dev AS development +SHELL ["/bin/bash", "-lc"] @@ FROM bionemo2-base AS release +SHELL ["/bin/bash", "-lc"]
🧹 Nitpick comments (3)
Dockerfile (3)
200-202: Pinnvidia-resiliency-extto a commit/tag for reproducible buildsUnpinned clones introduce nondeterminism.
Example:
-git clone https://github.com/NVIDIA/nvidia-resiliency-ext +git clone --depth=1 --branch <tag-or-commit> https://github.com/NVIDIA/nvidia-resiliency-extReplace
<tag-or-commit>with a known good ref.
119-122: Clean up TileDB-SOMA repo after install to keep image smallThe cloned repo persists.
Apply:
- cd TileDB-SOMA/apis/python && \ - pip install .; \ + cd TileDB-SOMA/apis/python && \ + pip install . && \ + cd / && rm -rf TileDB-SOMA; \
15-21: Update stale comment about Transformer EngineThe comment states “with apex and transformer engine,” but TE patch/integration was removed here.
Refresh the comment to reflect the current contents.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
Dockerfile(1 hunks)docker_build_patches/te.patch(0 hunks)
💤 Files with no reviewable changes (1)
- docker_build_patches/te.patch
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (4)
Dockerfile (4)
167-168: Confirm safe removal of bundled ONNX sourcesRemoving
/opt/pytorch/pytorch/third_party/onnxmay affect torch.onnx export or tooling expecting these sources.Please confirm no ONNX export paths in tests/pipelines rely on this directory.
252-266: Verify Python path hardcode (3.12) matches base imageIf base moves to a different Python minor, these paths break.
Consider deriving with
python -c 'import sys;import sysconfig;print(sysconfig.get_paths()[\"purelib\"])'during build, or keep in sync with BASE_IMAGE updates.
24-33: Rust toolchain layering is inconsistent; validate image tag and needBase stage uses
rust:1.86.0but sets default to 1.82.0. Ensurerust:1.86.0exists and this split is intended.If only 1.82.0 is required, switch base to
rust:1.82to reduce size/variance.
108-122: TILEDB_PATH export is ephemeral; ensure build finds libs
export TILEDB_PATH=...only applies to this RUN. Confirm the SOMA build locates TileDB without needing a persistent env var.If required, set
ENV TILEDB_PATH=/usr/lib/tiledbbefore the build or pass it inline to the pip build step.
|
/ok to test 5b79da9 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1194 +/- ##
==========================================
- Coverage 79.93% 79.76% -0.18%
==========================================
Files 160 156 -4
Lines 11858 11617 -241
==========================================
- Hits 9479 9266 -213
+ Misses 2379 2351 -28 |
803412c to
c245759
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
Dockerfile(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
c245759 to
6534410
Compare
209ad56 to
bc49d1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
Dockerfile (1)
226-229: Gate the bitsandbytes install by architecture.This always pulls the manylinux_2_24_x86_64 wheel, so an ARM64 build (
TARGETARCH=arm64) explodes while unpacking the x86_64 artifact. Restore an architecture check and keep/restore the ARM path (e.g., skip or install an ARM-capable build).-uv pip uninstall bitsandbytes && pip install --no-deps --no-build-isolation https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl +if [ "$TARGETARCH" = "amd64" ] || [ "$TARGETARCH" = "x86_64" ]; then + uv pip uninstall bitsandbytes + pip install --no-deps --no-build-isolation \ + https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl +elif [ "$TARGETARCH" = "arm64" ]; then + uv pip uninstall bitsandbytes || true + uv pip install bitsandbytes==0.45.2 # or another ARM-compatible build +fi
🧹 Nitpick comments (1)
Dockerfile (1)
83-89: Fix the repo cleanup path.The clone lives in
causal-conv1d, but the cleanup step removescausal_conv1d, so the source tree is left behind in the image. Please delete the actual directory you cloned.-rm -rf causal_conv1d +rm -rf causal-conv1d
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
Dockerfile(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
|
/ok to test bc49d1e |
bc49d1e to
c00ca7e
Compare
|
/ok to test c00ca7e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
Dockerfile (1)
83-90: Fix cleanup path after causal-conv1d installWe clone
causal-conv1d, but the cleanup removescausal_conv1d, so the repo stays in the image. Switch therm -rftarget back tocausal-conv1dto keep the layer tidy.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
Dockerfile(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
| # CUDA 13.0 supported bitsandbytes requires a nightly build, which doesn't seem to work | ||
| # with uv because of some trivial package/version name check, so we use uv for uninstall | ||
| # and then pip for reinstall | ||
| uv pip uninstall bitsandbytes && pip install --no-deps --no-build-isolation https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore ARM build compatibility for bitsandbytes install
pip install … manylinux_2_24_x86_64.whl runs unconditionally. On TARGETARCH=arm64 builds the wheel is rejected as incompatible, so the Docker build aborts and we lose our ARM images. Wrap this block in an if [ "$TARGETARCH" = "amd64" ] … fi (with an appropriate else branch—either skip, or keep the previous ARM flow) so non-x86 builds don’t try to consume the x86_64 artifact.
🤖 Prompt for AI Agents
In Dockerfile around lines 226 to 229, the pip install of the x86_64
bitsandbytes wheel runs unconditionally and fails on TARGETARCH=arm64; wrap this
block in a shell conditional that checks if "$TARGETARCH" = "amd64" and only
runs the uv uninstall and pip install for that case, and add an else branch that
either skips the x86 wheel (no-op) or performs the previous ARM-compatible
install flow so ARM builds don't attempt to install the incompatible
manylinux_2_24_x86_64.whl.
Description
Update pytorch, megatron, nemo, and test.
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Pre-submit Checklist
Summary by CodeRabbit
New Changes
Documentation
Chores
Tests