-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[megatron] chore: add a docker image for with mcore0.15 and TE2.7 #3540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # Start from the verl base image | ||
| # Dockerfile.base | ||
| FROM iseekyan/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.7.4-h100 | ||
|
|
||
| # Define environments | ||
| ENV MAX_JOBS=32 | ||
| ENV VLLM_WORKER_MULTIPROC_METHOD=spawn | ||
| ENV DEBIAN_FRONTEND=noninteractive | ||
| ENV NODE_OPTIONS="" | ||
| ENV PIP_ROOT_USER_ACTION=ignore | ||
| ENV HF_HUB_ENABLE_HF_TRANSFER="1" | ||
|
|
||
| # Install torch-2.7.1+cu126 + vllm-0.10.0 | ||
| RUN pip install --resume-retries 999 --no-cache-dir vllm==0.10.0 | ||
|
|
||
| # Fix packages | ||
| # transformers 4.54.0 still not support | ||
| RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.55.4" accelerate datasets peft hf-transfer \ | ||
| "numpy<2.0.0" "pyarrow>=19.0.1" pandas \ | ||
| ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \ | ||
| pytest py-spy pyext pre-commit ruff | ||
|
|
||
| RUN pip uninstall -y pynvml nvidia-ml-py && \ | ||
| pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1" | ||
|
|
||
| RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87 | ||
|
|
||
| # Install TransformerEngine | ||
| RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.7 | ||
| RUN pip install onnxscript | ||
|
|
||
| # Install Megatron-LM | ||
| RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.15.0rc4 | ||
|
|
||
| # Install mbridge | ||
| RUN pip3 install --no-cache-dir mbridge==v0.15.0 | ||
|
|
||
| # Fix qwen vl | ||
| RUN pip3 install --no-cache-dir --no-deps trl | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -79,6 +79,8 @@ For latest vLLM with FSDP, please refer to `hiyouga/verl <https://hub.docker.com | |||||
|
|
||||||
| For latest SGLang with FSDP, please refer to `hebiaobuaa/verl <https://hub.docker.com/r/hebiaobuaa/verl>`_ repository and the latest version is ``hebiaobuaa/verl:app-verl0.5-sglang0.4.9.post6-mcore0.12.2-te2.2`` which is provided by SGLang RL Group. | ||||||
|
|
||||||
| For latest vLLM with Megatron, please refer to `iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7` | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The new line added to the documentation is inconsistent with the surrounding entries. It should include a link to the Docker Hub repository for
Suggested change
|
||||||
|
|
||||||
| See files under ``docker/`` for NGC-based image or if you want to build your own. | ||||||
|
|
||||||
| Note that For aws instances with EFA net interface (Sagemaker AI Pod), | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dockerfile has multiple
RUN pip installcommands, which creates many layers in the Docker image. This increases the image size and can slow down builds and pulls. It's a best practice to combine these into as fewRUNcommands as is logical. Here is a refactored version of the package installation steps that groups installations to reduce layers, while keeping build-intensive git installations separate to leverage caching.