- 
                Notifications
    
You must be signed in to change notification settings  - Fork 2.4k
 
[megatron] chore: add a docker image for with mcore0.15 and TE2.7 #3540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new Docker image for verl with megatron-core v0.15 and TransformerEngine v2.7. The changes include the new Dockerfile and updates to the documentation to include the new image.
My review focuses on improving the new Dockerfile for better performance and smaller image size by consolidating pip install commands. I've also pointed out an inconsistency in the documentation update and suggested a fix to align it with the existing format. These changes will improve maintainability and user experience.
| RUN pip install --resume-retries 999 --no-cache-dir vllm==0.10.0 | ||
| 
               | 
          ||
| # Fix packages | ||
| # transformers 4.54.0 still not support | ||
| RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.55.4" accelerate datasets peft hf-transfer \ | ||
| "numpy<2.0.0" "pyarrow>=19.0.1" pandas \ | ||
| ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \ | ||
| pytest py-spy pyext pre-commit ruff | ||
| 
               | 
          ||
| RUN pip uninstall -y pynvml nvidia-ml-py && \ | ||
| pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1" | ||
| 
               | 
          ||
| RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87 | ||
| 
               | 
          ||
| # Install TransformerEngine | ||
| RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.7 | ||
| RUN pip install onnxscript | ||
| 
               | 
          ||
| # Install Megatron-LM | ||
| RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.15.0rc4 | ||
| 
               | 
          ||
| # Install mbridge | ||
| RUN pip3 install --no-cache-dir mbridge==v0.15.0 | ||
| 
               | 
          ||
| # Fix qwen vl | ||
| RUN pip3 install --no-cache-dir --no-deps trl No newline at end of file | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dockerfile has multiple RUN pip install commands, which creates many layers in the Docker image. This increases the image size and can slow down builds and pulls. It's a best practice to combine these into as few RUN commands as is logical. Here is a refactored version of the package installation steps that groups installations to reduce layers, while keeping build-intensive git installations separate to leverage caching.
RUN pip install --resume-retries 999 --no-cache-dir \
    vllm==0.10.0 \
    "tensordict==0.6.2" "transformers[hf_xet]>=4.55.4" accelerate datasets peft hf-transfer \
    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
    pytest py-spy pyext pre-commit ruff
RUN pip uninstall -y pynvml nvidia-ml-py && \
    pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87 onnxscript && \
    pip3 install --no-cache-dir mbridge==v0.15.0 && \
    pip3 install --no-cache-dir --no-deps trl
# Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.7
# Install Megatron-LM
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.15.0rc4
| 
               | 
          ||
| For latest SGLang with FSDP, please refer to `hebiaobuaa/verl <https://hub.docker.com/r/hebiaobuaa/verl>`_ repository and the latest version is ``hebiaobuaa/verl:app-verl0.5-sglang0.4.9.post6-mcore0.12.2-te2.2`` which is provided by SGLang RL Group. | ||
| 
               | 
          ||
| For latest vLLM with Megatron, please refer to `iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7` | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new line added to the documentation is inconsistent with the surrounding entries. It should include a link to the Docker Hub repository for iseekyan/verl to provide users with more context and an easy way to find the image, just like the entries for hiyouga/verl and hebiaobuaa/verl.
| For latest vLLM with Megatron, please refer to `iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7` | |
| For latest vLLM with Megatron, please refer to `iseekyan/verl <https://hub.docker.com/r/iseekyan/verl>`_ repository and the latest version is ``iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7`` | 
No description provided.