[megatron] chore: add a docker image for with mcore0.15 and TE2.7 #3540

ISEEKYAN · 2025-09-19T09:49:43Z

No description provided.

gemini-code-assist

Code Review

This pull request adds a new Docker image for verl with megatron-core v0.15 and TransformerEngine v2.7. The changes include the new Dockerfile and updates to the documentation to include the new image.

My review focuses on improving the new Dockerfile for better performance and smaller image size by consolidating pip install commands. I've also pointed out an inconsistency in the documentation update and suggested a fix to align it with the existing format. These changes will improve maintainability and user experience.

gemini-code-assist · 2025-09-19T09:52:12Z

docker/verl0.5-cu126-torch2.7-fa2.7.4/Dockerfile.app.vllm.mcore0.15

+RUN pip install --resume-retries 999 --no-cache-dir vllm==0.10.0
+
+# Fix packages
+# transformers 4.54.0 still not support
+RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.55.4" accelerate datasets peft hf-transfer \
+    "numpy<2.0.0" "pyarrow>=19.0.1" pandas \
+    ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
+    pytest py-spy pyext pre-commit ruff
+
+RUN pip uninstall -y pynvml nvidia-ml-py && \
+    pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
+
+RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
+
+# Install TransformerEngine
+RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.7
+RUN pip install onnxscript
+
+# Install Megatron-LM
+RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.15.0rc4
+
+# Install mbridge
+RUN pip3 install --no-cache-dir mbridge==v0.15.0
+
+# Fix qwen vl
+RUN pip3 install --no-cache-dir --no-deps trl


The Dockerfile has multiple RUN pip install commands, which creates many layers in the Docker image. This increases the image size and can slow down builds and pulls. It's a best practice to combine these into as few RUN commands as is logical. Here is a refactored version of the package installation steps that groups installations to reduce layers, while keeping build-intensive git installations separate to leverage caching.

RUN pip install --resume-retries 999 --no-cache-dir \ vllm==0.10.0 \ "tensordict==0.6.2" "transformers[hf_xet]>=4.55.4" accelerate datasets peft hf-transfer \ "numpy<2.0.0" "pyarrow>=19.0.1" pandas \ ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \ pytest py-spy pyext pre-commit ruff RUN pip uninstall -y pynvml nvidia-ml-py && \ pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1" RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87 onnxscript && \ pip3 install --no-cache-dir mbridge==v0.15.0 && \ pip3 install --no-cache-dir --no-deps trl # Install TransformerEngine RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.7 # Install Megatron-LM RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.15.0rc4

gemini-code-assist · 2025-09-19T09:52:12Z

docs/start/install.rst


 For latest SGLang with FSDP, please refer to `hebiaobuaa/verl <https://hub.docker.com/r/hebiaobuaa/verl>`_ repository and the latest version is ``hebiaobuaa/verl:app-verl0.5-sglang0.4.9.post6-mcore0.12.2-te2.2`` which is provided by SGLang RL Group.

+For latest vLLM with Megatron, please refer to `iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7`


The new line added to the documentation is inconsistent with the surrounding entries. It should include a link to the Docker Hub repository for iseekyan/verl to provide users with more context and an easy way to find the image, just like the entries for hiyouga/verl and hebiaobuaa/verl.

Suggested change

For latest vLLM with Megatron, please refer to `iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7`

For latest vLLM with Megatron, please refer to `iseekyan/verl <https://hub.docker.com/r/iseekyan/verl>`_ repository and the latest version is ``iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7``

…lcengine#3540)

ISEEKYAN added 3 commits September 18, 2025 23:54

compatible to mcore0.15

2581fb4

update a new docker image for mcore0.15

3f86e68

Merge branch 'main' into mcore_015_docker

55a61f6

ISEEKYAN requested review from eric-haibin-lin and zhaochenyang20 as code owners September 19, 2025 09:49

gemini-code-assist bot reviewed Sep 19, 2025

View reviewed changes

wuxibin89 approved these changes Sep 22, 2025

View reviewed changes

wuxibin89 merged commit bcd2275 into volcengine:main Sep 22, 2025
1 check passed

masoudhashemi pushed a commit to masoudhashemi/verl that referenced this pull request Oct 19, 2025

[megatron] chore: add a docker image for with mcore0.15 and TE2.7 (vo…

8cdbfda

…lcengine#3540)

techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025

[megatron] chore: add a docker image for with mcore0.15 and TE2.7 (vo…

ae6b1d9

…lcengine#3540)

mtian8 pushed a commit to mtian8/verl that referenced this pull request Nov 1, 2025

[megatron] chore: add a docker image for with mcore0.15 and TE2.7 (vo…

e999d22

…lcengine#3540)

wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025

[megatron] chore: add a docker image for with mcore0.15 and TE2.7 (vo…

fa64fde

…lcengine#3540)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[megatron] chore: add a docker image for with mcore0.15 and TE2.7 #3540

[megatron] chore: add a docker image for with mcore0.15 and TE2.7 #3540

Uh oh!

ISEEKYAN commented Sep 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 19, 2025

Uh oh!

gemini-code-assist bot Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		For latest SGLang with FSDP, please refer to `hebiaobuaa/verl <https://hub.docker.com/r/hebiaobuaa/verl>`_ repository and the latest version is ``hebiaobuaa/verl:app-verl0.5-sglang0.4.9.post6-mcore0.12.2-te2.2`` which is provided by SGLang RL Group.

		For latest vLLM with Megatron, please refer to `iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7`

[megatron] chore: add a docker image for with mcore0.15 and TE2.7 #3540

[megatron] chore: add a docker image for with mcore0.15 and TE2.7 #3540

Uh oh!

Conversation

ISEEKYAN commented Sep 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants