Skip to content

Commit ae6b1d9

Browse files
ISEEKYANtechkang
authored andcommitted
[megatron] chore: add a docker image for with mcore0.15 and TE2.7 (volcengine#3540)
1 parent d576e73 commit ae6b1d9

File tree

3 files changed

+42
-0
lines changed

3 files changed

+42
-0
lines changed
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Start from the verl base image
2+
# Dockerfile.base
3+
FROM iseekyan/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.1-fa2.7.4-h100
4+
5+
# Define environments
6+
ENV MAX_JOBS=32
7+
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
8+
ENV DEBIAN_FRONTEND=noninteractive
9+
ENV NODE_OPTIONS=""
10+
ENV PIP_ROOT_USER_ACTION=ignore
11+
ENV HF_HUB_ENABLE_HF_TRANSFER="1"
12+
13+
# Install torch-2.7.1+cu126 + vllm-0.10.0
14+
RUN pip install --resume-retries 999 --no-cache-dir vllm==0.10.0
15+
16+
# Fix packages
17+
# transformers 4.54.0 still not support
18+
RUN pip install --no-cache-dir "tensordict==0.6.2" "transformers[hf_xet]>=4.55.4" accelerate datasets peft hf-transfer \
19+
"numpy<2.0.0" "pyarrow>=19.0.1" pandas \
20+
ray[default] codetiming hydra-core pylatexenc qwen-vl-utils wandb dill pybind11 liger-kernel mathruler blobfile xgrammar \
21+
pytest py-spy pyext pre-commit ruff
22+
23+
RUN pip uninstall -y pynvml nvidia-ml-py && \
24+
pip install --resume-retries 999 --no-cache-dir --upgrade "nvidia-ml-py>=12.560.30" "fastapi[standard]>=0.115.0" "optree>=0.13.0" "pydantic>=2.9" "grpcio>=1.62.1"
25+
26+
RUN pip install --resume-retries 999 --no-cache-dir nvidia-cudnn-cu12==9.8.0.87
27+
28+
# Install TransformerEngine
29+
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --resume-retries 999 --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.7
30+
RUN pip install onnxscript
31+
32+
# Install Megatron-LM
33+
RUN pip3 install --no-deps --no-cache-dir --no-build-isolation git+https://github.com/NVIDIA/Megatron-LM.git@core_v0.15.0rc4
34+
35+
# Install mbridge
36+
RUN pip3 install --no-cache-dir mbridge==v0.15.0
37+
38+
# Fix qwen vl
39+
RUN pip3 install --no-cache-dir --no-deps trl

docker/verl0.5-cu126-torch2.7-fa2.7.4/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,4 @@ megatron.core==core_r0.13.0
2424
- App image:
2525
- `verlai/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.13.0-te2.2`
2626
- `verlai/verl:app-verl0.5-transformers4.55.4-sglang0.4.10.post2-mcore0.13.0-te2.2`
27+
- `iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7`

docs/start/install.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@ For latest vLLM with FSDP, please refer to `hiyouga/verl <https://hub.docker.com
7979

8080
For latest SGLang with FSDP, please refer to `hebiaobuaa/verl <https://hub.docker.com/r/hebiaobuaa/verl>`_ repository and the latest version is ``hebiaobuaa/verl:app-verl0.5-sglang0.4.9.post6-mcore0.12.2-te2.2`` which is provided by SGLang RL Group.
8181

82+
For latest vLLM with Megatron, please refer to `iseekyan/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.15.0-te2.7`
83+
8284
See files under ``docker/`` for NGC-based image or if you want to build your own.
8385

8486
Note that For aws instances with EFA net interface (Sagemaker AI Pod),

0 commit comments

Comments
 (0)