Skip to content

在Ascend硬件下尝试ms-swift在Qwen3-VL-2B-Instruct做GRPO训练中途出现OOM,python==3.11.13 #7989

@LuckyDenKy

Description

@LuckyDenKy

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

中期npu-smi info

+------------------------------------------------------------------------------------------------+
| npu-smi 25.2.0 Version: 25.2.0 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B3 | OK | 111.7 39 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 32484/ 65536 |
+===========================+===============+====================================================+
| 1 910B3 | OK | 317.6 51 0 / 0 |
| 0 | 0000:C2:00.0 | 100 0 / 0 49656/ 65536 |
+===========================+===============+====================================================+
| 2 910B3 | OK | 364.4 52 0 / 0 |
| 0 | 0000:81:00.0 | 100 0 / 0 20660/ 65536 |
+===========================+===============+====================================================+
| 3 910B3 | OK | 102.5 37 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 30420/ 65536 |
+===========================+===============+====================================================+
| 4 910B3 | OK | 99.8 42 0 / 0 |
| 0 | 0000:01:00.0 | 0 0 / 0 50114/ 65536 |
+===========================+===============+====================================================+
| 5 910B3 | OK | 98.9 45 0 / 0 |
| 0 | 0000:02:00.0 | 0 0 / 0 15783/ 65536 |
+===========================+===============+====================================================+
| 6 910B3 | OK | 113.0 47 0 / 0 |
| 0 | 0000:41:00.0 | 8 0 / 0 61145/ 65536 |
+===========================+===============+====================================================+
| 7 910B3 | OK | 107.9 44 0 / 0 |
| 0 | 0000:42:00.0 | 7 0 / 0 61183/ 65536 |
+===========================+===============+====================================================+

现象:这里6,7用作vllm rollout,05用于训练,训练初期05的显存占用大约在10000~20000MB,中期某些NPU会有所上升,继续训练之后会直接出现OOM的问题。

How to Reproduce / 如何复现

rollout 脚本

ASCEND_RT_VISIBLE_DEVICES=6,7 \
swift rollout \
    --model /workspace/model/private/xxxxxx/xxxxxx \
    --vllm_data_parallel_size 2

训练脚本

cd /workspace/algorithm/xxxxxx
export PYTHONPATH=$PYTHONPATH:/workspace/algorithm/xxxxxx/Megatron-LM
export MEGATRON_LM_PATH=/workspace/algorithm/xxxxxx/Megatron-LM
export CUDA_DEVICE_MAX_CONNECTIONS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_DETERMINISTIC=true
export HCCL_CONNECT_TIMEOUT=7200

# 基础配置
MODEL_NAME_OR_DIR="/workspace/model/private/xxxxxx"
DATASET=/workspace/algorithm/xxxxxx/data/aug_part0_nochange_2w.jsonl
DATASET_SPLIT=0.  # 当没有制定val dataset时,从训练集上划分验证集,默认为0
SYSTEM_PROMPT="./prompt.txt"  # prompt路径
PLUGIN_PY="/workspace/algorithm/xxxxxx/scripts/myPlugin/plugin.py"  # 奖励函数插件脚本路径
TRAIN_TYPE=full

# 训练配置1
## batch_size = num_process * per_device_train_batch_size * gradient_accumulation_steps
per_device_train_batch_size=2
gradient_accumulation_steps=1
## steps = (num_train_epochs * len(datasets) * num_generations) / batch_size
num_generations=3
num_train_epochs=1
## maxSteps = num_iterations * steps
num_iterations=1

# 训练配置2
max_completion_length=1024
learning_rate=1e-6
warmup_ratio=0.05
beta=0.001
deepspeed_zero=zero3

# 验证与保存
per_device_eval_batch_size=8
save_strategy='steps'
eval_strategy='steps'
eval_steps=500
save_steps=50
save_total_limit=4
output_dir="output/GRPO_GEOQA/aug_part0_nochange_2w"

# 开始训练
## 根据实际情况设置NPU信息、奖励函数等等
IMAGE_MAX_TOKEN_NUM=16384 \
NPROC_PER_NODE=6 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5 \
swift rlhf \
    --rlhf_type grpo \
    --reward_funcs text_quad_iou \
    --load_from_cache_file true \
    --model ${MODEL_NAME_OR_DIR} \
    --dataset ${DATASET} \
    --split_dataset_ratio ${DATASET_SPLIT} \
    --dataloader_num_workers 4 \
    --external_plugins ${PLUGIN_PY} \
    --use_vllm true \
    --vllm_mode server \
    --vllm_server_host 0.0.0.0 \
    --vllm_server_port 8001 \
    --train_type ${TRAIN_TYPE} \
    --torch_dtype bfloat16 \
    --num_iterations ${num_iterations} \
    --num_generations ${num_generations} \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --num_train_epochs ${num_train_epochs} \
    --max_completion_length ${max_completion_length} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --learning_rate ${learning_rate} \
    --warmup_ratio ${warmup_ratio} \
    --temperature 1.0 \
    --repetition_penalty 1.1 \
    --beta ${beta} \
    --max_grad_norm 0.5 \
    --save_strategy ${save_strategy} \
    --eval_strategy ${eval_strategy} \
    --eval_steps ${eval_steps} \
    --save_steps ${save_steps} \
    --save_total_limit ${save_total_limit} \
    --logging_steps 1 \
    --output_dir ${output_dir} \
    --system ${SYSTEM_PROMPT} \
    --deepspeed ${deepspeed_zero} \
    --log_completions true \
    --async_generate true \

Additional Information / 补充信息

环境信息

Package                           Version           Editable project location
--------------------------------- ----------------- --------------------------------------
absl-py                           2.3.1
accelerate                        1.12.0
addict                            2.4.0
aiofiles                          24.1.0
aiohappyeyeballs                  2.6.1
aiohttp                           3.13.2
aiosignal                         1.4.0
aliyun-python-sdk-core            2.16.0
aliyun-python-sdk-kms             2.16.5
annotated-doc                     0.0.4
annotated-types                   0.7.0
anthropic                         0.71.0
antlr4-python3-runtime            4.9.3
anyio                             4.12.0
asc_opc_tool                      0.1.0
astor                             0.8.1
attrdict                          2.0.1
attrs                             25.4.0
audioread                         3.1.0
auto_tune                         0.1.0
av                                16.1.0
binpacking                        2.0.0
blake3                            1.0.8
blinker                           1.9.0
brotli                            1.2.0
cachetools                        6.2.4
cbor2                             5.7.1
certifi                           2025.11.12
cffi                              2.0.0
charset-normalizer                3.4.4
click                             8.3.1
cloudpickle                       3.1.2
cmake                             4.2.1
compressed-tensors                0.12.2
contourpy                         1.3.3
cpm-kernels                       1.0.11
crcmod                            1.7
cryptography                      46.0.3
cycler                            0.12.1
Cython                            3.2.1
dacite                            1.9.2
dataflow                          0.0.1
datasets                          3.6.0
decorator                         5.2.1
deepspeed                         0.18.5
depyf                             0.20.0
dill                              0.3.8
diskcache                         5.6.3
distro                            1.9.0
dnspython                         2.8.0
docstring_parser                  0.17.0
einops                            0.8.1
email-validator                   2.3.0
fastapi                           0.123.10
fastapi-cli                       0.0.20
fastapi-cloud-cli                 0.8.0
fastar                            0.8.0
ffmpy                             1.0.0
filelock                          3.20.1
Flask                             3.1.2
fonttools                         4.61.1
frozenlist                        1.8.0
fsspec                            2025.3.0
gguf                              0.17.1
gpytorch                          1.15.1
gradio                            5.50.0
gradio_client                     1.14.0
greenlet                          3.3.0
groovy                            0.1.2
grpcio                            1.76.0
h11                               0.16.0
h2                                4.3.0
hccl                              0.1.0
hccl_parser                       0.1
hf-xet                            1.2.0
hjson                             3.1.0
hpack                             4.1.0
httpcore                          1.0.9
httptools                         0.7.1
httpx                             0.28.1
httpx-sse                         0.4.3
huggingface-hub                   0.36.0
Hypercorn                         0.18.0
hyperframe                        6.1.0
idna                              3.11
ijson                             3.4.0.post0
importlib_metadata                8.7.1
iniconfig                         2.3.0
interegular                       0.3.3
itsdangerous                      2.2.0
jaxtyping                         0.3.5
jieba                             0.42.1
Jinja2                            3.1.6
jiter                             0.12.0
jmespath                          0.10.0
joblib                            1.5.3
json_repair                       0.55.0
jsonschema                        4.25.1
jsonschema-specifications         2025.9.1
kiwisolver                        1.4.9
lark                              1.2.2
lazy_loader                       0.4
librosa                           0.11.0
linear-operator                   0.6
llguidance                        1.3.0
llm_datadist                      0.0.1
llm_datadist_v1                   0.0.1
llvmlite                          0.46.0
lm-format-enforcer                0.11.3
loguru                            0.7.3
Markdown                          3.10
markdown-it-py                    4.0.0
MarkupSafe                        3.0.3
matplotlib                        3.10.8
mcp                               1.25.0
mdurl                             0.1.2
mindspeed                         0.12.1            /workspace/algorithm/MindSpeed_r0_12_1
mistral_common                    1.8.8
model-hosting-container-standards 0.1.12
modelscope                        1.33.0
mpmath                            1.3.0
ms_swift                          3.12.1
msgpack                           1.1.2
msgspec                           0.20.0
msobjdump                         0.1.0
multidict                         6.7.0
multiprocess                      0.70.16
networkx                          3.6.1
ninja                             1.13.0
nltk                              3.9.2
numba                             0.63.1
numpy                             1.26.0
omegaconf                         2.3.0
op_compile_tool                   0.1.0
op_gen                            0.1
op_test_frame                     0.1
opc_tool                          0.1.0
openai                            2.14.0
openai-harmony                    0.0.8
opencv-python-headless            4.11.0.86
orjson                            3.11.5
oss2                              2.19.1
outlines_core                     0.2.11
packaging                         25.0
pandas                            2.3.3
pandas-stubs                      2.3.3.251219
partial-json-parser               0.2.1.1.post7
pathlib2                          2.3.7.post1
peft                              0.18.1
pillow                            11.3.0
pip                               25.3
platformdirs                      4.5.1
pluggy                            1.6.0
pooch                             1.8.2
priority                          2.0.0
prometheus_client                 0.23.1
prometheus-fastapi-instrumentator 7.1.0
propcache                         0.4.1
protobuf                          6.33.2
psutil                            7.1.3
py-cpuinfo                        9.0.0
pyarrow                           23.0.0
pybase64                          1.4.3
pybind11                          3.0.1
pycountry                         24.6.1
pycparser                         2.23
pycryptodome                      3.23.0
pydantic                          2.12.3
pydantic_core                     2.41.4
pydantic-extra-types              2.10.6
pydantic-settings                 2.12.0
pydub                             0.25.1
Pygments                          2.19.2
PyJWT                             2.10.1
pyparsing                         3.3.1
pytest                            9.0.2
pytest-mock                       3.15.1
python-dateutil                   2.9.0.post0
python-dotenv                     1.2.1
python-json-logger                4.0.0
python-multipart                  0.0.21
pytz                              2025.2
PyYAML                            6.0.3
pyzmq                             27.1.0
Quart                             0.20.0
qwen-vl-utils                     0.0.14
ray                               2.48.0
referencing                       0.37.0
regex                             2025.11.3
requests                          2.32.5
rich                              14.2.0
rich-toolkit                      0.17.1
rignore                           0.7.6
rouge                             1.0.1
rpds-py                           0.30.0
ruff                              0.14.13
safehttpx                         0.1.7
safetensors                       0.7.0
schedule_search                   0.0.1
scikit-learn                      1.8.0
scipy                             1.15.3
semantic-version                  2.10.0
sentencepiece                     0.2.1
sentry-sdk                        2.48.0
setproctitle                      1.3.7
setuptools                        65.5.0
setuptools-scm                    9.2.2
shellingham                       1.5.4
show_kernel_debug_data            0.1.0
simplejson                        3.20.2
six                               1.17.0
sniffio                           1.3.1
sortedcontainers                  2.4.0
soundfile                         0.13.1
soxr                              1.0.0
SQLAlchemy                        2.0.45
sse-starlette                     3.1.1
starlette                         0.50.0
supervisor                        4.3.0
sympy                             1.14.0
te                                0.4.0
tensorboard                       2.20.0
tensorboard-data-server           0.7.2
threadpoolctl                     3.6.0
tiktoken                          0.12.0
tokenizers                        0.22.1
tomlkit                           0.13.3
torch                             2.8.0+cpu
torch_npu                         2.8.0
torchvision                       0.23.0
tqdm                              4.67.1
transformers                      4.57.3
transformers-stream-generator     0.0.5
trl                               0.24.0
typer                             0.21.0
types-pytz                        2025.2.0.20251108
typing_extensions                 4.15.0
typing-inspection                 0.4.2
tzdata                            2025.3
urllib3                           2.5.0
uvicorn                           0.40.0
uvloop                            0.22.1
vllm                              0.13.0+empty      /vllm-workspace/vllm
vllm_ascend                       0.13.0rc1         /vllm-workspace/vllm-ascend
wadler_lindig                     0.1.7
watchfiles                        1.1.1
websockets                        15.0.1
Werkzeug                          3.1.4
wheel                             0.45.1
wsproto                           1.3.2
xgrammar                          0.1.27
xxhash                            3.6.0
yarl                              1.22.0
zipp                              3.23.0
zstandard                         0.25.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions