-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Checklist / 检查清单
- I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。
Bug Description / Bug 描述
中期npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 25.2.0 Version: 25.2.0 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B3 | OK | 111.7 39 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 32484/ 65536 |
+===========================+===============+====================================================+
| 1 910B3 | OK | 317.6 51 0 / 0 |
| 0 | 0000:C2:00.0 | 100 0 / 0 49656/ 65536 |
+===========================+===============+====================================================+
| 2 910B3 | OK | 364.4 52 0 / 0 |
| 0 | 0000:81:00.0 | 100 0 / 0 20660/ 65536 |
+===========================+===============+====================================================+
| 3 910B3 | OK | 102.5 37 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 30420/ 65536 |
+===========================+===============+====================================================+
| 4 910B3 | OK | 99.8 42 0 / 0 |
| 0 | 0000:01:00.0 | 0 0 / 0 50114/ 65536 |
+===========================+===============+====================================================+
| 5 910B3 | OK | 98.9 45 0 / 0 |
| 0 | 0000:02:00.0 | 0 0 / 0 15783/ 65536 |
+===========================+===============+====================================================+
| 6 910B3 | OK | 113.0 47 0 / 0 |
| 0 | 0000:41:00.0 | 8 0 / 0 61145/ 65536 |
+===========================+===============+====================================================+
| 7 910B3 | OK | 107.9 44 0 / 0 |
| 0 | 0000:42:00.0 | 7 0 / 0 61183/ 65536 |
+===========================+===============+====================================================+
现象:这里6,7用作vllm rollout,05用于训练,训练初期05的显存占用大约在10000~20000MB,中期某些NPU会有所上升,继续训练之后会直接出现OOM的问题。
How to Reproduce / 如何复现
rollout 脚本
ASCEND_RT_VISIBLE_DEVICES=6,7 \
swift rollout \
--model /workspace/model/private/xxxxxx/xxxxxx \
--vllm_data_parallel_size 2
训练脚本
cd /workspace/algorithm/xxxxxx
export PYTHONPATH=$PYTHONPATH:/workspace/algorithm/xxxxxx/Megatron-LM
export MEGATRON_LM_PATH=/workspace/algorithm/xxxxxx/Megatron-LM
export CUDA_DEVICE_MAX_CONNECTIONS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_DETERMINISTIC=true
export HCCL_CONNECT_TIMEOUT=7200
# 基础配置
MODEL_NAME_OR_DIR="/workspace/model/private/xxxxxx"
DATASET=/workspace/algorithm/xxxxxx/data/aug_part0_nochange_2w.jsonl
DATASET_SPLIT=0. # 当没有制定val dataset时,从训练集上划分验证集,默认为0
SYSTEM_PROMPT="./prompt.txt" # prompt路径
PLUGIN_PY="/workspace/algorithm/xxxxxx/scripts/myPlugin/plugin.py" # 奖励函数插件脚本路径
TRAIN_TYPE=full
# 训练配置1
## batch_size = num_process * per_device_train_batch_size * gradient_accumulation_steps
per_device_train_batch_size=2
gradient_accumulation_steps=1
## steps = (num_train_epochs * len(datasets) * num_generations) / batch_size
num_generations=3
num_train_epochs=1
## maxSteps = num_iterations * steps
num_iterations=1
# 训练配置2
max_completion_length=1024
learning_rate=1e-6
warmup_ratio=0.05
beta=0.001
deepspeed_zero=zero3
# 验证与保存
per_device_eval_batch_size=8
save_strategy='steps'
eval_strategy='steps'
eval_steps=500
save_steps=50
save_total_limit=4
output_dir="output/GRPO_GEOQA/aug_part0_nochange_2w"
# 开始训练
## 根据实际情况设置NPU信息、奖励函数等等
IMAGE_MAX_TOKEN_NUM=16384 \
NPROC_PER_NODE=6 \
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5 \
swift rlhf \
--rlhf_type grpo \
--reward_funcs text_quad_iou \
--load_from_cache_file true \
--model ${MODEL_NAME_OR_DIR} \
--dataset ${DATASET} \
--split_dataset_ratio ${DATASET_SPLIT} \
--dataloader_num_workers 4 \
--external_plugins ${PLUGIN_PY} \
--use_vllm true \
--vllm_mode server \
--vllm_server_host 0.0.0.0 \
--vllm_server_port 8001 \
--train_type ${TRAIN_TYPE} \
--torch_dtype bfloat16 \
--num_iterations ${num_iterations} \
--num_generations ${num_generations} \
--gradient_accumulation_steps ${gradient_accumulation_steps} \
--per_device_train_batch_size ${per_device_train_batch_size} \
--num_train_epochs ${num_train_epochs} \
--max_completion_length ${max_completion_length} \
--per_device_eval_batch_size ${per_device_eval_batch_size} \
--learning_rate ${learning_rate} \
--warmup_ratio ${warmup_ratio} \
--temperature 1.0 \
--repetition_penalty 1.1 \
--beta ${beta} \
--max_grad_norm 0.5 \
--save_strategy ${save_strategy} \
--eval_strategy ${eval_strategy} \
--eval_steps ${eval_steps} \
--save_steps ${save_steps} \
--save_total_limit ${save_total_limit} \
--logging_steps 1 \
--output_dir ${output_dir} \
--system ${SYSTEM_PROMPT} \
--deepspeed ${deepspeed_zero} \
--log_completions true \
--async_generate true \
Additional Information / 补充信息
环境信息
Package Version Editable project location
--------------------------------- ----------------- --------------------------------------
absl-py 2.3.1
accelerate 1.12.0
addict 2.4.0
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.13.2
aiosignal 1.4.0
aliyun-python-sdk-core 2.16.0
aliyun-python-sdk-kms 2.16.5
annotated-doc 0.0.4
annotated-types 0.7.0
anthropic 0.71.0
antlr4-python3-runtime 4.9.3
anyio 4.12.0
asc_opc_tool 0.1.0
astor 0.8.1
attrdict 2.0.1
attrs 25.4.0
audioread 3.1.0
auto_tune 0.1.0
av 16.1.0
binpacking 2.0.0
blake3 1.0.8
blinker 1.9.0
brotli 1.2.0
cachetools 6.2.4
cbor2 5.7.1
certifi 2025.11.12
cffi 2.0.0
charset-normalizer 3.4.4
click 8.3.1
cloudpickle 3.1.2
cmake 4.2.1
compressed-tensors 0.12.2
contourpy 1.3.3
cpm-kernels 1.0.11
crcmod 1.7
cryptography 46.0.3
cycler 0.12.1
Cython 3.2.1
dacite 1.9.2
dataflow 0.0.1
datasets 3.6.0
decorator 5.2.1
deepspeed 0.18.5
depyf 0.20.0
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
dnspython 2.8.0
docstring_parser 0.17.0
einops 0.8.1
email-validator 2.3.0
fastapi 0.123.10
fastapi-cli 0.0.20
fastapi-cloud-cli 0.8.0
fastar 0.8.0
ffmpy 1.0.0
filelock 3.20.1
Flask 3.1.2
fonttools 4.61.1
frozenlist 1.8.0
fsspec 2025.3.0
gguf 0.17.1
gpytorch 1.15.1
gradio 5.50.0
gradio_client 1.14.0
greenlet 3.3.0
groovy 0.1.2
grpcio 1.76.0
h11 0.16.0
h2 4.3.0
hccl 0.1.0
hccl_parser 0.1
hf-xet 1.2.0
hjson 3.1.0
hpack 4.1.0
httpcore 1.0.9
httptools 0.7.1
httpx 0.28.1
httpx-sse 0.4.3
huggingface-hub 0.36.0
Hypercorn 0.18.0
hyperframe 6.1.0
idna 3.11
ijson 3.4.0.post0
importlib_metadata 8.7.1
iniconfig 2.3.0
interegular 0.3.3
itsdangerous 2.2.0
jaxtyping 0.3.5
jieba 0.42.1
Jinja2 3.1.6
jiter 0.12.0
jmespath 0.10.0
joblib 1.5.3
json_repair 0.55.0
jsonschema 4.25.1
jsonschema-specifications 2025.9.1
kiwisolver 1.4.9
lark 1.2.2
lazy_loader 0.4
librosa 0.11.0
linear-operator 0.6
llguidance 1.3.0
llm_datadist 0.0.1
llm_datadist_v1 0.0.1
llvmlite 0.46.0
lm-format-enforcer 0.11.3
loguru 0.7.3
Markdown 3.10
markdown-it-py 4.0.0
MarkupSafe 3.0.3
matplotlib 3.10.8
mcp 1.25.0
mdurl 0.1.2
mindspeed 0.12.1 /workspace/algorithm/MindSpeed_r0_12_1
mistral_common 1.8.8
model-hosting-container-standards 0.1.12
modelscope 1.33.0
mpmath 1.3.0
ms_swift 3.12.1
msgpack 1.1.2
msgspec 0.20.0
msobjdump 0.1.0
multidict 6.7.0
multiprocess 0.70.16
networkx 3.6.1
ninja 1.13.0
nltk 3.9.2
numba 0.63.1
numpy 1.26.0
omegaconf 2.3.0
op_compile_tool 0.1.0
op_gen 0.1
op_test_frame 0.1
opc_tool 0.1.0
openai 2.14.0
openai-harmony 0.0.8
opencv-python-headless 4.11.0.86
orjson 3.11.5
oss2 2.19.1
outlines_core 0.2.11
packaging 25.0
pandas 2.3.3
pandas-stubs 2.3.3.251219
partial-json-parser 0.2.1.1.post7
pathlib2 2.3.7.post1
peft 0.18.1
pillow 11.3.0
pip 25.3
platformdirs 4.5.1
pluggy 1.6.0
pooch 1.8.2
priority 2.0.0
prometheus_client 0.23.1
prometheus-fastapi-instrumentator 7.1.0
propcache 0.4.1
protobuf 6.33.2
psutil 7.1.3
py-cpuinfo 9.0.0
pyarrow 23.0.0
pybase64 1.4.3
pybind11 3.0.1
pycountry 24.6.1
pycparser 2.23
pycryptodome 3.23.0
pydantic 2.12.3
pydantic_core 2.41.4
pydantic-extra-types 2.10.6
pydantic-settings 2.12.0
pydub 0.25.1
Pygments 2.19.2
PyJWT 2.10.1
pyparsing 3.3.1
pytest 9.0.2
pytest-mock 3.15.1
python-dateutil 2.9.0.post0
python-dotenv 1.2.1
python-json-logger 4.0.0
python-multipart 0.0.21
pytz 2025.2
PyYAML 6.0.3
pyzmq 27.1.0
Quart 0.20.0
qwen-vl-utils 0.0.14
ray 2.48.0
referencing 0.37.0
regex 2025.11.3
requests 2.32.5
rich 14.2.0
rich-toolkit 0.17.1
rignore 0.7.6
rouge 1.0.1
rpds-py 0.30.0
ruff 0.14.13
safehttpx 0.1.7
safetensors 0.7.0
schedule_search 0.0.1
scikit-learn 1.8.0
scipy 1.15.3
semantic-version 2.10.0
sentencepiece 0.2.1
sentry-sdk 2.48.0
setproctitle 1.3.7
setuptools 65.5.0
setuptools-scm 9.2.2
shellingham 1.5.4
show_kernel_debug_data 0.1.0
simplejson 3.20.2
six 1.17.0
sniffio 1.3.1
sortedcontainers 2.4.0
soundfile 0.13.1
soxr 1.0.0
SQLAlchemy 2.0.45
sse-starlette 3.1.1
starlette 0.50.0
supervisor 4.3.0
sympy 1.14.0
te 0.4.0
tensorboard 2.20.0
tensorboard-data-server 0.7.2
threadpoolctl 3.6.0
tiktoken 0.12.0
tokenizers 0.22.1
tomlkit 0.13.3
torch 2.8.0+cpu
torch_npu 2.8.0
torchvision 0.23.0
tqdm 4.67.1
transformers 4.57.3
transformers-stream-generator 0.0.5
trl 0.24.0
typer 0.21.0
types-pytz 2025.2.0.20251108
typing_extensions 4.15.0
typing-inspection 0.4.2
tzdata 2025.3
urllib3 2.5.0
uvicorn 0.40.0
uvloop 0.22.1
vllm 0.13.0+empty /vllm-workspace/vllm
vllm_ascend 0.13.0rc1 /vllm-workspace/vllm-ascend
wadler_lindig 0.1.7
watchfiles 1.1.1
websockets 15.0.1
Werkzeug 3.1.4
wheel 0.45.1
wsproto 1.3.2
xgrammar 0.1.27
xxhash 3.6.0
yarl 1.22.0
zipp 3.23.0
zstandard 0.25.0