-
Notifications
You must be signed in to change notification settings - Fork 682
[INTEL_HPU] supported ERNIE-4.5-21B-A3B-Thinking mold #5891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[INTEL_HPU] supported ERNIE-4.5-21B-A3B-Thinking mold #5891
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5891 +/- ##
==========================================
Coverage ? 66.59%
==========================================
Files ? 347
Lines ? 44467
Branches ? 6835
==========================================
Hits ? 29614
Misses ? 12668
Partials ? 2185
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@LeoZhao-Intel @yanfeich @JianyuLi01 @feiwan1 @fmiao2372 could you help to give a review? |
fastdeploy/model_executor/utils.py
Outdated
| or current_platform.is_maca() | ||
| or current_platform.is_intel_hpu() | ||
| ): | ||
| _err_msg("v1loader currently only support backends gpu, xpu, iluvatar and maca") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to add "intel hpu" into _err_msg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks review, updated and added,
ERNIE-4.5-21B-A3B-Thinking needs to use DefaultModelLoaderV1 mode
reference command line:
ENABLE_V1_KVCACHE_SCHEDULER=1 FD_ENC_DEC_BLOCK_NUM=8 HPU_PERF_BREAKDOWN_SYNC_MODE=1 \
HPU_WARMUP_BUCKET=0 MAX_PREFILL_NUM=1 FD_ATTENTION_BACKEND=HPU_ATTN \
python -m fastdeploy.entrypoints.openai.api_server --model \
./models--baidu--ERNIE-4.5-21B-A3B-Thinking/snapshots/4341bb42644d5422859509fa25d41544c57181f8/ \
--port 8388 --engine-worker-queue-port 8302 --metrics-port 8301 \
--cache-queue-port 8303 --max-model-len 16384 --tensor-parallel-size 1 \
--load-choices "default_v1" --num-gpu-blocks-override 5000 --kv-cache-ratio 0.5 \
--max-num-seqs 128 --block-size 64 --no-enable-prefix-caching \
--graph-optimization-config '{"use_cudagraph":false}'
Signed-off-by: Luo, Focus <[email protected]>
fd36ec3 to
c42a0b0
Compare
zoooo0820
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ERNIE-4.5-21B-A3B-Thinking needs to use DefaultModelLoaderV1 mode
reference command line:
ENABLE_V1_KVCACHE_SCHEDULER=1 FD_ENC_DEC_BLOCK_NUM=8 HPU_PERF_BREAKDOWN_SYNC_MODE=1 \ HPU_WARMUP_BUCKET=0 MAX_PREFILL_NUM=1 FD_ATTENTION_BACKEND=HPU_ATTN \ python -m fastdeploy.entrypoints.openai.api_server --model \ ./models--baidu--ERNIE-4.5-21B-A3B-Thinking/snapshots/4341bb42644d5422859509fa25d41544c57181f8/ \ --port 8388 --engine-worker-queue-port 8302 --metrics-port 8301 \ --cache-queue-port 8303 --max-model-len 16384 --tensor-parallel-size 1 \ --load-choices "default_v1" --num-gpu-blocks-override 5000 --kv-cache-ratio 0.5 \ --max-num-seqs 128 --block-size 64 --no-enable-prefix-caching \ --graph-optimization-config '{"use_cudagraph":false}'
python bench_gsm8k.py --data-path ./test.jsonl --port 8388 --num-shots 5 --num-questions 1319 --parallel 1 --result-file test_accuracy_parallel_1.json
Motivation
supported ERNIE-4.5-21B-A3B-Thinking mold
Modifications
enabled DefaultModelLoaderV1 on Intel HPU
Usage or Command
ENABLE_V1_KVCACHE_SCHEDULER=1 FD_ENC_DEC_BLOCK_NUM=8 HPU_PERF_BREAKDOWN_SYNC_MODE=1 HPU_WARMUP_BUCKET=0 MAX_PREFILL_NUM=1 FD_ATTENTION_BACKEND=HPU_ATTN python -m fastdeploy.entrypoints.openai.api_server --model /mnt/disk3/ernie_opensource/hub/models--baidu--ERNIE-4.5-21B-A3B-Thinking/snapshots/4341bb42644d5422859509fa25d41544c57181f8/ --port 8388 --engine-worker-queue-port 8302 --metrics-port 8301 --cache-queue-port 8303 --max-model-len 16384 --tensor-parallel-size 1 --load-choices "default_v1" --num-gpu-blocks-override 5000 --kv-cache-ratio 0.5 --max-num-seqs 128 --block-size 64 --no-enable-prefix-caching --graph-optimization-config '{"use_cudagraph":false}'
Accuracy Tests
On Intel HPU
{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 21223.108, "accuracy": 0.774, "num_requests": 1319, "other": {"num_questions": 1319, "parallel": 1}}
{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 3019.105, "accuracy": 0.754, "num_requests": 1319, "other": {"num_questions": 1319, "parallel": 16}}
On NV H20
ENABLE_V1_KVCACHE_SCHEDULER=1 python -m fastdeploy.entrypoints.openai.api_server --model ./models--baidu--ERNIE-4.5-21B-A3B-Thinking/snapshots/4341bb42644d5422859509fa25d41544c57181f8/ --port 8388 --engine-worker-queue-port 8302 --metrics-port 8301 \ --cache-queue-port 8303 --tensor-parallel-size 1 --max-model-len 16384 --max-num-seqs 128 --block-size 64 --kv-cache-ratio 0.5 --num-gpu-blocks-override 6000 --graph-optimization-config '{"use_cudagraph":false}' --no-enable-prefix-caching --load-choices "default_v1"
{"task": "gsm8k", "backend": "paddlepaddle", "num_gpus": 1, "latency": 18805.898, "accuracy": 0.763, "num_requests": 1319, "other": {"num_questions": 1319, "parallel": 1}}
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.