【CPU方式加载推理模型】xinferece vllm 跑emb 性能和vllm serve 原生相比，【并发1】要低

### System Info / 系統信息

xinference 1.15.0
vllm 0.11
ubuntu 20.04

### Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

- [ ] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装

### Version info / 版本信息

已知在1.15.0 上，vllm emb 已知支持了聚合组batch 逻辑和异步优化处理
https://github.com/xorbitsai/inference/pull/4197
具体见

### The command used to start Xinference / 用以启动 xinference 的命令

<img width="1661" height="962" alt="Image" src="https://github.com/user-attachments/assets/b3ecf1e9-63ba-4db9-8642-d3f1e710bded" />
xinf 启动Qwen3-Embedding-0.6B, cpu 方式, tp1 

想对比，vllm 方式
vllm serve /weights/Qwen3-Embed-0.6B --trust-remote-code --tensor-parallel-size 1 --max-model-len 6553 --enforce-eager --served-model-name qwen3_emb --port 8008 --task embed --enable-log-requests


### Reproduction / 复现过程

import requests
import time
# 定义请求参数
# 开始计时
for i in range(50000):
    start_time = time.time()
    response = requests.post(
        "http://localhost:8008/v1/embeddings",  # Embedding 端点
        json={
            "model": "qwen3-emb",  # 替换为你的 Embedding 模型 UID（如 'bge-m3'）
            "input": "A man is eating pasta."    # 支持字符串或字符串列表
        }
    )
    # 结束计时
    end_time = time.time()
    total_time = end_time - start_time

    print(f"{i} 请求耗时: {total_time:.3f} 秒")
    # print("Emb 结果:", response.json())
测试代码，查看每个请求处理耗时。
vllm serve 原生每个请求差不多 0.078s 

<img width="551" height="397" alt="Image" src="https://github.com/user-attachments/assets/2de86874-fbab-42d0-b67b-5c54619a5e3e" />

### Expected behavior / 期待表现

两边耗时接近，性能尽可能持平。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【CPU方式加载推理模型】xinferece vllm 跑emb 性能和vllm serve 原生相比，【并发1】要低 #4418

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

定义请求参数

开始计时

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

【CPU方式加载推理模型】xinferece vllm 跑emb 性能和vllm serve 原生相比，【并发1】要低 #4418

Description

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

定义请求参数

开始计时

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions