-
Notifications
You must be signed in to change notification settings - Fork 475
Description
Motivation.
Currently, the vllm-omni test suite primarily consists of model-specific end-to-end (E2E) offline inference tests . As Omni supports an increasing number of model architectures, the existing testing approach has revealed several limitations:
- Lack of Automated Accuracy Comparison: Existing tests only verify that the process is runnable, lacking automated numerical consistency checks against HuggingFace (HF) reference implementations.
- Insufficient Cache Logic Coverage: There is a lack of standardized "Cache Consistency Tests." In multimodal scenarios, placeholders (e.g.,
<image_pad>) are expanded into a large number of Vision/Audio tokens. If the expansion logic differs between "Cache ON" and "Cache OFF" states, it leads to unpredictable precision drift. - Poor Extensibility and High Maintenance Burden: Adding new models requires writing redundant boilerplate code. There is no way to leverage declarative configurations to automatically obtain test coverage upon model registration.
Therefore, it is necessary to introduce a generalized multimodal comparison testing framework similar to the main vllm repository to improve the engineering quality and development efficiency of the Omni-mode inference engine.
This framework defines the testing strategy required to implement the CI system described in #400
Proposed Change.
This proposal suggests an engineered migration of vllm's common testing logic to vllm-omni, with deep adaptations for Omni-mode.
2.1 Directory Structure
A three-layer architecture is recommended under tests/models/multimodal/:
tests/models/multimodal/
├── vlm_utils/ # Core Utilities: Type definitions and automated parameter allocation
├── generation/ # Generation Alignment: Extracting Hidden States and comparing with HF
│ ├── test_common.py # Common test entry point
│ └── runners/ # Execution engines supporting multi-stage outputs (VllmRunner, HfRunner)
└── processing/ # Preprocessing Tests: Validating input pipeline and cache consistency
└── test_common.py # Cache consistency validation logic
2.2 Key Components
- Common Test Suites: Migrate and adapt
test_common.pyto establish standard entry points for end-to-end accuracy comparison and cache consistency checks. - Multimodal Execution Engines (Runners): Refactor
VllmRunnerandHfRunnerto focus on extracting Hidden States (latent representations) for comparison. By verifying numerical consistency of core representations, model alignment can be efficiently validated without full decoding into video/audio files. - Parametrization Engine: Migrate
case_filtering.pylogic to automatically build comprehensive test matrices based on parameter spaces. - Configuration Registry: Migrate the
VLM_TEST_SETTINGSpattern to establish a standardized data input framework for models. - Process-level Resource Isolation: Introduce the
@create_new_process_for_each_testdecorator to enforce GPU resource reclamation via sub-process lifecycle management, resolving memory fragmentation issues caused by running large models sequentially.
Feedback Period.
No response
CC List.
@hsliuustc0106 @princepride @congw729
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.