Support Paraformer-zh in optimum-intel#1642
Support Paraformer-zh in optimum-intel#1642padatta wants to merge 8 commits intohuggingface:mainfrom
Conversation
rkazants
left a comment
There was a problem hiding this comment.
Please add tests,
openvino.py, main.py should not be changed. Please check what files are changed when adding new models support in other PRs.
Thanks
There was a problem hiding this comment.
Pull request overview
This PR introduces Paraformer (funasr/paraformer-zh) support in optimum-intel’s OpenVINO workflow, covering model auto-detection, export to OpenVINO IR, optional INT8 quantization, and a new OpenVINO inference wrapper class.
Changes:
- Add Paraformer auto-detection and pipeline registration for ASR.
- Implement Paraformer OpenVINO export path (including INT8 quantization flow).
- Add a new OpenVINO inference implementation
OVParaformerForSpeechSeq2Seq.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
optimum/intel/utils/modeling_utils.py |
Detect Paraformer models via presence of am.mvn. |
optimum/intel/utils/dummy_openvino_objects.py |
Add dummy OVParaformerForSpeechSeq2Seq for missing OpenVINO backend. |
optimum/intel/pipelines/accelerator_utils.py |
Register Paraformer class under ASR task mapping. |
optimum/intel/openvino/utils.py |
Add aishell-1 dataset marker and adjust predefined dataset metadata. |
optimum/intel/openvino/modeling_speech2text.py |
New Paraformer OpenVINO inference wrapper and component classes. |
optimum/intel/openvino/__init__.py |
Export OVParaformerForSpeechSeq2Seq from OpenVINO package. |
optimum/intel/__init__.py |
Export OVParaformerForSpeechSeq2Seq from top-level package. |
optimum/exporters/openvino/modeling_paraformer.py |
New Paraformer Torch/TorchScript export implementation used by OpenVINO conversion. |
optimum/exporters/openvino/__main__.py |
Add Paraformer-specific export + quantization handling into main_export and task inference. |
optimum/commands/export/openvino.py |
Skip _main_quantize for Paraformer since quantization is handled during export. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ov_config is not None and ov_config.quantization_config is not None: | ||
| import nncf | ||
| import numpy as np | ||
| import librosa |
There was a problem hiding this comment.
Paraformer INT8 quantization imports librosa unconditionally inside the quantization branch. librosa is not a declared runtime dependency, so --quant-mode int8 --dataset aishell-1 will crash with ModuleNotFoundError in standard installs. Either switch to the existing datasets-based audio loading used elsewhere in the repo, or wrap the import in a try/except and raise a targeted error telling users which extra to install.
| import librosa | |
| try: | |
| import librosa | |
| except ModuleNotFoundError as exc: | |
| if exc.name == "librosa": | |
| raise RuntimeError( | |
| "INT8 quantization for Paraformer requires the optional 'librosa' dependency for " | |
| "audio feature extraction. Please install it with `pip install librosa` or install " | |
| "the appropriate audio extra for this package." | |
| ) from exc | |
| raise |
| def __int__(self, d_model=80, dropout_rate=0.1): | ||
| pass | ||
|
|
||
| # Copied from https://github.com/modelscope/FunASR/blob/main/funasr/models/transformer/embedding.py#L383 (Apache 2.0) | ||
| class SinusoidalPositionEncoder(torch.nn.Module): | ||
| """ """ | ||
|
|
||
| def __int__(self, d_model=80, dropout_rate=0.1): | ||
| pass |
There was a problem hiding this comment.
Both StreamSinusoidalPositionEncoder and SinusoidalPositionEncoder define __int__ instead of __init__, so their constructors will never run. If these classes are expected to initialize any state (e.g., dropout), it will silently be skipped. Rename __int__ to __init__ (or remove the method entirely if no initialization is needed).
| def __int__(self, d_model=80, dropout_rate=0.1): | |
| pass | |
| # Copied from https://github.com/modelscope/FunASR/blob/main/funasr/models/transformer/embedding.py#L383 (Apache 2.0) | |
| class SinusoidalPositionEncoder(torch.nn.Module): | |
| """ """ | |
| def __int__(self, d_model=80, dropout_rate=0.1): | |
| pass | |
| def __init__(self, d_model=80, dropout_rate=0.1): | |
| super().__init__() | |
| # Copied from https://github.com/modelscope/FunASR/blob/main/funasr/models/transformer/embedding.py#L383 (Apache 2.0) | |
| class SinusoidalPositionEncoder(torch.nn.Module): | |
| """ """ | |
| def __init__(self, d_model=80, dropout_rate=0.1): | |
| super().__init__() |
| if lora_list is not None: | ||
| if "o" in lora_list: | ||
| self.linear_out = lora.Linear( | ||
| n_feat, n_feat, r=lora_rank, lora_alpha=lora_alpha, lora_dropout=lora_dropout | ||
| ) | ||
| else: | ||
| self.linear_out = nn.Linear(n_feat, n_feat) | ||
| lora_qkv_list = ["q" in lora_list, "k" in lora_list, "v" in lora_list] | ||
| if lora_qkv_list == [False, False, False]: | ||
| self.linear_q_k_v = nn.Linear(in_feat, n_feat * 3) | ||
| else: | ||
| self.linear_q_k_v = lora.MergedLinear( | ||
| in_feat, | ||
| n_feat * 3, | ||
| r=lora_rank, | ||
| lora_alpha=lora_alpha, | ||
| lora_dropout=lora_dropout, | ||
| enable_lora=lora_qkv_list, | ||
| ) |
There was a problem hiding this comment.
lora.Linear / lora.MergedLinear is referenced when lora_list is not None, but lora is never imported/defined in this module. This will raise NameError for configs that enable LoRA. Add the missing import (or gate the LoRA branch behind an availability check with a clear error).
|
|
||
| inputs = { | ||
| "encoder_out": self._prepare_input(encoder_out), | ||
| "encoder_out_lens": self._prepare_input(mask), | ||
| } | ||
|
|
There was a problem hiding this comment.
OVParaformerPredictor.forward() builds an attention mask but feeds it into the input named "encoder_out_lens", while also hard-coding input names ("encoder_out", "encoder_out_lens"). This is very likely to mismatch the exported predictor IR inputs (which typically follow the TorchScript arg names, e.g. hidden/mask) and will break component-based inference. Use the discovered self.input_names (or positional indices) to map encoder_out and the computed mask to the actual OpenVINO input names.
| inputs = { | |
| "encoder_out": self._prepare_input(encoder_out), | |
| "encoder_out_lens": self._prepare_input(mask), | |
| } | |
| # Map encoder_out and mask to actual OV input names using discovered input_names | |
| inputs = {} | |
| if len(self.input_names) > 0: | |
| inputs[self.input_names[0]] = self._prepare_input(encoder_out) | |
| if len(self.input_names) > 1: | |
| inputs[self.input_names[1]] = self._prepare_input(mask) | |
| PREDEFINED_SD_DATASETS = { | ||
| "conceptual_captions": { | ||
| "id": "google-research-datasets/conceptual_captions", | ||
| "split": "train", | ||
| "prompt_column_name": "caption", | ||
| "streaming": True, | ||
| }, | ||
| "conceptual_captions": {"split": "train", "prompt_column_name": "caption", "streaming": True}, | ||
| "laion/220k-GPT4Vision-captions-from-LIVIS": { | ||
| "id": "laion/220k-GPT4Vision-captions-from-LIVIS", | ||
| "split": "train", | ||
| "prompt_column_name": "caption", | ||
| "streaming": True, | ||
| }, | ||
| "laion/filtered-wit": { | ||
| "id": "laion/filtered-wit", | ||
| "split": "train", | ||
| "prompt_column_name": "caption", | ||
| "streaming": True, | ||
| }, | ||
| "laion/filtered-wit": {"split": "train", "prompt_column_name": "caption", "streaming": True}, | ||
| } |
There was a problem hiding this comment.
PREDEFINED_SD_DATASETS entries no longer include an id, but the quantization pipeline expects dataset_metadata["id"] (see optimum/intel/openvino/quantization.py). This will raise a KeyError when quantizing diffusion pipelines. Add the missing id back for each predefined SD dataset (or update the quantization code to handle missing ids consistently).
| ) | ||
| raise ValueError( | ||
| f"Asked to export a {model_type} model for the task {task}{autodetected_message}, but the Optimum OpenVINO exporter only supports the tasks {', '.join(model_tasks.keys())} for {model_type}. Please use a supported task. Please open an issue at https://github.com/huggingface/optimum-intel/issues if you would like the task {task} to be supported in the OpenVINO export for {model_type}." | ||
| f"Asked to export a {model_type} model for the task {task}{autodetected_message}, but the Optimum OpenVINO exporter only supports the tasks {', '.join(model_tasks.keys())} for {model_type}. Please use a supported task. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the task {task} to be supported in the ONNX export for {model_type}." |
There was a problem hiding this comment.
The raised ValueError message mentions the optimum repo and "ONNX export", but this codepath is in the OpenVINO exporter and other messages in this file point users to huggingface/optimum-intel and OpenVINO export. Update this message to consistently reference the correct repository and exporter to avoid confusing users.
| f"Asked to export a {model_type} model for the task {task}{autodetected_message}, but the Optimum OpenVINO exporter only supports the tasks {', '.join(model_tasks.keys())} for {model_type}. Please use a supported task. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the task {task} to be supported in the ONNX export for {model_type}." | |
| f"Asked to export a {model_type} model for the task {task}{autodetected_message}, but the Optimum OpenVINO exporter only supports the tasks {', '.join(model_tasks.keys())} for {model_type}. Please use a supported task. Please open an issue at https://github.com/huggingface/optimum-intel/issues if you would like the task {task} to be supported in the OpenVINO export for {model_type}." |
| import os | ||
| import json | ||
| import copy | ||
| from omegaconf import OmegaConf, DictConfig, ListConfig |
There was a problem hiding this comment.
This module hard-depends on omegaconf, but omegaconf is not listed in install_requires/extras. As-is, Paraformer export will fail at import time with ModuleNotFoundError. Either add omegaconf to the appropriate extra (and ensure the exporter only imports this module when that extra is installed), or remove the dependency by using standard YAML/JSON parsing.
| from omegaconf import OmegaConf, DictConfig, ListConfig | |
| try: | |
| from omegaconf import OmegaConf, DictConfig, ListConfig | |
| _OMEGACONF_AVAILABLE = True | |
| except ImportError: # pragma: no cover - optional dependency | |
| OmegaConf = None # type: ignore[assignment] | |
| DictConfig = None # type: ignore[assignment] | |
| ListConfig = None # type: ignore[assignment] | |
| _OMEGACONF_AVAILABLE = False |
| from openvino import CompiledModel, Core, Model | ||
| import torch | ||
| from huggingface_hub import hf_hub_download | ||
| from huggingface_hub.constants import HUGGINGFACE_HUB_CACHE | ||
| from transformers import AutoConfig, PretrainedConfig, GenerationConfig |
There was a problem hiding this comment.
There are unused imports (hf_hub_download, GenerationConfig, and Model), which will trigger Ruff F401 failures and increase import time. Remove them or use them as intended (e.g., implement Hub downloading if planned).
| from openvino import CompiledModel, Core, Model | |
| import torch | |
| from huggingface_hub import hf_hub_download | |
| from huggingface_hub.constants import HUGGINGFACE_HUB_CACHE | |
| from transformers import AutoConfig, PretrainedConfig, GenerationConfig | |
| from openvino import CompiledModel, Core | |
| import torch | |
| from huggingface_hub.constants import HUGGINGFACE_HUB_CACHE | |
| from transformers import AutoConfig, PretrainedConfig |
| "audio-classification": (OVModelForAudioClassification,), | ||
| "audio-frame-classification": (OVModelForAudioFrameClassification,), | ||
| "audio-xvector": (OVModelForAudioXVector,), | ||
| "automatic-speech-recognition": (OVModelForCTC, OVModelForSpeechSeq2Seq), | ||
| "automatic-speech-recognition": (OVModelForCTC, OVModelForSpeechSeq2Seq, OVParaformerForSpeechSeq2Seq), | ||
| "feature-extraction": (OVModelForFeatureExtraction,), |
There was a problem hiding this comment.
OV_TASKS_MAPPING["automatic-speech-recognition"] now includes OVParaformerForSpeechSeq2Seq, but get_openvino_model_class() only ever returns index 0 (CTC) or index 1 (seq2seq). As a result, Paraformer will never be auto-selected by pipelines for ASR. Either update the selection logic to detect Paraformer models (e.g., via library_name/characteristic files) and return the Paraformer class, or remove it from the mapping to avoid a misleading entry.
81b9c35 to
d528b3c
Compare
This adds comprehensive support for Alibaba's Paraformer automatic speech recognition model in optimum-intel without modifying core export files. Inference Support: - Add OVParaformerForSpeechSeq2Seq class for inference with OpenVINO - Support for single model and component-based architectures - CPU/GPU support with dynamic device switching - FP32/FP16/INT8 model support with automatic format detection - Includes encoder, predictor, and decoder components Export Support: - Add modeling_paraformer.py for Paraformer model export to OpenVINO - Add standalone export_paraformer.py script (independent of main export pipeline) - Support torchscript conversion for model export - Copy model parameter files (am.mvn, config.yaml, tokens.json, etc.) - Filter streaming-specific encoder parameters for compatibility - Conditional import to avoid omegaconf dependency for non-paraformer use - Auto-detect paraformer library from model files (am.mvn, config.yaml, tokens.json) Pipeline Integration: - Add OVParaformerForSpeechSeq2Seq to accelerator_utils.py - Add to OV_TASKS_MAPPING for automatic-speech-recognition task - Add detection logic in get_openvino_model_class for Paraformer models Testing: - Add comprehensive test suite with 10 test cases (all passing) - Tests cover model loading, inference, batch processing, save/load - Tests for numpy input, generate API, and model properties Note: This implementation does NOT modify __main__.py or openvino.py. Export is available via the standalone export_paraformer.py script: python -m optimum.exporters.openvino.export_paraformer --model <path> --output <dir>
d528b3c to
30f81d9
Compare
This commit introduces a plugin-based approach for Paraformer model export that does NOT require modifications to __main__.py or openvino.py. Key changes: - Added paraformer_plugin.py with: - ParaformerConfig: Transformers-compatible config class - ParaformerForASR: Model wrapper for FunASR models - ParaformerOnnxConfig: Export configuration - TasksManager registration for 'paraformer' library - Automatic monkey-patching of main_export to detect Paraformer models - Modified model_configs.py to import the plugin at startup Usage: optimum-cli export openvino --model funasr/paraformer-zh --weight-format fp16 output_dir optimum-cli export openvino --model funasr/paraformer-zh --weight-format int8 output_dir Both FP16 and INT8 exports tested successfully with inference verification.
- Add AISHELL-1 to PREDEFINED_SPEECH_TO_TEXT_DATASETS in utils.py - Add patch_main_quantize to skip Paraformer in _main_quantize step - Enable INT8 weight compression export via optimum-cli - Add debug logging to paraformer_plugin - Tested INT8 export and GPU inference successfully
- Add Model to openvino imports to fix NameError - Fixes: NameError: name 'Model' is not defined
- Add both AISHELL-1 and aishell-1 to support case variations - Allows users to use --dataset aishell-1 (lowercase)
- Implement full INT8 quantization (weights + activations) using nncf.quantize() - Support --quant-mode int8 --dataset aishell-1 for calibration-based quantization - Use per-tensor quantization for activations (supports dynamic shapes) - Generate calibration samples from example audio with noise augmentation - Save model to ov_models/ subdirectory (matching optimum-intel structure) - Use correct tensor name 'speech.1' for calibration data - Pass ov_config to export function for quantization settings - Achieves same performance as direct __main__.py implementation
- Add ParaformerModelPatcher in model_patcher.py following ModelPatcher pattern - Add ParaformerDummyAudioInputGenerator for speech/speech_lengths inputs - Add ParaformerOpenVINOConfig with @register_in_tasks_manager decorator - Add transformers-compatible wrappers in modeling_paraformer.py: - ParaformerConfig: transformers-compatible configuration - ParaformerForASR: transformers-compatible model wrapper - _load_paraformer_model: TasksManager loader function - Register paraformer library with TasksManager for non-standard library support - Keep paraformer_plugin import for main_export hooking (required for FunASR library) Tested: - FP16 export: Working (824MB model) - INT8 export with AISHELL-1 dataset: Working (210MB model) - INT8 latency on Intel Arc iGPU: ~38.7ms median
I’ve added tests for the Paraformer model and refactored the export logic. There are no changes to main.py or openvino.py, and the implementation follows the same conventions used for the other models. |
- Add paraformer model entry to MODEL_NAMES in utils_tests.py (using funasr/paraformer-zh) - Add paraformer INT8 quantization expectations (268 quantized nodes) - Add OVParaformerForSpeechSeq2Seq import to test_export.py - Add paraformer to SUPPORTED_ARCHITECTURES in test_export.py - Add OVParaformerForSpeechSeq2Seq import to test_exporters_cli.py - Add automatic-speech-recognition task for paraformer in test_exporters_cli.py Verified: - Export via optimum-cli works correctly - Model loading with OVParaformerForSpeechSeq2Seq succeeds - Inference produces expected output shapes
Summary
This PR adds support for exporting Alibaba's Paraformer ASR model (
funasr/paraformer-zh) to OpenVINO IR format with comprehensive INT8 quantization capabilities and full inference support.Key Features
torch.whereandindex_fill_am.mvn,config.yaml,tokens.json)OVParaformerForSpeechSeq2Seq) following the SpeechT5TTS patternFiles Changed
Export Implementation
optimum/exporters/openvino/modeling_paraformer.pyoptimum/exporters/openvino/__main__.pyoptimum/commands/export/openvino.pyoptimum/intel/openvino/utils.pyaishell-1to predefined speech-to-text datasetsoptimum/intel/utils/modeling_utils.pyInference Implementation
optimum/intel/openvino/modeling_speech2text.pyoptimum/intel/__init__.pyoptimum/intel/openvino/__init__.pyoptimum/intel/pipelines/accelerator_utils.pyoptimum/intel/utils/dummy_openvino_objects.pyUsage
Export Models
Inference with OVParaformerForSpeechSeq2Seq
from optimum.intel.openvino import OVParaformerForSpeechSeq2Seq
import torch
Load model
model = OVParaformerForSpeechSeq2Seq.from_pretrained(
"ov_paraformer_int8/ov_models",
device="GPU"
)
Prepare inputs
speech = torch.randn(1, 100, 560) # [batch, time, features]
speech_lengths = torch.tensor([100], dtype=torch.int32)
Run inference
output = model(speech, speech_lengths)
Get results
token_ids = output.token_ids # Decoded token IDs
token_num = output.token_num # Number of valid tokens
logits = output.logits # Raw logits