-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
The Error
When cuda_use_flash_attention2=True is set in AcceleratorOptions, CodeFormulaV2 loads successfully but every
inference batch fails with:
FlashAttention only support fp16 and bf16 data type
The model loads in fp32 (the HuggingFace default when dtype=None), but the attention implementation is set to
flash_attention_2 which requires fp16/bf16 tensors.
Issue 1: Preset definition omits torch_dtype
File: docling/datamodel/stage_model_specs.py, lines 1108-1129
CODE_FORMULA_CODEFORMULAV2 = StageModelPreset(
preset_id="codeformulav2",
...
model_spec=VlmModelSpec(
...
engine_overrides={
VlmEngineType.TRANSFORMERS: EngineModelConfig(
extra_config={
"transformers_model_type": TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,
"extra_generation_config": {"skip_special_tokens": False},
}
NOTE: no torch_dtype specified here
),
},
),
...
default_engine_type=VlmEngineType.AUTO_INLINE,
)
Issue 2: get_engine_config() drops torch_dtype
File: docling/datamodel/stage_model_specs.py, lines 216-241
Even if torch_dtype were added to the CodeFormulaV2 preset, it would be silently discarded by
VlmModelSpec.get_engine_config():
def get_engine_config(self, engine_type: VlmEngineType) -> EngineModelConfig:
repo_id = self.get_repo_id(engine_type)
revision = self.get_revision(engine_type)
extra_config = {}
if engine_type in self.engine_overrides:
extra_config = self.engine_overrides[engine_type].extra_config.copy()
return EngineModelConfig(
repo_id=repo_id,
revision=revision,
extra_config=extra_config,
NOTE: torch_dtype is NOT extracted from self.engine_overrides[engine_type]
)
This method constructs a new EngineModelConfig with only repo_id, revision, and extra_config. The torch_dtype
field from the engine override is never read.
Issue 3: AutoInlineVlmEngine creates engine options with default torch_dtype=NoneFile: docling/models/inference_engines/vlm/auto_inline_engine.py, lines 196-207
When AutoInlineVlmEngine (CodeFormulaV2’s default engine) selects the Transformers backend, it creates
TransformersVlmEngineOptions() with no arguments:
else: ### TRANSFORMERS
transformers_options = TransformersVlmEngineOptions() # torch_dtype=None
self.actual_engine = TransformersVlmEngine(
options=transformers_options,
accelerator_options=self.accelerator_options,
artifacts_path=self.artifacts_path,
model_config=model_config, # model_config.torch_dtype is also None (Issue 2)
)
TransformersVlmEngineOptions defaults torch_dtype to None (line 70 in vlm_engine_options.py):
class TransformersVlmEngineOptions(BaseVlmEngineOptions):
torch_dtype: Optional[str] = Field(
default=None, description="PyTorch dtype (e.g., 'float16', 'bfloat16')"
)
And even though model_config is passed, it also has torch_dtype=None due to Issue 2.
Suggested Fixes
Ideal fix: Automatic dtype/attention compatibility
When cuda_use_flash_attention2=True and torch_dtype=None, the engine should automatically cast to bf16 rather
than loading in fp32 and failing at inference time. This mirrors how HuggingFace’s from_pretrained handles
attn_implementation="flash_attention_2" when torch_dtype=torch.float32 is explicitly passed — it raises an
error, but when dtype is unset, it could default to bf16 for compatibility.
Affected PresetsPreset Stage torch_dtype in TRANSFORMERS override FA2 compatible?smoldocling VLM_CONVERT "bfloat16" Yessmolvlm PICTURE_DESC "bfloat16" Yescodeformulav2 CODE_FORMULA Not set (None) Nogranite_docling CODE_FORMULA Not set (None) Nogranite_vision PICTURE_DESC Not set (None) No (but typically API-based)