Skip to content

CodeFormulaV2 Preset Missing torch_dtype Causes Flash Attention 2 Incompatibility  #3026

@raven16180

Description

@raven16180

The Error

When cuda_use_flash_attention2=True is set in AcceleratorOptions, CodeFormulaV2 loads successfully but every
inference batch fails with:
FlashAttention only support fp16 and bf16 data type

The model loads in fp32 (the HuggingFace default when dtype=None), but the attention implementation is set to
flash_attention_2 which requires fp16/bf16 tensors.

Issue 1: Preset definition omits torch_dtype

File: docling/datamodel/stage_model_specs.py, lines 1108-1129
CODE_FORMULA_CODEFORMULAV2 = StageModelPreset(
preset_id="codeformulav2",
...
model_spec=VlmModelSpec(
...
engine_overrides={
VlmEngineType.TRANSFORMERS: EngineModelConfig(
extra_config={
"transformers_model_type": TransformersModelType.AUTOMODEL_IMAGETEXTTOTEXT,
"extra_generation_config": {"skip_special_tokens": False},
}

NOTE: no torch_dtype specified here

),
},
),
...
default_engine_type=VlmEngineType.AUTO_INLINE,
)

Issue 2: get_engine_config() drops torch_dtype

File: docling/datamodel/stage_model_specs.py, lines 216-241
Even if torch_dtype were added to the CodeFormulaV2 preset, it would be silently discarded by
VlmModelSpec.get_engine_config():
def get_engine_config(self, engine_type: VlmEngineType) -> EngineModelConfig:
repo_id = self.get_repo_id(engine_type)
revision = self.get_revision(engine_type)

 extra_config = {}
if engine_type in self.engine_overrides:
extra_config = self.engine_overrides[engine_type].extra_config.copy()

 return EngineModelConfig(
repo_id=repo_id,
revision=revision,
extra_config=extra_config,

NOTE: torch_dtype is NOT extracted from self.engine_overrides[engine_type]

)

This method constructs a new EngineModelConfig with only repo_id, revision, and extra_config. The torch_dtype
field from the engine override is never read.

Issue 3: AutoInlineVlmEngine creates engine options with default torch_dtype=NoneFile: docling/models/inference_engines/vlm/auto_inline_engine.py, lines 196-207

When AutoInlineVlmEngine (CodeFormulaV2’s default engine) selects the Transformers backend, it creates
TransformersVlmEngineOptions() with no arguments:
else: ### TRANSFORMERS
transformers_options = TransformersVlmEngineOptions() # torch_dtype=None
self.actual_engine = TransformersVlmEngine(
options=transformers_options,
accelerator_options=self.accelerator_options,
artifacts_path=self.artifacts_path,
model_config=model_config, # model_config.torch_dtype is also None (Issue 2)
)

TransformersVlmEngineOptions defaults torch_dtype to None (line 70 in vlm_engine_options.py):
class TransformersVlmEngineOptions(BaseVlmEngineOptions):
torch_dtype: Optional[str] = Field(
default=None, description="PyTorch dtype (e.g., 'float16', 'bfloat16')"
)

And even though model_config is passed, it also has torch_dtype=None due to Issue 2.

Suggested Fixes

Ideal fix: Automatic dtype/attention compatibility
When cuda_use_flash_attention2=True and torch_dtype=None, the engine should automatically cast to bf16 rather
than loading in fp32 and failing at inference time. This mirrors how HuggingFace’s from_pretrained handles
attn_implementation="flash_attention_2" when torch_dtype=torch.float32 is explicitly passed — it raises an
error, but when dtype is unset, it could default to bf16 for compatibility.
Affected PresetsPreset Stage torch_dtype in TRANSFORMERS override FA2 compatible?smoldocling VLM_CONVERT "bfloat16" Yessmolvlm PICTURE_DESC "bfloat16" Yescodeformulav2 CODE_FORMULA Not set (None) Nogranite_docling CODE_FORMULA Not set (None) Nogranite_vision PICTURE_DESC Not set (None) No (but typically API-based)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions