Skip to content

mtmd: plan to add video input support #18389

@ngxson

Description

@ngxson

Ref discussions from #16910 and #17660

Some considerations:

API design

Do we need to allow streaming frames?

The main issue is that streaming is only beneficial for the encoding pass, not the decoding pass. We can probably only allow async frame feeding to avoid having to store the whole decoded video in memory.

A pseudo code looks like this:

chunks = mtmd_tokenize(...)
while chunk in chunks:
    mtmd_tokenize_lazy(chunk, frame_data, frame_size) # will also call preprocessing
    mtmd_helper_eval_chunk_single(chunk, ...) # will do the encode+decode pass, then free the chunk

Audio + image interleave

TODO: investigate how Qwen-Omni is handling this

What are other APIs that need to be added?

TODO: maybe exposing FPS for models using fixed FPS?

Video codec

Linking against libavcodec sounds like a cleaner approach, but will lead to a bad UX in practice. Some systems may not have libavcodec and will be forced to use another build of libmtmd.so/.dll

A temporary solution for the first iteration would be to call ffmpeg command and check its existence. If yes, we launch it via subprocess.h and ask it to decode the video.

Second solution is to manually use dlopen and search for libavcodec, reject if it's not found. This will avoidlibmtmd.so/.dll to be linked against libavcodec on building phrase.

Internal design

As mentioned in the section earlier, we will need to implement a "lazy" image tokenizer.

For models doing "fused" frame like Qwen (via 3D conv), we need to allow multiple images to be processed in the same batch. clip_image_f32 will need to be extended with a nz dimension.

Metadata

Metadata

Assignees

Labels

Review Complexity : HighGenerally require indepth knowledge of LLMs or GPUsenhancementNew feature or requesthotSomething that is hotmtmdRelated to multimodal functionality (video/image/audio)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions