-
Notifications
You must be signed in to change notification settings - Fork 16.2k
mtmd: plan to add video input support #18389
Description
Ref discussions from #16910 and #17660
Some considerations:
API design
Do we need to allow streaming frames?
The main issue is that streaming is only beneficial for the encoding pass, not the decoding pass. We can probably only allow async frame feeding to avoid having to store the whole decoded video in memory.
A pseudo code looks like this:
chunks = mtmd_tokenize(...)
while chunk in chunks:
mtmd_tokenize_lazy(chunk, frame_data, frame_size) # will also call preprocessing
mtmd_helper_eval_chunk_single(chunk, ...) # will do the encode+decode pass, then free the chunkAudio + image interleave
TODO: investigate how Qwen-Omni is handling this
What are other APIs that need to be added?
TODO: maybe exposing FPS for models using fixed FPS?
Video codec
Linking against libavcodec sounds like a cleaner approach, but will lead to a bad UX in practice. Some systems may not have libavcodec and will be forced to use another build of libmtmd.so/.dll
A temporary solution for the first iteration would be to call ffmpeg command and check its existence. If yes, we launch it via subprocess.h and ask it to decode the video.
Second solution is to manually use dlopen and search for libavcodec, reject if it's not found. This will avoidlibmtmd.so/.dll to be linked against libavcodec on building phrase.
Internal design
As mentioned in the section earlier, we will need to implement a "lazy" image tokenizer.
For models doing "fused" frame like Qwen (via 3D conv), we need to allow multiple images to be processed in the same batch. clip_image_f32 will need to be extended with a nz dimension.