mtmd: plan to add video input support

Ref discussions from https://github.com/ggml-org/llama.cpp/pull/16910 and https://github.com/ggml-org/llama.cpp/issues/17660

## Some considerations:

### API design

**Do we need to allow streaming frames?**

The main issue is that streaming is only beneficial for the encoding pass, not the decoding pass. We can probably only allow async frame feeding to avoid having to store the whole decoded video in memory.

A pseudo code looks like this:

```py
chunks = mtmd_tokenize(...)
while chunk in chunks:
    mtmd_tokenize_lazy(chunk, frame_data, frame_size) # will also call preprocessing
    mtmd_helper_eval_chunk_single(chunk, ...) # will do the encode+decode pass, then free the chunk
```

**Audio + image interleave**

TODO: investigate how Qwen-Omni is handling this

**What are other APIs that need to be added?**

TODO: maybe exposing FPS for models using fixed FPS?

### Video codec

Linking against `libavcodec` sounds like a cleaner approach, but will lead to a bad UX in practice. Some systems may not have `libavcodec` and will be forced to use another build of `libmtmd.so/.dll`

A temporary solution for the first iteration would be to call `ffmpeg` command and check its existence. If yes, we launch it via `subprocess.h` and ask it to decode the video.

Second solution is to manually use `dlopen` and search for `libavcodec`, reject if it's not found. This will avoid`libmtmd.so/.dll` to be linked against `libavcodec` on building phrase.

### Internal design

As mentioned in the section earlier, we will need to implement a "lazy" image tokenizer.

For models doing "fused" frame like Qwen (via 3D conv), we need to allow multiple images to be processed in the same batch. `clip_image_f32` will need to be extended with a `nz` dimension.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd: plan to add video input support #18389

Some considerations:

API design

Video codec

Internal design

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mtmd: plan to add video input support #18389

Description

Some considerations:

API design

Video codec

Internal design

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions