You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[video processors] decode only sampled videos -> less RAM and faster processing (#39600)
* draft update two models for now
* batch update all VLMs first
* update some more image processors
* update
* fix a few tests
* just make CI green for now
* fix copies
* update once more
* update
* unskip the test
* fix these two
* fix torchcodec audio loading
* maybe
* yay, i fixed torchcodec installation and now can actually test it
* fix copies deepseek
* make sure the metadata is returrned when users request it
* add docs
* update
* fixup
* Update src/transformers/audio_utils.py
Co-authored-by: Pavel Iakubovskii <[email protected]>
* Update src/transformers/models/glm4v/video_processing_glm4v.py
Co-authored-by: Pavel Iakubovskii <[email protected]>
* update
* what if we set some metadata attr to `None`
* fix CI
* fix one test
* fix 4 channel test
* fix glm timestemps
* rebase gone wrong
* raise warning once
* fixup
* typo
* fix copies
* ifx smolvlm test
* this is why torch's official benchmark was faster, set threads to `0`
* Apply style fixes
---------
Co-authored-by: Pavel Iakubovskii <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Copy file name to clipboardExpand all lines: docs/source/en/main_classes/image_processor.md
+1-2Lines changed: 1 addition & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,8 +16,7 @@ rendered properly in your Markdown viewer.
16
16
17
17
# Image Processor
18
18
19
-
An image processor is in charge of preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to Numpy and PyTorch tensors. It may also include model specific post-processing such as converting logits to segmentation masks.
20
-
19
+
An image processor is in charge of loading images (optionally), preparing input features for vision models and post processing their outputs. This includes transformations such as resizing, normalization, and conversion to PyTorch and Numpy tensors. It may also include model specific post-processing such as converting logits to segmentation masks.
21
20
Fast image processors are available for a few models and more will be added in the future. They are based on the [torchvision](https://pytorch.org/vision/stable/index.html) library and provide a significant speed-up, especially when processing on GPU.
22
21
They have the same API as the base image processors and can be used as drop-in replacements.
23
22
To use a fast image processor, you need to install the `torchvision` library, and set the `use_fast` argument to `True` when instantiating the image processor:
Copy file name to clipboardExpand all lines: docs/source/en/main_classes/video_processor.md
+42-2Lines changed: 42 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,10 +14,9 @@ rendered properly in your Markdown viewer.
14
14
15
15
-->
16
16
17
-
18
17
# Video Processor
19
18
20
-
A **Video Processor** is a utility responsible for preparing input features for video models, as well as handling the post-processing of their outputs. It provides transformations such as resizing, normalization, and conversion into PyTorch.
19
+
A **Video Processor** is a utility responsible for preparing input features for video models, as well as handling the post-processing of their outputs. It provides transformations such as resizing, normalization, and conversion into PyTorch. Along ith transformations the `VideoProcessor` class handles video decoding from local paths or URLs (requires [`torchcodec`](https://pypi.org/project/torchcodec/)) and frame sampling according to model-specific strategies.
21
20
22
21
The video processor extends the functionality of image processors by allowing Vision Large Language Models (VLMs) to handle videos with a distinct set of arguments compared to images. It serves as the bridge between raw video data and the model, ensuring that input features are optimized for the VLM.
The video processor can also sample video frames using the technique best suited for the given model. Sampling behavior is controlled with the `do_sample_frames` argument and can be configured through model-specific parameters such as `num_frames` or `fps` (the rate at which the video will be sampled). If the input video is given as a local path or URL (`str`), the processor will decode it automatically. To obtain metadata about the decoded video, such as sampled frame indices, original dimensions, duration, and fps, pass `return_metadata=True` to the processor.
53
+
54
+
<Tipwarning={false}>
55
+
56
+
- Specifying `num_frames` does not guarantee the output will contain exactly that number of frames. Depending on the model, the sampler may enforce minimum or maximum frame limits.
57
+
58
+
- The default decoder is [`torchcodec`](https://pypi.org/project/torchcodec/), which must be installed.
If you pass an already decoded video array but still want to enable model-specific frame sampling, it is strongly recommended to provide video_metadata. This allows the sampler to know the original video’s duration and FPS. You can pass metadata as a `VideoMetadata` object or as a plain dict.
75
+
76
+
```python
77
+
from transformers import AutoVideoProcessor
78
+
from transformers.video_utils import VideoMetadata
0 commit comments