This repository is a brief introduction and case study of the paper "Training-Free Semantic Video Composition via Pre-trained Diffusion Model" in ICME 2024 (Oral).
The video composition task aims to integrate specified foregrounds and backgrounds from different videos into a harmonious composite. Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments, such as domain gaps. Therefore, we propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge, which can process composite videos with broader semantic disparities.
Using the pretrained Stable Diffusion V2-1 as our backbone, we leverage its robust semantic understanding capabilities to propose an training-free video compositing pipeline.
| input | TF-ICON | Ours |
|---|---|---|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |
|
![]() |
![]() |









