Jiahua Dong*, Hui Yin*, Wenqi Liang, Hanbin Zhao, Henghui Ding, Nicu Sebe, Salman Khan, Fahad Shahbaz Khan (*equal contribution)
[arXiv]
Video instance segmentation (VIS) has gained significant attention for its capability in segmenting and tracking object instances across video frames. However, most of the existing VIS methods unrealistically assume that the categories of object instances remain fixed over time. Moreover, they expe rience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new classes. To address the above challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model, which alleviates catastrophic forgetting of old classes from both frame-level and video-level perspectives. Specifically, to mit igate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to ad dress forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt features, and then propagates task specific global video contexts from the frame prompt features to the video prompt. The effectiveness of our HVPL model is demonstrated through extensive experiments, in which it outperforms baseline methods.
Aug 21, 2025: 🎉🎉🎉 Code and pretrained weights are now available! Thanks for your patience :)Jun 26, 2025: 🎉🎉🎉 HVPL is accepted to ICCV 2025!
See installation instructions.
We provide a script train_net_hvpl.py, that is made to train all the configs provided in HVPL.
To train a model with "train_net_hvpl.py" on VIS, first setup the corresponding datasets following Preparing Datasets for HVPL.
- Step t=0: Training the model for base classes (you can skip this process if you use pre-trained weights.)
- Step t>1: Training the model for novel classes with HVPL
| Scenario | Script |
|---|---|
| YouTubeVIS 2019 20-2 | bash youtube_2019_20_2.sh |
| YouTubeVIS 2019 20-5 | bash youtube_2019_20_5.sh |
| YouTubeVIS 2021 20-4 | bash yvis_2021_20_4.sh |
| YouTubeVIS 2021 30-10 | bash yvis_2021_30_10.sh |
| OVIS 15-5 | bash OVIS_15_5.sh |
| OVIS 15-10 | bash OVIS_15_10.sh |
Note that during training, testing will be performed, generating a corresponding result.json file as well as a .txt file for saving the evaluation metrics (AP、AP50、AP75、AR1).
To validate the forgetting (F) metric (FAP、FAR1)., please run dataset_eval.py.
If you use HVPL in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.
@InProceedings{Dong2025HVPL,
title={Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation},
author={Jiahua Dong and Hui Yin and Wenqi Liang and Hanbin Zhao and Henghui Ding and Nicu Sebe and Salman Khan and Fahad Shahbaz Khan},
year={2025},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month={October},
}Our code is largely based on VITA, ECLIPSE, Detectron2, Mask2Former, and Deformable DETR. We are truly grateful for their excellent work.


