Jinlong Li · Cristiano Saltori · Fabio Poiesi · Nicu Sebe
This repository contains the official PyTorch implementation of the paper "CUA-O3D: Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding" (CVPR 205). The paper is available on Arxiv. The project page is online at CUA-O3D.
The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models—such as CLIP, DINOv2, and Stable Diffusion—into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities so as to provide state-of-the-art performance in tasks such as:
- Zero-shot 3D semantic segmentation
- Cross-modal zero-shot segmentation
- Linear probing segmentation
Visit the CUA-O3D website to explore more details about the project, methodology, and results.
- 2D feature extraction release
- distillation training release
- linear probing training release
- release datas - link
Requirements
- Python 3.x
- Pytorch 1.7.1
- CUDA 11.x or higher
The following installation suppose python=3.8 pytorch=1.7.1 and cuda=11.x.
-
Create a conda virtual environment
conda create -n CUA_O3D python=3.8 conda activate CUA_O3D -
Clone the repository
git clone https://github.com/TyroneLi/CUA_O3D -
Install the dependencies
-
Install environment dependency
pip install -r requirements.txt -
Install Pytorch 1.7.1
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html -
Install MinkowskiEngine from scratch
conda install openblas-devel -c anaconda git clone https://github.com/NVIDIA/MinkowskiEngine.git cd MinkowskiEngine python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --blas=openblas
-
Download the ScanNet v2 dataset.
Put the downloaded scans and scans_test folder as follows.
CUA_O3D
├── data
│ ├── scannet
│ │ ├── scans
│ │ ├── scans_test
Pre-process ScanNet data from 2d multi-view images by adopting Lseg, DINOv2 and Stable Diffusion models.
cd 2D_feature_extraction/
(1) For LSeg feature extraction and projection
CUDA_VISIBLE_DEVICES=0 python 2D_feature_extraction/embedding_projection/fusion_scannet_lseg.py \
--data_dir <ScanNetV2_save_path> \
--output_dir <save_path_for_lseg_projection_embeddings> \
--save_aligned False \
--split train \
--process_id_range 0,1600
We recommend to first extract the 2D multi-view feature embeddings from Lseg model first without specifying the previous projected point mask index, then you need to specify the first 2D model's projected mask index so as to maintain the consistent point mask index during the training. See here.
(2) For DINOv2 feature extraction and projection
CUDA_VISIBLE_DEVICES=0 python 2D_feature_extraction/embedding_projection/fusion_scannet_dinov2.py \
--data_dir <ScanNetV2_save_path> \
--output_dir <save_path_for_DINOv2_projection_embeddings> \
--save_aligned False \
--split train \
--process_id_range 0,1600
(3) For Stable Diffusion (SD) feature extraction and projection
CUDA_VISIBLE_DEVICES=0 python 2D_feature_extraction/embedding_projection/fusion_scannet_sd.py \
--data_dir <ScanNetV2_save_path> \
--output_dir <save_path_for_SD_projection_embeddings> \
--save_aligned False \
--split train \
--process_id_range 0,1600
After that, specify the corresponding 2D projection embedding path to the config: data_root_2d_fused_feature data_root_2d_fused_feature_dinov2 data_root_2d_fused_feature_sd
Perform Distillation Training
bash run/distill_with_dinov2_sd_adaptiveWeightLoss_demean.sh \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/scannet/ours_lseg_ep50_lsegCosine_dinov2L1_SDCosine.yaml
(1) Perform 2D Fusion Evaluation
sh run/eval_with_dinov2_sd.sh \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/scannet/ours_lseg_ep50_lsegCosine_dinov2L1_SDCosine.yaml \
fusion
(2) Perform 2D Distillation Evaluation
sh run/eval_with_dinov2_sd.sh \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/scannet/ours_lseg_ep50_lsegCosine_dinov2L1_SDCosine.yaml \
distill
(3) Perform 2D Ensemble Evaluation
sh run/eval_with_dinov2_sd.sh \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/scannet/ours_lseg_ep50_lsegCosine_dinov2L1_SDCosine.yaml \
ensemble
sh run/eval_with_dinov2_sd.sh \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/matterport/test_21classes.yaml \
ensemble
sh run/eval_with_dinov2_sd.sh \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/matterport/test_40classes.yaml \
ensemble
sh run/eval_with_dinov2_sd.sh \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/matterport/test_80classes.yaml \
ensemble
sh run/eval_with_dinov2_sd.sh \
training_testing_logs/CUA_O3D_LSeg_DINOv2_SD \
config_CUA_O3D/matterport/test_160classes.yaml \
ensemble
(1) Concatenate Lseg, DINOv2 and SD to perform linear probing
sh run/distill_cat_prob_seg_all.sh \
config_CUA_O3D/scannet/ours_lseg_ep20_seg.yaml \
<best_model_saved_path_from_distillation_training>
(1) Lseg head to perform linear probing
sh run/distill_sep_prob_seg_Lseg.sh \
config_CUA_O3D/scannet/ours_lseg_ep20_seg.yaml \
<best_model_saved_path_from_distillation_training>
(1) DINOv2 head to perform linear probing
sh run/distill_sep_prob_seg_DINOv2.sh \
config_CUA_O3D/scannet/ours_lseg_ep20_seg.yaml \
<best_model_saved_path_from_distillation_training>
(1) SD head to perform linear probing
sh run/distill_sep_prob_seg_SD.sh \
config_CUA_O3D/scannet/ours_lseg_ep20_seg.yaml \
<best_model_saved_path_from_distillation_training>
If you use our work in your research, please cite our publication:
@inproceedings{li2025cross,
title={Cross-modal and uncertainty-aware agglomeration for open-vocabulary 3d scene understanding},
author={Li, Jinlong and Saltori, Cristiano and Poiesi, Fabio and Sebe, Nicu},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={19390--19400},
year={2025}
}We extend our gratitude to all contributors and supporters of the CUA-O3D project. Your valuable insights and contributions drive innovation and progress in the field of 3D and language-based AI systems.
This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
For more information, visit the Creative Commons License page.


