RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

🌐 Homepage | 📖 arXiv | 🛠️ Data Gen | 📂 Benchmark

✨ CVPR 2025 (Oral) ✨

Authors: Chan Hee Song¹, Valts Blukis², Jonathan Tremblay², Stephen Tyree², Yu Su¹, Stan Birchfield²

¹ The Ohio State University ² NVIDIA

Introduction

This repository provides evaluation tools for RoboSpatial-Home, a spatial reasoning benchmark designed for robotics, augmented reality (AR), and related applications.

If you are looking for the data generation code, please check out our repository at here.

Evaluation Guidelines

We provide detailed instructions for evaluating your model on RoboSpatial-Home. You can either run a model through our interface or evaluate pre-generated results. See the Usage section below for both workflows.

Requirements

pip install numpy tqdm pyyaml

Download & Preprocess Dataset

You’ll need to download the dataset before running the evaluation. We provide a script to make this easy, especially for debugging or if you’re not using the Hugging Face datasets library.

python download_benchmark.py [OUTPUT_FOLDER_PATH]

Set Dataset Path

Edit config.yaml to point to your local dataset directory and desired output folder:

# Dataset paths
datasets:
  robospatial_home:
    data_dir: "/path/to/robospatial-home"  # Root directory containing JSON files and images/ folder

# Output configuration
output:
  output_dir: "./results"  # Full path to where results will be stored
  # If not specified, a 'results' folder will be created in the current directory

If OUTPUT_FOLDER_PATH is provided, the dataset will be downloaded and saved there. If not provided, the dataset will be saved in a folder named RoboSpatial-Home in the current directory.

Usage

There are two ways to use this tool:

1. Run and evaluate a model

python main.py <MODEL_NAME> [MODEL_PATH] --config CONFIG_PATH [--dry-run]

2. Evaluate pre-generated results

python main.py --results RESULTS_FILE --config CONFIG_PATH [--dry-run]

Required arguments:

--config CONFIG_PATH: Path to YAML config file with dataset paths

For running a model:

MODEL_NAME: Name of the model to use (See Supported Models for valid options.)
MODEL_PATH (optional): Path to model weights if not using default

For evaluating pre-generated results:

--results RESULTS_FILE: Path to a JSON file containing pre-generated model responses

Optional arguments:

--dry-run: Only evaluate the first 3 examples from each JSON file

Example Commands:

# Run LLaVA-Next with default model weights
python main.py llava_next --config config.yaml

# Run RoboPoint with a custom model checkpoint
python main.py robopoint /path/to/my/model --config config.yaml

# Run SpatialVLM in dry-run mode (only 3 samples)
python main.py spatialvlm --config config.yaml --dry-run

# Evaluate pre-generated results from JSON
python main.py --results /path/to/results.json --config config.yaml

Pre-generated Results Format

The pre-generated results file should be a JSON file containing a list of QAs with the same structure as the RoboSpatial-Home dataset:

[
  {
    "img": "images/img_context_0.png", 
    "category": "context",
    "question": "In the image, there is a bowl. Pinpoint several points within the vacant space situated to the left of the bowl...",
    "answer": "[(0.383, 0.873), (0.390, 0.990), ...]"
  },
  ...
]

🚨 Important: The evaluation script matches each entry in your results file to the ground truth using a combination of the question and img fields. These two fields must exactly match the corresponding example in the dataset.

⚠️ Note on point formatting: The evaluation code attempts to handle common variations in point representations, including cases where model responses contain a mix of text and coordinates. If the answer field includes additional text, the code uses regular expressions to extract only the coordinate points. That said, for best results, format your predictions as clean, Python-style tuples—just like in the benchmark annotations.

Output

Results are saved in the following locations:

Prediction results: <output_dir>/<annotation_file_name>_<model_name>_results.json
Evaluation summary: <output_dir>/aggregate_robospatial_home_<model_name>.json

For pre-generated results evaluation, <model_name> is replaced with custom in the output filenames.

For dry runs, all output files are prefixed with dry_run_.

Supported Models

🚨 Important: For all models, please create a separate Python environment using each model's instructions. During our evaluation, we switch between Python environments when evaluating different models. The inference code for all models is containerized to allow for isolated evaluation.

LLaVA-Next

lmms-lab/llama3-llava-next-8b

SpatialVLM

remyxai/SpaceMantis

RoboPoint

wentao-yuan/robopoint-v1-vicuna-v1.5-13b

Qwen2-VL

Qwen/Qwen2-VL-7B-Instruct

Molmo

allenai/Molmo-7B-D-0924

GPT-4o

export OPENAI_API_KEY=<Your API Key>

Contact

Luke Song: [email protected]

Citation

BibTex:

@inproceedings{song2025robospatial,
  author    = {Song, Chan Hee and Blukis, Valts and Tremblay, Jonathan and Tyree, Stephen and Su, Yu and Birchfield, Stan},
  title     = {{RoboSpatial}: Teaching Spatial Understanding to {2D} and {3D} Vision-Language Models for Robotics},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
  note      = {To appear},
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
download_benchmark.py		download_benchmark.py
evaluation.py		evaluation.py
main.py		main.py
models.py		models.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Introduction

Evaluation Guidelines

Requirements

Download & Preprocess Dataset

Set Dataset Path

Usage

1. Run and evaluate a model

2. Evaluate pre-generated results

Required arguments:

For running a model:

For evaluating pre-generated results:

Optional arguments:

Example Commands:

Pre-generated Results Format

Output

Supported Models

LLaVA-Next

SpatialVLM

RoboPoint

Qwen2-VL

Molmo

GPT-4o

Contact

Citation

About

Uh oh!

Releases

Packages

Languages

License

chanhee-luke/RoboSpatial-Eval

Folders and files

Latest commit

History

Repository files navigation

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Introduction

Evaluation Guidelines

Requirements

Download & Preprocess Dataset

Set Dataset Path

Usage

1. Run and evaluate a model

2. Evaluate pre-generated results

Required arguments:

For running a model:

For evaluating pre-generated results:

Optional arguments:

Example Commands:

Pre-generated Results Format

Output

Supported Models

LLaVA-Next

SpatialVLM

RoboPoint

Qwen2-VL

Molmo

GPT-4o

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages