OpenSpatial Quick Start

1. Environment Setup

1.1 Create Conda Environment

conda create -n openspatial python=3.10 -y
conda activate openspatial

1.2 Install Dependencies

cd OpenSpatial

# Core dependencies
pip install -r requirements.txt

# PyTorch (CUDA 12.6)
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
    --index-url https://download.pytorch.org/whl/cu126

# spaCy model
python -m spacy download en_core_web_sm

# Foundation models
pip install 'git+https://github.com/facebookresearch/sam2.git' --quiet
pip install -q -U flash-attn --no-build-isolation

### 1.3 Verify Installation

```bash
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
python -m pytest tests/ -x -q

2. Data Preparation

2.1 Coordinate System & 3D Box Convention

World coordinate system: The scene mesh must be gravity-aligned — the ground plane is parallel to the XY plane, and the Z-axis points upward.

        Z (up)
        |
        |
        |_______ Y
       /
      /
     X

  Ground plane = XY

Camera coordinate system (OpenCV convention): X-axis points right, Y-axis points down, Z-axis points forward (into the scene along the optical axis). The depth value of a pixel equals its Z coordinate in camera frame.

              Z (forward / optical axis)
             /
            /
           /_______ X (right)
           |
           |
           Y (down)

  Image plane is perpendicular to Z.
  Pixel (u, v) back-projects to:
    X_cam = (u - cx) * depth / fx
    Y_cam = (v - cy) * depth / fy
    Z_cam = depth

The pose field stores the 4x4 camera-to-world extrinsic matrix. To convert from world to camera frame: P_cam = inv(pose) @ P_world.

3D Oriented Bounding Box (OBB): Each object's 3D bounding box is represented as a 9-element vector in world coordinates:

[cx, cy, cz, xl, yl, zl, roll, pitch, yaw]

Element	Description
`cx, cy, cz`	Center position in world frame
`xl, yl, zl`	Extents (full lengths) along the box's local X / Y / Z axes. `zl` corresponds to the object's height
`roll, pitch, yaw`	Euler angles in zxy intrinsic order (radians), encoding the rotation from the world frame to the box's local frame

The pipeline converts world-frame OBBs to camera frame via inv(pose) @ T_world. The resulting camera-frame OBB uses the same 9-element representation.

2.2 Data Format

OpenSpatial uses Parquet files as the unified data format. Each row represents a scene sample.

Field	Singleview	Multiview	Source	Description
`image`	`str`	`list[str]`	Input	RGB image path
`depth_map`	`str`	`list[str]`	Input	Depth map path
`pose`	`str`	`list[str]`	Input	Camera extrinsic (4x4 txt)
`intrinsic`	`str`	`list[str]`	Input	Camera intrinsic (4x4 txt)
`obj_tags`	`list[str]`	`list[list[str]]`	Input	Object tags (per-view in multiview)
`bboxes_3d_world_coords`	`list[list[float]]`	`list[list[list[float]]]`	Input	3D OBB `[cx,cy,cz,xl,yl,zl,roll,pitch,yaw]`
`depth_scale`	`int`	`int`	Input	Depth scale factor (e.g. 1000)
`is_metric_depth`	`bool`	`bool`	Input	Whether depth is metric
`masks`	`list[str]`	`list[list[str]]`	Pipeline	Mask PNG paths per object
`bboxes_2d`	`list[list[int]]`	`list[list[list[int]]]`	Pipeline	2D bbox `[x1,y1,x2,y2]` per object
`pointclouds`	`list[str]`	`list[list[str]]`	Pipeline	Point cloud `.pcd` paths per object

Note: Not all fields are required at the outset. Input fields come from data preprocessing (Section 2.3); Pipeline fields (masks, bboxes_2d, pointclouds) are intermediate results generated by pipeline stages such as Localizer, Sam2Refiner, and DepthBackProjecter. Different annotation tasks depend on different subsets of these fields.

2.3 Data Preprocessing

OpenSpatial supports two types of input data:

Public 3D-Annotated Datasets

For public datasets with existing 3D annotations (e.g. ScanNet++, Hypersim, EmbodiedScan), we provide preprocessing scripts to convert them into the standard Parquet format.

ScanNet++:

python data_preprocessing/scannetpp/prepare_scannetpp.py \
    --input_root /path/to/scannetpp/data \
    --output_dir /path/to/output/parquet \
    --selected_tags_file data_preprocessing/scannetpp/scannet-labels.combined.tsv \
    --chunk_size 100 \
    --max_workers 32

Hypersim:

python data_preprocessing/hypersim/prepare_hypersim.py \
    --input_root /path/to/Hypersim \
    --output_dir /path/to/output/parquet \
    --camera_params_csv data_preprocessing/hypersim/metadata_camera_parameters.csv \
    --labels_tsv data_preprocessing/scannetpp/scannet-labels.combined.tsv \
    --chunk_size 1000 \
    --max_workers 32

EmbodiedScan (ScanNet, 3RScan, Matterport3D, ARKitScenes):

# Install the preprocessing package first
cd data_preprocessing/embodiedscan && pip install -e . && cd ../..

# Extract per-image data
python -m embodiedscan_data extract \
    --dataset all \
    --data-root /path/to/EmbodiedScan/data \
    --output /path/to/output \
    --workers 24

# Merge into per-scene records
python -m embodiedscan_data merge --input /path/to/output

# Export to Parquet
python -m embodiedscan_data export --input /path/to/output --format both

See data_preprocessing/embodiedscan/README.md for detailed data directory structure and per-dataset notes.

3D Lifting for Unannotated Web Data

Coming soon (expected May 2025).

3. Running the Pipeline

3.1 Basic Usage

python run.py --config <config.yaml> --output_dir <output_directory>

3.2 Config File Structure

Config files are in config/ and define the dataset source, pipeline executor, and processing stages. Below is a full end-to-end pipeline config with annotations:

# ── Dataset ──────────────────────────────────────────────────
dataset:
  modality: image                          # input modality: "image"
  dataset_name: image_base                 # dataset loader class (dataset/image_base.py)
  data_dir: /path/to/input.parquet         # path to Parquet file(s), str or list[str]

# ── Pipeline ─────────────────────────────────────────────────
pipeline:
  file_name: base_pipeline                 # pipeline module (pipeline/base_pipeline.py)
  class_name: BasePipeline                 # pipeline class

  stages:
    # ── Stage 1: Filter ────────────────────────────────────
    # Validate 3D bounding boxes via 2D projection & point cloud
    filter_stage:
      -
        file_name: 3dbox_filter
        method: ThreeDBoxFilter
        filter_tags: ["ceiling", "floor", "wall", "object"]
        output_dir:

    # ── Stage 2: Localization ──────────────────────────────
    # Refine masks with SAM2 box prompts
    localization_stage:
      -
        file_name: sam2_refiner
        method: Sam2Refiner
        update_keys: ["obj_tags", "bboxes_3d_world_coords"]
        depends_on: filter_stage/3dbox_filter
        output_dir:

    # ── Stage 3: Scene Fusion ──────────────────────────────
    # Back-project depth to per-object 3D point clouds
    scene_fusion:
      -
        file_name: depth_back_projection
        method: DepthBackProjecter
        depends_on: localization_stage/sam2_refiner
        output_dir:

    # ── Stage 4: Annotation ────────────────────────────────
    # Generate spatial QA pairs
    annotation_stage:
      -
        file_name: distance                # task module name under task/<stage>/
        method: AnnotationGenerator        # task class name
        depends_on: scene_fusion/depth_back_projection

        # Common parameters (BaseTask)
        use_multi_processing: false        # enable multi-process execution
        num_workers: 8                     # number of parallel workers

        # Annotation parameters (BaseAnnotationTask)
        scaling_factor: 1                  # coordinate scaling factor
        filter_tags: ["ceiling", "floor"]  # tags to exclude from QA generation
        sub_tasks:                         # sub-task names → count per sample
          absolute_distance: 1
          relative_distance: 1

        # Multiview parameters (BaseMultiviewTask, multiview tasks only)
        # max_num_views: 400               # max views to consider per scene
        # min_rot_angle: 15.0              # min rotation angle (deg) between views
        # min_translation: 0.0             # min camera center distance between views

        # Task-specific parameters vary by task (see demo configs for full list)

        output_dir:

Each stage is a list of tasks (the - items). A stage can contain multiple tasks that run sequentially. Stages execute in dependency order defined by depends_on. For annotation-only runs, a single annotation_stage is sufficient (see demo configs).

3.3 Available Tasks

Note: This list is continuously updated. Check the config/ directory for the latest available tasks.

Annotation (singleview):

Config	Task	Description
`config/annotation/demo_distance.yaml`	Distance	Absolute & relative distance QA
`config/annotation/demo_depth.yaml`	Depth	Depth ordering & comparison QA
`config/annotation/demo_size.yaml`	Size	Absolute & relative size QA
`config/annotation/demo_position.yaml`	Position	Height comparison & proximity QA
`config/annotation/demo_counting.yaml`	Counting	Object counting QA
`config/annotation/demo_3d_grounding.yaml`	3D Grounding	3D bounding box grounding QA

Annotation (multiview):

Config	Task	Description
`config/annotation/demo_multiview_distance.yaml`	Multiview Distance	Cross-view distance QA
`config/annotation/demo_multiview_size.yaml`	Multiview Size	Cross-view size QA
`config/annotation/demo_multiview_correspondence.yaml`	Correspondence	Cross-view object matching QA
`config/annotation/demo_multiview_distance_obj_cam.yaml`	Object-Camera Distance	Object to camera distance QA
`config/annotation/demo_multiview_object_position.yaml`	Object Position	Cross-view position QA

Processing stages (used in full pipeline):

Stage	Task	Description
`filter_stage`	`ThreeDBoxFilter`	Validate 3D boxes via 2D projection & point cloud
`localization_stage`	`Sam2Refiner`	Refine masks with SAM2 box prompts
`scene_fusion`	`DepthBackProjecter`	Back-project depth to per-object point clouds

3.4 Annotation Pipeline by Data Mode

Input data falls into two modes depending on the Parquet structure. The preprocessing pipeline differs slightly between them:

Singleview input (per-image Parquet, e.g. Hypersim, EmbodiedScan per_image):

Each row is one image. The pipeline runs filter → localize → scene_fusion directly, then groups into multiview if needed.

filter_stage ──> localization_stage ──> scene_fusion_stage ──> group_stage
(3dbox_filter)   (sam2_refiner)         (depth_back_projection) (SampleGrouper)

python run.py --config config/preprocessing/demo_preprocessing_hypersim.yaml --output_dir output/

Multiview input (per-scene Parquet, e.g. ScanNet++, EmbodiedScan per_scene):

Each row is a scene with list-valued fields. The pipeline first flattens to per-image, then follows the same stages. Datasets like ScanNet++ fall into this mode because their 3D box annotations are provided at the scene level (shared across all views), rather than per-image.

flatten_stage ──> filter_stage ──> localization_stage ──> scene_fusion_stage ──> group_stage
(SampleFlattener) (3dbox_filter)   (sam2_refiner)         (depth_back_projection) (SampleGrouper)

python run.py --config config/preprocessing/demo_preprocessing_scannetpp.yaml --output_dir output/

Note: The output of scene_fusion_stage is per-image Parquet, used as input for singleview annotation tasks. The output of group_stage aggregates per-image records back into per-scene Parquet, used as input for multiview annotation tasks.

# Singleview annotations
for cfg in demo_distance demo_depth demo_size demo_position demo_counting demo_3d_grounding; do
    python run.py --config config/annotation/${cfg}.yaml --output_dir output/demo
done

# Multiview annotations
for cfg in demo_multiview_distance demo_multiview_size demo_multiview_correspondence \
           demo_multiview_distance_obj_cam demo_multiview_object_position; do
    python run.py --config config/annotation/${cfg}.yaml --output_dir output/demo
done

4. Visualization

Launch the built-in visualization server to browse annotation results:

python visualize_server.py --data_dir output/demo --port 8888

Then open http://<host>:8888 in a browser. Features:

Dropdown to switch between task outputs
QA pairs with rendered images
Lightbox for full-size image viewing
Keyboard navigation (left/right arrows)

5. Development Guide

For extending OpenSpatial with new annotation tasks, pipeline stages, prompt templates, or dataset preprocessors, see the Development Guide.

It covers:

Class hierarchy (BaseTask → BaseAnnotationTask → BaseMultiviewAnnotationTask)
Step-by-step: adding singleview / multiview annotation tasks
Prompt template system and placeholder conventions
SceneGraph data model
Adding new pipeline stages and dataset preprocessors
Testing and common patterns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSpatial Quick Start

1. Environment Setup

1.1 Create Conda Environment

1.2 Install Dependencies

2. Data Preparation

2.1 Coordinate System & 3D Box Convention

2.2 Data Format

2.3 Data Preprocessing

Public 3D-Annotated Datasets

3D Lifting for Unannotated Web Data

3. Running the Pipeline

3.1 Basic Usage

3.2 Config File Structure

3.3 Available Tasks

3.4 Annotation Pipeline by Data Mode

4. Visualization

5. Development Guide

FilesExpand file tree

quick_start.md

Latest commit

History

quick_start.md

File metadata and controls

OpenSpatial Quick Start

1. Environment Setup

1.1 Create Conda Environment

1.2 Install Dependencies

2. Data Preparation

2.1 Coordinate System & 3D Box Convention

2.2 Data Format

2.3 Data Preprocessing

Public 3D-Annotated Datasets

3D Lifting for Unannotated Web Data

3. Running the Pipeline

3.1 Basic Usage

3.2 Config File Structure

3.3 Available Tasks

3.4 Annotation Pipeline by Data Mode

4. Visualization

5. Development Guide