Skip to content

Latest commit

 

History

History
361 lines (268 loc) · 13.9 KB

File metadata and controls

361 lines (268 loc) · 13.9 KB

OpenSpatial Quick Start

1. Environment Setup

1.1 Create Conda Environment

conda create -n openspatial python=3.10 -y
conda activate openspatial

1.2 Install Dependencies

cd OpenSpatial

# Core dependencies
pip install -r requirements.txt

# PyTorch (CUDA 12.6)
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
    --index-url https://download.pytorch.org/whl/cu126

# spaCy model
python -m spacy download en_core_web_sm

# Foundation models
pip install 'git+https://github.com/facebookresearch/sam2.git' --quiet
pip install -q -U flash-attn --no-build-isolation

### 1.3 Verify Installation

```bash
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
python -m pytest tests/ -x -q

2. Data Preparation

2.1 Coordinate System & 3D Box Convention

World coordinate system: The scene mesh must be gravity-aligned — the ground plane is parallel to the XY plane, and the Z-axis points upward.

        Z (up)
        |
        |
        |_______ Y
       /
      /
     X

  Ground plane = XY

Camera coordinate system (OpenCV convention): X-axis points right, Y-axis points down, Z-axis points forward (into the scene along the optical axis). The depth value of a pixel equals its Z coordinate in camera frame.

              Z (forward / optical axis)
             /
            /
           /_______ X (right)
           |
           |
           Y (down)

  Image plane is perpendicular to Z.
  Pixel (u, v) back-projects to:
    X_cam = (u - cx) * depth / fx
    Y_cam = (v - cy) * depth / fy
    Z_cam = depth

The pose field stores the 4x4 camera-to-world extrinsic matrix. To convert from world to camera frame: P_cam = inv(pose) @ P_world.

3D Oriented Bounding Box (OBB): Each object's 3D bounding box is represented as a 9-element vector in world coordinates:

[cx, cy, cz, xl, yl, zl, roll, pitch, yaw]
Element Description
cx, cy, cz Center position in world frame
xl, yl, zl Extents (full lengths) along the box's local X / Y / Z axes. zl corresponds to the object's height
roll, pitch, yaw Euler angles in zxy intrinsic order (radians), encoding the rotation from the world frame to the box's local frame

The pipeline converts world-frame OBBs to camera frame via inv(pose) @ T_world. The resulting camera-frame OBB uses the same 9-element representation.

2.2 Data Format

OpenSpatial uses Parquet files as the unified data format. Each row represents a scene sample.

Field Singleview Multiview Source Description
image str list[str] Input RGB image path
depth_map str list[str] Input Depth map path
pose str list[str] Input Camera extrinsic (4x4 txt)
intrinsic str list[str] Input Camera intrinsic (4x4 txt)
obj_tags list[str] list[list[str]] Input Object tags (per-view in multiview)
bboxes_3d_world_coords list[list[float]] list[list[list[float]]] Input 3D OBB [cx,cy,cz,xl,yl,zl,roll,pitch,yaw]
depth_scale int int Input Depth scale factor (e.g. 1000)
is_metric_depth bool bool Input Whether depth is metric
masks list[str] list[list[str]] Pipeline Mask PNG paths per object
bboxes_2d list[list[int]] list[list[list[int]]] Pipeline 2D bbox [x1,y1,x2,y2] per object
pointclouds list[str] list[list[str]] Pipeline Point cloud .pcd paths per object

Note: Not all fields are required at the outset. Input fields come from data preprocessing (Section 2.3); Pipeline fields (masks, bboxes_2d, pointclouds) are intermediate results generated by pipeline stages such as Localizer, Sam2Refiner, and DepthBackProjecter. Different annotation tasks depend on different subsets of these fields.

2.3 Data Preprocessing

OpenSpatial supports two types of input data:

Public 3D-Annotated Datasets

For public datasets with existing 3D annotations (e.g. ScanNet++, Hypersim, EmbodiedScan), we provide preprocessing scripts to convert them into the standard Parquet format.

ScanNet++:

python data_preprocessing/scannetpp/prepare_scannetpp.py \
    --input_root /path/to/scannetpp/data \
    --output_dir /path/to/output/parquet \
    --selected_tags_file data_preprocessing/scannetpp/scannet-labels.combined.tsv \
    --chunk_size 100 \
    --max_workers 32

Hypersim:

python data_preprocessing/hypersim/prepare_hypersim.py \
    --input_root /path/to/Hypersim \
    --output_dir /path/to/output/parquet \
    --camera_params_csv data_preprocessing/hypersim/metadata_camera_parameters.csv \
    --labels_tsv data_preprocessing/scannetpp/scannet-labels.combined.tsv \
    --chunk_size 1000 \
    --max_workers 32

EmbodiedScan (ScanNet, 3RScan, Matterport3D, ARKitScenes):

# Install the preprocessing package first
cd data_preprocessing/embodiedscan && pip install -e . && cd ../..

# Extract per-image data
python -m embodiedscan_data extract \
    --dataset all \
    --data-root /path/to/EmbodiedScan/data \
    --output /path/to/output \
    --workers 24

# Merge into per-scene records
python -m embodiedscan_data merge --input /path/to/output

# Export to Parquet
python -m embodiedscan_data export --input /path/to/output --format both

See data_preprocessing/embodiedscan/README.md for detailed data directory structure and per-dataset notes.

3D Lifting for Unannotated Web Data

Coming soon (expected May 2025).


3. Running the Pipeline

3.1 Basic Usage

python run.py --config <config.yaml> --output_dir <output_directory>

3.2 Config File Structure

Config files are in config/ and define the dataset source, pipeline executor, and processing stages. Below is a full end-to-end pipeline config with annotations:

# ── Dataset ──────────────────────────────────────────────────
dataset:
  modality: image                          # input modality: "image"
  dataset_name: image_base                 # dataset loader class (dataset/image_base.py)
  data_dir: /path/to/input.parquet         # path to Parquet file(s), str or list[str]

# ── Pipeline ─────────────────────────────────────────────────
pipeline:
  file_name: base_pipeline                 # pipeline module (pipeline/base_pipeline.py)
  class_name: BasePipeline                 # pipeline class

  stages:
    # ── Stage 1: Filter ────────────────────────────────────
    # Validate 3D bounding boxes via 2D projection & point cloud
    filter_stage:
      -
        file_name: 3dbox_filter
        method: ThreeDBoxFilter
        filter_tags: ["ceiling", "floor", "wall", "object"]
        output_dir:

    # ── Stage 2: Localization ──────────────────────────────
    # Refine masks with SAM2 box prompts
    localization_stage:
      -
        file_name: sam2_refiner
        method: Sam2Refiner
        update_keys: ["obj_tags", "bboxes_3d_world_coords"]
        depends_on: filter_stage/3dbox_filter
        output_dir:

    # ── Stage 3: Scene Fusion ──────────────────────────────
    # Back-project depth to per-object 3D point clouds
    scene_fusion:
      -
        file_name: depth_back_projection
        method: DepthBackProjecter
        depends_on: localization_stage/sam2_refiner
        output_dir:

    # ── Stage 4: Annotation ────────────────────────────────
    # Generate spatial QA pairs
    annotation_stage:
      -
        file_name: distance                # task module name under task/<stage>/
        method: AnnotationGenerator        # task class name
        depends_on: scene_fusion/depth_back_projection

        # Common parameters (BaseTask)
        use_multi_processing: false        # enable multi-process execution
        num_workers: 8                     # number of parallel workers

        # Annotation parameters (BaseAnnotationTask)
        scaling_factor: 1                  # coordinate scaling factor
        filter_tags: ["ceiling", "floor"]  # tags to exclude from QA generation
        sub_tasks:                         # sub-task names → count per sample
          absolute_distance: 1
          relative_distance: 1

        # Multiview parameters (BaseMultiviewTask, multiview tasks only)
        # max_num_views: 400               # max views to consider per scene
        # min_rot_angle: 15.0              # min rotation angle (deg) between views
        # min_translation: 0.0             # min camera center distance between views

        # Task-specific parameters vary by task (see demo configs for full list)

        output_dir:

Each stage is a list of tasks (the - items). A stage can contain multiple tasks that run sequentially. Stages execute in dependency order defined by depends_on. For annotation-only runs, a single annotation_stage is sufficient (see demo configs).

3.3 Available Tasks

Note: This list is continuously updated. Check the config/ directory for the latest available tasks.

Annotation (singleview):

Config Task Description
config/annotation/demo_distance.yaml Distance Absolute & relative distance QA
config/annotation/demo_depth.yaml Depth Depth ordering & comparison QA
config/annotation/demo_size.yaml Size Absolute & relative size QA
config/annotation/demo_position.yaml Position Height comparison & proximity QA
config/annotation/demo_counting.yaml Counting Object counting QA
config/annotation/demo_3d_grounding.yaml 3D Grounding 3D bounding box grounding QA

Annotation (multiview):

Config Task Description
config/annotation/demo_multiview_distance.yaml Multiview Distance Cross-view distance QA
config/annotation/demo_multiview_size.yaml Multiview Size Cross-view size QA
config/annotation/demo_multiview_correspondence.yaml Correspondence Cross-view object matching QA
config/annotation/demo_multiview_distance_obj_cam.yaml Object-Camera Distance Object to camera distance QA
config/annotation/demo_multiview_object_position.yaml Object Position Cross-view position QA

Processing stages (used in full pipeline):

Stage Task Description
filter_stage ThreeDBoxFilter Validate 3D boxes via 2D projection & point cloud
localization_stage Sam2Refiner Refine masks with SAM2 box prompts
scene_fusion DepthBackProjecter Back-project depth to per-object point clouds

3.4 Annotation Pipeline by Data Mode

Input data falls into two modes depending on the Parquet structure. The preprocessing pipeline differs slightly between them:

Singleview input (per-image Parquet, e.g. Hypersim, EmbodiedScan per_image):

Each row is one image. The pipeline runs filter → localize → scene_fusion directly, then groups into multiview if needed.

filter_stage ──> localization_stage ──> scene_fusion_stage ──> group_stage
(3dbox_filter)   (sam2_refiner)         (depth_back_projection) (SampleGrouper)
python run.py --config config/preprocessing/demo_preprocessing_hypersim.yaml --output_dir output/

Multiview input (per-scene Parquet, e.g. ScanNet++, EmbodiedScan per_scene):

Each row is a scene with list-valued fields. The pipeline first flattens to per-image, then follows the same stages. Datasets like ScanNet++ fall into this mode because their 3D box annotations are provided at the scene level (shared across all views), rather than per-image.

flatten_stage ──> filter_stage ──> localization_stage ──> scene_fusion_stage ──> group_stage
(SampleFlattener) (3dbox_filter)   (sam2_refiner)         (depth_back_projection) (SampleGrouper)
python run.py --config config/preprocessing/demo_preprocessing_scannetpp.yaml --output_dir output/

Note: The output of scene_fusion_stage is per-image Parquet, used as input for singleview annotation tasks. The output of group_stage aggregates per-image records back into per-scene Parquet, used as input for multiview annotation tasks.

# Singleview annotations
for cfg in demo_distance demo_depth demo_size demo_position demo_counting demo_3d_grounding; do
    python run.py --config config/annotation/${cfg}.yaml --output_dir output/demo
done

# Multiview annotations
for cfg in demo_multiview_distance demo_multiview_size demo_multiview_correspondence \
           demo_multiview_distance_obj_cam demo_multiview_object_position; do
    python run.py --config config/annotation/${cfg}.yaml --output_dir output/demo
done

4. Visualization

Launch the built-in visualization server to browse annotation results:

python visualize_server.py --data_dir output/demo --port 8888

Then open http://<host>:8888 in a browser. Features:

  • Dropdown to switch between task outputs
  • QA pairs with rendered images
  • Lightbox for full-size image viewing
  • Keyboard navigation (left/right arrows)

5. Development Guide

For extending OpenSpatial with new annotation tasks, pipeline stages, prompt templates, or dataset preprocessors, see the Development Guide.

It covers:

  • Class hierarchy (BaseTaskBaseAnnotationTaskBaseMultiviewAnnotationTask)
  • Step-by-step: adding singleview / multiview annotation tasks
  • Prompt template system and placeholder conventions
  • SceneGraph data model
  • Adding new pipeline stages and dataset preprocessors
  • Testing and common patterns