conda create -n openspatial python=3.10 -y
conda activate openspatialcd OpenSpatial
# Core dependencies
pip install -r requirements.txt
# PyTorch (CUDA 12.6)
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
--index-url https://download.pytorch.org/whl/cu126
# spaCy model
python -m spacy download en_core_web_sm
# Foundation models
pip install 'git+https://github.com/facebookresearch/sam2.git' --quiet
pip install -q -U flash-attn --no-build-isolation
### 1.3 Verify Installation
```bash
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
python -m pytest tests/ -x -qWorld coordinate system: The scene mesh must be gravity-aligned — the ground plane is parallel to the XY plane, and the Z-axis points upward.
Z (up)
|
|
|_______ Y
/
/
X
Ground plane = XY
Camera coordinate system (OpenCV convention): X-axis points right, Y-axis points down, Z-axis points forward (into the scene along the optical axis). The depth value of a pixel equals its Z coordinate in camera frame.
Z (forward / optical axis)
/
/
/_______ X (right)
|
|
Y (down)
Image plane is perpendicular to Z.
Pixel (u, v) back-projects to:
X_cam = (u - cx) * depth / fx
Y_cam = (v - cy) * depth / fy
Z_cam = depth
The pose field stores the 4x4 camera-to-world extrinsic matrix. To convert from world to camera frame: P_cam = inv(pose) @ P_world.
3D Oriented Bounding Box (OBB): Each object's 3D bounding box is represented as a 9-element vector in world coordinates:
[cx, cy, cz, xl, yl, zl, roll, pitch, yaw]
| Element | Description |
|---|---|
cx, cy, cz |
Center position in world frame |
xl, yl, zl |
Extents (full lengths) along the box's local X / Y / Z axes. zl corresponds to the object's height |
roll, pitch, yaw |
Euler angles in zxy intrinsic order (radians), encoding the rotation from the world frame to the box's local frame |
The pipeline converts world-frame OBBs to camera frame via inv(pose) @ T_world. The resulting camera-frame OBB uses the same 9-element representation.
OpenSpatial uses Parquet files as the unified data format. Each row represents a scene sample.
| Field | Singleview | Multiview | Source | Description |
|---|---|---|---|---|
image |
str |
list[str] |
Input | RGB image path |
depth_map |
str |
list[str] |
Input | Depth map path |
pose |
str |
list[str] |
Input | Camera extrinsic (4x4 txt) |
intrinsic |
str |
list[str] |
Input | Camera intrinsic (4x4 txt) |
obj_tags |
list[str] |
list[list[str]] |
Input | Object tags (per-view in multiview) |
bboxes_3d_world_coords |
list[list[float]] |
list[list[list[float]]] |
Input | 3D OBB [cx,cy,cz,xl,yl,zl,roll,pitch,yaw] |
depth_scale |
int |
int |
Input | Depth scale factor (e.g. 1000) |
is_metric_depth |
bool |
bool |
Input | Whether depth is metric |
masks |
list[str] |
list[list[str]] |
Pipeline | Mask PNG paths per object |
bboxes_2d |
list[list[int]] |
list[list[list[int]]] |
Pipeline | 2D bbox [x1,y1,x2,y2] per object |
pointclouds |
list[str] |
list[list[str]] |
Pipeline | Point cloud .pcd paths per object |
Note: Not all fields are required at the outset. Input fields come from data preprocessing (Section 2.3); Pipeline fields (
masks,bboxes_2d,pointclouds) are intermediate results generated by pipeline stages such asLocalizer,Sam2Refiner, andDepthBackProjecter. Different annotation tasks depend on different subsets of these fields.
OpenSpatial supports two types of input data:
For public datasets with existing 3D annotations (e.g. ScanNet++, Hypersim, EmbodiedScan), we provide preprocessing scripts to convert them into the standard Parquet format.
ScanNet++:
python data_preprocessing/scannetpp/prepare_scannetpp.py \
--input_root /path/to/scannetpp/data \
--output_dir /path/to/output/parquet \
--selected_tags_file data_preprocessing/scannetpp/scannet-labels.combined.tsv \
--chunk_size 100 \
--max_workers 32Hypersim:
python data_preprocessing/hypersim/prepare_hypersim.py \
--input_root /path/to/Hypersim \
--output_dir /path/to/output/parquet \
--camera_params_csv data_preprocessing/hypersim/metadata_camera_parameters.csv \
--labels_tsv data_preprocessing/scannetpp/scannet-labels.combined.tsv \
--chunk_size 1000 \
--max_workers 32EmbodiedScan (ScanNet, 3RScan, Matterport3D, ARKitScenes):
# Install the preprocessing package first
cd data_preprocessing/embodiedscan && pip install -e . && cd ../..
# Extract per-image data
python -m embodiedscan_data extract \
--dataset all \
--data-root /path/to/EmbodiedScan/data \
--output /path/to/output \
--workers 24
# Merge into per-scene records
python -m embodiedscan_data merge --input /path/to/output
# Export to Parquet
python -m embodiedscan_data export --input /path/to/output --format bothSee data_preprocessing/embodiedscan/README.md for detailed data directory structure and per-dataset notes.
Coming soon (expected May 2025).
python run.py --config <config.yaml> --output_dir <output_directory>Config files are in config/ and define the dataset source, pipeline executor, and processing stages. Below is a full end-to-end pipeline config with annotations:
# ── Dataset ──────────────────────────────────────────────────
dataset:
modality: image # input modality: "image"
dataset_name: image_base # dataset loader class (dataset/image_base.py)
data_dir: /path/to/input.parquet # path to Parquet file(s), str or list[str]
# ── Pipeline ─────────────────────────────────────────────────
pipeline:
file_name: base_pipeline # pipeline module (pipeline/base_pipeline.py)
class_name: BasePipeline # pipeline class
stages:
# ── Stage 1: Filter ────────────────────────────────────
# Validate 3D bounding boxes via 2D projection & point cloud
filter_stage:
-
file_name: 3dbox_filter
method: ThreeDBoxFilter
filter_tags: ["ceiling", "floor", "wall", "object"]
output_dir:
# ── Stage 2: Localization ──────────────────────────────
# Refine masks with SAM2 box prompts
localization_stage:
-
file_name: sam2_refiner
method: Sam2Refiner
update_keys: ["obj_tags", "bboxes_3d_world_coords"]
depends_on: filter_stage/3dbox_filter
output_dir:
# ── Stage 3: Scene Fusion ──────────────────────────────
# Back-project depth to per-object 3D point clouds
scene_fusion:
-
file_name: depth_back_projection
method: DepthBackProjecter
depends_on: localization_stage/sam2_refiner
output_dir:
# ── Stage 4: Annotation ────────────────────────────────
# Generate spatial QA pairs
annotation_stage:
-
file_name: distance # task module name under task/<stage>/
method: AnnotationGenerator # task class name
depends_on: scene_fusion/depth_back_projection
# Common parameters (BaseTask)
use_multi_processing: false # enable multi-process execution
num_workers: 8 # number of parallel workers
# Annotation parameters (BaseAnnotationTask)
scaling_factor: 1 # coordinate scaling factor
filter_tags: ["ceiling", "floor"] # tags to exclude from QA generation
sub_tasks: # sub-task names → count per sample
absolute_distance: 1
relative_distance: 1
# Multiview parameters (BaseMultiviewTask, multiview tasks only)
# max_num_views: 400 # max views to consider per scene
# min_rot_angle: 15.0 # min rotation angle (deg) between views
# min_translation: 0.0 # min camera center distance between views
# Task-specific parameters vary by task (see demo configs for full list)
output_dir:Each stage is a list of tasks (the - items). A stage can contain multiple tasks that run sequentially. Stages execute in dependency order defined by depends_on. For annotation-only runs, a single annotation_stage is sufficient (see demo configs).
Note: This list is continuously updated. Check the
config/directory for the latest available tasks.
Annotation (singleview):
| Config | Task | Description |
|---|---|---|
config/annotation/demo_distance.yaml |
Distance | Absolute & relative distance QA |
config/annotation/demo_depth.yaml |
Depth | Depth ordering & comparison QA |
config/annotation/demo_size.yaml |
Size | Absolute & relative size QA |
config/annotation/demo_position.yaml |
Position | Height comparison & proximity QA |
config/annotation/demo_counting.yaml |
Counting | Object counting QA |
config/annotation/demo_3d_grounding.yaml |
3D Grounding | 3D bounding box grounding QA |
Annotation (multiview):
| Config | Task | Description |
|---|---|---|
config/annotation/demo_multiview_distance.yaml |
Multiview Distance | Cross-view distance QA |
config/annotation/demo_multiview_size.yaml |
Multiview Size | Cross-view size QA |
config/annotation/demo_multiview_correspondence.yaml |
Correspondence | Cross-view object matching QA |
config/annotation/demo_multiview_distance_obj_cam.yaml |
Object-Camera Distance | Object to camera distance QA |
config/annotation/demo_multiview_object_position.yaml |
Object Position | Cross-view position QA |
Processing stages (used in full pipeline):
| Stage | Task | Description |
|---|---|---|
filter_stage |
ThreeDBoxFilter |
Validate 3D boxes via 2D projection & point cloud |
localization_stage |
Sam2Refiner |
Refine masks with SAM2 box prompts |
scene_fusion |
DepthBackProjecter |
Back-project depth to per-object point clouds |
Input data falls into two modes depending on the Parquet structure. The preprocessing pipeline differs slightly between them:
Singleview input (per-image Parquet, e.g. Hypersim, EmbodiedScan per_image):
Each row is one image. The pipeline runs filter → localize → scene_fusion directly, then groups into multiview if needed.
filter_stage ──> localization_stage ──> scene_fusion_stage ──> group_stage
(3dbox_filter) (sam2_refiner) (depth_back_projection) (SampleGrouper)
python run.py --config config/preprocessing/demo_preprocessing_hypersim.yaml --output_dir output/Multiview input (per-scene Parquet, e.g. ScanNet++, EmbodiedScan per_scene):
Each row is a scene with list-valued fields. The pipeline first flattens to per-image, then follows the same stages. Datasets like ScanNet++ fall into this mode because their 3D box annotations are provided at the scene level (shared across all views), rather than per-image.
flatten_stage ──> filter_stage ──> localization_stage ──> scene_fusion_stage ──> group_stage
(SampleFlattener) (3dbox_filter) (sam2_refiner) (depth_back_projection) (SampleGrouper)
python run.py --config config/preprocessing/demo_preprocessing_scannetpp.yaml --output_dir output/Note: The output of
scene_fusion_stageis per-image Parquet, used as input for singleview annotation tasks. The output ofgroup_stageaggregates per-image records back into per-scene Parquet, used as input for multiview annotation tasks.
# Singleview annotations
for cfg in demo_distance demo_depth demo_size demo_position demo_counting demo_3d_grounding; do
python run.py --config config/annotation/${cfg}.yaml --output_dir output/demo
done
# Multiview annotations
for cfg in demo_multiview_distance demo_multiview_size demo_multiview_correspondence \
demo_multiview_distance_obj_cam demo_multiview_object_position; do
python run.py --config config/annotation/${cfg}.yaml --output_dir output/demo
doneLaunch the built-in visualization server to browse annotation results:
python visualize_server.py --data_dir output/demo --port 8888Then open http://<host>:8888 in a browser. Features:
- Dropdown to switch between task outputs
- QA pairs with rendered images
- Lightbox for full-size image viewing
- Keyboard navigation (left/right arrows)
For extending OpenSpatial with new annotation tasks, pipeline stages, prompt templates, or dataset preprocessors, see the Development Guide.
It covers:
- Class hierarchy (
BaseTask→BaseAnnotationTask→BaseMultiviewAnnotationTask) - Step-by-step: adding singleview / multiview annotation tasks
- Prompt template system and placeholder conventions
- SceneGraph data model
- Adding new pipeline stages and dataset preprocessors
- Testing and common patterns