Skip to content

akaiHuang/monocular-3d-reconstruction

Repository files navigation

Monocular 3D Reconstruction

Single-View to 3D Scene in Under a Second -- Powered by Apple's SHARP Research

About

Monocular 3D Reconstruction 實作單張影像到 3D 場景的快速重建流程,將 2D 照片轉換為可渲染的 3D 表示。適合用於研究復現、即時 3D 內容生成原型,以及評估單目重建在 AR/VR 與視覺化應用的可行性。

About (EN)

Monocular 3D Reconstruction implements single-image 3D scene generation with Gaussian-style representations. It is useful for research replication, rapid 3D content prototyping, and AR/VR feasibility studies.

📋 Quick Summary

📸 單張照片到 3D 場景,不到一秒完成! 本專案實作 Apple SHARP 研究成果,透過單一前饋神經網路,將任意 2D 照片轉換為高品質 3D 高斯潑濺(Gaussian Splatting)表示。🧠 核心技術包含影像編碼器、多解析度解碼器、單眼深度估計與高斯參數預測,一次推理即可產出具有 絕對深度的度量級 3D 模型。🎮 內建基於 SuperSplat 的互動式網頁檢視器(Next.js 架構),使用者上傳照片後可即時在瀏覽器中 360 度探索 3D 場景。🎬 支援 CUDA 加速的相機軌跡影片渲染,輸出 .mp4 動畫。⚡ 支援 CPU、CUDA 及 Apple MPS(Metal)多裝置推理。📊 相較先前最先進技術,LPIPS 降低 25-34%、DISTS 降低 21-43%,展現強大的零樣本泛化能力。🏗️ 適合電腦視覺研究者、3D 內容創作者、以及需要快速從照片生成 3D 資產的開發者使用。


🤔 Why This Exists

Generating a full 3D scene from a single photograph has long been one of the hardest problems in computer vision. This project implements Apple's SHARP approach (Sharp Monocular View Synthesis), which produces photorealistic 3D Gaussian representations from a single 2D image in one feedforward pass -- no multi-view capture, no scanning, no waiting. It pairs the ML inference pipeline with a custom web-based 3D viewer built on SuperSplat for immediate interactive exploration of generated scenes.

🏗️ Architecture

Single 2D Image (any photograph)
        |
        v
+--------------------------------------------------+
|  SHARP Neural Network                             |
|                                                   |
|  Image Encoder --> Multi-Resolution Decoder       |
|       --> Monocular Depth Estimation              |
|       --> Gaussian Parameter Prediction (NDC)     |
|       --> Unproject to Metric 3D Space            |
|                                                   |
|  Single feedforward pass, < 1 second on GPU       |
+--------------------------------------------------+
        |
        v
3D Gaussian Splat (.ply) -- metric scale, absolute depth
        |
        +---> Web Viewer (Next.js + SuperSplat)
        |         - Interactive 3D exploration in browser
        |         - Upload image, view result immediately
        |
        +---> Video Rendering (gsplat, CUDA)
                  - Camera trajectory animation
                  - .mp4 output

Key Capabilities

  • Sub-second 3D generation from any single photograph via a single neural network forward pass
  • Metric-scale output with absolute depth -- enables real-world camera movements
  • 3DGS-compatible (.ply format) works with any Gaussian Splat renderer
  • Interactive web viewer for immediate 3D scene exploration in the browser
  • Multi-device inference -- runs on CPU, CUDA, and Apple MPS (Metal)
  • Zero-shot generalization across datasets, reducing LPIPS by 25-34% and DISTS by 21-43% vs. prior state of the art

🛠️ Tech Stack

Layer Technology
ML Framework PyTorch
Model Encoder-Decoder with Gaussian Head (SHARP)
3D Representation 3D Gaussian Splatting
CLI Click
Web Viewer Next.js (App Router), SuperSplat, Three.js
Video Rendering gsplat (CUDA only)
Package Manager pip, pyproject.toml

🏁 Quick Start

ML Pipeline

# Create environment
conda create -n sharp python=3.13
conda activate sharp

# Install dependencies
pip install -r requirements.txt

# Run prediction (model downloads automatically on first run)
sharp predict -i /path/to/image.jpg -o /path/to/output/

# Render camera trajectory video (CUDA GPU required)
sharp predict -i /path/to/image.jpg -o /path/to/output/ --render

Web Viewer

cd viewer
npm install
npm run dev
# Open http://localhost:3000

📁 Project Structure

monocular-3d-reconstruction/
  src/
    sharp/
      cli/
        predict.py               # Main prediction CLI
        render.py                # Camera trajectory video rendering
      models/
        encoders/                # Image feature encoders
        decoders/                # Multi-resolution convolutional decoders (UNet, etc.)
        gaussian_decoder.py      # 3D Gaussian parameter prediction
        predictor.py             # End-to-end RGB Gaussian predictor
        monodepth.py             # Monocular depth estimation module
        composer.py              # Model composition
        heads.py, blocks.py      # Neural network building blocks
      utils/                     # I/O, Gaussian ops, logging
  viewer/
    src/
      app/                       # Next.js App Router
        api/generate/route.ts    # Image-to-3D API endpoint
        api/ply/route.ts         # PLY file serving
      components/
        SuperSplatViewer.tsx      # SuperSplat-based 3D viewer
        GaussianSplatViewer.tsx   # Custom Gaussian Splat viewer
        EmbeddedViewer.tsx       # Embedded viewer wrapper
    public/supersplat/           # SuperSplat viewer assets
  supersplat-viewer-source/      # SuperSplat viewer source code
  data/                          # Sample images and teaser assets

Research Reference

Based on: Sharp Monocular View Synthesis in Less Than a Second -- Mescheder, Dong, Li, Bai, Santos, Hu, Lecouat, Zhen, Delaunoy, Fang, Tsin, Richter, Koltun (Apple, 2025).

arXiv:2512.10685 | Project Page


Built by Huang Akai (Kai) -- Creative Technologist, Founder @ Universal FAW Labs

About

Single-image to 3D scene reconstruction (Gaussian Splatting) inspired by Apple's SHARP research.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors