Skip to content

VisionBench implements and benchmarks four leading computer vision models — YOLOv8, SAM, DeepLabV3, and GroundingDINO — for object detection and segmentation on diverse real-world images.

License

Notifications You must be signed in to change notification settings

imehranasgari/ComputerVisionFusion_ObjectDetection_Segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Method Image Segmentation & Object Detection

A practical, hands-on exploration of four state-of-the-art deep learning models for computer vision — implemented, compared, and demonstrated on diverse real-world images. This project highlights strengths, trade-offs, and unique capabilities such as zero-shot detection and segmentation, serving as a showcase of my technical skills for portfolios and CVs.


🎯 Project Goal

In computer vision, two key challenges are:

  • Detection — identifying what objects are present.
  • Segmentation — outlining their exact boundaries.

Different models excel in different areas: some prioritize speed, others accuracy, and some offer flexible, zero-shot capabilities. This project’s goal was to:

  1. Implement and run four leading models: DeepLabV3, YOLOv8, Segment Anything Model (SAM), and GroundingDINO + SAM.

  2. Process the same input images across all models, saving:

    • Visual outputs
    • Structured data (class labels, bounding boxes, confidence scores)
  3. Compare architectures, performance, and outputs, emphasizing the strengths of semantic segmentation, real-time detection, and prompt-based zero-shot segmentation.


💡 Solution Approach

The project is implemented as a comparative Jupyter Notebook, with each section dedicated to one model. The same curated image set is processed through all pipelines for side-by-side evaluation.

1️⃣ Semantic Segmentation — DeepLabV3

  • Model: deeplabv3_resnet101 (pre-trained on COCO)
  • Task: Dense, pixel-level classification with predefined classes.
  • Output: High-precision masks and segmentation maps.

2️⃣ Panoptic Segmentation — SAM + CLIP

  • Model: vit-h variant of SAM.
  • Enhancement: Integrated OpenAI CLIP for zero-shot semantic labeling of SAM’s class-agnostic masks.
  • Result: Fully automated panoptic segmentation with meaningful class names.

3️⃣ Object Detection — YOLOv8

  • Strength: Real-time detection with bounding boxes & class labels.
  • Use Case: Benchmarked against semantic and panoptic methods.
  • Extra: Applied to both original and SAM-segmented images.

4️⃣ Zero-Shot Detection & Segmentation — GroundingDINO + SAM

  • Pipeline:

    1. GroundingDINO: Text-prompt-based detection (e.g., “a person on a horse”).
    2. SAM: Precise segmentation masks for detected boxes.
  • Benefit: Promptable, zero-shot segmentation without retraining.


🛠️ Technologies & Libraries

Category Tools / Frameworks
Core Frameworks PyTorch, TorchVision
Models & Architectures YOLOv8 (ultralytics), SAM, DeepLabV3, GroundingDINO, CLIP, Transformers
Utilities & Processing Pillow, OpenCV, NumPy, Matplotlib, requests, supervision
Environment Jupyter Notebook, pip

🖼️ Dataset

  • Type: Curated collection of 7 diverse real-world images.
  • Location: images/input/
  • Scenes: Group gatherings, action shots, wildlife, landscapes.
  • Purpose: Test generalization and robustness across varied, non-benchmark examples.

⚙️ Installation & Execution

1. Clone Repository

git clone <repository-url>
cd <repository-name>

2. Install Dependencies

pip install torch torchvision pillow matplotlib
pip install git+https://github.com/facebookresearch/segment-anything.git
pip install ultralytics
pip install groundingdino-py supervision

3. Prepare Environment

  • Place images in images/input/
  • First run will auto-download pretrained model weights into download_model/

4. Run Notebook

Open and execute segmention_yolo_deepleb.ipynb cell-by-cell.

5. View Results

Outputs are saved in:

  • images/deeplab_segmented/ — DeepLabV3
  • images/segmented/ — SAM
  • images/yolo/ — YOLOv8
  • images/groundingdino_sam/ — GroundingDINO + SAM

📊 Performance Highlights

Model Notable Strengths Example Performance
DeepLabV3 Strong pixel-level segmentation 7 images in 4.63s on GPU (0.66s/image)
SAM Extremely high-quality masks 88 masks in 53.48s on CPU (heavy model)
YOLOv8 Real-time detection Performance depends on hardware
GroundingDINO + SAM Flexible, prompt-based segmentation Accurate zero-shot results

🏙️ Sample Outputs

SAM

YOLOv8

DeepLabV3

GroundingDINO + SAM

e.g Detection Results (festive-holiday-gathering-with-friends)


🧠 Key Learnings & Reflections

  • Learned practical trade-offs between semantic, panoptic, and real-time approaches.
  • Composing GroundingDINO + SAM revealed the power of multi-model pipelines.
  • SAM + CLIP integration turned a class-agnostic model into a zero-shot labeling tool.
  • Running heavy models like SAM (vit-h) on CPU highlighted the importance of balancing accuracy and compute resources.

👤 Author

Mehran Asgari 📧 [email protected] 🌐 GitHub Profile


📄 License

Licensed under the Apache 2.0 License — see LICENSE for details.


About

VisionBench implements and benchmarks four leading computer vision models — YOLOv8, SAM, DeepLabV3, and GroundingDINO — for object detection and segmentation on diverse real-world images.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published