A practical, hands-on exploration of four state-of-the-art deep learning models for computer vision — implemented, compared, and demonstrated on diverse real-world images. This project highlights strengths, trade-offs, and unique capabilities such as zero-shot detection and segmentation, serving as a showcase of my technical skills for portfolios and CVs.
In computer vision, two key challenges are:
- Detection — identifying what objects are present.
- Segmentation — outlining their exact boundaries.
Different models excel in different areas: some prioritize speed, others accuracy, and some offer flexible, zero-shot capabilities. This project’s goal was to:
-
Implement and run four leading models: DeepLabV3, YOLOv8, Segment Anything Model (SAM), and GroundingDINO + SAM.
-
Process the same input images across all models, saving:
- Visual outputs
- Structured data (class labels, bounding boxes, confidence scores)
-
Compare architectures, performance, and outputs, emphasizing the strengths of semantic segmentation, real-time detection, and prompt-based zero-shot segmentation.
The project is implemented as a comparative Jupyter Notebook, with each section dedicated to one model. The same curated image set is processed through all pipelines for side-by-side evaluation.
- Model:
deeplabv3_resnet101(pre-trained on COCO) - Task: Dense, pixel-level classification with predefined classes.
- Output: High-precision masks and segmentation maps.
- Model:
vit-hvariant of SAM. - Enhancement: Integrated OpenAI CLIP for zero-shot semantic labeling of SAM’s class-agnostic masks.
- Result: Fully automated panoptic segmentation with meaningful class names.
- Strength: Real-time detection with bounding boxes & class labels.
- Use Case: Benchmarked against semantic and panoptic methods.
- Extra: Applied to both original and SAM-segmented images.
-
Pipeline:
- GroundingDINO: Text-prompt-based detection (e.g., “a person on a horse”).
- SAM: Precise segmentation masks for detected boxes.
-
Benefit: Promptable, zero-shot segmentation without retraining.
| Category | Tools / Frameworks |
|---|---|
| Core Frameworks | PyTorch, TorchVision |
| Models & Architectures | YOLOv8 (ultralytics), SAM, DeepLabV3, GroundingDINO, CLIP, Transformers |
| Utilities & Processing | Pillow, OpenCV, NumPy, Matplotlib, requests, supervision |
| Environment | Jupyter Notebook, pip |
- Type: Curated collection of 7 diverse real-world images.
- Location:
images/input/ - Scenes: Group gatherings, action shots, wildlife, landscapes.
- Purpose: Test generalization and robustness across varied, non-benchmark examples.
git clone <repository-url>
cd <repository-name>pip install torch torchvision pillow matplotlib
pip install git+https://github.com/facebookresearch/segment-anything.git
pip install ultralytics
pip install groundingdino-py supervision- Place images in
images/input/ - First run will auto-download pretrained model weights into
download_model/
Open and execute segmention_yolo_deepleb.ipynb cell-by-cell.
Outputs are saved in:
images/deeplab_segmented/— DeepLabV3images/segmented/— SAMimages/yolo/— YOLOv8images/groundingdino_sam/— GroundingDINO + SAM
| Model | Notable Strengths | Example Performance |
|---|---|---|
| DeepLabV3 | Strong pixel-level segmentation | 7 images in 4.63s on GPU (0.66s/image) |
| SAM | Extremely high-quality masks | 88 masks in 53.48s on CPU (heavy model) |
| YOLOv8 | Real-time detection | Performance depends on hardware |
| GroundingDINO + SAM | Flexible, prompt-based segmentation | Accurate zero-shot results |
- Learned practical trade-offs between semantic, panoptic, and real-time approaches.
- Composing GroundingDINO + SAM revealed the power of multi-model pipelines.
- SAM + CLIP integration turned a class-agnostic model into a zero-shot labeling tool.
- Running heavy models like SAM (
vit-h) on CPU highlighted the importance of balancing accuracy and compute resources.
Mehran Asgari 📧 [email protected] 🌐 GitHub Profile
Licensed under the Apache 2.0 License — see LICENSE for details.












