A modular, extensible framework for language‑guided embodied navigation that leverages a Vision‑Language Model (VLM) (tested with Qwen2.5‑VL) to translate raw simulator observations into high‑level actions.
- 🚀 Overview
- 🖼 Architecture
- ✨ Features
- 🚦 Quick Start
- 🛠 Configuration
- 🤝 Contributing
- 📄 License
- 📚 Citation
This project implements a plug‑and‑play navigation loop comprised of four interchangeable modules:
- Visual Interpreter — Extracts scene representations (objects, depth) from raw simulator observations.
- VLM Agent (Qwen2.5‑VL) — Receives visual inputs + historic context to generate natural‑language‑grounded navigation actions.
- Action Interpreter — Converts high‑level action tokens into simulator API calls.
- Simulator Wrapper — Provides a unified interface to 3D environment
A History Manager persistently stores timestep metadata.
- Simulator → Visual Interpreter
- Visual Interpreter → VLM Agent
- VLM Agent → Action Interpreter
- Action Interpreter → Simulator
History Manager maintains bidirectional context with VLM Agent.
- 🔄 Modular design — swap in/out any Vision model, simulator, or planner
- 📊 Persistent memory — builds steps history for memory and refelection.
- 💬 Natural‑language actions — driven by state‑of‑the-art VLM (Qwen2.5‑VL)
- ⚙️ Simulator‑agnostic — Habitat-lab
To run the evaluation:
python run.pyMake one .local.yaml first.
All hyperparameters live in .local.yaml. Key sections:
mp3d_habitat_scene_dataset_path: "<your path>/mp3d/"
r2r_dataset_path: "<your path>/R2R_VLNCE_v1-3/val_unseen/val_unseen.json.gz"
eval_config: 'r2r_eval.yaml'
success_distance: 3- Fork → Clone → Create feature branch
- Add tests for new modules
- Submit PR → Review → Merge
This project is MIT Licensed. See LICENSE for details.
If you find our work helpful, feel free to give us a cite:
@misc{oobvlm,
title = {Plug‑and‑Play Navigation Framework using Vision‑Language Model},
url = {https://github.com/YichengDuan/oobvlm},
author = {Yicheng Duan, Kaiyu Tang},
month = {April},
year = {2025}
}
