🔌 Plug‑and‑Play Navigation Framework using Vision‑Language Model

A modular, extensible framework for language‑guided embodied navigation that leverages a Vision‑Language Model (VLM) (tested with Qwen2.5‑VL) to translate raw simulator observations into high‑level actions.

📖 Table of Contents

🚀 Overview
🖼 Architecture
✨ Features
🚦 Quick Start
🛠 Configuration
🤝 Contributing
📄 License
📚 Citation

🚀 Overview

This project implements a plug‑and‑play navigation loop comprised of four interchangeable modules:

Visual Interpreter — Extracts scene representations (objects, depth) from raw simulator observations.
VLM Agent (Qwen2.5‑VL) — Receives visual inputs + historic context to generate natural‑language‑grounded navigation actions.
Action Interpreter — Converts high‑level action tokens into simulator API calls.
Simulator Wrapper — Provides a unified interface to 3D environment

A History Manager persistently stores timestep metadata.

🖼 Architecture

Simulator → Visual Interpreter
Visual Interpreter → VLM Agent
VLM Agent → Action Interpreter
Action Interpreter → Simulator

History Manager maintains bidirectional context with VLM Agent.

✨ Features

🔄 Modular design — swap in/out any Vision model, simulator, or planner
📊 Persistent memory — builds steps history for memory and refelection.
💬 Natural‑language actions — driven by state‑of‑the-art VLM (Qwen2.5‑VL)
⚙️ Simulator‑agnostic — Habitat-lab

🚦 Quick Start

To run the evaluation:

python run.py

🛠 Configuration

Make one .local.yaml first. All hyperparameters live in .local.yaml. Key sections:

mp3d_habitat_scene_dataset_path: "<your path>/mp3d/"
r2r_dataset_path: "<your path>/R2R_VLNCE_v1-3/val_unseen/val_unseen.json.gz"
eval_config: 'r2r_eval.yaml'
success_distance: 3

🤝 Contributing

Fork → Clone → Create feature branch
Add tests for new modules
Submit PR → Review → Merge

📄 License

This project is MIT Licensed. See LICENSE for details.

📚 Citation

If you find our work helpful, feel free to give us a cite:

@misc{oobvlm,
    title = {Plug‑and‑Play Navigation Framework using Vision‑Language Model},
    url = {https://github.com/YichengDuan/oobvlm},
    author = {Yicheng Duan, Kaiyu Tang},
    month = {April},
    year = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
agent		agent
config		config
docs		docs
model		model
results		results
sim_connect		sim_connect
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config_unit.py		config_unit.py
run.py		run.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔌 Plug‑and‑Play Navigation Framework using Vision‑Language Model

📖 Table of Contents

🚀 Overview

🖼 Architecture

✨ Features

🚦 Quick Start

🛠 Configuration

🤝 Contributing

📄 License

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔌 Plug‑and‑Play Navigation Framework using Vision‑Language Model

📖 Table of Contents

🚀 Overview

🖼 Architecture

✨ Features

🚦 Quick Start

🛠 Configuration

🤝 Contributing

📄 License

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages