Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

This repository contains the official implementation of the paper Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM.

📄 Paper: arXiv:2505.17726

Environment Setup

We provide a Conda configuration file to easily set up the environment:

conda env create -f slot_mllm.yaml
conda activate slot_mllm

Huggingface Model Weights

Slot Q-Former Weights: KU-AGI/Slot_Q-Former
Slot-MLLM Weights: KU-AGI/Slot-MLLM-7B-instruct | KU-AGI/Slot-MLLM-14B-instruct

Inference

Slot Q-Former

Run the following command:

python inference_tokenizer.py

Slot-MLLM

Run the following command to perform each task:

# Image Captioning
python inference_mllm.py --image_path=sample_data/understanding_input_img.jpg [--is_14b]

# Visual Question Answering
python inference_mllm.py --image_path=sample_data/understanding_input_img.jpg --prompt="What color is the small animal?" [--is_14b]

# Text-to-Image Generation
python inference_mllm.py --prompt="A red bicycle against a blue wall." --generation [--is_14b]

# Image Editing
python inference_mllm.py --image_path=sample_data/edit_input_img.png --prompt="leave only one cherry on top." --generation [--is_14b]

Guidelines for Responsible Use

Slot-MLLM is designed to effectively perform multimodal understanding and image generation tasks. To ensure responsible use, users are advised to adhere to the following:

Ethical Use: Only utilize Slot-MLLM for ethical applications, clearly disclose generated content, and avoid biased or inappropriate data.
Validation: Always validate and manually inspect generated outputs, particularly in sensitive or public-facing contexts.
Transparency: Clearly communicate when outputs are AI-generated.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
generated_images		generated_images
gradio_demo		gradio_demo
models		models
sample_data		sample_data
utils		utils
.gitignore		.gitignore
.project-root		.project-root
LICENSE		LICENSE
README.md		README.md
gradio_demo_frontend.py		gradio_demo_frontend.py
inference_huggingface_tokenizer.py		inference_huggingface_tokenizer.py
inference_mllm.py		inference_mllm.py
inference_tokenizer.py		inference_tokenizer.py
modeling_slot_mllm.py		modeling_slot_mllm.py
modeling_slot_qformer.py		modeling_slot_qformer.py
slot_mllm.yaml		slot_mllm.yaml
slot_mllm_flask.py		slot_mllm_flask.py
slot_mllm_gradio.py		slot_mllm_gradio.py
start_backend_8b.sh		start_backend_8b.sh
start_frontend_8b.sh		start_frontend_8b.sh
start_frontend_unified.sh		start_frontend_unified.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Environment Setup

Huggingface Model Weights

Inference

Slot Q-Former

Slot-MLLM

Guidelines for Responsible Use

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

KU-AGI/Slot-MLLM

Folders and files

Latest commit

History

Repository files navigation

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Environment Setup

Huggingface Model Weights

Inference

Slot Q-Former

Slot-MLLM

Guidelines for Responsible Use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages