Skip to content

muktac5/Visual-Goal-Guidance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual Goal-Guidance: Step Inference for Instruction Retrieval

Description: Given a text-based goal and a set of images, the idea is to retrieve all the images that corresponding to the steps leading up to the goal and eventually order them in the next best action based order.

Existing Implementation: Given a textual goal and 4 images, identify 1 image which corresponds to the steps leading upto the goal. Link: https://aclanthology.org/2021.emnlp-main.165.pdf

Our primary implementation is predominantly encapsulated in the following notebooks: "Goal_step_relevance_LLava.ipynb," "Intent_data_prep_for_step_ordering," and "Step_ordering.ipynb."

In our exploration, we employed various models, including ViT and CLIP. Subsequently, we progressed towards an end-to-end trained large multimodal model that integrates a vision encoder and Vicuna for comprehensive visual and language understanding purposes.

Remarkably, we achieved commendable performance without the necessity for prior training with a smaller model in comparison with the custom model integration with LLM and CLIP using Microsoft GIT. It is important to acknowledge the challenges associated with intent detection from images, such as discerning between visually similar objects like celery, scallion, and asparagus.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors