Hi,
I am currently working on applying the LRV-Instruction mitigation method to the LLaVA model on the POPE benchmark. The questions in the POPE benchmark are all binary (Yes/No), for example:
{"question_id": 1, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a person in the image?", "label": "yes"}
{"question_id": 2, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a refrigerator in the image?", "label": "no"}
Is it possible to use LRV-Instruction to further improve LLaVA's performance on these types of binary questions?
If so, could you provide guidance on how to implement it?
Below is the current implementation I am using for LLaVA to generate answers from an image:
model_id = "llava-hf/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
).to(0)
processor = AutoProcessor.from_pretrained(model_id)
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": question_text},
{"type": "image"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
answer_text = processor.decode(output[0][2:], skip_special_tokens=True)
Hi,
I am currently working on applying the LRV-Instruction mitigation method to the LLaVA model on the POPE benchmark. The questions in the POPE benchmark are all binary (Yes/No), for example:
{"question_id": 1, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a person in the image?", "label": "yes"} {"question_id": 2, "image": "COCO_val2014_000000016631.jpg", "text": "Is there a refrigerator in the image?", "label": "no"}Is it possible to use LRV-Instruction to further improve LLaVA's performance on these types of binary questions?
If so, could you provide guidance on how to implement it?
Below is the current implementation I am using for LLaVA to generate answers from an image: