Multimodal large language models for zero-shot real-world classification tasks: benchmark, taxonomy of prompting methods, and application to human-object interaction recognition and detection

Rabadessa Alcaide, Oriol

Multimodal large language models for zero-shot real-world classification tasks: benchmark, taxonomy of prompting methods, and application to human-object interaction recognition and detection

dc.contributor

Universitat Politècnica de Catalunya. Universitat de Barcelona

dc.contributor

Escalera Guerrero, Sergio

dc.contributor.author

Rabadessa Alcaide, Oriol

dc.date.issued

2025-05-14

dc.identifier

https://hdl.handle.net/2117/430265

dc.identifier

192092

dc.description.abstract

Multimodal Large Language Models (MLLMs) excel as zero-shot reasoners across diverse domains. However, their application to real-world classification tasks, particularly in direct comparison with specialized models, remains underexplored. This work explores how MLLMs can be leveraged for zero-shot Human-Object Interaction (HOI) recognition and detection using token probability outputs. We first benchmark lightweight MLLMs, identifying Qwen2-VL and MiniCPM-V as the most effective families for HOI. We perform a comprehensive comparison of zero-shot strategies applicable to this task. A taxonomy of zero-shot approaches is proposed, integrating textual and visual prompting strategies. Our analysis over the HICO dataset reveals that Objects as Context boosts performance for multi-image-capable MLLMs, while ensembling text prompts enhances robustness. On the HICO-DET and V-COCO datasets, Objects as Context, Black Other Objects, and Blur the Background emerge as superior visual prompting methods for localization. Our approach achieves 53.50 mAP on HICO and 23.69 mAP on HICO-DET, outperforming prior zero-shot methods and being competitive with the current state-of-the-art supervised models. Our code is made publicly available

dc.format

application/pdf

dc.language

eng

dc.publisher

Universitat Politècnica de Catalunya

dc.rights

Open Access

dc.subject

Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural

dc.subject

Natural language processing (Computer science)

dc.subject

Benchmarking (Management)

dc.subject

Computer software -- Verification

dc.subject

Models Massius de Llenguatge Multimodals

dc.subject

Detecció de Interaccions Persona-Objecte

dc.subject

Reconeixement de Interaccions Persona-Objecte

dc.subject

Enginyeria d'Instruccions

dc.subject

Aprenentatge Zero-shot

dc.subject

Zero-shot Learning

dc.subject

Multimodal Large Language Models

dc.subject

Human-Object Interaction Recognition

dc.subject

Human-Object Interaction Detection

dc.subject

Prompt Engineering

dc.subject

Tractament del llenguatge natural (Informàtica)

dc.subject

Referenciació (Economia)

dc.subject

Programari--Verificació

dc.title

Multimodal large language models for zero-shot real-world classification tasks: benchmark, taxonomy of prompting methods, and application to human-object interaction recognition and detection

dc.type

Master thesis

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Treballs acadèmics [82549]