Multimodal large language models for zero-shot real-world classification tasks: benchmark, taxonomy of prompting methods, and application to human-object interaction recognition and detection

Rabadessa Alcaide, Oriol; Rabadessa Alcaide, Oriol

Multimodal large language models for zero-shot real-world classification tasks: benchmark, taxonomy of prompting methods, and application to human-object interaction recognition and detection

Autor/a

Rabadessa Alcaide, Oriol

Altres autors/es

Universitat Politècnica de Catalunya. Universitat de Barcelona

Escalera Guerrero, Sergio

Data de publicació

2025-05-14

Resum

Multimodal Large Language Models (MLLMs) excel as zero-shot reasoners across diverse domains. However, their application to real-world classification tasks, particularly in direct comparison with specialized models, remains underexplored. This work explores how MLLMs can be leveraged for zero-shot Human-Object Interaction (HOI) recognition and detection using token probability outputs. We first benchmark lightweight MLLMs, identifying Qwen2-VL and MiniCPM-V as the most effective families for HOI. We perform a comprehensive comparison of zero-shot strategies applicable to this task. A taxonomy of zero-shot approaches is proposed, integrating textual and visual prompting strategies. Our analysis over the HICO dataset reveals that Objects as Context boosts performance for multi-image-capable MLLMs, while ensembling text prompts enhances robustness. On the HICO-DET and V-COCO datasets, Objects as Context, Black Other Objects, and Blur the Background emerge as superior visual prompting methods for localization. Our approach achieves 53.50 mAP on HICO and 23.69 mAP on HICO-DET, outperforming prior zero-shot methods and being competitive with the current state-of-the-art supervised models. Our code is made publicly available

Tipus de document

Master thesis

Llengua

Anglès

Matèries i paraules clau

Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural; Natural language processing (Computer science); Benchmarking (Management); Computer software -- Verification; Models Massius de Llenguatge Multimodals; Detecció de Interaccions Persona-Objecte; Reconeixement de Interaccions Persona-Objecte; Enginyeria d'Instruccions; Aprenentatge Zero-shot; Zero-shot Learning; Multimodal Large Language Models; Human-Object Interaction Recognition; Human-Object Interaction Detection; Prompt Engineering; Tractament del llenguatge natural (Informàtica); Referenciació (Economia); Programari--Verificació

Publicat per

Universitat Politècnica de Catalunya

Citació recomanada

Aquesta citació s'ha generat automàticament.

Exportar

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Drets

Open Access

Aquest element apareix en la col·lecció o col·leccions següent(s)

Treballs acadèmics [82667]

Multimodal large language models for zero-shot real-world classification tasks: benchmark, taxonomy of prompting methods, and application to human-object interaction recognition and detection

Autor/a

Altres autors/es

Data de publicació

Compartir

Resum

Tipus de document

Llengua

Matèries i paraules clau

Publicat per

Citació recomanada

Exportar

Drets

Aquest element apareix en la col·lecció o col·leccions següent(s)