Multimodal large language models for zero-shot real-world classification tasks: benchmark, taxonomy of prompting methods, and application to human-object interaction recognition and detection

dc.contributor
Universitat Politècnica de Catalunya. Universitat de Barcelona
dc.contributor
Escalera Guerrero, Sergio
dc.contributor.author
Rabadessa Alcaide, Oriol
dc.date.issued
2025-05-14
dc.identifier
https://hdl.handle.net/2117/430265
dc.identifier
192092
dc.description.abstract
Multimodal Large Language Models (MLLMs) excel as zero-shot reasoners across diverse domains. However, their application to real-world classification tasks, particularly in direct comparison with specialized models, remains underexplored. This work explores how MLLMs can be leveraged for zero-shot Human-Object Interaction (HOI) recognition and detection using token probability outputs. We first benchmark lightweight MLLMs, identifying Qwen2-VL and MiniCPM-V as the most effective families for HOI. We perform a comprehensive comparison of zero-shot strategies applicable to this task. A taxonomy of zero-shot approaches is proposed, integrating textual and visual prompting strategies. Our analysis over the HICO dataset reveals that Objects as Context boosts performance for multi-image-capable MLLMs, while ensembling text prompts enhances robustness. On the HICO-DET and V-COCO datasets, Objects as Context, Black Other Objects, and Blur the Background emerge as superior visual prompting methods for localization. Our approach achieves 53.50 mAP on HICO and 23.69 mAP on HICO-DET, outperforming prior zero-shot methods and being competitive with the current state-of-the-art supervised models. Our code is made publicly available
dc.format
application/pdf
dc.language
eng
dc.publisher
Universitat Politècnica de Catalunya
dc.rights
Open Access
dc.subject
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural
dc.subject
Natural language processing (Computer science)
dc.subject
Benchmarking (Management)
dc.subject
Computer software -- Verification
dc.subject
Models Massius de Llenguatge Multimodals
dc.subject
Detecció de Interaccions Persona-Objecte
dc.subject
Reconeixement de Interaccions Persona-Objecte
dc.subject
Enginyeria d'Instruccions
dc.subject
Aprenentatge Zero-shot
dc.subject
Zero-shot Learning
dc.subject
Multimodal Large Language Models
dc.subject
Human-Object Interaction Recognition
dc.subject
Human-Object Interaction Detection
dc.subject
Prompt Engineering
dc.subject
Tractament del llenguatge natural (Informàtica)
dc.subject
Referenciació (Economia)
dc.subject
Programari--Verificació
dc.title
Multimodal large language models for zero-shot real-world classification tasks: benchmark, taxonomy of prompting methods, and application to human-object interaction recognition and detection
dc.type
Master thesis


Ficheros en el ítem

FicherosTamañoFormatoVer

No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)