Multimodal knowledge representation for long-term agents

Omedes de Ribot, Joan; Omedes de Ribot, Joan

Multimodal knowledge representation for long-term agents

Para acceder a los documentos con el texto completo, por favor, siga el siguiente enlace: https://hdl.handle.net/2117/445758

Autor/a

Omedes de Ribot, Joan

Otros/as autores/as

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial

Fundació Eurecat

Dalmau Moreno, Magí

Rosell Gratacòs, Jan

Fecha de publicación

2025-07

Resumen

This project presents the development of a knowledge representation framework for a robotic agent operating in a simulated, previously unmapped and unseen environment. The main objective is to enable the robot to perceive, understand, and reason about its surroundings by dynamically building and maintaining a Knowledge Graph from multimodal input, combining visual frames from the environment and object metadata obtained from the simulator’s perception layer. To achieve this, the framework has a multi-agentic architecture composed of agents built upon state-of-the-art Large Language Models (LLMs), capable of extracting structured relational data from both text and images. This structured information is used to construct a semantic representation of the environment that evolves as the robot explores new areas. The robot then uses this knowledge to interact with users by answering natural language queries based on the updated information it has accumulated. The entire system is deployed in a simulated environment using AI2-THOR and ROS2, with custom modules developed for perception, knowledge graph generation, key frame detection, and user interaction. Results show that the proposed approach enables effective knowledge accumulation and contextual reasoning, in previously unknown environments. However, there is a trade-off between computational time and the accuracy of knowledge representation, depending on how much autonomy is given to the agents in constructing that knowledge. When the process is left entirely to a single LLM that ingests all the information at once, the acquisition of knowledge is very fast. However, the resulting answers tend to be more generic, less precise, and more prone to hallucinations. In contrast, when using a multi-agentic framework that builds a structured knowledge graph by explicitly extracting and processing semantic relations, the resulting representation is much more accurate. Although this process takes more time, it leads to higher-quality answers from the robot and provides a much more scalable solution in the long term for growing and maintaining knowledge.

Tipo de documento

Master thesis

Lengua

Inglés

Materias y palabras clave

Àrees temàtiques de la UPC::Informàtica; Intelligent agents (Computer software); Robotics; Knowledge representation (Information theory); Agents intel·ligents (Programari); Robòtica; Representació del coneixement (Teoria de la informació)

Publicado por

Universitat Politècnica de Catalunya

Citación recomendada

Esta citación se ha generado automáticamente.

Exportar

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Derechos

Open Access

Este ítem aparece en la(s) siguiente(s) colección(ones)

Treballs acadèmics [82541]

Multimodal knowledge representation for long-term agents

Autor/a

Otros/as autores/as

Fecha de publicación

Compartir

Resumen

Tipo de documento

Lengua

Materias y palabras clave

Publicado por

Citación recomendada

Exportar

Derechos

Este ítem aparece en la(s) siguiente(s) colección(ones)