Multimodal knowledge representation for long-term agents

dc.contributor
Universitat Politècnica de Catalunya. Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial
dc.contributor
Fundació Eurecat
dc.contributor
Dalmau Moreno, Magí
dc.contributor
Rosell Gratacòs, Jan
dc.contributor.author
Omedes de Ribot, Joan
dc.date.accessioned
2025-11-08T11:05:02Z
dc.date.available
2025-11-08T11:05:02Z
dc.date.issued
2025-07
dc.identifier
https://hdl.handle.net/2117/445758
dc.identifier
PRISMA-197395
dc.identifier.uri
https://hdl.handle.net/2117/445758
dc.description.abstract
This project presents the development of a knowledge representation framework for a robotic agent operating in a simulated, previously unmapped and unseen environment. The main objective is to enable the robot to perceive, understand, and reason about its surroundings by dynamically building and maintaining a Knowledge Graph from multimodal input, combining visual frames from the environment and object metadata obtained from the simulator’s perception layer. To achieve this, the framework has a multi-agentic architecture composed of agents built upon state-of-the-art Large Language Models (LLMs), capable of extracting structured relational data from both text and images. This structured information is used to construct a semantic representation of the environment that evolves as the robot explores new areas. The robot then uses this knowledge to interact with users by answering natural language queries based on the updated information it has accumulated. The entire system is deployed in a simulated environment using AI2-THOR and ROS2, with custom modules developed for perception, knowledge graph generation, key frame detection, and user interaction. Results show that the proposed approach enables effective knowledge accumulation and contextual reasoning, in previously unknown environments. However, there is a trade-off between computational time and the accuracy of knowledge representation, depending on how much autonomy is given to the agents in constructing that knowledge. When the process is left entirely to a single LLM that ingests all the information at once, the acquisition of knowledge is very fast. However, the resulting answers tend to be more generic, less precise, and more prone to hallucinations. In contrast, when using a multi-agentic framework that builds a structured knowledge graph by explicitly extracting and processing semantic relations, the resulting representation is much more accurate. Although this process takes more time, it leads to higher-quality answers from the robot and provides a much more scalable solution in the long term for growing and maintaining knowledge.
dc.format
application/pdf
dc.language
eng
dc.publisher
Universitat Politècnica de Catalunya
dc.rights
Open Access
dc.subject
Àrees temàtiques de la UPC::Informàtica
dc.subject
Intelligent agents (Computer software)
dc.subject
Robotics
dc.subject
Knowledge representation (Information theory)
dc.subject
Agents intel·ligents (Programari)
dc.subject
Robòtica
dc.subject
Representació del coneixement (Teoria de la informació)
dc.title
Multimodal knowledge representation for long-term agents
dc.type
Master thesis


Fitxers en aquest element

FitxersGrandàriaFormatVisualització

No hi ha fitxers associats a aquest element.

Aquest element apareix en la col·lecció o col·leccions següent(s)