Multimodal knowledge representation for long-term agents

Omedes de Ribot, Joan

Multimodal knowledge representation for long-term agents

dc.contributor

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial

dc.contributor

Fundació Eurecat

dc.contributor

Dalmau Moreno, Magí

dc.contributor

Rosell Gratacòs, Jan

dc.contributor.author

Omedes de Ribot, Joan

dc.date.accessioned

2025-11-08T11:05:02Z

dc.date.available

2025-11-08T11:05:02Z

dc.date.issued

2025-07

dc.identifier

https://hdl.handle.net/2117/445758

dc.identifier

PRISMA-197395

dc.identifier.uri

https://hdl.handle.net/2117/445758

dc.description.abstract

This project presents the development of a knowledge representation framework for a robotic agent operating in a simulated, previously unmapped and unseen environment. The main objective is to enable the robot to perceive, understand, and reason about its surroundings by dynamically building and maintaining a Knowledge Graph from multimodal input, combining visual frames from the environment and object metadata obtained from the simulator’s perception layer. To achieve this, the framework has a multi-agentic architecture composed of agents built upon state-of-the-art Large Language Models (LLMs), capable of extracting structured relational data from both text and images. This structured information is used to construct a semantic representation of the environment that evolves as the robot explores new areas. The robot then uses this knowledge to interact with users by answering natural language queries based on the updated information it has accumulated. The entire system is deployed in a simulated environment using AI2-THOR and ROS2, with custom modules developed for perception, knowledge graph generation, key frame detection, and user interaction. Results show that the proposed approach enables effective knowledge accumulation and contextual reasoning, in previously unknown environments. However, there is a trade-off between computational time and the accuracy of knowledge representation, depending on how much autonomy is given to the agents in constructing that knowledge. When the process is left entirely to a single LLM that ingests all the information at once, the acquisition of knowledge is very fast. However, the resulting answers tend to be more generic, less precise, and more prone to hallucinations. In contrast, when using a multi-agentic framework that builds a structured knowledge graph by explicitly extracting and processing semantic relations, the resulting representation is much more accurate. Although this process takes more time, it leads to higher-quality answers from the robot and provides a much more scalable solution in the long term for growing and maintaining knowledge.

dc.format

application/pdf

dc.language

eng

dc.publisher

Universitat Politècnica de Catalunya

dc.rights

Open Access

dc.subject

Àrees temàtiques de la UPC::Informàtica

dc.subject

Intelligent agents (Computer software)

dc.subject

Robotics

dc.subject

Knowledge representation (Information theory)

dc.subject

Agents intel·ligents (Programari)

dc.subject

Robòtica

dc.subject

Representació del coneixement (Teoria de la informació)

dc.title

Multimodal knowledge representation for long-term agents

dc.type

Master thesis

Fitxers en aquest element

Fitxers	Grandària	Format	Visualització
No hi ha fitxers associats a aquest element.

Aquest element apareix en la col·lecció o col·leccions següent(s)

Treballs acadèmics [82541]