Clustering and topic modeling for biomedical text mining

Other authors

Universitat de Barcelona. Departament de Genètica, Microbiologia i Estadística

Reverter Comes, Ferran

Vegas Lozano, Esteban

Publication date

2021-06

Abstract

In this work, we study the problem of characterizing an unlabelled corpus of biomedical documents in an unsupervised manner. After a review of the literature on the subject, we propose an integrative approach to the problem. The integration is twofold. On one hand, we integrate, with multiview learning, different text representations derived from a traditional bag-of-words model, Latent Dirichlet Allocation, and a recurrent neural autoencoder. On the other hand, we integrate topic modeling outputs, clustering outputs and biomedical word embeddings to generate an intuitive and comprehensive characterization of the corpus. We also propose a semantic graph that supplies a synthetic visualization of the relationships between topics, clusters, and any other biomedical concept, based on semantic similarity. An application to the CORD-19 dataset, a collection of articles on COVID-19, shows our methodology produces a coherent, meaningful, and informative characterization of the corpus.

Document Type

Master thesis

Language

English

Publisher

Universitat Politècnica de Catalunya

Universitat de Barcelona

Recommended citation

This citation was generated automatically.

Rights

Restricted access - author's decision

This item appears in the following Collection(s)