Universitat de Barcelona. Departament de Genètica, Microbiologia i Estadística
Reverter Comes, Ferran
Vegas Lozano, Esteban
2021-06
In this work, we study the problem of characterizing an unlabelled corpus of biomedical documents in an unsupervised manner. After a review of the literature on the subject, we propose an integrative approach to the problem. The integration is twofold. On one hand, we integrate, with multiview learning, different text representations derived from a traditional bag-of-words model, Latent Dirichlet Allocation, and a recurrent neural autoencoder. On the other hand, we integrate topic modeling outputs, clustering outputs and biomedical word embeddings to generate an intuitive and comprehensive characterization of the corpus. We also propose a semantic graph that supplies a synthetic visualization of the relationships between topics, clusters, and any other biomedical concept, based on semantic similarity. An application to the CORD-19 dataset, a collection of articles on COVID-19, shows our methodology produces a coherent, meaningful, and informative characterization of the corpus.
Master thesis
English
Àrees temàtiques de la UPC::Matemàtiques i estadística::Estadística matemàtica; Statistical Mathematics -- Applications; Text mining; Document clustering; Topic modeling; Word embeddings; Biomedical text mining; Estadística matemàtica--Aplicacions; Classificació AMS::62 Statistics::62P Applications
Universitat Politècnica de Catalunya
Universitat de Barcelona
Restricted access - author's decision
Treballs acadèmics [82539]