Title:
|
Deduplication of Universitat de Lleida scholarly data
|
Author:
|
Berga Gatius, Albert
|
Other authors:
|
García González, Roberto; Universitat de Lleida. Escola Politècnica Superior |
Notes:
|
In this project we have used data science tools and techniques to detect duplicated data in GREC repository, which contains information about the articles published by University of Lleida staff. We have used Locality-sensitive hashing (LSH) to group articles in a way that those which are more likely to be duplicates are classified to the same group. Then, we have compared pairwise articles in the same group to determine which pairs are referring the same article. |
Subject(s):
|
-Spark -Big data -Data mining -Data science -Macrodades -Mineria de dades |
Rights:
|
cc-by-nc-nd
http://creativecommons.org/licenses/by-nc-nd/4.0/
|
Document type:
|
masterThesis |
Share:
|
|