DS-Prox : dataset proximity mining for governing the data lake

dc.contributor
Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació
dc.contributor
Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering
dc.contributor
Universitat Politècnica de Catalunya. IMP - Information Modeling and Processing
dc.contributor.author
Al-serafi, Ayman Mounir Mohamed
dc.contributor.author
Calders, Toon
dc.contributor.author
Abelló Gamazo, Alberto
dc.contributor.author
Romero Moral, Óscar
dc.date.issued
2017
dc.identifier
Al-serafi, A., Calders, T., Abello, A., Romero, O. DS-Prox : dataset proximity mining for governing the data lake. A: The International Conference on Similarity Search and Applications. "Similarity Search and Applications: 10th International Conference, SISAP 2017: Munich, Germany, October 4-6, 2017: proceedings". Berlín: Springer, 2017, p. 284-299.
dc.identifier
978-3-319-68474-1
dc.identifier
https://hdl.handle.net/2117/117036
dc.identifier
10.1007/978-3-319-68474-1_20
dc.description.abstract
With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.
dc.description.abstract
Peer Reviewed
dc.description.abstract
Postprint (author's final draft)
dc.format
16 p.
dc.format
application/pdf
dc.language
eng
dc.publisher
Springer
dc.relation
https://link.springer.com/chapter/10.1007/978-3-319-68474-1_20
dc.rights
Open Access
dc.subject
Àrees temàtiques de la UPC::Informàtica::Enginyeria del software
dc.subject
Data mining
dc.subject
Proximity Mining
dc.subject
Data Lakes
dc.subject
Data Governance
dc.subject
Similarity Search
dc.subject
Mineria de dades
dc.title
DS-Prox : dataset proximity mining for governing the data lake
dc.type
Conference report


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

E-prints [72987]