DS-Prox : dataset proximity mining for governing the data lake

Al-serafi, Ayman Mounir Mohamed; Calders, Toon; Abelló Gamazo, Alberto; Romero Moral, Óscar; Al-serafi, Ayman Mounir Mohamed; Calders, Toon; Abelló Gamazo, Alberto; Romero Moral, Óscar

DS-Prox : dataset proximity mining for governing the data lake

Author

Al-serafi, Ayman Mounir Mohamed

Calders, Toon

Abelló Gamazo, Alberto

Romero Moral, Óscar

Other authors

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació

Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering

Universitat Politècnica de Catalunya. IMP - Information Modeling and Processing

Publication date

2017

Abstract

With the arrival of Data Lakes (DL) there is an increasing need for efficient dataset classification to support data analysis and information retrieval. Our goal is to use meta-features describing datasets to detect whether they are similar. We utilise a novel proximity mining approach to assess the similarity of datasets. The proximity scores are used as an efficient first step, where pairs of datasets with high proximity are selected for further time-consuming schema matching and deduplication. The proposed approach helps in early-pruning unnecessary computations, thus improving the efficiency of similar-schema search. We evaluate our approach in experiments using the OpenML online DL, which shows significant efficiency gains above 25% compared to matching without early-pruning, and recall rates reaching higher than 90% under certain scenarios.

Peer Reviewed

Postprint (author's final draft)

Document Type

Conference report

Language

English

Subjects and keywords

Àrees temàtiques de la UPC::Informàtica::Enginyeria del software; Data mining; Proximity Mining; Data Lakes; Data Governance; Similarity Search; Mineria de dades

Publisher

Springer

Related items

https://link.springer.com/chapter/10.1007/978-3-319-68474-1_20

Recommended citation

This citation was generated automatically.

Export

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Rights

Open Access

This item appears in the following Collection(s)

E-prints [72987]

DS-Prox : dataset proximity mining for governing the data lake

Author

Other authors

Publication date

Share

Abstract

Document Type

Language

Subjects and keywords

Publisher

Related items

Recommended citation

Export

Rights

This item appears in the following Collection(s)