Effective and scalable data discovery with NextiaJD

Otros/as autores/as

Universitat Politècnica de Catalunya. Doctorat en Computació

Universitat Politècnica de Catalunya. Doctorat Erasmus Mundus en Tecnologies de la Informació per a la Intel·ligència Empresarial

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació

Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering

Fecha de publicación

2021

Resumen

We present NextiaJD, a data discovery system with high predictive performance and computational efficiency. NextiaJD aids data scientists in the discovery of datasets that can be crossed. To that end, it proposes a ranking of candidate pairs according to their join quality, which is based on a novel similarity measure that considers both containment and cardinality pro- portions between candidate attributes. To do so, NextiaJD adopts a learning approach relying on profiles. These are succint and informative representations of the schemata and data values of datasets that capture their underlying characteristics. NextiaJD's features are fully integrated into Apache Spark and benefits from it to parallelize the profiling and discovery processes. The on-site demonstration will showcase how NextiaJD can effectively support large-scale data discovery tasks with a large set of datasets the audience will be able to play with.


This work is partly supported by Barcelona’s City Council under grant agreement 20S08704. Javier Flores is supported by contract 2020-DI-027 of the Industrial Doctorate Program of the Government of Catalonia and Consejo Nacional de Ciencia y Tecnología (CONACYT, Mexico).


Peer Reviewed


Postprint (published version)

Tipo de documento

Conference lecture

Lengua

Inglés

Publicado por

OpenProceedings

Documentos relacionados

https://doi.org/10.5441/002/edbt.2021.85

info:eu-repo/grantAgreement/Ajuntament de Barcelona/20S08704

info:eu-repo/grantAgreement/AGAUR/V PRI/2020 DI 027

Citación recomendada

Esta citación se ha generado automáticamente.

Derechos

https://creativecommons.org/licenses/by-nc-nd/4.0/

Open Access

Attribution-NonCommercial-NoDerivatives 4.0 International

Este ítem aparece en la(s) siguiente(s) colección(ones)

E-prints [73020]