Technical Challenges and Approaches to build an Open Ecosystem of Heterogeneous Heritage Collections Ricard de la Vega & Natalia Torres & Albert Martínez; Consorci de Serveis Universitaris de Catalunya (CSUC); Barcelona; Catalonia/Spain Abstract Empowering Communities with a Heritage Open Ecosystem (ECHOES)1 is a project that intends to provide an open-source and modular architecture to gather different digital contents related to European heritage. When a wide amount of heterogeneous data of collections, disciplines and countries are joined, different kinds of technical challenges must be considered. In this article are detailed these challenges and the approaches that have been used, with more or less success, to solve it. Introduction ECHOES provide an open, easy and innovative access to digital cultural assets from different nations and languages. Within a single and integrated platform, users have access to a wide range of cultural heritage items that can be explored according to different criteria. The platform works as a digital ecosystem formed by a wide range of user communities and allows an active participation with the ability to enrich the digital collections it contains. The project is carried out by the Erfgoed Leiden en Omstreken (ELO), Tresoar, the Department of Culture of the Generalitat de Catalunya and the Diputació de Barcelona. The Consorci de Serveis Universitaris de Catalunya (CSUC) participates in the project as a technological partner. Having an initial length of two years (2016-2018), it has been decided to extend the initial development for at least one more year. All the code developed in the project is open source under a MIT license and it is available from GitHub2. An open source community is also created to manage the maintenance and further development of the software. ECHOES encourage that the content of heterogeneous collections of different institutions or even countries can be linked. For example, the work of the Catalan architect Antoni Gaudí is mostly located in Catalonia, but he also participated in some projects throughout the Spanish geography, such as El capricho in Cantabria, Casa de los botines in León or El Palacio Episcopal in Astorga. So, his work is part of different collections managed by the different departments of culture of each region and need to access separately. If we can transform all these collections into the same standard and load them together somewhere, we could find all the works of the artist, searching by different architectural styles (neogothic, modernist) or type of elements used (mitral arches, gusset, etc.). Adding new data from other collections located around Europe we could find new interesting information like his collaborators works and even works influenced by his style, for example the Het Schip from Amsterdam by Michel Klerk. The theory is sound but there exist many challenges to tackle that are detailed in this article. These challenges, and the approaches that have been chosen are the first part of the paper. The second part of this paper covers the technical architecture, the development is explained, some lessons learned in the first two years of the project are described, and finally, the current and future development is detailed. Challenges and Approaches The objective of the project is the cooperation across heritage disciplines, institutions and borders to the European challenge of existing cultural heritage fragmentation (silos). This is a project of interoperability between different data collections. Integrating data is not just about putting them together in a repository, but also to facilitate their access so it can be properly exploited by the public. ECHOES also aim to provide an easy way for smaller cultural heritage owners to transform their collection into linked data and share their collection data with Europeana. After reviewing similar projects on big data, the choice was made that all the records inserted into the system should have the same structure and format. This was chosen to simplify the re-use of the data by the public. Transformation of the data into the same structure and format makes it so that the quality of outputs is determined by the quality of the input. If garbage comes in, then garbage comes out. The cleaning and homogenization of large different datasets that are being integrated is an ongoing research topic4, meaning that this is not an easy task and is requiring a lot of efforts. Linked data suffers from quality problems such as inconsistency, inaccuracy, out-of-datedness and incompleteness. It is therefore important to assess the quality of the datasets that are used in linked data applications before using them5. There are two ways to ensure the coherence and consistency of the data: a priori or posteriori. A priori data integration means that new data should be cleaned and transformed to some standard before it is added to the final database. A posteriori integration means that data is added to the system as it is, being cleaned and transformed to some standard in real time at the time it is used. Due to the complexity, and specially to the volume of data to be integrated on the ECHOES project it has been decided to process data at the time is being inserted in the dataset (a priori approach). Below are the details of eight of the challenges the project identified. The first four refer to the input of data. The next two refer to the output. And finally, there are two challenges that are not easy to classify as input or output. After explaining the details of the challenge, the approach of the ECHOES project to solving this challenge is detailed. 1. Different input collections can have different input metadata schemas, such us Dublin Core, A2A, EAD, etc. It was necessary to have a one standard that was the basis for mapping for the rest. The standard that was chosen was the Europeana Data Model (EDM6), the most used standard for heritage conten purposes. This choice also facilitates the export of ECHOES hubs contents to the Europeana portal. consists of three primary functions: the first, that reviews the EDM schema (tags, mandatory fields…), the second is a semantic validation based on the EDM Schematron and the last part is a validation of the content, based on a pre-defined and configurable list of rules. An input collection can only be loaded into the data lake when the input collection passes the three validations. In a first approach the inputs were mapped directly to EDM in the data source module and then passed directly to the data lake (the data repository). Initial mappings were closely related to the first test collections, and this caused chaos due to the quality of the source data. It was necessary to define which data formats are accepted and the mapping, metadata to metadata, from these formats to EDM to assure that only the data that are consistent can be published into the data lake. In any case, an input collection that succeeds all the validations is not a complete guarantee that the data quality is of an exactable level. We found examples of records that uses the same metadata field called “coverage”, for either information about places or information about dates. There is no common rule that can be added to the validation to counter these human errors in the input collection. 3. The transformation to EDM of the following formats have now been developed or validated: Dublin Core (DC), A2A, EAD, custom metadata schema from Memorix, custom Catalan metadada schema, Topx and CARARE. This approach is easily extensible, if someone wants a format that is not on this list, creating their own EDM mapping is possible and can contribute to the community. One subpoint within this challenge is when a standard has an official conversion, such as CARARE, but is not yet completed. In this case, if a collection with this schema wants to be incorporated to the data lake, there are two options: to wait for the CARARE working group to complete the official conversion, or to selfcomplete the mapping, running the risk that of not coinciding with the official version. In that case, it must have been adapted again, to adapt to the standard. Both options are not desirable. 2. With the transformation of the first heterogeneous collections of multiple collection owners within the project, without homogenization nor validation processes. It came apparent that that a poor data quality limits the exploitation of the data. For example, one unique field which contains references to locations has values at a different geolocation level, there are references to a municipally such as Bussum, a city such as Chicago or a country such as China. The same happens with dates, we have also found problems since some records refer to a specific day and time, another to a whole year, and others referencing to temporal time spans such as centuries. Another quality problem is the misspelling, as we found examples of “Leide4n”, “Leideb”, “Leidedn”, that can be assumed as “Leiden”. Two modules have been developed to improve the quality of the data. One is focused on data profiling, analyzing the inputs with information about the data types, the number of instances, blank cells, etc. And the other is focused on validating the input data. This second module, called “Quality assurance”, reviews each item and based on a defined rule decides if this item can be loaded into the data lake. This module Same data from different collections can be stored once and updated. An important task in working with data from different sources is the data deduplication. As shown in Figure 1, different collections can include the same object, and store different interesting metadata related to it. Deduplication process needs to detect this and create a unique item that includes all metadata. This process can be performed easily if the objects have identifiers, but it is not usual in heterogeneous collections of different sources. In this case, different similarities and distance metrics algorithm can be used to find duplicates, for exemple, Levenshtein, Jaro Winkler or others implemented with Duke7 tool. It is possible to adjust the degree of similarity, and to reduce the risk of false positives, comparisons can be made through multiple metadata. Figure 1. Inputs deduplication challenge 4. One of the most important challenges of the project is the content enrichment. There are two types of enrichments, those suggested by users, and those that can be done automatically. The automatic enrichments allow the use of information systems, usually in the form of linked open data (LOD), that can be a useful complement to the original data. For example, possible automatic enrichment for places are TGN, GeoNames or Pleiades, for Agents, VIAF or Wikidata. To integrate automatic enrichments, it is needed to decide which metadata fields are candidates for the enrichments. After that, it is necessary to decide if the new metadata can be incorporated as a pre or post process, to choose the data sources and, finally, how to do this, reusing existing fields or creating new metadata. allows to navigate thought any kind of RDF-based system. The project is still far from achieving this holistic navigation approach for all data. Not only it is difficult to decide on the entry point, but also how to calculate the relevance of the nodes. For this challenge a final approximation has not yet been decided. It is a topic to work on during 2019. Some tests have been performed with places. For example, the A2A collections do not have geolocation coordinates, and there have been added based on texts such as “Leiden” from GeoNames. The coordinates are necessary in order to be able to locate data onmaps. Another example has been performed adding information from DBpedia based on the location, too. It is pertinent to remember that the enrichments depend on the data quality, a challenge that has been previously detailed. Once data inserted in the data lake has the necessary quality, the challenges for its exploitation will be detailed below. Exploiting integrated systems such as ECHOES in a visual way is a very difficult task, since the heterogeneous nature of the data as well as its volume makes it very hard to represent it. Despite being technically complicated, searching through all the data lake may be interesting in order to find related information from several collections that without this aggregation before it was not possible to find. On the other hand, a lot (perhaps most) of the data it does not make sense to consult it for in an aggregated way. Figure 2. Data visualization of a 455.000 nodes graph One alternative for tackling the problem of visualizing integrated data is to define domain-specific visualization4. This is, instead of targeting the exploration of the whole database, very specific solutions can be built to solve specific tasks in specific domains, following this approach, visualization tools have been developed for the different types of information to be displayed. Instead of focusing on a single option for the exploitation of the data, several alternatives have been studied and tested, which are detailed below. Additionally, a key aspect of the project has always been to offer open access to data so that, via API or publishing as linked open data, the information can be consulted and exploited by other systems. 5. In an attempt to visualize data in the form of a graph, from four random collections on old registers, two from ELO and two from Tresoar, with a total of approximately 167 thousand items. It became clear how challenging too much data can be. One way to facilitate the navigation through this content of linked resources is to provide an explorable representation of the underlying graph of data. These data produced a graph of approximately 455 thousand nodes, generated with Gephy. In theory, it was navigable but in practice, as can be seen in figure 2, it was impossible, and it did not provide any type of value. And even worse, to generate it, it was necessary to use high performance computing (HPC) facilities. For this challenge, the divide and conquer strategy can be used. The main point is to pick a particular datapoint as a focus for analysis and the system then computes and displays an “optimal” relevant context given to the users current interests7. It all starts with a query that generates the first node, with a set of connections to other relevant nodes that allow users to interact with this initial subgraph and expand it in any direction. This way Cultural objects are shown as a graph that relates places, people, dates and concepts (see figure 5); clicking on each of them the user can browse through data and explore the content. Place searches can be done using a text query box or selecting an area on a map, results are shown also on a map using their geolocation (see figure 7). Dates and periods are represented using timelines where the user can optionally add temporary periods to help to contextualize the date result, for example, the industrial revolution in the Netherlands or in Catalonia are in different year periods (see figure 9); another approach offered for dates is the use of heatmaps to visualize dates because it helps users to focus on dates when various events occur, for example, a peak of deaths could help the user to relate them to an epidemic (see figure 8). Graphs are used to show relationships between people (see figure 6). 6. All the data lake contents are accessible in RDF format through a linked open data endpoint. A user-friendly interface called YASGUI to access this endpoint is integrated. One factor that currently limits the success of linked data repositories is the requirement of knowing the querying language SPARQL and the exact structure of the used database. For all users (most) who are not accustomed to doing queries with the SPARQL language there exist the possibility to develop a visual query system that assist in the generation of SPARQL queries effortlessly. The tool Visual SPARQL Builder was tested with the ECHOES data (see figure 4). The system allows to drag any of the elements from the database to an infinite canvas, showing the metadata of each element in boxes to specify details of the query. Then relations between boxes can be created in order to combine the information from different elements. 4. There are three kinds of data validations: the syntactic EDM schema, the semantic EDM schema with Schematron and a content validation. The Publish submodule oversees the data deduplication and loads datasets into the data lake. The project challenges are not only related to inputs and outputs, but there are two more challenges difficult to classify. 7. The software developed in the project must be useful for different scope institutions, from small to great national or international hubs. The technology developed is scalable so that it covers many different scenarios, from small collections to instances including collections from many institutions. It is possible that small instances do not use all the developed software modules. On the other hand, it is necessary to size the large instances well and perform performance tests. 8. It has already been said that one of the objectives of the project is the user enrichment of the contents. This challenge has all the difficulties mentioned above for automatic enrichment. Although some initiatives such as Zooniverse have been analyzed, priority has been given to the provision of mechanisms for the processing of input data and their subsequent exploitation mechanisms, as a basis for doing this functionality afterwards. It is planned to start this issue the second semester of 2019. Technical architecture The architecture of ECHOES is formed by a modular approach, with four main pieces (see figure 3) the mapping and transformation data sources module, the data lake, the data retrieval and visualization module and the enrichments. The core module is the data lake, where the data from heterogeneous collections (inputs) are introduced after a process of cleaning and transformation. Once the “normalized” data is available in the data lake, this can be searched and accessed in different ways (outputs). Finally, the data in the data lake can be enriched, automatically or with the user collaboration. The data sources module is formed by four pieces to transform the input collections to data with enough quality to be introduced into the data lake. Figure 3. ECHOES technical architecture The data lake module contains a big amount of data from different sources in EDM. It is built by a graph open source database called Blazegraph. The data retrieval and visualization module is composed of modules to exploit the data in the data lake in different ways. 1. 2. 1. 2. 3. The Analyze optional submodule enables data profiling. It generates reports that summarizes the contents of the data source. The Transformation optional submodule is the tool for data standardization, to prepare data sources to EDM standard. Nowadays, there are the following transformed metadata schemas: Dublin Core, A2A, EAD, Custom from Memorix, Custom Catalan metadata schema, Topx and CARARE. (see figure 10). The Quality Assurance submodule, as its name implies, reviews each item. And based on defined rules decides whether it can be loaded into the data lake. 3. A SPARQL endpoint called YASGUI that opens the collections and link them to the world as linked open data (LOD). A web portal builder with a modular and extensible architecture. At present it is built with the WordPress CMS with some additional custom plugins. Some data visualizations detailed before are implemented on this, such as a map, timespan, heatmap or graph visualizations. Mechanisms to export the data lake contents via a REST API or with the OAI-PMH protocol. Only some test was done with the last part of the architecture, the enrichment module, that includes automatic and manual enrichments. Figure 4. Visual SPARQL Builder Figure 7. ECHOES map data visualization Figure 5. ECHOES graph data visualization of concepts Figure 8. ECHOES heatmap data visualization Figure 6. ECHOES graph data visualization of people relations Figure 9. ECHOES timespan data visualization The status of the development is to improve the data source mapping and transformation tools and focus on the enrichment module. On the other hand, more users of the platform are expected to help grow the community. References Figure 10. ECHOES GUI transformation module Lessons learned (for now) There are methodologies that have been useful during the first years of the project, likewise, there are some lessons learned that we would like to share. The agile methodologies are the best solution for this type of project, since we do not know what will happen with the raised challenges. They have allowed us to adapt, evolve and change the parts or technologies necessary thanks to their flexibility. Don’t be afraid to make changes to proposed options and look for alternatives. It is also necessary to have the collaboration and the involvement of all the team in each iteration to advance all in the same direction. A multidisciplinary team brings different points of view to solve a challenge, the contributions of both functional and technical viewpoints have contributed to improve the final solution. Also, the different points of view on the management of cultural heritage of different countries helps to enrich it. Even if a core piece of the project are the enrichments, they have been relegated to the end of the project., Start from the beginning. Without a set of data to work with, they do not make any sense. The focus of the project at the beginning must be on the data. Much of the project has consisted of research and development to find the best solution considering the large volume of data with which you work. Learning by doing, the best way to know if it works is to test it; and after this, test it again using different data sources. Results and future development After 2 years in 19 one-month iteration sprints, we have developed 7 releases and one stable minimum viable product (MPV). The developed tools allow to analyze, clean and transform data to the EDM standard. Validate and publish heterogeneous data to a normalized data lake that can be exploited as linked open data and with different data visualizations. The Github platform is used to publish software releases under a MIT open source license, manage user requests, process contributions and also publish related documentation and a new software community based on Benevolent dictator for life model (BDFL) is waiting for users feedback. [1] Ariela Netiv & Walther Hasselo, ECHOES - cooperation across heritage disciplines, institutes and borders (IS&T, Washington, 2018) pg. 70-74 [2] All the code (version 1.4), specifications and documentation are available on the GitHub page of the project: https://github.com/CSUC/ECHOES-Tools. [3] Lluís M. Anglada & Sandra Reoyo & Ramon Ros & Ricard de la Vega, “Doing it together spreading ORCID among Catalan universities and researchers” (ORCID-CASRAI Joint conference, Barcelona, 2015) [4] Victor Pasqual, The navigation system of ECHOES. Report 2018. [5] Anisa Rula & Andrea Maurino & Carlo Batini, “Data Quality Issues in Linked Open Data”. (Part of the Data-Centric Systems and Applications book series, DCSA, 2016) [6] Europeana Data Model (EDM) https://pro.europeana.eu/resources/standardization-tools/edmdocumentation. [7] Duke. A tool to find duplicates. https://github.com/larsga/Duke [8] Frank van Ham & Adam Perer, “Search, Show Context, expand on demand”: Supporting Large Graph Exploration with Degree-ofinterest. http://perer.org/papers/adamPerer-DOIGraphsInfoVis2009.pdf Acknowledgements We wish to acknowledge the great work provided by Gerard Suades and David Fernandez, colleagues who have been part of the CSUC development team. It is also important to emphasize the implication and imperceptible work of Walther Hasselo, Olav Kwakman and Anna Busom, specially each month at the “cookie” meeting. Finally, we would like to thank also the help provided by the data visualization expert Víctor Pasqual, of OneTandem company. Authors Biography Ricard de la Vega is the Computing and Applications Department Manager at Consorci de Serveis Universitaris de Catalunya (CSUC). He received a bachelor’s degree in Software Engineering from the Polytechnic University of Catalonia (UPC), a master’s degree in Computer Science from the Open University of Catalonia (UOC), a master’s degree in Business Innovation and Entrepreneurship from the Pompeu Fabra University (UPF) and a postgraduate course of Big Data and Analytics from the UPC. He is interested in data related topics (big, small, open, fair, LOD, interoperability, searching, machine learning, visualization, preservation, etc). Natalia Torres is the Computing and Applications Department project expert leader at Consorci de Serveis Universitaris de Catalunya (CSUC). She received a bachelor’s degree in Systems Engineering from the Polytechnic University of Catalonia (UPC), a master’s degree in Computer Science from the University of Catalonia (UPC). She is interested in interoperability, visualization and preservation. Albert Martínez is a software engineer in the Computing and Applications Department at Consorci de Serveis Universitaris de Catalunya (CSUC). He received a bachelor's degree in Computer Science from the University of Catalonia (UAB). He is interested in big data, visualization and machine learning.