“This document is the Accepted Manuscript version of a Published Work that appeared in final form in “Managing the Computational Chemistry Big Data problem: the ioChem-BD platform”, copyright © American Chemical Society after peer review and technical editing by the publisher. To access the final edited and published work see http://pubs.acs.org/doi/abs/10.1021/ci500593j. Managing the Computational Chemistry Big Data problem: the ioChem-BD platform M. Álvarez-Moreno, 1,2,* C. de Graaf, 2,4 N. López,1 F. Maseras,1,3 J. M. Poblet,2 C. Bo, 1,2,* 1 Institute of Chemical Research of Catalonia, ICIQ, Av. Països Catalans 16, 43007, Tarragona, Catalonia, Spain Department of Physical and Inorganic Chemistry, Universitat Rovira i Virgili, C/ Marcel·lí Domingo s/n, 43007, Tarragona, Catalonia, Spain 3 Department of Chemistry, Universitat Autònoma de Barcelona, 08193, Bellaterra, Catalonia, Spain 4 Catalan Institution for Research and Advanced Studies, ICREA, Passeig Lluis Companys 23, 08010, Barcelona, Catalonia, Spain KEYWORDS: Big Data, Chemical Markup Language, XML, XSLT, HTML5, semantic data, digital repository, Density Functional Theory, simulations, catalysis. 2 ABSTRACT: We present the ioChem-BD platform as a multi-headed tool aimed to manage large volumes of quantum chemistry results from a diverse group of already common simulation packages. The platform has an extensible structure, the key modules managing the main tasks: (i) upload of output files from common computational chemistry packages, (ii) extract meaningful data from the results, (iii) generate output summaries in user-friendly formats. A heavy use of the Chemical Mark-up Language (CML) is made in the intermediate files used by ioChem-BD. From them and using XSL techniques, we will manipulate and transform such chemical datasets to fulfill researchers’ needs in the form of HTML5 reports, supporting information and other research media. 1. INTRODUCTION Intensive, high performance computing is one of the pillars to accelerate materials discovery and development in many fields of science and engineering, most prominently Chemistry, Physics and related areas. The volume of information generated daily, coming from the results of scientific calculations, is increasing exponentially. For instance, in our lab scientists (a group of about 10) generate 1.3TB weekly. Its conservation on physical media is favored right now by cheap price of storage per bit of information as well as the increase of the available telecommunications infrastructure bandwidth. This fact makes the public and private centers generate and store more and more terabytes of information, in our particular case this information corresponds to the outcome of calculations. At present, storage of computational simulations has not been identified as a bottleneck in most data and supercomputing centers. However, the main global players in the Internet business have already introduced the "Big Data"1 concept to start looking for solutions to maintain all physical data storage systems sustainable, and provide its convenient access. Solutions based on what is called "the cloud" along with concepts derived from the "social networks" are leading to a reformula- tion of how and in what physical space the information shall be stored. As a result, there is a growing demand of tools to order the storage, allow the analysis and simplify the presentation of significantly large volumes of growing data in an amenable, transparent, and reliable manner.2 Only a very small percentage of the information currently stored in data centers is hierarchically indexed.3 This simply means that the information available is impossible to process: the information bits are hardly usable by any person other than the creator himself. Tim Berners-Lee, one of the original developers of the World Wide Web, identified the need to transform numerical data into "raw data".4 This raw data is the desirable state where information is meaningful because it is enriched with labels that contextualizes it, the labels being then "metadata". Once contextualized, searching the ocean of information is more efficient. Moreover, the search process becomes a process of knowledge creation, as it is possible to establish new connections, in what is already called "linked data".5 The application of these concepts to the field of computational chemistry is hindered by the heterogeneity of the program packages used in the atomistic simulations of molecules and materials. As a result, the outcome of atomistic simulations addressing chemical or physical problems and based on the application of the Schrödinger equation are presented in a disperse manner, with sparse data showing only some of the key aspects as geometries, energies, chemical and/or physical properties. This high degree of diversity in the data formats requires the definition of standards.6 The ultimate consequence is that the data published in the scientific journals of Chemistry, Physics, Nanoscience, Biochemistry and related areas are not homogeneous, they are often incomplete, are hardly consulted in bulk and rarely reused. There have been technological initiatives7 like the Quixote project that implement solutions on multiple aspects of this problem like data format unity and data management.8 Alternatively purpose dedicated databases have been generated by the groups of MIT, Berkeley9 and Standford10. Scheme 1. ioChem-BD system overview. The web nature of the CREATE and BROWSE modules is highlighted. Both modules share functionalities like searching and browsing chemical datasets, exporting to third-party formats and publishing results, but without losing sight of the original private-public sense of each module. In this paper, we present an alternative platform, ioChemBD, encompassing a variety of aspects in the definition of standards for treatment, hierarchical storage and retrieval of data. Our platform automates the extraction of relevant data and its conversion into fully tagged information in a distributed database. It provides tools for the researcher to validate, enrich, publish and share information, and tools in the cloud to access it and view it. 2. BASE TECHNOLOGIES The keystone in the definition of the project is that it employs high reliability software technologies widely used in Internet world that are extended to cover our particular area of interest. We chose eXtensible Markup Language (XML)11 as the container element of all information for its reliability, format neutrality and ease of validation (using XSD verification tools)12. To be more specific we chose Chemical Markup Language (CML) implementation, because it contains all semantics necessary to describe most chemistry.13 With calcu- lations in CML format, querying its content for specific information is extremely easy and efficient by using XPATH queries.14 Working with XML provide a wide range of conversion operations from CML files into any other existing or future format using eXtensible Stylesheet Language Transformations (XSLT).15 As the access to information is becoming more universal, the data in the system is reachable through Internet by any digital device in the market, following the latest existing web standards in communication.11,16 Users have at their disposal the latest search,17,18 display,19 and data labeling tools13 and also there exist communication channels enabled to propose new features. In terms of data storage, information is distributed among the content generators, creating a mesh topology in which the service is always available and accessible on the network. Being a cloud system, it has the necessary standards in data definition20-22 in order to connect to other digital repositories and external Web services to build up a network of intercon- nected semantic data to provide the most sense to the user experience. Finally, to enhance industrial implementation, the platform allows secure23,24 and reliable channels for the communication and collaboration between users, groups and/or centers but with the highest privacy standards for third partners. All information has configurable levels of access and licensing, allowing to adapt to the specific legal needs of each entity. ARCHITECTURE The ioChem-BD platform is composed of two main modules that work independently, labeled CREATE and BROWSE. Both of them are executed as Java web services. The CREATE module is designed to extract the information from the output files generated by the computational chemistry packages and store it in an organized way. It currently manages output files from programs Gaussian25, ADF26, VASP27. The list of accessible codes should be expanded in the midterm future to codes such as SIESTA,28 TURBOMOLE,29 MOLCAS,30 and ORCA.31 The CREATE module shall be used by the scientists that execute the simulations (creator). The BROWSE module is designed as a tool to explore and use the data contained in the database and resides in the cloud. BROWSE module has a much broader scope, and is useful to researchers interested in accessing the computational Big Data. The combination of both modules allows the scientific community the storage and access to all the Chemistry Big Data in a way schematically shown in Scheme 1. CREATE MODULE The CREATE module contains two different subunits that allow: (i) the extraction and structuring of the relevant output data (ii) the publication of such data and derivates into BROWSE. CREATE MODULE: DATA EXTRACTION FROM OUTPUT FILES. The system initially works with input and output files from the computational chemistry packages described above. In Gaussian and ADF the results are stored in a single file that needs to be extracted. However, for other computational codes the CREATE module needs a group of files. This is the particular case of VASP calculations, where relevant data are split into multiple files from inputs like POTCAR, or output files OUTCAR (summary), CONTCAR (geometries), XDATCAR (trajectory), and vasprun.xml. ioChem-BD platform outputs a single CML file from these sources. The upload of these files to the CREATE module is done inside a layered process where we simultaneously parse and tag all relevant data; we infer its metadata and capture the molecular geometry. A scriptable shell upload utility is used to do file conversion and upload calculations straight from HPC clusters to CREATE module, there is also an alternative mechanism to upload content from user web browser. A detection algorithm decides which format extraction templates to use based on calculation file content. Once the file format is elucidated, the appropriate templates for such format are selected and the first conversion is performed: from plain text to CML using a modified version of the JUMBOconverters library.32 This is followed by a second conversion to reorder CML tags so they comply with CompChem convention.33 This process can be used on individual or multiple output files as it is depicted in Figure 1 and Figure 2. As an additional feature, the module is able to attach directly other supporting files such as calculation input files, graphics, text and all needed associated gray literature. These additional files are not processed, so the future user of the database should process them him/herself. Such files will be paired with the calculation CML file during its existence inside ioChem-BD system to provide further information of the calculation. Figure 1. Conversion workflow for individual output files. This two-step process is the default behavior for file conversion inside JUMBOconverters library, from output files into CML elements, and then to compliant CML CompChem. Figure 2. Conversion workflow for multiple output files. Our customized JUMBOconverters library accepts multiple output files for its unification into a single CML file. Uploaded files can be a mixture of plain text and XML files. Figure 3. CREATE main panel view. The hierarchical tree on upper section allows browsing all uploaded content. Selecting an element from it will fill lower panels with more detailed information and its available display actions. Once the CompChem CML file is generated, a second data flow is triggered to extract the corresponding metadata fields. By using XSLT style sheets we infer fields such as: type of calculation, methods used, basis set, charge, multiplicity, and several others. From CREATE database we will also retrieve additional information such as structural (which files are involved in this upload process) and administrative (how this files were generated). Figure 4 depicts this process that ends building a METS compliant file of administrative, descriptive and structural metadata containing all aspects of the upload. Prior to the data storage by the CREATE module there is a final step aimed to extract the final geometry, a key point to repeat the calculation in case of need. This particular point sets our computational database close to other structural databases like the CSD (molecules)34 and for the structures of compounds in crystallography COD.35 In the case of geometry optimizations a large number of geometries can appear in the same file. Again we rely on XSLT templates to contain the necessary logic to retrieve final value of such field. The geometry is then indexed with ChemAxon JChemBase software for future substructure searches.36 When all these processes are completed, newly uploaded calculations are accessible on CREATE module via tree browsing or search (Figure 3). At this point, users can browse their uploaded content. Selecting a calculation opens an auxiliary window on lower right corner with all available actions: visualize molecule on JSmol viewer,37 view an HTML5 resume of relevant calculation data (or other attaches files), download and visualize CML and attached files. To keep the system extensibility, all actions applicable to content are implementations of an abstract Action class which are managed via an ActionManager object. Such class acts also as a class loader. This allows upgrading the system with new calculation Figure 5. CREATE search panel allows users to define multiple search criteria using boolean logic. Such queries range from administrative metadata, chemical related terms and chemical substructure. Figure 6. Search output can be narrowed by the definition of a molecular substructure that will refine its results. A visual HTML5 molecular editor is displayed on user’s browser to sketch part or the entire molecule. Figure 4. METS file generation workflow. Using XSLT stylesheets we can extract calculation descriptive metadata fields. Together with these fields we will append structural and administrative information to compose a METS file that fully describes our new uploaded result. operations without the need to update its code, just dropping a new Action implementation class package in the web server class path. Once uploaded it is possible to search the stored data. The search functionality relies on standard database queries in conjunction with JChemBase search engine to filter its content.36 As seen on Figure 5 and 6, users can query administrative and descriptive metadata fields and use a molecular editor to sketch substructures that will be used as a search filter. Results vary depending on the privileges that the user possesses towards CREATE calculations. They are defined by fine grained access rules set at user, group and others level, like UNIX system file rights. Next to JSmol visualization, another remarkable action is the HTML resume (see Figure 7). Using XSLT style sheets, ioChem-BD is able to generate a fully compliant HTML5 resume that implements features such as: one page presentation, all datasets are exportable to other formats, compact drop-down content, device-responsive, and its most valuable feature: fully customizable with new data fields without the need to upgrade the platform. CREATE PUBLICATION MECHANISM Communication between both ioChem-BD modules is currently unidirectional, from CREATE to BROWSE modules through a process called "Content publication". Publishing allows importing single calculations or groups of them to the BROWSE module, to generate assets like reports. To complete this step, it is only necessary to name calculations and mark them for publication. The remaining process, REST API communications, is invisible to the user. Because both modules are written as Java web services, publication mechanism is done via servlets and published files are bundled in DSpace METS SIP41 format during its ingestion in BROWSE module. From this step onwards published calculations will be called ‘items’. As a result of the publication process, a group of URL handles referring published items are presented. These links point to public HTML pages in BROWSE module with the following content: (i) final calculation geometry visualization with Jsmol, (ii) an expandable summary of the item’s metadata, (iii) a summary of the most relevant data in HTML5 format, (iv) a list of downloadable content such as input files, (v) support files and gray literature associated with calculations. Most of these sections can be mapped to CREATE Actions as they share the same conversion style sheets. Therefore, results share coherency in both modules. Figure 7. Every uploaded calculation has a group of actions associated with it. One of them is an HTML summary that displays its most remarkable fields. Such summary can be customized to fulfill researchers’ needs and to adapt to future requirements. Another feature delivered in HTML5 reports is the visual representation of data. A reference to Highcharts (a Javascript charting library)38 has been included in all generated reports. This inclusion eases the process to convert plain data into interactive visual elements using (among others) line, scatter or column charts. This inclusion behavior is easily replicable to the innumerable third-party plugins that exist today in the chemistry field. In addition, this report file can contain other rich content objects such as third party plugins, navigable data tables and interactive graphics among others. An example of ioChemBD’s pluggability with external tools can be observed in the integration process done inside HTML5 report generation engine with JCAMP-MOL IR Spectrum Viewer applet.38 During the development of this engine, there was a need to include an IR viewer so that calculated vibrational frequencies could be displayed as an additional visual field inside the resume. To do so, a java servlet was created to convert CML calculations to Jcamp-DX40 compliant output text by the use of XSLT transformations. Now calling this servlet with a calculation ID will return its vibrational information in Jcamp-DX format, so appending the applet tag calling this servlet inside our report did the job and no major code development was needed. BROWSE MODULE The BROWSE module consists of a heavily modified version of DSpace digital repository.17 It has been adapted to fulfill our requirements, mainly in quantum chemistry data representation and in external services communication. Some workflows have been copied from the CREATE module to have a similar behavior between them. One of the main features in BROWSE (DSpace) instances is that they can communicate between them using OAI-PMH protocol20 to share item metadata. This allows building a public distributed network of theoretical chemical repositories, which will become a great advance in term of information socialization. The module works by default with Dublin Core metadata schema,42 which is good on capturing the most basic bibliographic information about any digital asset, but cannot hold the description of quantum chemistry documents. However, this module is versatile enough to expand its metadata schemas with new ones, so we have created a schema focused on computational chemistry field. Among other interesting features, the BROWSE module accepts browsing and searching content, such content can be embargoed, exported or syndicated depending on users’ needs. A notable aspect of the BROWSE module is its ability to display Supporting Information and other derived chemical reports build on CREATE module. As a brief overview, Supporting Information documents are normally composed of one or several chemical structures (normally on XYZ format) from a series of related calculations. It can also contain extra information fields such as final energies, vibrational frequencies, spin angular momentum, etc. Supporting Information documents are normally stored on heterogeneous locations like public ftp servers, private web servers, cloud storage services, etc, depending on the data publication policy of each research center. These documents are later pointed by journal papers as additional information related to research. Usually they are generated manually in a tedious, time-wasting and error prone action that sometimes derivates on unportable digital documents (one, maybe two stars of five in Open Data Scheme).43 We try to remove such ineffective procedure using the Supporting Information generator that is integrated inside CREATE module and whose results are displayed on BROWSE (Figure 8). It uses XSL-FO, an open format object definition language, as a bridge between raw data and multiformat output. Such report engine is feed with CML calculations from a user selection at CREATE main panel tree. After setting them in session, the user chooses to create a new report from it. In this case, we choose Supporting Information as report type. A fast XPath query will return additional fields (like final energies) that exist among these calculations and that will (dis)activate additional report generation options. Using a similar process, ioChem-BD is able to build a daily growing set of reports. In this case, we can opt to generate two types of outputs: a ready to download multi-format XSL-FO document (similar to Supporting Information report), or an HTML5 web page that will pop up in a new tab. This last option is extremely versatile because it opens the door for adding third-party plugins and other dynamic content to our report, a more powerful way to display results. As an example of this functionality we will describe energy reaction profile report generation. CREATE users need to select a group of calculations and define a set of formulas that constitute the energy steps. The report engine will build a dynamic, device-responsive HTML5 report in our browser displaying an energy profile chart for such calculations (Figure 9). Figure 9. An example of dynamically generated report. Based on user calculation selection and the definition of multiple energy reaction formulas, our platform is able to build and output reaction energy profile reports. In terms of programming code, there is an abstract class defined for Reports, so new classes can implement its functions and the ReportManager class will load them appending new report types dynamically with no need to alter our existing code. Extensions to more complex outputs like R language code snippets44 or Jmol scripts45 are envisaged. Figure 8. Supporting Information report generation workflow. Starting from a user selection of calculations, the module extracts its molecular geometry (among other fields like final energies) to bundle them into a single XML file. Following iterations will convert it to XSL-FO format and then to user’s desired output format. After setting up report fields, the engine will extract the final geometries and other fields from chosen calculations. Then they are joined into a single CML file. Next step in report generation is to convert CML to XSL-FO document; with some more XSLT work we obtain a XSL-FO document ready to be converted based on users choice to any kind of digital document such as PDF, TXT, CSV, etc. Inside ioChem-BD all content derived from calculations is built under demand and then streamed to the user’s browser. There is a minimal performance loss using this dynamic generation approach but we enormously reduce disk space requirements and increase in data veracity avoiding the massive storage of formatted content that over time can become outdated or partial. Current developments in ioChem-BD are focused on the publication in the BROWSE module of calculation reports. At present, reports can only be generated in the CREATE module, but in the near future it will be possible to generate a public handle inside BROWSE that points to a report generation page that, depending on its URL parameters, outputs its results in multiple formats. SYSTEM ADAPTABILITY AND SAFETY CONSIDERATIONS Dynamic data definition and capture is a requirement of nowadays chemistry computational sector. Quantum software vendors periodically release new versions of its products with the addition of new functionalities, bug fixes, data representation changes, new chemical properties, calculation methods or atom basis and on the other side, chemist software users demand more analysis tools and higher levels of data representation. This constant flow of structural and representational data changes defines a list of requirements that our software tries to fulfill with a loosely-coupled data management rules. With our customized JUMBOconverters library we can expand our data capture rules just by expanding its XML templates definition. We can also modify metadata capture and data presentation to final user with the modification of inner XSLT style sheets. Mastering the skills necessary to modify and expand these rules presents a small learning curve, since they are based on open and well documented standards. Therefore, every research group can easily adapt its ioChem-BD instance to its requirements without the need of an external programmer. User authentication mechanism has been implemented with Jasig CAS SSO Server.24 Its session management service allows us to append new independent web services in a modular fashion without the need to implement user credential management inside our modules. In ioChem-BD, data processing documentation has the same relevance as the processes it tries to describe. An outdated documentation on a highly dynamic system as the ioChem-BD environment will unavoidably lead to confusion. Users cannot track down recent changes and the reimplementation of already existing extraction rules becomes hard to avoid. In addition to this, such rules are defined on XML, a cryptic language which does not help its reading unless it is converted to a userfriendly format. These new requirements led us to develop a toolkit that manipulates Jumbo capture templates to build a SGML/XML DocBook fileset.46 We use it as a neutral format bridge for its later conversion into a hierarchical group of web pages in WebHelp format. Documentation generation process is triggered on every template modification and becomes instantly accessible to all CREATE users for its reference. This effectively avoids that the documentation becomes outdated. All content managed inside ioChem-BD is under access control, even published items. In the CREATE module calculation content is restricted at user/group/others level. In the BROWSE module content can define fine grained access rules and also set content embargos depending on third-party publication requirements. Splitting the system into two separated modules that should be installed on separated web servers increases the overall security of the system. CREATE module will hold internal research data and should be deployed in internal web servers with few open ports to capture upload calculations from HPC and for publication mechanism. BROWSE module can be moved to a public web server, where published items will reside and also referred by its handle. The whole system relies on HTTPS protocol for its communication among users and modules to ensure that data is always encrypted when transferred. There is an “additional” CAS module in charge of user validation that uses tokens for single sign on / single sign off session management, which greatly simplifies the session management code and detaches it from our modules. CONCLUSIONS The massive use of simulation techniques in chemical research generates huge amounts of information, which starts to be known as “the Big Data problem”. The main obstacle for managing enormous volumes of information is its storage in such a way that facilitates data mining as a strategy to optimize the processes that allow scientists to face the challenges of sustainability, knowledge, and the rational use of existent resources. We created ioChem-BD as a group of services in the cloud to manage computational chemistry input and output files. As other database-related projects, the concepts underlying our platform rely on well-defined standards and it implements treatment, hierarchical storage and data recovery tools to facilitate data mining. This software implements new methodological strategies that promotes an optimal re-use of results and accumulated knowledge, and that improve researchers’ daily productivity. It automates the extraction of relevant data and transforms numerical data into tagged data inside its database. This platform provides tools for the researcher in order to validate, enrich, publish and share information, and tools for accessing and visualizing data. Other modules allow the automatic creation of both reaction energy profile plots (by combining data of a set of molecular entities), and Supporting Information files. Besides, this platform capable of performing kinetic analysis from reaction energy profiles, QSSR analysis, or build data sets for screening, for instance. The final goal is to build a new reference tool in computational chemistry research, to fill the gap between the generation of results and the publication of manuscripts, embedded in bibliography management and services to third parties. Future implementations will include integration with a semantic database by taking advantage of XSLT transformations to create data triples of every uploaded calculation. With such information we will be in the position to connect our semantic data with other external data sources and to develop a REST API to open bridges between the BROWSE module and thirdparty data services.43 ASSOCIATED CONTENT A list of current working instances of ioChem-BD software and a demo server are accessible at www.iochem-bd.org. AUTHOR INFORMATION Corresponding Author moises.alvarez@urv.cat; cbo@iciq.cat Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. ACKNOW LEDGMENTS Financial support for this work from the AGAUR (ref. 2009 SGR 25, 2014 SGR 199 and 2014 SGR 409) of Generalitat de Catalunya is grateful acknowledged. We also thank the Spanish Ministry of Science and Innovation (projects CTQ2011-29054-C0201/BQU; CTQ2011-29054-C02-02/BQU; CTQ2011-27033/BQU; CTQ2012-3382/BQU; CTQ2011-23140) and MINECO for support through Severo Ochoa Excellence Accreditation 2014-2018 (SEV-2013-0319). COST Action CM1203 “Polyoxometalate Chemistry for Molecular Nanoscience (PoCheMoN)” and COST Action "ECOSTBio CM1305" are also gratefully acknowledged. REFERENCES (1) Lynch, C. Big data: How do your data grow? Nature 2008, 455, 28-29. (2) Harvey, M. J.; Mason, N. J.; Rzepa, H. S. Digital Data Repositories in Chemistry and Their Integration with Journals and Electronic Notebooks. J. Chem. Inf. Model. 2014. DOI: 10.1021/ci500302p (accessed Sept 17, 2014). (3) Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 2014. DOI:10.1038/sdata.2014.22 (accessed Sept 22, 2014). (4) Berners-Lee, T. The next web. Ted Conference. http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html (accessed Sept 17, 2014). (5) Frey, J. G.; Bird, C. L. Cheminformatics and the semantic web: adding value with linked data and enhanced provenance. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2013, 3, 465-481. (6) Phadungsukanan, W.; Kraft, M.; Townsend, J.; Murray-Rust, P. The semantics of Chemical Markup Language (CML) for computational chemistry: CompChem. J. Cheminf. 2012, 4, 15. (7) Chen, M.; Stott, A. C.; Li, S.; Dixon, D. A. Construction of a robust, large-scale, collaborative database for raw data in computational chemistry: The Collaborative Chemistry Database Tool (CCDBT). J. Mol. Graphics Modell. 2012, 34, 67-75. (8) Adams, S.; de Castro, P.; Echenique, P.; Estrada, J.; Hanwell, M. D.; Murray-Rust, P.; Sherwood, P.; Thomas, J.; Townsend, J. The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age. J. Cheminf. 2011, 3, 38. (9) The Materials Project Home Page https://www.materialsproject.org (accessed Sept 22, 2014). (10) Hummelshøj, J. S.; Abild-Pedersen, F.; Studt, F.; Bligaard, T.; Nørskov, J. K. CatApp: A Web Application for Surface Chemistry and Heterogeneous Catalysis. Angew. Chem. Int. Ed. 2012, 51, 272– 274. (11) World Wide Web Consortium. Extensible Markup Language (XML) 1.0 (Third Edition) specification. http://www.w3.org/TR/REC-xml (accessed Sept 17, 2014). (12) Java schema validation class, javadoc definition. http://docs.oracle.com/javase/7/docs/api/javax/xml/validation/Validat or.html (accessed Sept 17, 2014). (13) Adams, N.; Cannon, E. O.; Murray-Rust, P. ChemAxiom–an ontological framework for chemistry in science. Nature Precedings, 2009. DOI:10.1038/npre.2009.3714.1 (accessed Sept 22, 2014). (14) World Wide Web Consortium. XML Path Language – Version 1.0 http://www.w3.org/TR/xpath (accessed Sept 17, 2014). (15) World Wide Web Consortium. XSL Transformations (XSLT) - Version 1.0 - W3C Recommendation 16 November 1999. http://www.w3.org/TR/xslt (accessed Sept 17, 2014). (16) HTML5 - A vocabulary and associated APIs for HTML and XHTML. http://www.w3.org/TR/2012/CR-html5-20121217/ (accessed Sept 17, 2014). (17) Smith, M.; Barton, M.; Bass, M.; Branschofsky, M.; McClellan, G.; Stuve, D.; Walker, J. H. DSpace: An open source dynamic digital repository. D-Lib Magazine, Jan 2003, 9. (18) Apache Lucene. A high-performance, full-featured text search engine library. http://lucene.apache.org (accessed Sept 17, 2014). (19) Jmol Home Page. http://jmol.sourceforge.net/ (accessed Sept 17, 2014). (20) Lagoze, C; Van de Sompel, H. In The open archives initiative: building a low-barrier interoperability framework. Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, New York, NY, USA, ACM, 2001. (21) Gartner, R. METS: Metadata Encoding and Transmission Standard. JISC Techwatch report TSW, Oct 2002, 2-5. (22) Allinson, J.; François, S.; Lewis, S. Sword: Simple webservice offering repository deposit. Ariadne, Jan 2008, 54. (23) HTTP over TLS description. https://tools.ietf.org/html/rfc2818/ (accessed Sept 17, 2014). (24) Addison, M. S.; Battaglia, S.; Petro, A. Jasig CAS Documentation. http://jasig.github.io/cas/4.0.0/index.html (accessed Sept 17, 2014). (25) Gaussian Home Page. http://www.gaussian.com (accessed Sept 17, 2014). (26) ADF Home Page. http://www.scm.com/ADF (accessed Sept 17, 2014). (27) VASP Home Page. http://www.vasp.at (accessed Sept 17, 2014). (28) SIESTA Home Page. http://departments.icmab.es/leem/siesta (accessed Sept 17, 2014). (29) Turbomole Home Page. http://www.turbomole.com (accessed Sept 17, 2014). (30) Molcas Home Page. http://www.molcas.org (accessed Sept 17, 2014). (31) Orca Home Page. http://cec.mpg.de/forum (accessed Sept 17, 2014). (32) JUMBOconverters Main project page. https://bitbucket.org/wwmm/jumbo-converters (accessed Sept 17, 2014). (33) Murray-Rust, P.; Townsend, J.; Adams, S. E.; Phadungsukanan, W.; Thomas, J. The semantics of Chemical Markup Language (CML): dictionaries and conventions. J. Cheminf. 2011, 3, 43. (34) Cambridge Structural Database Home Page. http://www.ccdc.cam.ac.uk/Solutions/CSDSystem/Pages/CSD.aspx (accessed Sept 17, 2014). (35) Crystallography Open Database Home Page. http://www.crystallography.net/ - (accessed Sept 17, 2014). (36) JChem Base - Chemical interface to relational database engines. http://www.chemaxon.com/products/jchem-base (accessed Sept 17, 2014). (37) JSmol, sourceforge project. http://sourceforge.net/projects/jsmol/ - (accessed Sept 17, 2014). (38) Highcharts Home Page. http://www.highcharts.com (accessed Sept 17, 2014). (39) Hanson, R. M.; Lancashire, R. J. In JCAMP-MOL: A JCAMPDX extension to allow interactive model/spectrum exploration using Jmol and JSpecView. The ACS 2013 symposium on exchangeable data formats, Sept 11, Indiana, IN, USA, Am. Chem. Soc. 2013. (40) IUPAC CPEP Subcommittee on Electronic Data Standards Home Page - http://www.jcamp-dx.org (accessed Sept 17, 2014). (41) DSpace METS Document Profile for Submission Information Packages (SIP). https://wiki.duraspace.org/display/DSPACE/DSpaceMETSSIPProfile (accessed Sept 17, 2014). (42) DCMI Metadata Terms definition page. http://dublincore.org/documents/dcmi-terms/ (accessed Sept 22,2104). (43) Five star open data Home Page. http://5stardata.info (accessed Sept 17, 2014). (44) The R Project for Statistical Computing. http://www.rproject.org (accessed Sept 17, 2014). (45) Jmol /JSmol interactive scripting documentation. http://chemapps.stolaf.edu/jmol/docs (accessed Sept 17, 2014). (46) Ortiz, I. M.; Moreno, P.; Sierra, J. L.; Manjón, B. F. Using DocBook and XML Technologies to Create Adaptive Learning Content in Technical Domains. Int. J. Comput. Sci., App. 2006, 3, 91-108. Table of Contents 10