Report - Semagrow
Transcription
Report - Semagrow
ICT Seventh Framework Programme (ICT FP7) Grant Agreement No: 318497 Data Intensive Techniques to Boost the Real – Time Performance of Global Agricultural Data Infrastructures D5.1.2 Semantic Store Infrastructure Deliverable Form Project Reference No. ICT FP7 318497 Deliverable No. D5.1.2 Relevant Workpackage: WP5: Semantic Infrastructure Nature: P Dissemination Level: PU Document version: V2.0 Date: 21/11/2014 Authors: IPB, UAH, NCSR-D Document description: This document describes the development, integration, and deployment effort that was required in order to deploy the computational and software infrastructure needed in order to test and pilot SemaGrow technologies. D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 Document History Version Date Author (Partner) Remarks ToC v0.1 20/09/2013 IPB Draft version of the ToC. ToC v0.2 01/10/2013 NCSR-D, UAH Final version of the ToC. Draft v0.3 15/10/2013 IPB Contribution from IPB. Draft v0.4 31/10/2013 NCSR-D Contribution from NCSR-D and UAH Draft v0.9 1/11/2013 NCSR-D, SWC Internal review Final v1.0 11/11/2013 IPB Delivered as D5.1.1 Draft v1.1 3/11/2014 IPB, FAO Added description of the FAO Web Crawler Database population and hosting at IPB (Sect. 4.3) Draft v1.2 7/11/2014 NCSR-D, UAH Added description of the rdfCDF Toolkit and its connection to Repository Integration (Sect. 4.2) Draft v1.3 14/11/2014 UAH, IPB, NCSR-D Added descriptions of various infrastructure tools deployed at NCSR-D and IPB (Chap. 4) Draft v1.9 19/11/2014 NCSR-D, SWC Internal review Final v2.0 21/11/2014 IPB Delivered as D5.1.2 Page 2 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 EXECUTIVE SUMMARY This document describes the development, integration, and deployment effort that was required in order to deploy the computational and software infrastructure needed in order to test and pilot SemaGrow technologies. Page 3 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 TABLE OF CONTENTS LIST OF FIGURES ...........................................................................................................................5 LIST OF TABLES.............................................................................................................................6 LIST OF TERMS AND ABBREVIATIONS ...........................................................................................7 1. INTRODUCTION ..................................................................................................................9 1.1 1.2 1.3 1.4 2. COMPUTATIONAL INFRASTRUCTURE ................................................................................ 11 2.1 2.2 2.3 3. PARADOX III cluster ................................................................................................................ 11 PARADOX IV cluster ................................................................................................................ 11 4store cluster .......................................................................................................................... 13 ACCESS TO THE COMPUTATIONAL INFRASTRUCTURE ........................................................ 14 3.1 3.2 3.3 3.4 4. PARADOX batch system .......................................................................................................... 14 EMI-based Grid layer .............................................................................................................. 14 gUSE/WS-PGRADE portal layer ............................................................................................... 16 RESTful interface..................................................................................................................... 17 SOFTWARE INFRASTRUCTURE ........................................................................................... 18 4.1 4.2 4.3 4.4 4.5 5. Purpose and Scope ................................................................................................................... 9 Approach .................................................................................................................................. 9 Relation to other Workpackages and Deliverables .................................................................. 9 Big Data Aspects ....................................................................................................................... 9 Toolkit For Repository Integration ......................................................................................... 18 rdfCDF Toolkit ......................................................................................................................... 20 Crawler Database Population and Hosting............................................................................. 21 Clean AGRIS Date/Time Service .............................................................................................. 21 SemaGrowREST ...................................................................................................................... 22 REFERENCES ..................................................................................................................... 23 Page 4 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 Figure 1: PARADOX III cluster............................................................................................................... 12 Figure 2: PARADOX IV installation. ...................................................................................................... 13 Figure 3: WS-PGRADE workflow example. .......................................................................................... 17 Figure 4: Architecture of the toolkit for repository integration. ......................................................... 19 Page 5 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 Table 1: List of SemaGrow gLite services. ............................................................................................ 15 Table 2: Characteristic examples of date cleaning .............................................................................. 22 Page 6 of 23 D5.1.2 Semantic Store Infrastructure API Application Programming Interface BDII Berkeley Database Information Index CA Certification Authority CE Computing Element CLI Command Line Interface DB Database DC Dublin Core DCI Distributed Computing Infrastructure DPM Disk Pool Manager EMI European Middleware Initiative GP-GPU General-purpose Computing on GPU GPU Graphics Processing Unit GT Globus Toolkit, a software framework for implementing computational grids GUI Graphical User Interface HTTP Hypertext Transfer Protocol IEEE Institute of Electrical and Electronics Engineers JAR Java Archive JSON JavaScript Object Notation LFC Logical File Catalogue LOM Learning Object Metadata MCPD Multi-crop Passport Descriptors MPI Message Passing Interface MPICH High performance and widely portable implementation of the Message Passing Interface OS Operating System PBS Portable Batch System RAM Random-access memory RDF Resource Description Framework Page 7 of 23 FP7-ICT-2011.4.4 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 REST Representational State Transfer - XML-based protocol for invoking web services over HTTP SDMX Statistical Data and Metadata Exchange SE Storage Element SG Science Gateway SOA Service Oriented Architecture SPARQL SPARQL Protocol and RDF Query Language URL Uniform Resource Locator VO Virtual Organisation VOMS Virtual Organization Membership Service WMS Workload Management System XML Extensible Markup Language XSL Extensible Stylesheet Language XSLT Extensible Stylesheet Language Transformations Page 8 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 1. 1.1 Purpose and Scope The aim of this document is to present semantic store infrastructure deployed for the project over the Institute of Physics Belgrade (IPB) data center, with all developed and integrated components. The document describes deployed large-scale computational infrastructure used for the project’s experiments on distributed semantic stores, as well as the software infrastructure developed or integrated in order to support the SemaGrow experiments and pilots. 1.2 Approach The deployed large-scale computational infrastructure used for the project’s experiments is described in Chapter 2 and Chapter 3. Chapter 2 gives details on technical characteristics of the infrastructure, while Chapter 3 lists available end-user interfaces that enables access to the infrastructure. Chapter 4 documents software developed in order to establish the infrastructure required for deploying Semagrow and carrying out experiments and pilots. 1.3 Relation to other Workpackages and Deliverables The development, integration, and deployment effort documented in this deliverable is done within the Task 5.1 of workpackage WP5. This work is needed for pilot deployment (Deliverable 6.2.1). The relationship between software developed in this task as opposed to software developed in Task 6.2 Pilot Deployment is as follows: The software developed in Task 5.1 and documented in Chapter 4 of this deliverable is part of the Semagrow ecosystem, developed by technical partners, but is not a core research prototype. Its development was necessary in order to be able to run realistic pilots, but the development itself was not part of piloting. The software developed in Task 6.2 and documented in Deliverable 6.2.1 is client-side software developed by use case partners. Its development was part of the piloting effort, in the sense that it was used to evaluate the impact of Semagrow technologies on developing clients that consume distributed data. 1.4 Big Data Aspects Chapter 2 documents the computational infrastructure used to prepare and execute large-scale experiments in SemaGrow. Specifically: PARADOX III (Section 2.1) has been used for the large-scale triplification of NetCDF and XML data using triplifiers developed in WP2 (cf. D2.2.2 and also Sect. 4.2) and the software infrastructure developed in this task (Chapter 3). PARADOX IV (Section 2.2) has been used for crawler database population. During the third year: Page 9 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 PARADOX IV will be used for ISI-MIP hosting, as well as for creation of super-large scale datasets using the data generator. The 4store cluster (Section 2.3) will be used to populate and host the Crawler Database (Section 4.3) service for the whole duration of the final project year. Furthermore, Section 4.2 documents the development of software infrastructure needed in order to expose large-scale NetCDF datasets without triplification. This will be used to compare performance between serving triplified NetCDF datasets and serving over the raw data. This is particularly important for the large-scale NetCDF datasets used in the Heterogeneous Data Collections and Streams use cases, since their size makes their duplication in RDF stores a major concern and experimentation is needed in order to understand and measure the performance gained by indexing in RDF stores. Page 10 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 2. The Institute of Physics Belgrade (IPB) provides SemaGrow large-scale computational infrastructure for the project’s experiments on distributed semantic store. The same infrastructure is used for heterogeneous repository integration by the toolkit described in Chapter 4, as well as by AgroTagger component provided by FAO. The infrastructure is organized in two clusters, PARADOX III and PARADOX IV, and their technical characteristics are given in this chapter. Beside the clustered resources, IPB provides additional hardware resources, services that support various access channels to the clusters, as well as SemaGrow-specific services (Triplestore server, 4store cluster, RESTful interface server). Technical characteristics of these additional hardware resources used for installation of various services are given in Chapter 3. 2.1 PARADOX III cluster PARADOX III cluster (Figure 1) consists of 88 computing nodes, each with two quad-core Intel Xeon E5345 processors at 2.33 GHz, totalling to 704 processor cores. Each node contains 8 GB of RAM memory, and 100 GB of local disk space. In addition to the scratch space (local disk space), PARADOX III provides up to 50 TB of disk storage that is shared between the machines. Nodes are interconnected by the star topology Gigabit Ethernet network through three stacked highthroughput Layer 3 switches, each node being connected to the switch by two Gigabit Ethernet cables arranged in a channel bonding configuration. PARADOX III resources are available and can be accessed through various layers and gateways (Chapter 3), which are installed at 15 additional Xeon-based service nodes. In addition to standard applications developed using the serial programming approach, the cluster is optimized and heavily tested for parallel processing mode using MPICH [2], MPICH-2, and OpenMPI [3] frameworks. 2.2 PARADOX IV cluster Fourth major upgrade of PARADOX installation (Paradox IV, shown in Figure 2) consists of 106 working nodes and 3 service nodes. Working nodes (HP ProLiant SL250s Gen8, 2U height) are configured with two Intel Xeon E5-2670 8-core Sandy Bridge processors, at a frequency of 2.6 GHz and 32 GB of RAM (2 GB per CPU-core). The total number of new processor-cores in the cluster is 1696, in addition to the PARADOX III resources. Each working node contains an additional GP-GPU card (NVIDIA Tesla M2090) with 6 GB of RAM. With a total of 106 NVIDIA Tesla M2090 graphics cards, PARADOX IV is a premier computer resource in the wider region, which provides access to a large production GPU cluster and new technology. The peak computing power of PARADOX IV is 105 TFlops, which is about 18 times more than the previous PARADOX III installation. Page 11 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 Figure 1: PARADOX III cluster. One service node (HP DL380p Gen8), equipped with an uplink of 10 Gbps, is dedicated to cluster management and user access (gateway machine). All cluster nodes are interconnected via Infiniband QDR technology, through a non-blocking 144-port Mellanox QDR Infiniband switch. The communication speed of all nodes is 40 Gbps in both directions, which is a qualitative step forward over the previous (Gigabit Ethernet) PARADOX installation. The administration of the cluster is enabled by an independent network connection through the iLO (Integrated Lights-Out) interface integrated on motherboards of all nodes. PARADOX IV also provides a data storage system, which consists of two service nodes (HP DL380p Gen8) and 5 additional disk enclosures. One disk enclosure is configured with 12 SAS drives of 300 GB (3.6 TB in total), while the other four disk enclosures are configured each with 12 SATA drives of 2 TB (96 TB in total), so that the cluster provides around 100 TB of storage space. Storage space is distributed via a Lustre high performance parallel file system that uses Infiniband technology, and is available both on working and service nodes. New PARADOX IV cluster is installed in four water-cooled racks, while the additional 3 racks contain PARADOX III equipment. The cooling system consists of 7 cooling modules (one within each rack), which are connected via a system of pipes with a large industrial chiller and configured so as to minimize power consumption. Page 12 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 Figure 2: PARADOX IV installation. 2.3 4store cluster 4store (Garlik) is a scalable and stable RDF database that stores RDF triplets in quad format. The 4store source code has been made available under the GNU General Public Licence version 3, and together with the documentation can be obtained from http://4store.org/. For SemaGrow purposes, within the IPB data center, a dedicated 4store cluster has been deployed. The cluster consists of 8 quad-core Intel Xeon E5345 processors at 2.33 GHz, 4 GB of RAM, and 1 TB of disk space per processor. Nodes are interconnected by the star topology Gigabit Ethernet network through Layer 3 switch. The cluster is accessible via SPARQL HTTP protocol server, which can answer SPARQL queries using the standard SPARQL HTTP query protocol. Page 13 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 3. Hardware resources organized in PARADOX III and PARADOX IV clusters and described in Chapter 2 are available through three general-purpose layers (batch system, Grid layer, and gUSE/WSPGRADE portal layer), and SemaGrow-dedicated RESTful interface. 3.1 PARADOX batch system PARADOX batch system uses open-source Torque resource manager [4], also known as PBS or OpenPBS [5], and Maui [6] batch scheduler. The batch system provides commands for job management that allow users to submit, monitor, and delete jobs, and has the following main components: pbs_server (job server), providing the basic batch services such as receiving/creating a batch job, modifying the job, protecting the job against system crashes, and running the job; pbs_mom (job executor), a daemon that places the job into execution when it receives a copy of the job from the pbs_server; pbs_mom creates a new session with the environment identical to a user login session and returns the job’s output to the user. Maui (job scheduler), a daemon that contains the site’s policies controlling job priorities, where and when they will be executed, etc. Maui scheduler can communicate with various pbs_mom instances to learn about the state of system's resources and with the pbs_server to learn about the availability of jobs to execute. In order to use PARADOX batch system user has to access PARADOX gateway machine (ui.ipb.ac.rs), to create a job script (quite similar to standard shell scripts), to submit the job script file to a dedicated semagrow queue, and to monitor the job. All technical details, recommendations and PARADOX batch system usage instructions are described in PARADOX User Guide [7]. 3.2 EMI-based Grid layer gLite middleware [8] is product of a number of current and past Grid project, such as DataGrid, DataTag, Globus, EGEE, WLCG, and EMI. Through the almost decade-long process of development, gLite has become one of the most popular frameworks for building applications tapping into the geographically distributed computing and storage resources. gLite middleware is distributed as a set of software components providing services to discover, access, allocate and monitor shared resources in a secure way and according to well defined policies. These services form an intermediate layer (middleware) between the physical resources and the applications. Its architecture is following the Service Oriented Architecture (SOA) paradigm, simplifying the interoperability among different, heterogeneous Grid services and allowing easier compliance with upcoming standards. Generally, the complex structure of the gLite technology can be divided in four main groups: Security services concern authentication, authorization, and auditing. An important role in the context of authorization is played by the gLite Virtual Organization Membership Service Page 14 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 (VOMS), which allows fine-grained access control. Using the short-time proxy to minimize the risk of identification compromise, the user is authenticated on the Grid infrastructure, while for the long-running jobs, MyProxy (PX) service provides a proxy renewal mechanism to keep the job proxy valid for as long as needed. Job management services concern the execution and control of computational jobs for their whole lifetime throughout the Grid infrastructure. In the gLite terminology, the Computing Element (CE) provides an interface to access and manage a computing resource typically consisting in a batch queue of a cluster farm. The Workload Management System (WMS) provides a meta-scheduler that dispatches jobs on the available CEs best suited to run user’s job according to its requirements and well defined VO-level and resource-level policies. Job status tracking during the job’s lifetime and after its end is performed by the Logging and Bookkeeping service (LB). Data managements services concern the access, transfer, and cataloguing of data. The granularity of the data access control in gLite is on the file level. The Storage Element (SE) provides an interface to a storage resource, ranging from simple disk server to complex hierarchical tape storage systems. The gLite LCG File Catalogue service keeps track of the location of the files (as well as the relevant metadata) and replicas distributed in the Grid. Information and monitoring services provide mechanism to collect and publish information about the dynamical state of Grid services and resources, as well as to discover them. gLite adopted two information systems: the Berkley DB Information Index (BDII), and the Relational Grid Monitoring Architecture (R-GMA). Semagrow gLite services are located at the IPB, and their technical characteristics are given in Table 1. All services are installed at dual-core Intel Xeon machines with 4 or more GB of RAM per machine. Machines are running Scientific Linux Operating System [9], and EMI-3 middleware that is regularly upgraded to the latest version. Table 1: List of SemaGrow gLite services. gLite service Service endpoint Technical characteristics VOMS voms.ipb.ac.rs Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM PX myproxy.ipb.ac.rs Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM APEL apel.ipb.ac.rs Dual-core Xeon 3060 @ 2.40 GHz, 6 GB RAM CREAM CE ce64.ipb.ac.rs cream.ipb.ac.rs Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM Dual-core Xeon E3110 @ 3.00 GHz, 4 GB RAM WMS/LB wms.ipb.ac.rs wms-aegis.ipb.ac.rs Dual-core Xeon E3110 @ 3.00 GHz, 8 GB RAM Dual-core Xeon 3060 @ 2.40 GHz, 8 GB RAM DPM SE dpm.ipb.ac.rs Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM LFC lfc.ipb.ac.rs Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM BDII bdii.ipb.ac.rs Dual-core Xeon 3060 @ 2.40 GHz, 4 GB RAM Page 15 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 In order to use PARADOX resources, also known as AEGIS01-IPB-SCL Grid site, through the Grid layer, user has to be authenticated by a personal X.509 Grid certificate obtained from corresponding national Certification Authority [10], and authorized at SemaGrow-dedicated (vo.semagrow.rs) VOMS service [11]. Usage of gLite interface is described in details in gLite User Guide [12]. 3.3 gUSE/WS-PGRADE portal layer gUSE/WS-PGRADE portal [13] includes a set of high-level Grid services (workflow manager, storage, broker, grid-submitters for various types of Grids, etc.) and a graphical portal service based on Liferay technology [14]. gUSE is implemented as a set of web services that bind together in flexible ways on demand to deliver user services in Grid and/or Web services environments. User interfaces for gUSE services are provided by the WS-PGRADE web application. WS-PGRADE uses its own XML-based workflow language with a number of features: advanced parameter study features through special workflow entities (generator and collector jobs, parametric files), diverse distributed computing infrastructure (DCI) support, condition-dependent workflow execution and workflow embedding support. The structure of WS-PGRADE workflows can be represented by a directed acyclic graph (DAG), illustrated in Figure 3. Big yellow boxes represent job nodes of the workflow, whereas smaller grey and green boxes attached to the bigger boxes represent input and output file connectors (ports) of the given node. Directed edges of the graph represent data dependency (and corresponding file transfer) among the workflow nodes. The execution of a workflow instance is data driven and forced by the graph structure: a node will be activated (the associated job submitted) when the required input data elements (usually file, or set of files) become available at each input port of the node. This node execution is represented as the instance of the created job. One node can be activated with several input sets (for example, in the case of a parameter sweep node) and each activation results in a new job instance. The job instances contain also status information and in case of successful termination the results of the calculation are represented in form of data entities associated with the output ports of the corresponding node. Page 16 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 Figure 3: WS-PGRADE workflow example. SemaGrow dedicated gUSE/WS-PGRADE portal is provided by IPB. It is hosted on a 2 x Quad-core Xeon E5345 @ 2.33GHz machine with 8 GB of RAM memory, and available at http://scibus.ipb.ac.rs/. WS-PGRADE user interface is described in WS-PGRADE Cookbook [15]. 3.4 RESTful interface Beside previously described general-purpose layers, within the SemaGrow project, IPB team has developed a dedicated RESTful interface to the toolkit for large-scale integration of heterogeneous repositories. This interface is developed on top of the CouchDB [16] REST API, which is extended by the additional layer – CouchDB Proxy. This layer enables authentication with X.509 Grid certificates, and X.509 RFC 3820-compliant proxy certificates, as well as tracking of document changes. It also significantly simplifies CouchDB document extension. In parallel with CouchDB, several daemons are tracking changes in the CouchDB documents, and respond by corresponding actions (job submission, data management operation, etc.) on the Grid Infrastructure. Since each action on the Grid requires authentication and authorization, daemons are supplied with a robot certificate, and in this way certified on the Grid side. Page 17 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 4. This chapter documents software developed in order to establish the infrastructure required for deploying Semagrow and carrying out the pilots. Specifically, this chapter documents: The Toolkit for Repository Integration, comprising tools for using the large-scale distributed computational infrastructure described above to prepare RDF data stores from non-RDF datasets and to expose the resulting RDF datasets on SPARQL endpoints. This toolkit was used to triplify NetCDF data for the Heterogeneous Data collections and Streams pilots and XML data for the Reactive Data Discovery pilots. The rdfCDF Toolkit, comprising tools for converting and serving NetCDF data as well as for preparing NetCDF files by consuming data exposed on SPARQL endpoints. rdfCDF integrates the NetCDF triplifier developed in WP2 (UAH, NCSR-D) with the NetCDF Creator and the NetCDF Endpoint developed within this task (UAH, NCSR-D). This toolkit was used to triplify and serve NetCDF data for the Heterogeneous Data collections and Streams pilots. The FAO Web Crawler Database, populated by crawling and semantic annotation software provided by FAO. Semagrow effort (IPB) pertains to deploying the software on IPB and exposing the database contents as one of the endpoints to be federated for the Reactive Data Analysis pilots. cleanDT, an endpoint that serves a structured and well-formed publication date for AGRIS bibliography entries. The dataset is automatically constructed by applying heuristics (UAH) to the (often) informal values used in the AGRIS publication date field. These dates can be more accurately joined against temporal specifications in other datasets used in the Reactive Data Analysis pilots. SemagrowREST (NCSR-D), a REST API that wraps the Semagrow Stack WebApp (D5.4.3) under a NoSQL querying endpoint. This layer reduces the flexibility of what can be queried, but simplifies client development for those cases that it does support in Reactive Data Discovery pilots. In the remainder of this chapter we will describe the software outlined above and explain its role in the pilots. Please cf. Deliverable 6.2.1 Pilot Deployment for more details about the full data and software suite used in each pilot. 4.1 Toolkit For Repository Integration The project offers SemaGrow SPARQL endpoint that federate SPARQL endpoints over heterogeneous and diverse data sources. However, in several cases, data are provided as non-RDF data. Furthermore, due to heterogeneity of data provider background and presence of several selfdeveloped systems, data are made available using different metadata standards. In order to allow integration of such non-SPARQL endpoints to the framework, during the first year of the project, we have developed a toolkit for large-scale integration of heterogeneous repositories. Page 18 of 23 D5.1.2 Semantic Store Infrastructure Dataset Uploader FP7-ICT-2011.4.4 Grid Job Submitter Couch DB RESTful Interface Converter ORIGINAL FILES RDF FILES Grid Storage Triple Store Builder Grid Infrastructure Triple Store #1 Triple Store #2 Triple Store #3 Endpoint Service Local Batch System PARADOX IV Figure 4: Architecture of the toolkit for repository integration. The data collections involved in the SemaGrow Use Cases are expressed in different formats, such as XML and NetCDF [1]. We designate as a dataset a set of such files for each collection, written using the same format, and compressed into the single tarball (compressed tar archive). The architecture of the toolkit is illustrated in Figure 4, and is organized into three main blocks: RESTful interface that enables upload of datasets, and keeps technical metadata related to datasets; Gridified applications for conversion of the uploaded datasets to RDF datasets (Converters), and upload of the produced RDFs to the triplestore (TripleStoreBuilder); Triplestore with SPARQL endpoint. Dataset providers use the RESTful interface of the toolkit for upload of the datasets. The upload is usually done in two steps: we first specify an HTTP location from where the dataset could be retrieved and stored on Grid storage system, and afterwards send this information to the CouchDB Page 19 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 via a POST HTTP request. Once the information is stored in the CouchDB, the upload of the dataset is performed by a Dataset uploader component of the RESTful interface. This process is followed up by a Grid Job Submitter component that triggers submission of the job to the Grid infrastructure. The aim of the job is to perform a conversion of the original files within the dataset to the corresponding RDF files. The triplification process is executed by a triplification module specific to the format and schema of each collection. For example, when the original data are expressed in XML and follow a certain metadata schema (e.g. LOM), the triplification module uses and XSL Transformation to produce the RDF triples from the original XML files. After the conversion, the produced RDFs are published to the Triplestore by another gridified application designated as TripleStoreBuilder. Both the original datasets (provided manually) and the produced RDF datasets are stored permanently on the Grid storage system. Each step in the presented workflow is followed by a modification of corresponding dataset’s CouchDB document. Each modification introduces new technical metadata used for dataset tracking. Such metadata are: location of the original and RDF datasets in the Grid storage system, number of files within the dataset, relevant timestamps, problems, etc. Applications ported to the Grid environment (Converters and TripleStoreBuilder) are written in the Java programming language. Java is supposed to be platform-independent, which significantly simplified their porting. However, in order to avoid problems related to the compatibility of the applications and different Java versions, together with the applications, we have deployed Java environment to vo.semagrow.rs Grid software stack as well. For more details on the architecture of the Converters and the TripleStore Builder, please cf. Deliverable 2.2.2, Data Streams and Collections. 4.2 rdfCDF Toolkit The rdfCDF Toolkit that serves as the intermediate between datasets in NetCDF [1] format and the Semantic Web representation technologies and formats used by the SemaGrow Stack. Specifically, the toolkit comprises the following: The NetCDF Converter, the off-line triplification of NetCDF data into RDF, following an RDF schema developed in SemaGrow by appropriately extending the Data Cube schema. The NetCDF Endpoint, a SPARQL endpoint that operates over unconverted NetCDF files. The NetCDF Endpoint performs on-line, implicit triplification into the same schema as the NetCDF Converter and serves the results via a Sesame SPARQL endpoint. This allows us to establish SPARQL endpoints that serve the original NetCDF data without requiring duplication into RDF stores. The NetCDF Creator, a Semagrow Stack client that queries the triplified NetCDF data and uses query results to construct NetCDF files. This allows us to dynamically select subdatasets according to user specified restrictions and to combine measurements (datapoints) from different NetCDF files. Page 20 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 The point behind this ability to round-trip between data originally stored in NetCDF files back to NetCDF files is that this allows us to offer to the end-users results collected by combining and filtering the contents of the original NetCDF files, so that they could not have been simply found in the original collection but need to be dynamically generated. Creating new NetCDF files out of these results is important for the users, since NetCDF files constitute appropriate input for their modelling experiments. 4.3 Crawler Database Population and Hosting The Crawler Database allows the AGRIS Web Portal users to discover Web documents based on their relevance to a given bibliographic entry from the AGRIS database. This functionality exploits semantic annotations of crawled Web pages, allowing that their semantic similarity to the bibliographic entry is estimated. More specifically, the workflow of the application consists of: Crawling: Apache Nutch Web Crawler1, customized by FAO, crawls the Web starting from a list of Web sites configured by FAO to focus on the agricultural domain Parsing and annotating: the AgroTagger semantically annotates the crawled documents with AGROVOC2 terms. The AgroTagger,3 developed by FAO, is based on the MAUI Indexer.4 This Java application produces RDF metadata that uses SKOS to link all the documents it receives as input to AGROVOC terms. Publishing: the produced RDF is loaded to the IPB 4store cluster (Section 2.3) and exposed via a SPARQL endpoint. This SPARQL endpoint is the Crawler Database that is one of the data sources used in the AGRIS Demonstrator (cf. D6.2.1 Pilot Deployment). 4.4 Clean AGRIS Date/Time Service The AGRIS dataset follows the BIBO schema and uses the bibo:issued property5 to provide publication dates. As a sub-property of dct:date,6 this property inherits lack of specificity with respect to its range, which can be any RDF literal. This is appropriate for human consumption and for allowing flexibility to leave publication dates under-specified, but it makes it considerably harder to develop queries that, for example, restrict a search within a given month or year. This includes regular expression filtering, since month dates are often provided in different languages. In order to offer the ability to safely author queries that join on dates, the project has added an endpoint that links AGRIS bibliographical entries with cleaned xsd:dateTime values. These are 1 Please cf. http://nutch.apache.org Please cf. http://aims.fao.org/standards/agrovoc/about 3 Please cf. https://github.com/agrisfao/agrotagger 4 Please cf. https://code.google.com/p/maui-indexer 5 Please cf. http://purl.org/ontology/bibo/issued 6 Please cf. http://purl.org/dc/terms/date 2 Page 21 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 automatically gleaned from the original AGRIS data by cleanDT, a Java application that implements several regular expression heuristics to transform the AGRIS dates into YYYY-MM-DD format. In cases of date ranges or under-specified dates that do not map to YYYY-MM-DD, the application responds with the most specific right-truncated under-specification of YYYY-MM-DD that includes the whole date range.7 Table 2 lists some characteristic examples. The cleanDT source code is publicly available at http://bitbucket.org/bigopendata/cleandt 4.5 SemaGrowREST SemaGrowREST is a REST API for the Semagrow Stack WebApp API. It provides the option to search a dataset's subjects, objects or predicates using the q parameter and add limit results using the page_size parameter. The SemaGrowREST source code is publicly available at https://bitbucket.org/bigopendata/semagrowrest Table 2: Characteristic examples of date cleaning AGRIS Value Intention cleanDT Value 8abr1995 On 8 April 1995 1995-04-08 8-15dec1990 Between 8 and 15 December 1990 1990-12 sum1985 Summer 1985 1985 0000 Unknown 7 Formally speaking, the format used in XML Schema to lexically represent xsd:date, xsd:gMonth and xsd:gYear. Please cf. http://www.w3.org/TR/xmlschema-2/#isoformats Page 22 of 23 D5.1.2 Semantic Store Infrastructure FP7-ICT-2011.4.4 5. [1] Network Common Data Form (NetCDF), http://www.unidata.ucar.edu/software/netcdf [2] MPICH Home Page, http://www.mpich.org/ [3] Open MPI: Open Source High Performance Computing, http://www.open-mpi.org/ [4] Garrick Staples, “TORQUE resource manager”, SC'06 Proceedings of the 2006 ACM/IEEE conference on Supercomputing [5] OpenPBS patches, tools, and information, http://www.mcs.anl.gov/research/projects/openpbs/ [6] Maui Cluster Scheduler, http://www.adaptivecomputing.com/products/open-source/maui/ [7] PARADOX User Guide, http://www.scl.rs/paradox/PARADOX_UG-v1.pdf [8] EMI Products, http://www.eu-emi.eu/products [9] Scientific Linux Operating System, https://www.scientificlinux.org/ [10] EUGridPMA Clickable Map of Authorities, https://www.eugridpma.org/members/worldmap/ [11] SemaGrow VOMS service, https://voms.ipb.ac.rs:8443/voms/vo.semagrow.eu [12] gLite User Guide, https://edms.cern.ch/file/722398/1.4/gLite-3-UserGuide.pdf [13] Akos Balasko, Zoltan Farkas, Peter Kacsuk, “Building science gateways by utilizing the generic WS-PGRADE/gUSE workflow system”, Computer Science 14 (2) 2013. [14] Liferay technology, http://www.liferay.com/ [15] WS-PGRADE Cookbook, http://sourceforge.net/projects/guse/files/WS-PGRADECookbook.pdf [16] Apache CouchDB, http://couchdb.apache.org/ Page 23 of 23