Luzzu â A Framework for Linked Data Quality Assessmentâ
Transcription
Luzzu â A Framework for Linked Data Quality Assessmentâ
Luzzu – A Framework for Linked Data Quality Assessment? Jeremy Debattista, Christoph Lange, Sören Auer University of Bonn & Fraunhofer IAIS {debattis,langec,auer}@cs.uni-bonn.de Abstract With the increasing adoption and growth of the Linked Open Data cloud [9], with RDFa, Microformats and other ways of embedding data into ordinary Web pages, and with initiatives such as schema.org, the Web is currently being complemented with a Web of Data. Thus, the Web of Data shares many characteristics with the original Web of Documents, which also varies in quality. This heterogeneity makes it challenging to determine the quality of the data published on the Web and to subsequently make this information explicit to data consumers. The main contribution of this article is L UZZU, a quality assessment framework for Linked Open Data. Apart from providing quality metadata and quality problem reports that can be used for data cleaning, L UZZU is extensible: third party metrics can be easily plugged-in the framework. The framework does not rely on SPARQL endpoints, and is thus free of all the problems that come with them, such as query timeouts. Another advantage over SPARQL based quality assessment frameworks is that metrics implemented in L UZZU can have more complex functionality than triple matching. Using the framework, we performed a quality assessment of a number of statistical linked datasets that are available on the LOD cloud. For this evaluation, 25 metrics from ten different dimensions were implemented. Keywords: Data Quality, Assessment Framework, Quality Metadata, Quality Metrics 1 Introduction With the increasing adoption of Linked Open Data [9], with RDFa, Microformats and other ways of embedding data into ordinary Web pages (cf. Web Data Commons1 ), and with initiatives such as schema.org, the Web is currently being complemented with a Web of Data. The Web of Data shares many characteristics with the original Web of Documents. As on the Web of Documents, we have varying quality on the Web of Data. Quality on the Web of Documents is usually measured indirectly using the page rank of the Web documents. This is because the quality of a document is only subjectively assessable, and thus an indirect measure, such as the number of links created by others to a certain Web page, gives a good approximation of its quality. On the Web of Data the ? 1 This work is supported by the European Commission under the Seventh Framework Program FP7 grant 601043 (http://diachron-fp7.eu). http://webdatacommons.org 2 Jeremy Debattista, Christoph Lange, Sören Auer 2.# Assessment# 1.#Metric# Iden@fica@on# and# Defini@on# 5.# Explora@on/ Ranking# 3.#Data# Repairing#and# Cleaning# 4.#Storage/ Cataloguing/ Archiving## Figure 1. The stages of the data quality lifecycle. situation can be deemed to be both simpler and more complex at the same time. There is a large variety of dimensions and measures of data quality [17] that can be computed automatically so we do not have to rely on indirect indicators alone. On the the other hand, the assessment of quality in terms of fitness for use [10] with respect to a certain use case is more challenging. Even datasets with quality problems might be useful for certain applications, as long as the quality is in the required range. For example, data extracted from semi-structured sources, such as DBpedia [12, 15], often contains inconsistencies as well as misrepresented and incomplete information. However, in the case of DBpedia the data quality is perfectly sufficient for enriching Web search with facts or suggestions about general information, such as entertainment topics. In such a scenario, DBpedia can be used to show related movies and personal information, when a user searches for an actor. In this case, it is rather neglectable, when, in relatively few cases, a related movie or some personal facts are missing. For developing a medical application, on the other hand, the quality of DBpedia is probably insufficient, as shown in [18], since data is extracted from a semi-structured source created in a crowdsourcing effort (i.e. Wikipedia). It should be noted that even the traditional document-oriented Web has content of varying quality but is still commonly perceived to be extremely useful. Consequently, a key challenge is to determine the quality of datasets published on the Web and to make this quality information explicit. Moreover, quality assessment cannot be tackled in isolation, but should be considered holistically involving other stages of the data management lifecycle. Quality assessment on its own cannot improve the quality of a dataset, but as part of a lifecycle, quality assessment can be used to continuously monitor and curate datasets as they evolved. The stages in the lifecycle depicted in Figure 1 are: 1. Metric Identification and Definition – Choosing the right metrics for a dataset and task at hand; 2. Assessment – Dataset assessment based on the metrics chosen; Luzzu – A Framework for Linked Data Quality Assessment 3 3. Data Repairing and Cleaning – Ensuring that, following a quality assessment, a dataset is curated in order to improve its quality; 4. Storage, Cataloguing and Archiving – Updating the improved dataset on the cloud whilst making the quality metadata available to the public. 5. Exploration and Ranking – Finally, data consumers can explore cleaned datasets according to their quality metadata. This article focuses on the Assessment aspect of the data lifecycle. The L UZZU Quality Assessment framework (cf. Section 2) for Linked Data is proposed and discussed in this article as means to attempt solving the above-mentioned challenge. The framework and the experiment on its performance (cf. Section 3) are the main contributions of this paper. In order to show the framework’s applicability, a number of statistical linked open datasets (available at http://270a.info) are evaluated against a number of quality metrics. The quality evaluation results are discussed in Section 4. L UZZU is compared with other similar frameworks in Section 5, whilst the concluding remarks are given in Section 6. 2 The Framework The L UZZU2 Quality Assessment Framework (referred to as simply L UZZU or ‘the framework’) provides support for the various stages of the data quality lifecycle. This section provides a comprehensive outline of the framework, describing the main contribution of this paper. The rationale of Luzzu is to provide an integrated platform that: (1) assesses Linked Data quality using a library of generic and user-provided domain specific quality metrics in a scalable manner; (2) provides queryable quality metadata on the assessed datasets; (3) assembles detailed quality reports on assessed datasets. Furthermore, we aim to create an infrastructure that: – can be easily extended by users by creating their custom and domain-specific pluggable metrics, either by employing a novel declarative quality metric specification language or conventional imperative plugins; – employs a comprehensive ontology stack for representing and exchanging all quality related information in the assessment workflow. The framework follows a three step workflow, starting with the metric initialisation process (Step 1). In this step, declarative metrics defined by the Luzzu Quality Metric Language (LQML [4]) are compiled and initialised together with metrics implemented in Java. The quality assessment process is then commenced (Step 2) by sequentially streaming statements of the candidate dataset into the initialised quality metrics. Once this process is completed, the annotation (Step 3) generates quality metadata and compiles a comprehensive quality report. This tool can be easily integrated into collections such as the Linked Data Stack [1] to enhance them with a quality assessment workflow. Figure 2 illustrates the high level architecture of L UZZU. The framework comprises three layers: Communication Layer, Assessment Layer and Knowledge Layer. The Communication Layer exploits the framework’s interfaces as a REST service. The latter two layers are described in the remainder of this section. 2 GitHub: https://github.com/EIS-Bonn/Luzzu; Website: http://eis-bonn. github.io/Luzzu/ 4 Jeremy Debattista, Christoph Lange, Sören Auer Quality)Assessment)Unit) Processing)Unit) LQML)Comp.)Unit) Annota9on)Unit) Opera9ons)Unit) Seman9c)Schema)Layer) Assessment) Layer) Knowledge) Layer) Opera;onal&Level& Specific&RL& Quality&Report&Ontology&(QPRO)& Generic& Representa;on& Level& Dataset&Quality&Ontology&(daQ)& PROVFO& Data&Cube& Figure 2. High-level architecture of the Luzzu quality framework. 2.1 LMI,&DRMO,&…&& Quality&Metrics&(e.g.&DQM)& Representa;on& Level& Communica9on)Layer) Figure 3. The L UZZU Ontology Stack. Knowledge Layer The Knowledge Layer is composed of three units, namely the Semantic Schema Layer, Annotation Unit, and Operations Unit. These units assist to the provision of quality metadata and other operations upon the same metadata. Semantic Schema Layer L UZZU is based on semantic technologies and thus has an underlying semantic schema layer. The semantic schema layer consists of two levels: representation (split into generic and specific sub-levels) and operational level. Figure 3 shows the framework’s schema stack, where the lower level comprises generic ontologies which form the foundations of the quality assessment framework, and the upper level represents the specific ontologies required for the various quality assessment tasks. Figure 4 depicts the relationships between the ontologies. lmi:LuzzuMetric1 JavaImplementaAon1 qpro:QualityProblem1 lmi:referTo1 qpro:isDescribedBy1 dqm:Extensional1 Conciseness1 rdfs:subClassOf, daq:hasObservaAon1 daq:ObservaAon1 daq:Metric1 daq:metric1 daq:QualityGraph1 rdfs:subClassOf1 qb:DataSet1 Figure 4. Key relationships in the data quality ontologies. Generic Representation Level – The generic representation level is domain independent, and can be easily reused in similar frameworks for assessing quality. The two vocabularies described on this level are the Dataset Quality Ontology and the Quality Problem Report Ontology. These vocabularies contribute to two objectives of L UZZU, Luzzu – A Framework for Linked Data Quality Assessment 5 more specifically, providing queryable metadata and assembling detailed quality reports. The rationale behind the Dataset Quality Ontology (daQ) [5]3 is to provide a core vocabulary that defines how quality metadata should be represented at an abstract level. It is used to attach the results of quality benchmarking of a Linked Open Dataset to the dataset itself. daQ is the core vocabulary of this schema layer, and any ontology describing quality metrics added to the framework (in the specific representation level) should extend it. The benefit of having an extensible schema is that quality metrics can be added to the vocabulary without major changes, as the representation of new metrics would follow those previously defined. The Quality Problem Report Ontology (QPRO) enables the fine-grained description of quality problems found while assessing a dataset. It comprises the two classes qpro:QualityReport and qpro:QualityProblem. The former represents a report on the problems detected during the assessment of quality on a dataset, whilst the latter represents the individual quality problems contained in that report. Specific Representation Level – Complementing L UZZU, a number of Linked Data quality metrics are implemented, some of which were identified from the survey presented in [17]. These metrics are used in assessing the quality of the 270a statistical datasets [3] (cf. Section 4). The Data Quality Metric (dqm) Ontology represents these metrics in a semantic fashion based on the daQ vocabulary. Operational Level – The highest level in the schema stack. This level comprises a number of ontologies that assist the framework to perform certain tasks, such as the Luzzu Metric Implementation ontology (LMI), which is a vocabulary that enables the metrics specified in terms of daQ to be connected to their Java implementations. Annotation Unit The Semantic Annotation Unit generates quality metadata and quality reports for each metric assessed. The annotation unit is based on the two core ontologies – daQ and QPRO. With regard to the quality problem reports, the unit compiles a set of triples consisting of problematic triples found during the metric assessment of a dataset. The user can then use the generated quality problem report in a data cleaning tool that supports the Quality Report Ontology to clean the dataset. Operations Unit The Operations Unit does not interact with the quality assessment directly. Currently, this unit provides algorithms for the day-to-day use of the quality metadata, including an algorithm for quality-driven ranking of datasets. In the spirit of “fitness for use”, the proposed user-driven ranking algorithm enables users to define weights on their preferred categories, dimensions or metrics, that are deemed suitable for her task in hand. The user-driven ranking is formalised as follows: Let v : Fm → {R ∪ N ∪ B ∪ . . .} be the function that yields the value of a metric (which is, most commonly, a real number, but could also be an integer, a boolean, or any other simple type). 3 All defined ontologies in L UZZU have the namespace http://purl.org/eis/vocab/ {prefix}. For each ontology then one should add the relevant ontology prefix, e.g., daq. Any updates to the ontology are automatically reflected in the ontology’s webspace 6 Jeremy Debattista, Christoph Lange, Sören Auer Ranking by Metric – Ranking datasets by individual metrics requires computing a weighted sum of the values of all metrics chosen by the user. Let mi be a metric, v(mi ) its value, and θi its weight (i = 1, . . . , n), then the weighted value v(mi , θi ) is given by: Definition 1 (Weighted metric value) v(mi , θi ) := θi · v(mi ) Pn The sum i=1 v(mi , θi ) of these weighted values, with the same weights applied to all datasets, determines the ranking of the datasets. Ranking by Dimension – When users want to rank datasets in a less fine-grained manner, they assign weights to dimensions.4 The weighted value of the dimension D is computed by evenly applying the weight θ to each metric m in the dimension. Definition 2 (Weighted dimension value) P v(D, θ) := v(m, θ) =θ #D m∈D P v(m) #D m∈D Ranking by Category – A category C is defined to have one or more dimensions D5 , thus, similarly to the previous case, ranking on the level of categories requires distributing the weight chosen for a category over the dimensions in this category and then applying Definition 2. Definition 3 (Weighted category value) P v(C, θ) := 2.2 v(D, θ) #C D∈C Assessment Layer The Assessment Layer is composed of three units, namely the Processing Unit, the LQML Compilation Unit, and the Quality Assessment Unit. These units handle the operations related to the quality assessment of a dataset. 4 5 Zaveri et al. define, for example, the dimension of licensing, which comprises the metrics “[existence of a] machine-readable indication of a license” and “human-readable indication of a license” [17]. In the classification of Zaveri et al., “Licensing” is a dimension within the category of “accessibility dimensions” [17]. Jena&Streaming& Mechanism&(Threaded)& broadcast&quad& to&all&metrics& Output:,Problem, Report, Communica0on&Layer& Input:,Dataset,and, Metric,Selec:on, Invoke, Quality&Assessment&Unit& Metric&1& Metric&2& Quad,<s,p,o,c>, Dataset& triple/quad, …" Metric&n& Metric,Value,&, Problema:c,Triples, Processing&Unit& Quality, Report, 1& 2& …" n& Metrics&Running&on&& Separate&Threads& input&ds& 7 SPARK&Master& Streaming&Mechanism& map&DS&to&clusters& 1& Streaming&Process& input&ds& Streaming&Process& Luzzu – A Framework for Linked Data Quality Assessment 2& …" n& SPARK&Clusters& stream&quads&to& message&queue& broadcast&queue& messages&& 1& 2& …" n& Metrics&on&Separate& Threads&on&Master& Annota0on&Unit& Quality,Metadata, Figure 5. Quality Assessment Process Figure 6. Streaming Quads using the Jena Stream Process (left) and Spark Stream Process (right). The Quality Assessment Unit The Quality Assessment Unit is the most important unit of the framework. Third parties can extend the framework by creating custom metrics and plugging them into the framework. Figure 5 depicts the quality assessment process. A user selects a dataset as well as the metrics required for the quality assessment. This information is passed to the quality framework via the communication layer, which then invokes the assessment process. The processing unit is then initialised by: (1) creating the necessary objects in memory, and (2) initialising the chosen metrics. In Figure 5, Metric 1 is shaded out to illustrate that it was not chosen by the user for this particular process. Once the objects have been created, the stream processor fetches the dataset and streams triples or quads one by one to all initialised metrics. Since the framework has no control over the metrics’ performance, the assessment of datasets is optimised by parallelising metric computation into different threads, ensuring that the different metrics are computed at the same time. After all statements have been assessed, the annotation unit requests the results for each metric and creates (or updates) the quality metadata graph. The annotation unit also requests problematic triples and prepares a quality report that is passed back to the user via the communication layer. This marks the end of a successful quality assessment process. Processing Unit The Processing Unit controls the whole execution of the quality assessment of a chosen dataset. In L UZZU two stream processing units are implemented; one based on the Jena RDF API6 and the other on the Spark big data processing framework7 . Streaming ensures scalability (since we are not limited by main memory) and parallelisability (since the parsing of a dataset can be split into several streams to be processed on different threads, cores or machines). Each data processor in the Quality Framework operates in three stages: (i) processor initialisation; (ii) processing; and (iii) memory clean up. Typically, an invoked processor has three inputs: the base URI, the dataset URI, and a metric configuration file defining the metrics to be computed on the dataset. In the first stage (processor initialisation), the processor creates the necessary objects in memory to process data and loads the 6 7 http://jena.apache.org/ http://spark.apache.org/ 8 Jeremy Debattista, Christoph Lange, Sören Auer metrics defined in the configuration file. Once the initialisation is ready, the processing is performed by passing the streamed triples to the metrics. A final memory cleanup ensures that no unused objects are using unnecessary computational resources. Sequential Streaming of Datasets Jena provides the possibility of streaming triples sequentially in a separate thread implementing a producer-consumer queue. A dataset, which can be serialised in many typical RDF formats (RDF/XML, N-Triples, N-Quads etc.), is read directly from the disk storage. Triple statements are read as string tokens, which the Jena API then transforms into triple or quad objects. Another approach for sequential stream processing is to use the Hadoop MapReduce technology8 or its in-memory equivalent Spark. The idea is to map the processing of large datasets on multiple clusters, creating triples in the process. A simple reduce function then takes the results of the map to populate a queue that feeds the metrics. Metric computations are not ‘associative’ in general, which is one of the main requirements to implement a MapReduce job; therefore, instantiated metrics are split into different threads in the master node. Figure 6 shows how the two different processors stream statements to initialised metrics. LQML Compilation Unit One foreseeable obstacle is that Java experts are needed to create these metrics, using traditional imperative classes, following a defined interface. Therefore, our framework also provides the Luzzu Quality Metric Language (LQML), thus enabling knowledge engineers without Java expertise to create quality metrics in a declarative manner. Since LQML is not one of the main contributions of this paper, the reader is referred to [4] for more information. 3 Performance Experiment The aim of this experiment is to determine the scalability properties of the stream processors described in Section 2.2 with regard to dataset size. Runtime is measured for both processors against a number of datasets ranging from 10,000 triples to 100,000,000 triples and also against a number of different metrics (up to 9). Since the main goal of this experiment is to measure the stream processors’ performance, having datasets with different quality problems is considered to be not the primary focus at this stage. We generated datasets of different sizes using the Berlin SPARQL Benchmark (BSBM) V3.1 data generator9 . BSBM is primarily used as a benchmark to measure the performance of SPARQL queries against large datasets. We generate datasets with scale factors of 24, 56, 128, 199, 256, 666, 1369, 2089, 2785, 28453, 70812, 284826, which translates into approximately 10K, 25K, 50K, 75K, 100K, 250K, 500K, 750K, 1M, 10M, 25M, 50M and 100M triples respectively. The result of the performance evaluation meets our expectations of having a linearly scalable processor, whilst metrics affect the runtime of the stream processor. The results 8 9 http://hadoop.apache.org/ http://wifo5-03.informatik.uni-mannheim.de/bizer/ berlinsparqlbenchmark/spec/ Luzzu – A Framework for Linked Data Quality Assessment 1077" 1066" 1055" Log10 Time in milliseconds Log10 Time in milliseconds 1066" 1044" 1033" 1022" 1011" 1000" 3" 103 9 4" 104 5" 105 6" 106 7" 107 1044" 1033" 1022" 1011" 1000" 3" 103 8" 108 No. of Triples (Log10) x Stream Processor + Spark Processor Figure 7. Time vs. dataset size in triples – no metrics initialised. 1055" 4" 104 5" 6" 105 106 No. of Triples (Log10) 0"Metrics" 4"Metrics" 9"Metrics" 7" 107 8" 108 Figure 8. Time vs. dataset size with different metric initialisations. also confirm the assumption that big data technologies such as Spark are not beneficial for smaller datasets. All tests were run on a Google Cloud platform, with three worker clusters set up for the Spark stream processors. Results Figure 7 and Figure 8 show the time taken (in ms) to process datasets of different sizes10 . Figure 7 shows the time taken to process the datasets without computing any metrics. The latter chart shows the time taken to process and compute datasets with nine metrics (note the logarithmic scale). As expected, both processors scale linearly with regard to the number of triples. This is also observed when we added metrics to the processor. From these results, we also conclude that for up to 100M triples, the (Jena) stream processor performs better than the Spark processor. A cause for this difference is that the Spark processor has to deal with the extra overhead needed to enqueue and dequeue triples on an external queue, however, as the number of triples increases, the performance of both processors converges. Figure 8 shows how the stream processor fares with regard to the execution time when we initialise it with 0, 4 and 9 metrics. The performance of the processor still maintains a linear computation, showing that the execution time is longer as the number of initialised metrics is increased. 4 Linked Data Quality Evaluation To showcase L UZZU in a real-world scenario, a Linked Data quality evaluation was performed on the 270a statistical linked data space. The datasets in question are: – European Central Bank (Acronym: ecb11 ) – statistical data published by the European Central Bank; – Food and Agriculture Organization of the United Nations (fao) – statistical data from the FAO official website, related to food and agriculture statistics for global monitoring; 10 11 More performance evaluation results can be found at http://eis-bonn.github.io/ Luzzu/evaluation.html All URIs for the 270a datasets are: http://{acronym}.270a.info 10 Jeremy Debattista, Christoph Lange, Sören Auer – Swiss Federal Statistics Office (bfs) – contains Swiss statistical data related to population, national economy, health and sustainable development amongst others; – UNESCO Institute for Statistics (uis) – contains data related to education, science and technology, culture and communication, comparing over 200 countries; – International Monetary Fund (imf) – contains data related to the IMF financial indicators; – World Bank (worldbank) – includes data from the official World Bank API endpoint, and contains data related to social and financial projections; – Transparency International Linked Data (transparency) – features data from excel sheets and PDF documents available on the Transparency International website, containing information about corruption perceptions index; – Organisation for Economic Co-operation and Development (oecd) – features a variety of data related to the economic growth of countries; – Federal Reserve Board (frb) – data published by the United States central bank. These datasets are published using the Linked SDMX principles [3]. Each of these datasets has a SPARQL endpoint from which a query was executed to obtain the datasets (void:Dataset) and corresponding data dumps (void:dataDump). All datasets were then assessed in L UZZU over a number of metrics, whilst quality metadata was generated12 . Chosen Metrics A number of metrics (25) from ten different dimensions were implemented13 for the applicability evaluation. Whilst most of these metrics were identified from [17], two provenance related metrics were additionally defined for this assessment. The W3C Data on the Web Best Practices working group has defined a number of best practices in order to support the self-sustained eco-system of the Data Web. With regard to provenance, the working group argues that provenance information should be accessible to data consumers, whilst also being able to track the origin or history of the published data [13]. In light of this statement, we defined the Basic Provenance and Extended Provenance metrics. The Basic Provenance metric checks if a dataset, usually of type void:Dataset or dcat:Dataset, has the most basic provenance information; that is information related to the creator or publisher, using the dc:creator or dc:publisher. On the other hand, the Extended Provenance metric checks if a dataset has the required information that enables a consumer to track down the origin. Each dataset should have an entity containing the identification of a prov:Agent and the identification of one or more prov:Activity. The identified activities should also have an agent associated with it, apart from having a data source attached to it. Together with the dataset quality evaluation results, Table 1 shows the dimensions and metrics (marked with a reference from [17]) used on the 270a linked datasets. The metrics were implemented as described by their respective authors, expect for the Misreported Content Types, and No Entities as Members of Disjoint Classes metrics, where 12 13 Quality Metadata for each dataset can be found at http://eis-bonn.github.io/ Luzzu/eval270a.html These metrics, together with any configuration, can be downloaded from http:// eis-bonn.github.io/Luzzu/downloads.html Luzzu – A Framework for Linked Data Quality Assessment 11 approximation techniques, more specifically reservoir sampling, were used to improve scalability [6]. BFS Availability Estimated No Misreported Content Types (A4) End Point Availability (A1) RDF Availability (A2) Performance Scalability of Data Source (P4) Low Latency (P2) High Throughput (P3) Correct URI Usage (P1) Licencing Machine Readable License (L1) Human Readable License (L2) ECB FAO FRB Datasets IMF OECD Transparency UIS World Bank 97.30% 91.67% 95.41% 96.11% 88.96% 96.82% 99.67% 88.96% 80.43% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 97.38% 100.0% 100.0% 0.117% 97.92% 100.0% 100.0% 99.999% 97.67% 100.0% 100.0% 100.0% 25.0% 66.667% 0.0% 0.0% 42.857% 0.0% 99.01% 100.0% 100.0% 99.955% 98.87% 99.0% 100.0% 98.89% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 25.0% 94.231% 78.571% 52.632% 0.0% 0.0% 0.0% 0.0% 96.41% 100.0% 100.0% 100.0% 62.5% 97.163% 0.0% 0.0% Reuse of Existing Terms (IO1) Reuse of Existing Vocabularies (IO2) 17.81% 11.01% 33.12% 19.18% 28.99% 0.00% 0.00% 0.00% 0.00% 0.00% 3.93% 0.00% 56.36% 32.54% 0.00% 0.00% 24.15% 6.25% Basic Provenance Extended Provenance 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 74.43% 48.819% 30.939% 79.091% 22.759% 73.153% 100.0% 100.0% 0.0% 67.164% 100.0% 0.0% Representational No Prolix RDF (RC2) Conciseness Short URIs (RC1) 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 99.955% 99.999% 99.998% 96.774% 99.997% 97.945% 100.0% 100.0% 95.534% 100.0% 100.0% 38.166% Interoperability Provenance Versatility Interpretability Timeliness Consistency Different Serialisations (V1) Multiple Language Usage (V2) 33.333% 1.923% 7.692% 5.556% 14.286% 0.714% 4 1 3 1 1 2 12.5% 2 14.286% 1 Low Blank Node Usage (IN4) 80.38% 99.977% 99.985% 99.977% 99.981% 99.923% Defined Classes and Properties (IN3) 90.909% 22.921% 82.178% 92.0% 59.116% 7.55% 89.559% 99.874% 90.323% 64.497% 99.505% 34.158% Timeliness of Resource (TI2) Freshness of Dataset (TI1) Currency of Dataset (TI2) 0.0% 0.0% 14.055% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Misplaced Classes or Properties (CS2) 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Estimated No Entities As Members of Disjoint Classes 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% (CS1) Correct Usage of OWL Datatype Or Object 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Properties (CS3) 25.0% 1 0.0% 0.0% 0.0% 0.0% 0.0% 0.141% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Table 1. Quality Evaluation Results for 270a Datasets Results and Discussion In general, the datasets adhere to high quality when it comes to the consistency, representational conciseness, and performance dimensions. The consistency dimension relates to the efficiency of the system hosting the dataset, whilst the representational conciseness dimension deals with data clarity and compactness. In other words, the latter dimension measures if (1) a dataset’s resource URIs are short, that is a URI should be maximum 80 characters14 and have no parameters; and (2) a dataset is free of RDF Collections and Lists. The performance metrics are related to how much the dataset is free of logical and formal errors and contradictions. According to the CoolURIs guidelines15 , Hash URIs should be used in small and stable datasets since they provide stability with regard to the retrieval performance, since no extra 303 redirects are required. On the other hand, Slash 14 15 This best practice is mentioned at http://www.w3.org/TR/chips/#uri http://www.w3.org/TR/cooluris/ 12 Jeremy Debattista, Christoph Lange, Sören Auer URIs should be used in large datasets. Following [7], the minimum number of triples required to be considered as a large dataset is set to 500K. The assessed transparency data space has around 43K triples, of which only 30 URIs are hash URIs. Whilst all data spaces provided basic provenance information, tracking of the origin (extended provenance metric) of the data source is not possible in all datasets. For example, in the Transparency data space, no provenance information is recorded. Furthermore, although provenance was recorded in the other assessed data spaces, entities might have had any of the required attributes mentioned in the previous section missing. This was confirmed by a SPARQL query on the ECB dataset, where from 460 entities, only 248 were attributable to an agent, whilst each activity contained pointers to the data source used and the agent it is associated with. With regard to the timeliness metrics, it was noticed that the datasets were missing the information required for the assessment. For example, the timeliness of resource metric required the last modified time of the original data source and the last modified time of the semantic resource. Similarly, for the freshness of dataset and currency of dataset (recency) metrics, the datasets had missing meta information, except for the Transparency and World Bank datasets for the latter mentioned metric. These still gave a relatively a low quality score, and a manual investigation revealed that the datasets have not been updated for over a year. In order to measure the interoperability metrics, some pre-processing was required. All nine datasets were tagged with a number of domains (such as Statistical, Government and Financial) and a number of vocabularies that were deemed suitable. For each domain, the Linked Open Vocabulary API (LOV16 ) was queried, and resolvable vocabularies were downloaded to L UZZU. Although one does not expect that datasets reuse all existing and available vocabularies, the reuse of vocabularies is encouraged. A SPARQL query, getting all vocabularies referred to with the void:vocabulary property, was executed against the BFS data space endpoint. It resulted in nine vocabularies, five of which were removed since they were the common OWL, RDF, RDFS, SKOS and DCMI Terms vocabularies. The rest of the vocabularies were http://bfs.270a.info/property/, http://purl.org/linked-data/sdmx/ 2009/concept, http://purl.org/linked-data/xkos, http://www.w3.org/ ns/prov. The LOV API suggested 14 vocabularies when given the terms assigned to this dataset, none of which appeared in the list of vocabularies used. One possible problem of this metric is that since these are qualitative metrics, that are measuring the quality according to a perspective, the dataset might have been tagged wrongly. On the other hand, the metric relies on an external service, LOV, that might (1) not have all vocabularies in the system, (2) vocabularies might not appear in the results since the search term is not included in the vocabulary description or label. With all data dumps and SPARQL endpoint URIs present in the VoID file of all datasets, the 270a data space publishers are well aware of the importance of having the data available and obtainable (availability dimension). The datasets also fare well with regard to having a relatively low number of misreported content types for the resource URIs they use. The percentage loss in this metric is attributable to the fact that in this metric all URI resources (local, i.e. having a base URI of 16 http://lov.okfn.org/dataset/lov/ Luzzu – A Framework for Linked Data Quality Assessment 13 270a.info, and external) that return a 200 OK HTTP status code are being assessed. In fact, most of the problematic URIs found were attributable to the usage of the following three vocabularies: http://purl.org/linked-data/xkos; http:// purl.org/linked-data/cube; http://purl.org/linked-data/sdmx/2009/ dimension. A curl command shows that the content type returned by these is text/html; charset=iso-8859-1, even when enforcing a content negotiation for semantic content types such as application/rdf+xml. When assessing the World Bank dataset, there was also one instance (more specifically http: //worldbank.270a.info/property/implementing-agency) that the metric incorrectly reported as a problematic resource URI. It was observed that the mentioned resource took an unusual long time to resolve, which might have resulted in a timeout. All data spaces have a machine readable license (licensing dimension) in their void file, more specifically those datasets (void:Dataset) that contain observation data. The datasets that describe metadata, such as http://bfs.270a.info/dataset/ meta, have no machine readable license attributed to them, thus such cases decreased the quality result for this metric. On the other hand, the datasets had no human readable licenses in their description. This metric verifies whether a human-readable text, stating the licensing model attributed to the resource, has been provided as part of the dataset. In contrast to the machine readable metric, literal objects (attached to properties such as rdfs:label, dcterms:description, rdfs:comment) are analysed for terms related to licensing. The versatility dimension refers to the different representation of data for both human and machine consumption. With regard to human consumption (multiple language usage), the BFS dataset on average supports four languages, the highest from all assessed datasets. This metric checks all literals that have a language tag. On the other hand, data spaces offer different serialisations (attributed via the void:feature – two formats for each data space), but only one dataset in each data space (e.g. http://bfs.270a.info/dataset/bfs) has this property as part of its VoID description. If for example we take the OECD data space, there are 140 defined voID datasets but only one of those has the void:feature property defined. The interpretability dimension refers to the notation of data, that is checking if a machine can process data. Therefore, a dataset should have a low blank node usage, since blank nodes cannot be resolved and referenced externally. In this regard, all datasets have a low usage, and these are generally used to describe void:classPartition in the voID metadata. The datasets also use classes and properties that are defined in ontologies and vocabularies, with an occasional usage of the typical http://example. org, such as http://example.org/XStats#value found in all data spaces. 5 Related Work In this section, we will introduce the reader to similar tools that assess quality of linked data. Table 5 summarises the tools discussed in this section. As can be seen in this table Luzzu fills with its focus on scalability, extensibility and quality metadata representation a gap in the space of related work. 14 Jeremy Debattista, Christoph Lange, Sören Auer Scalability Extensibility Quality Metadata Quality Report Collaboration Cleaning support Last release Flemming LinkQA Sieve RDF Unit LiQuate Luzzu 3 SPARQL 3 Triple Check Mate Crowdsourcing No 7 7 7 7 3 Java 7 3 XML 3 (Optional) N/A Bayesian rules 7 3 Java, LQML 3 HTML 7 7 HTML 7 7 7 7 3 HTML, RDF 7 7 7 3 7 7 7 7 RDF 7 7 2010 2011 2014 2014 2013 2014 2015 Table 2. Functional comparison of Linked Data quality tools. Flemming [7] provides a simple web user interface and a walk through guide that helps a user to assess data quality of a resource using a set of defined metrics. Metric weight can be customisable either by user assignment, though no new metric can be added to the system. L UZZU enables users to create custom metrics, including complex quality metrics that can be customisable further by configuring them, such as providing a list of trustworthy providers. In contrast to L UZZU, Flemming’s tool outputs the result and problems as unstructured text. LinkQA [8] is an assessment tool to measure the quality (and changes in quality) of a dataset using network analysis measures. The authors provide five network measures, namely degree, clustering coefficient, centrality, sameAs chains, and descriptive richness through sameAs. Similar to L UZZU, new metrics can be integrated to LinkQA, but these metrics are related to topological measures. LinkQA provides HTML reports in order to represent the assessment findings. Although considered structured, such reports cannot be attached to LOD datasets, in such a way that assessment values can be further queried in the future. In Sieve [14], metadata about named graphs is used in order to assess data quality, where assessment metrics are declaratively defined by users through an XML configuration. In such configurations, scoring functions (that can also be extended or customised) can be applied on one or an aggregate of metrics. Whereas L UZZU provides the quality problem report report which can be possibly imported in third party cleaning software, Sieve provides a data cleaning process, where data is cleaned based on a user configuration. The quality assessment tool is part of the LDIF Linked Data Integration Framework, which supports Hadoop. Similar to Flemming’s tool, the LiQuate [16] tool provides a guided walkthrough to view pre-computed datasets. LiQuate is a quality assessment tool based on Bayesian Networks, which analyse the quality of dataset in the LOD cloud whilst identifying potential quality problems and ambiguities. This probabilistic model is used in LiQuate to explore the assessed datasets for completeness, redundancies and inconsistencies. Data experts are required to identify rules for the Bayesian Network. Triple check mate [18] is mainly a crowdsourcing tool for quality assessment, supported with a semi-automatic verification of quality metrics. With the crowdsourcing approach, certain quality problems (such as semantic errors) might be detected easily by human evaluators rather than by computational algorithms. On the other hand, the Luzzu – A Framework for Linked Data Quality Assessment 15 proposed semi-automated approach makes use of reasoners and machine learning to learn characteristics of a knowledge base. RDFUnit [11] provides test-driven quality assessment for Linked Data. In RDFUnit, users define quality test patterns based on a SPARQL query template. Similar to L UZZU and Sieve, this gives the user the opportunity to adapt the quality framework to their needs. The focus of RDFUnit is more to check for integrity constraints expressed as SPARQL patterns. Thus, users have to understand different SPARQL patterns that represent these constraints. Quality is assessed by executing the custom SPARQL queries against dataset endpoints. In contrast, L UZZU does not rely on SPARQL querying to assess a dataset, and therefore can compute more complex processes (such as checking for dereferenceability of resources) on dataset triples themselves. Test case results, both quality values and quality problems, from an RDFUnit execution, are stored and represented as Linked Data and visualised as HTML. The main difference between L UZZU and RDFUnit in quality metadata representation is that the daQ ontology enables a more fine-grained and detailed quality metric representation. For example, representing quality metadata with daQ enables the representation of a metric value change over time. 6 Conclusions Data quality assessment is crucial for the wider deployment and use of Linked Data. In this paper, we presented an approach for a scalable and easy-to-use Linked Data quality assessment framework. The performance experiment showed that although there is a margin for improvement, L UZZU has very attractive performance characteristics. In particular, quality assessment with the proposed framework scales linearly w.r.t. dataset size and adding additional (domain-specific) metrics adds a relatively small overhead. Thus, L UZZU effectively supports Big Data applications. Beyer and Laney coined the definition of Big Data as High Volume, High Velocity, and High Variety [2]. Volume means large amounts of data; velocity addresses how much information is handled in real time; variety addresses data diversity. The implemented framework currently scales for both Volume and Variety. With regard to Volume, the processor runtime grows linearly with the amount of triples. We also cater for Variety since in L UZZU the results are not affected by data diversity. In particular, since we support the analysis of all kinds of data being represented as RDF any data schema and even various data models are supported as long as they can be mapped or encoded in RDF (e.g. relational data with R2RML mappings). Velocity completes the Big Data definition. Currently we employed Luzzu for quality assessment at well-defined checkpoints rather than in real time. However, due to its streaming nature, Luzzu can easily assess the performance of data streams as well thus catering for velocity. We see L UZZU as the first step on a long-term research agenda aiming at shedding light on the quality of data published on the Web. With regard to the future of L UZZU, we are currently investigating a number of techniques which would enable us to support the framework better in big data architecture. We are also conducting a thorough investigation of the quality of the Linked Datasets available in the LOD cloud. 16 REFERENCES References 1. Auer, S. et al. Managing the life-cycle of Linked Data with the LOD2 Stack. In: Proceedings of International Semantic Web Conference (ISWC 2012). 2012. 2. Beyer, M. A., Laney, D. The Importance of ‘Big Data’: A Definition. 3. Capadisli, S., Auer, S., Ngomo, A. N. Linked SDMX Data: Path to high fidelity Statistical Linked Data. In: Semantic Web 6(2) (2015), pp. 105–112. 4. Debattista, J., Lange, C., Auer, S. Luzzu Quality Metric Language – A DSL for Linked Data Quality Assessment. In: CoRR (2015). 5. Debattista, J., Lange, C., Auer, S. Representing Dataset Quality Metadata using Multi-Dimensional Views. In: 2014, pp. 92–99. 6. Debattista, J. et al. Quality Assessment of Linked Datasets using Probabilistic Approximation. In: CoRR (2015). To appear at the ESWC 2015 Proceedings. 7. Flemming, A. Quality characteristics of linked data publishing datasources. MA thesis. Humboldt-Universität zu Berlin, Institut für Informatik, 2011. 8. Guéret, C. et al. Assessing Linked Data Mappings Using Network Measures. In: ESWC. 2012. 9. Jentzsch, A., Cyganiak, R., Bizer, C. State of the LOD Cloud. Version 0.3. 19th Sept. 2011. http://lod-cloud.net/state/ (visited on 2014-08-06). 10. Juran, J. M. Juran’s Quality Control Handbook. 4th ed. McGraw-Hill, 1974. 11. Kontokostas, D. et al. Test-driven Evaluation of Linked Data Quality. In: Proceedings of the 23rd international conference on World Wide Web. to appear. 2014. 12. Lehmann, J. et al. DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. In: Semantic Web Journal (2014). 13. Lóscio, B. F., Burle, C., Calegari, N. Data on the Web Best Practices - W3C First Public Working Draft. Tech. rep. W3C, 2015. http://www.w3.org/TR/dwbp/. 14. Mendes, P. N., Mühleisen, H., Bizer, C. Sieve: Linked Data Quality Assessment and Fusion. In: 2012 Joint EDBT/ICDT Workshops. ACM, 2012, pp. 116–123. 15. Morsey, M. et al. DBpedia and the Live Extraction of Structured Data from Wikipedia. In: Program: electronic library and information systems 46 (2012), p. 27. 16. Ruckhaus, E., Baldizán, O., Vidal, M.-E. Analyzing Linked Data Quality with LiQuate. English. In: OTM 2013 Workshops. 2013. 17. Zaveri, A. et al. Quality Assessment Methodologies for Linked Open Data. In: Semantic Web Journal (2014). 18. Zaveri, A. et al. User-driven Quality Evaluation of DBpedia. In: Proceedings of 9th International Conference on Semantic Systems, I-SEMANTICS ’13. ACM, 2013, pp. 97–104.