Log in / Register
Home arrow Computer Science arrow Linked Open Data
< Prev   CONTENTS   Next >

5.1 Quality Assessment Metrics

Three main concepts Sieve uses in the quality assessment configuration are quality indicators, scoring functions and assessment metrics. A data quality indicator is an aspect of a data item or dataset that may indicate the suitability of the data for some intended use. The types of information used as quality indicators may stem from the metadata about the circumstances in which information was created, to information about the information provider, to data source ratings. A scoring function produces a numerical representation of the suitability of the data, based on some quality indicator. Each indicator may be associated with several scoring functions, e.g. max or average functions can be used with the data source rating indicator. Assessment metrics are procedures for measuring information quality based on a set of quality indicators and scoring functions. Additionally, assessment metrics can be aggregated through the average, sum, max, min or threshold functions.

For an example see Listing 3, where recency assessment metric uses the last update timestamp of a dataset or a single fact, a quality indicator which is transformed by TimeCloseness scoring function into a numeric score using a range parameter (in days) to normalize the scores. Other scoring functions available in Sieve include normalizing the value of a quality indicator, or calculating the score based on whether the indicator value belongs to some interval or exceeds a given threshold. The complete list of supported scoring functions is available at the Sieve webpage; users can define their own scoring functions using Scala and the guidelines provided at the webpage.

The output of the quality assessment module is a set of quads, where the calculated scores are associated with each graph. A graph can contain the whole dataset (e.g. Dutch DBpedia) or a subset of it (all properties of Berlin in Freebase) or a single fact. The scores represent the user-configured interpretation of quality and are then used by the Data Fusion module.

Listing 3. Data Fusion with Sieve: specification

5.2 Fusion Functions

In the context of data integration, Data Fusion is defined as the “process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation” [5]. Data Fusion is commonly seen as a third step following schema mapping and identity resolution, as a way to deal with conflicts that either already existed in the original sources or were generated by integrating them.

The Sieve Data Fusion module is inspired by [4], a framework for data fusion in the context of relational databases that includes three major categories of conflict handling strategies:

Conflict-ignoring strategies, which defer conflict resolution to the user. For instance, PassItOn strategy simply relays conflicts to the user or applications consuming integrated data.

Conflict-avoiding strategies, which apply a unique decision to all data. For instance, strategy TrustYourFriends strategy prefers data from specific data sources.

Conflict-resolution strategies, which decide between existing data (e.g. KeepUpToDate, which takes the most recent value), or mediate the creation of a new value from the existing ones (e.g. Average).

In Sieve, fusion functions are of two types. Filter functions remove some or all values from the input, according to some quality metric, for example keep the value with the highest score for a given metric (e.g. recency or trust) or vote to select the most frequent value. Transform functions operate over each value in the input, generating a new list of values built from the initially provided ones, e.g. computing the average of the numeric values. The complete list of supported fusion functions is available at the Sieve webpage, and users have the possibility to implement their own functions.

The example of the specification in Listing 3 illustrates how a fusion function for the population property of a populated place is configured to use KeepFirst fusion function (i.e. keep the highest score) applied to recency quality assessment metric.

The output of the data fusion module is a set of quads, each representing a fused value of a subject-property pair, with the 4th component of the quad identifying the named graph from which a value has been taken. An extension of Sieve for automatically learning an optimal conflict resolution policy is presented in [6].

Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
Business & Finance
Computer Science
Language & Literature
Political science