5.4 Provenance

The provenance represents the origin of entities and links. The provenance includes a classification of the relationships as intra-dataset or inter-dataset respectively based on entity linkage inside singular datasets or across multiple datasets. For example, a link between two entities can be classified as internal to a dataset because it was published within it, but can also be further classified as inter-dataset because the relationship contains an entity outside of the

Fig. 12. A three layer representation of our Web Data model. On the node collection layer, nodes labelled with a star * represent blank collections.

publishing context. The Fig. 13b presents a view of the Web Linkage Validator showing the links internal to a dataset.

Direction. Inter-dataset triples are classified as incoming or outgoing depending on the direction of the link, relative to its origin (the subject entity) and to its destination (the object entity), based on its perspective (the context or publishing domain of the triple). For example, a triple published on the knowledge base (or domain) “” that has a subject entity from “” and object entity from “” would be classified as incoming.

Authority. Similar to provenance and direction, the authority is based on the datasets and entities linked in a relationship. A link is classified as authoritative if at least one entity of the link originates from the publishing domain. For example, if a triple was published on “” and the subject was from “” and the object was from “”, then this link would be considered authoritative because “” is asserting it. However, if the domain in which this triple was published was changed to “”, then it would become a non-authoritative link.

Third-Party Links. In regards to validating datasets, the authority classification helps knowledge base owners to distinguish another important aspect: third-party links. These represent non-authoritative links where both the subject and object of the link are defined in a dataset other than the publishing

Fig. 13. Views on a dataset provided by the Web Linkage Validator application.

one. Also, they are useful to discover if they consist of links that are incorrect or specify relationships that the owner does not explicitly agree with. In some cases, these links can be connotative to the idea of e-mail spam. Figure 13c presents the view of the Web Linkage Validator that provides information on the links classified as non-authoritative.

5.5 How to Improve Your Dataset with the Web Linkage Validator

In this section we show how the results of the Web Linkage Validator can be used as suggestions for improving one's dataset. Being able to see the classes and the properties of his dataset, the dataset owner is able to have a deep understanding of his dataset. He can determine if the dataset graph looks as he planned. For example, let's assume the dataset contains the “foaf:Person” class which has, among others, the “foaf:name” and “foaf:homepage” properties. From the number of the occurrences of these properties, the dataset owner can decide if his dataset is as he intended too: if he knows that most of the people in the dataset should have a homepage, then this should be reflected in similar numbers for the occurences of the “foaf:name” and “foaf:homepage” properties.

Also, the dataset owner can identify possible mistakes like typos in the classes/properties names. For example, it is well known that “foaf:name” is a property of the FOAF vocabulary but “foaf:naem” is not.

Moreover, having access to the number of links to and from other datasets, the dataset owner can determine whether his dataset really is part of the LOD. If the number of links to/from other datasets is quite small or even missing completely, the Web Linkage Validator supports the dataset owner in improving the dataset by suggesting similar datasets to which the dataset owner can link. Based on the top most similar dataset, the dataset owner identify concepts in the recommended dataset similar to the ones he uses and link them.

Once the changes have been done and the dataset has been improved, the dataset owner changes his website or his dataset dump. The infrastructure on which the Web Linkage Validator is based will recompute the data graph summary for the resubmitted dataset and next time the user will see his improvements.

