Log in / Register
Home arrow Computer Science arrow Linked Open Data
< Prev   CONTENTS   Next >

5 Emergent Schemas

In this section, we describe solutions for deriving an emergent relational schema from RDF triples, that one could liken to an UML class diagram. These solutions have been implemented in the RDF parser of the open-source research column store, MonetDB, which we call MonetDB/RDF. A more extensive description of this work can be found in [8].

Our problem description is as follows. Given a (very) large set of RDF triples, we are looking an emergent schema that describes this RDF data consisting of classes with their attributes and their literal types, and the relationships between classes for URI objects, but:

(a) the schema should be compact, hence the amount of classes, attributes and relationships should be as small as possible, such that it is easily understood by humans, and data does not get scattered over too many small tables.

(b) the schema should have high coverage, so the great majority of the triples in the dataset should represent an attribute value or relationship of a class. Some triples may not be represented by the schema (we call these the “non-regular” triples), but try to keep this loss of coverage small, e.g. <10 %.

(c) the schema should be precise, so the amount of missing properties for any subject that is member of such an recognized class is minimized.

Our solution is based on finding Characteristic Sets (CS) of properties that co-occur with the same subject. We obtain a more compact schema than [10], by using the TF/IDF (Term Frequency/Inverted Document Frequency) measure from information retrieval [16] to detect discriminative properties, and using semantic information to merge similar CS's. Further, a schema graph of CS's is created by analyzing the co-reference relationship statistics between CS's.

Given our intention to provide users an easy-to-understand emergent schema, our second challenge is to determine logical and short labels for the classes, attributes and relationships. For this we use ontology labels and class hierarchy information, if present, as well as CS co-reference statistics, to obtain class, attribute and relational labels.

5.1 Step1: Basic CS Discovery

Exploring CS's. We first identify the basic set of CS's by making one pass through all triples in the SPO (Subject, Predicate, Object) table created after bulk-loading of all RDF triples. These basic CS's are secondly further split out into combinations of (property, literal-type), when the object is a literal value. Thus, for each basic CS found, we may have multiple CS variants, one for each combination of occurring literal types. We need the information on literal types because our end objective is RDF storage in relational tables, which allow only a single type per column.

Exploring CS Relationships. A foreign key (FK) relationship between two CS's happens when a URI property of one CS typically refers in the object field to members of one other CS (object-subject references). Therefore, we make a second pass over all triples with a non-literal object, look up which basic CS the reference points, and count the frequencies of the various destination CS's.

Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
Business & Finance
Computer Science
Language & Literature
Political science