Log in / Register
Home arrow Computer Science arrow Linked Open Data
< Prev   CONTENTS   Next >

5.2 Step2: Dimension Tables Detection

There tends to be a long tail of infrequently occurring CS's, and as we want a compact schema, the non-frequent CS's should be pruned. However, a low-frequency CS which is referred to many times by high-frequency CS's in fact represents important information of the dataset and should be part of the schema. This is similar to a dimension table in a relational data warehouse, which may be small itself, but may be referred to by many millions of tuples in large fact tables, over a foreign key. However, detecting dimension tables should not be handled just based on the number of direct relationship references. The relational analogy here are snowflake schemas, where a finer-grained dimension table like NATION refers to an even smaller coarse-grained dimension table CONTINENT. To find the transitive relationships and their relative importance, we use the recursive PageRank [14] algorithm on the graph formed by all CS's (vertexes) and relationships (edges). As a final result, we mark low-frequency CS's with a high rank as “dimension” tables, which will protect them later from being pruned.

5.3 Step3: Human-Friendly Labels

When presenting humans with a UML or relational schema, short labels should be used as aliases for machine-readable and unique URIs for naming classes, attributes and relationships. For assigning labels to CS's, we exploit both structural and semantic information (ontologies).

Type Properties. Certain specific properties (e.g., rdf:type) explicitly specify the class or concept a subject belongs to. By analyzing the frequency distribution of different RDF type property values in the triples that belong to a CS, we can find a class label for the CS. As ontologies usually contain hierarchies, we create a histogram of type property values per CS that is aware of hierarchies. The type property value that describes most of the subjects in the CS, but is also as specific as possible is chosen as the URI of the class. If a ontology class URI is found, we can use its label as the CS's label. In Fig. 3, the value “Ship” is chosen.

Ontologies. Even if no type property is present in the CS, we can still try to match a CS to an ontology class. We compare the property set of the CS with the property sets of ontology classes using the TF/IDF similarity score [16]. This method relies on identifying “discriminative” properties, that appear in few ontology classes only, and whose occurrence in triple data thus gives a strong hint for the membership of a specific class. An example is shown in Fig. 2.

The ontology class correspondence of a CS, if found, is also used to find labels for properties of the CS (both for relationships and literal properties).

Relationships between CS's. If the previous approaches do not apply, we can look at which other CS's refer to a CS, and then use the URI of the referring property to derive a label. For example, a CS that is referred as <author> indicates that this CS represents instances of a <Author> class. We use the most frequent relationship to provide a CS label. Figure 4 shows an example of such “foreign key” names.

URI shortening. If the above solutions cannot provide us a link to ontology information, for providing attribute and relationship labels we resort to a

Fig. 2. Example CS vs. ontology class

Fig. 3. CS type property values

Fig. 4. References to a CS

practical fall-back, based on the observation that often property URI values do convey a hint of the semantics. That is, for finding labels of CS properties we shorten URIs (e.g., becomes offers), by removing the ontology prefix (e.g., or simply using the part after the last slash, as suggested by [11].

Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
Business & Finance
Computer Science
Language & Literature
Political science