Menu
Home
Log in / Register
 
Home arrow Computer Science arrow Linked Open Data
< Prev   CONTENTS   Next >

5.6 Final Schema Evaluation

For evaluating the quality of the final schema, we have conducted extensive experiments over a wide range of real-world and synthetic datasets (i.e., DBpedia[1], PubMed[2], DBLP[3], MusicBrainz[4], EuroStat[5], BSBM[6], SP2B[7], LUBM[8] and WebDataCommons[9]). The experimental results in Table 7 show that we can derive a compact schema from each dataset with a relative small number of tables. We see that the synthetic RDF benchmark data (BSBM, SP2B, LUBM) is fully relational, and also all dataset with non-RDF roots (PubMed, MusicBrainz, EuroStat) get >99 % coverage. Most surprisingly, the RDFa data that dominates WebDataCommons and even DBpedia are more than 90 % regular.

Table 7. Number of tables and coverage percentage after merging & filtering steps

Labeling Evaluation. We evaluate the quality of the labels in the final schema by showing the schema of DBpedia and WebDataCommons (complex and, may be, “dirty” datasets) to 19 humans. The survey asking for rating label quality with the 5-point Likert scale from 1 (bad) to 5 (excellent) shows that 78 (WebDataCommons) and 90 % (DBpedia) of the labels are rated with 4 points (i.e., “good”) or better.

Computational cost & Compression. Our experiments also show that the time for detecting the emerging schema is negligible comparing to bulk-loading time for building a single SPO table, and thus the schema detection process can be integrated into the bulk-loading process without any recognizable delay. Additionally, the database size stored using relational tables can be 2x smaller than the database size of a single SPO triple table since in the relational representation the S and P columns effectively get compressed away and only the O columns remain.

Final words. We think the emergent schema detection approach we developed and evaluated is promising. The fact that all tested RDF datasets turned out highly regular, and that good labels for them could be found already provides immediate value, since MonetDB/RDF can now simply be used to load RDF data in a SQL system; hence existing SQL applications can now be leveraged on RDF without change. We expect that all systems that can store both RDF and relational data (this includes besides Virtuoso also the RDF solutions by Oracle and IBM) could incorporate the possibility to load RDF data and query it both from SQL and SPARQL.

Future research is to verify the approach on more RDF dataset and further tune the recognition algorithms. Also, the second and natural step is now to make the SPARQL engine aware of the emergent schema, such that its query optimization can become more reliable and query execution can reduce the join effort in evaluating so-called SPARQL star-patterns. In benchmarks like LUBM and BSBM our results show that SPARQL systems could become just as fast as SQL systems, but even on “real” RDF datasets like DBpedia 90 % of join effort can likely be accelerated. Work is underway to verify this both in MonetDB and Virtuoso.

  • [1] dbpedia.org we used v3.9
  • [2] ncbi.nlm.nih.gov/pubmed
  • [3] gaia.infor.uva.es/hdt/dblp-2012-11-28.hdt.gz
  • [4] linkedbrainz.c4dmpresents.org/data/musicbrainz ngs dump.rdf.ttl.gz
  • [5] eurostat.linked-statistics.org
  • [6] wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/
  • [7] dbis.informatik.uni-freiburg.de/forschung/projekte/SP2B/
  • [8] swat.cse.lehigh.edu/projects/lubm/
  • [9] A 100M triple file of webdatacommons.org
 
Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
 
Subjects
Accounting
Business & Finance
Communication
Computer Science
Economics
Education
Engineering
Environment
Geography
Health
History
Language & Literature
Law
Management
Marketing
Philosophy
Political science
Psychology
Religion
Sociology
Travel