Menu
Home
Log in / Register
 
Home arrow Computer Science arrow Linked Open Data
< Prev   CONTENTS   Next >

2.4 State of the RDF Tax

We refer to the performance difference between a relational and RDF implementation of a workload as the RDF tax. This has been accurately measured with the Star Schema Benchmark (SSB) [12], a simplified derivative of TPC-H. While Virtuoso does TPC-H in SQL [13] on a par with the best, the RDF translation of all the query optimization logic is not yet complete, hence we will look at SSB.

SSB has one large fact table (line order) and several smaller dimension tables (part, supplier, dw date, nation and region). The schema is denormalized into a simple star shape. Its RDF translation is trivial; each primary key of each table is a URI, each column is a property and each foreign key is a URI.

SSB was run at 30G scale on a single server with Virtuoso, MonetDB and MySQL. In SQL, Virtuoso beats MonetDB by a factor of 2 and MySQL by a factor of 300 (see Table 1). In SPARQL, Virtuoso came 10–20 % behind MonetDB but still 100x ahead of MySQL. These results place the RDF tax at about 2.5x in query execution time. Thanks to Virtuoso's excellent query performance, SPARQL in Virtuoso will outperform any but the best RDBMS's in analytics even when these are running SQL.

All plans consist of a scan of the fact table with selective hash joins against dimension tables followed by a simple aggregation or a group by with relatively few groups, e.g. YEAR, NATION. In the RDF variant, the fact table scan becomes a scan of a property from start to end, with the object, usually a foreign key, used for probing a hash table built from a dimension table. The next operation is typically a lookup on another property where the S is given by the first and the O must again satisfy a condition, like being in a hash table.

The RDF tax consists of the fact that the second column must be looked up by a self join instead of being on the same row with the previous column. This is

Table 1. Star Schema Benchmark with scales 30 GB and 300 GB (in seconds)

Query

30 GB

300 GB

Virtuoso SQL

Virtuoso SPARQL

RDF

tax

MonetDB

MySQL

Virtuoso SQL

Virtuoso SPARQL

RDF

tax

Q1

0.413

1.101

2.67

1.659

82.477

2.285

7.767

3.40

Q2

0.282

0.416

1.48

0.5

74.436

1.53

3.535

2.31

Q3

0.253

0.295

1.17

0.494

75.411

1.237

1.457

1.18

Q4

0.828

2.484

3.00

0.958

226.604

3.459

6.978

2.02

Q5

0.837

1.915

2.29

0.648

222.782

3.065

8.71

2.84

Q6

0.419

1.813

4.33

0.541

219.656

2.901

8.454

2.91

Q7

1.062

2.33

2.19

5.658

237.73

5.733

15.939

2.78

Q8

0.617

2.182

3.54

0.521

194.918

2.267

6.759

2.98

Q9

0.547

1.29

2.36

0.381

186.112

1.773

4.217

2.38

Q10

0.499

0.639

1.28

0.37

186.123

1.44

4.342

3.02

Q11

1.132

2.142

1.89

2.76

241.045

5.031

12.608

2.51

Q12

0.863

3.77

4.37

2.127

241.439

4.464

15.497

3.47

Q13

0.653

1.612

2.47

1.005

202.817

2.807

4.467

1.59

Total

8.405

21.989

2.62

17.622

2391.55

37.992

100.73

2.65

the best case for the RDF tax, as the execution is identical in all other respects. There are some string comparisons, e.g. brand contains a string but these are put on the build side of a hash join and are not run on much data.

In a broader context, the RDF tax has the following components:

Self-joins. If there are conditions on more than one column, every next one must be fetched via a join. This is usually local and ordered but still worse than getting another column. In a column store, predicates on a scan can be dynamically reordered based on their observed performance. In RDF, this is not readily feasible as it would alter the join order.

Cardinality estimation. In a multi-column table one can sample several predicates worth in one go, in RDF this requires doing joins in the cost model and is harder. Errors in cost estimation build up over many joins. Accurate choice of hash vs index based join requires reliable counts on either side. In SQL analytics, indices are often not even present, hence the join type decision is self-evident.

Optimization search space. A usage pattern of queries with tens of triple patterns actually hitting only a few thousand triples leads to compilation dominating execution times. A full exploration of all join orders is infeasible, as this is in the order of factorial of the count of tables and there can easily be 30 or 40 tables. Reuse of plans when the plans differ only in literals is a possibility and has been tried. This is beneficial in cases but still needs to revalidate if the cardinality estimates still hold with the new literal values. Exploring plans with many joins pushed to the build side of a hash join further expands the search space.

String operations. Since RDF is indexed many ways and arbitrary strings are allowed everywhere, implementations store unique strings in a dictionary and a specially tagged reference to the dictionary in the index. Going to the dictionary makes a scan with a LIKE condition extremely bad, specially if each string is distinct. Use of a full text index is therefore common.

URI's. For applications that do lookups, as most RDF applications do, translating identifiers to their external, usually very long, string form is an RDF-only penalty. This can be alleviated by doing this as late as possible but specially string conditions on URI's are disastrous for performance.

Indexing everything. Since there is usually an indexed access path for everything, space and time are consumed for this. TPC-H 100G loads in SQL in 15 min with no indices other than the primary keys. The 1:1 RDF translation takes 12 h. This is the worst case of the RDF tax but is limited to bulk load. Update intensive OLTP applications where this would be worse still are generally not done in RDF. Of course nobody forces one to index everything but this adds complexities to query optimization for cases where the predicate is not known at compile time.

Runtime data typing. This is a relatively light penalty since with vectored execution it is possible to detect a homogenous vector at runtime and use a typed data format. If a property is all integers, these can be summed by an integerspecific function. This usually works since RDF often comes from relational sources. DBpedia is maybe an exception with very dirty data, but then it is not large, hence the penalty stays small.

Lack of schema. There is usually no schema or the data does not comply with it. Therefore optimizations like late projection that are safe in SQL are not readily applicable. If you take the 10 youngest people and return their birth date, name and address you cannot defer getting the name and address after the top 10 since there might be people with many names or no names etc. These special cases complicate the matter but optimizations having to do with top k order are still possible. Similarly dependencies inside grouping columns in a group by cannot be exploited because one does not know that these are in fact functional even if the schema claims so.

Many of these penalties fall away when leaving the triple table format and actually making physical tables with columns for single valued properties. The exceptions may still be stored as triples/quads, so this does not represent a return to schema-first. Physical design, such as storing the same data in multiple orders becomes now possible since data that are alike occupy their own table. Also n:m relationships with attributes can be efficiently stored in a table with a multi-part key while still making this look like triples.

This is further analyzed in a later section of this chapter. Implementation in Virtuoso is foreseen in the near future.

 
Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
 
Subjects
Accounting
Business & Finance
Communication
Computer Science
Economics
Education
Engineering
Environment
Geography
Health
History
Language & Literature
Law
Management
Marketing
Philosophy
Political science
Psychology
Religion
Sociology
Travel