Virtuoso: The Prometheus of RDF-based Relational Data Management By Orri Erling Virtuoso Program Manager OpenLink Software
Linked Data at Dawn The Promise and the Practice The Science of Speed The Structure which Is Ongoing Research License CC-BY-SA 4.0 (International).
Linked Data Promises RDF is a generic, minimalistic model for describing things RDF has global identifiers and data is self-describing URIs may be dereferenceable RDF is flexible to query, does not force a single hierarchical view like XML License CC-BY-SA 4.0 (International).
Linked Data Scenarios RDF is used because of
global identifiers Inference, if present, is usual y trivial
Where Triples Come From Relational extracts or web content is converted to and stored as triples NLP extraction New applications with RDF as primary data model Doing SPARQL against data in RDBs is possible but is rare and does not deliver the flexibility License CC-BY-SA 4.0 (International).
Linked Data Verticals and Patterns Publishing: tagging & annotations, evolving vocabularies Archives: self description, long term identifiers, many versions of schema Semantic search: structured, semi-structured, and ful text, al in one Business intel igence: many sources, ease of adding sources, no 6 month DW schema change cycle E-science, often in life sciences: common interchange format, nano-publications, NLP extracts, different users cook their data differently, provenance License CC-BY-SA 4.0 (International).
The Hopes and Perceptions The age of ad hoc Find insight in any data, when you need it, from any source, any format No data warehouse planning cycles; make your own from the pieces you need, when you need it Stil , data integration remains hard work; quality and coverage of sources vary Flexibility may be there, but is performance and scalability on the level? License CC-BY-SA 4.0 (International).
Yes, But ... Web and Big Data: Everybody reinvents the triple. Self-description, long term identifiers, key-value pairs in many non-RDF use cases SPARQL and RDF would be the natural, standards- compliant choice if did beat SQL, information retrieval, custom big data, key value, map reduce solutions Is this intrinsic to linked data or is this lack of engineering? Linked data has unique advantages in breadth of coverage and expressivity but performance must not lag behind. License CC-BY-SA 4.0 (International).
What is the RDF Tax? 90% of bad performance comes from non-optimal query plans Some comes from indexing too much (e.g., SQL bulk load with no indices is 50x faster than the equivalent in RDF with al indexed) Some comes from string ops on URIs, literals Some comes from having a join for every attribute. Vectoring and right plans help, though License CC-BY-SA 4.0 (International).
The Bane of the Triple When data is stored as triples: There is structure stil but it is harder to exploit. Schema re-emerges as correlations More joins make more possible query plans, bigger errors in plan cost estimation More joining reduces locality Lack of schema causes needless indexing; data takes more space A URI for everything takes space and time For the same workload, Virtuoso SQL can also be 2–20x faster than Virtuoso SPARQL License CC-BY-SA 4.0 (International).
The Question is Raised LOD2 FP7, now ending: RDF Performance parity with relational? SQL is the senior science. Who ignores history is bound to repeat it Integral mastery of RDB science is a prerequisite, but do not forget the subtle twists of schema-less-ness License CC-BY-SA 4.0 (International).
Virtuoso RDF Relational DBMS Leadership 2000–2006, v1.x–4.x: SQL row store with SQL federation and XML 2007–2008, v5.x–6.x: SPARQL, adapted for RDF quads with more compression, bitmap indices, special data types, RDF awareness in query optimization 2009, v6.x: Scale-out cluster-capable 2010–2013, v7.x: Column store, vectored execution, 3x more space efficient, 10+x more speed 2013: Star Schema benchmark with SPARQL, 100x MySQL SQL, 0.8x MonetDB SQL 2014: Top of the line SQL analytics, 500 Gtriples, Structure Awareness License CC-BY-SA 4.0 (International).
Triples Done Right, so? Column-store techniques are a good fit; index-based triple storage does not get much better RAM-only pointer-based techniques can be faster but cost 10–100x more to scale up To take RDF to SQL parity, Virtuoso must first be on the level with the best in SQL TPC-H is the checklist for mastery of DW and query optimization; who survives shal not fear Parity is achieved when running with triples, just like with tables License CC-BY-SA 4.0 (International).
Structure is Everywhere CWI in LOD2: 90% of triples in Common Crawl fal into 20 tables Al relational extractions are 100% tables Even DBpedia is 90% covered by 500 tables, but is unusual y heterogeneous, albeit not very large License CC-BY-SA 4.0 (International).
The Glorious Dawn: Structure is the Servant, not the Tyrant A set of subjects with al the same single-valued properties is in fact a table. So, store it as a table Al ow exceptions, e.g., sometimes multiple values, different values in different graphs, extra properties, etc. If it is big, it has repeating structure Al RDF semantics are preserved; any triple is possible, but the common ones are SQL compact and SQL fast With tables, query optimization returns to SQL complexity and is much more reliable So, more tricks from the SQL analytics bag become safe and applicable License CC-BY-SA 4.0 (International).
Gains from Structure Awareness 3+x Load Speed 2x more space efficiency SPARQL queries against regular data within 10–20% of SQL speeds Just declare which properties tend to occur together; no strict schema-first like with SQL Later, self configuration License CC-BY-SA 4.0 (International).
The Cycle of Adventure Rebels: SQL not cool, too rigid, drop ACID, go key-value, map- reduce, the triple is al there is, semantic web Pioneers: Life on the frontier is hard, infrastructure missing or bad Same everyday problems also in Utopia Recognizing the objective values, e.g., schema freedom and identifiers, no AI. Do the job, forget dogma Reconciliation: schema-first and schema-last converge in structure awareness License CC-BY-SA 4.0 (International).
Present FP7 Research LDBC — Transparency and Relevance for Graph DB, RDF performance GeoKnow — GeoData is everywhere, how to carry the planet in your pocket LOD2 — Where no triple has gone before (and come back) Open PHACTs — A Data Platform for Drug Discovery License CC-BY-SA 4.0 (International).
LDBC - Linked Data Benchmark Council Rebels: SQL not cool, too rigid, drop ACID, go key-value, map-reduce, the triple is al there is, semantic web Pioneers: Life on the frontier is hard, infrastructure missing or bad Same everyday problems also in Utopia Recognizing the objective values, e.g., schema freedom and identifiers, no AI. Do the job, forget dogma Reconciliation: Some of the rebel thinking becomes mainstream, e.g., schema-first and schema-last converge in structure awareness License CC-BY-SA 4.0 (International).
LDBC, Independent Industry Forum for Benchmarking The TPC for the frontiers of database Bootstrapped in the LDBC FP7, continues as independent industry association OpenLink, Ontotext, Neo Technologies, Sparsity as founding members IBM, Oracle Labs, Systap, SPARQL City already joined DB superstars Peter Boncz and Thomas Neumann as founders and scientific lead License CC-BY-SA 4.0 (International).
LDBC Benchmarks Social Network Online — Lookups, updates, analysis of social environment Business Intel igence — Spotting trends, key players, big query Graph analytics — Community detection, Page rank, graph metrics Semantic Publishing Modeled after the BBC linked data portal, online lookups, dril downs and updates License CC-BY-SA 4.0 (International).
GeoKnow - The Planet in your Pocket Ms. Globe and Mr. Cube have a thing going on: Mr. Cube: Desiloization ... integrated metadata ... Explicit semantics . Ms. Globe: I can feel it ... but are you man enough? ... you need to show me. License CC-BY-SA 4.0 (International).
Planet Scale Roadmap Jan 2014: Virtuoso SPARQL outperforms PostGIS in map lookups with planet-wide Open Street Map Virtuoso SQL adds 5x more power License CC-BY-SA 4.0 (International).
Next: Jan 2015 Parity between SPARQL and SQL via structure awareness Geospatial data clustering Graph analytics close to the data — Pregel, Giraph, etc., in the DB itself Adding fine-grained geo dimension to LDBC social network benchmark License CC-BY-SA 4.0 (International).
The LOD2 scaling adventures Experiments at CWI’s Scilens cluster Jan 2013: 150 Gtriples (8 x 256GB RAM) Aug 2014: 500 Gtriples (12 x 256GB RAM) Some trillion-triple claims exist, but do not detail any query workload BSBM explore and BI workloads 10x speed gains for BI queries between 2013 and 2014 Bulk load at 6M triples/s All done in triples, structure awareness will go further still License CC-BY-SA 4.0 (International).
Open PHACTs Partners: License CC-BY-SA 4.0 (International).
Virtuoso Now Snapshot of RDF Linked Data customers in the Enterprise: Data.Gov (U.S. Govt. Open Linked Data initiative) Bank of America Booz Allen Hamilton Northrop Grumman Elsevier French National Library Samsung Globo Daimler Benz Johnson & Johnson Bayer License CC-BY-SA 4.0 (International). St Jude's Medical Fuijitsu Syngenta and many more
Virtuoso Availability Most capabilities as open source Commercial adds
Replication (SQL & RDF)
Advanced RDF security; ABAC & RBAC (ACLs)
and more Up to the minute tech previews via v7fasttrack on github, e.g., superfast TPC-H implementation License CC-BY-SA 4.0 (International).
Virtuoso Future Preview of structure-aware RDF store in fal 2014 via v7fasttrack Integrated graph analytics framework Embed complex graph algorithms, e.g., community detection, shortest path inside SPARQL/SQL Comparison of SQL and SPARQL for big data analytics License CC-BY-SA 4.0 (International).
Linked Data Now Adoption across major industries Superior flexibility and time to solution Dramatic performance gains in the last 5 years Benchmarking wil continue to drive progress, to the benefit of users and vendors alike Run circles around most open source SQL in SPARQL: Virtuoso SPARQL beats MySQL in SSB by 100x With structure awareness, SPARQL to match the best in SQL for data warehousing, OLTP Linked Data no longer a long shot but a technology that makes sense License CC-BY-SA 4.0 (International).
About OpenLink Software OpenLink Software is a privately-held company founded in 1992 by its President & CEO, Kingsley Idehen. The company is an industry acclaimed technology innovator in the following areas: ODBC, JDBC, ADO.NET, and OLE DB compliant Data Access Drivers for Oracle, Microsoft SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL High-Performance & Scalable Multi-Model (Relational & Graph) Database Technology Data Integration Middleware (Data Virtualization Technology across a wide variety of Protocols & Formats) Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis, Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.) Web Application Server Technology Linked Data Deployment & Management License CC-BY-SA 4.0 (International). Identity Management
Office Locations USA UK OpenLink Software, Inc OpenLink Software Ltd. 10 Burlington Mall Road Airport House Suite 265 Purley Way Burlington, MA 01803 Croydon, Surrey CR0 0XZ Tel.: +1 781 273 0900 Tel.: +44 (0)20 8681 7701 Fax: +1 781 229 8030 Fax: +44 (0)20 8681 7702 License CC-BY-SA 4.0 (International).