1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.

Slides:



Advertisements
Similar presentations
1 © 2008 OpenLink Software, All rights reserved. SPARQL for Business Intelligence Orri Erling - Program Manager, Virtuoso 1.
Advertisements

Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
RDF and RDB 1 Some slides adapted from a presentation by Ivan Herman at the Semantic Technology & Business Conference, 2012.
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.
Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
Michael Povolotsky CMSC491s/691s. What is Virtuoso? Virtuoso, known as Virtuoso Universal Server, is a multi-protocol RDBMS Includes an object-relational.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
© 2011 Citrusleaf. All rights reserved.1 A Real-Time NoSQL DB That Preserves ACID Citrusleaf Srini V. Srinivasan Brian Bulkowski VLDB, 09/01/11.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
5 Creating the Physical Model. Designing the Physical Model Phase IV: Defining the physical model.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Distributed Databases
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
Overview SQL Server 2008 Overview Presented by Tarek Ghazali IT Technical Specialist Microsoft SQL Server MVP, MCTS Microsoft Web Development MCP ITIL.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Database System Architecture and Performance CSCI 6442 ©Copyright 2015, David C. Roberts, all rights reserved.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Access Path Selection in a Relational Database Management System Selinger et al.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
The Volcano Optimizer Generator Extensibility and Efficient Search.
Databases Illuminated
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
CS4432: Database Systems II Query Processing- Part 2.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Handling Data Skew in Parallel Joins in Shared-Nothing Systems Yu Xu, Pekka Kostamaa, XinZhou (Teradata) Liang Chen (University of California) SIGMOD’08.
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
© 2009 OpenLink Software, All rights reserved. Mapping Relational Databases to RDF with OpenLink Virtuoso Orri Erling - Program Manager, Virtuoso.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
How to kill SQL Server Performance Håkan Winther.
CPSC-310 Database Systems
Blue Collar SQL Tricks - Make Standard Edition Work for you.
CSCI5570 Large Scale Data Processing Systems
RDF and RDB 1 Some slides adapted from a presentation by Ivan Herman at the Semantic Technology & Business Conference, 2012.
Parallel Databases.
Database Performance Tuning and Query Optimization
Introduction to Execution Plans
Predictive Performance
Akshay Tomar Prateek Singh Lohchubh
Introduction to Operating Systems
Introduction to Execution Plans
Chapter 11 Database Performance Tuning and Query Optimization
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Introduction to Execution Plans
Blue Collar SQL Tricks - Make Standard Edition Work for you.
Introduction to Execution Plans
Presentation transcript:

1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software

2 © 2012 OpenLink Software, All rights reserved. Flexible Big Data Data grows in volume and heterogeneity Schema last is great - if the price is right RDF, graphs promise powerful querying with the flexibility and scale of no-SQL key value stores Inference may be good for integration, if can express the right things, beyond OWL RDF tech must learn the lessons of DB, everything applies

3 © 2012 OpenLink Software, All rights reserved. Virtuoso Column Store Edition SQL and SPARQL Compressed column store, vectored execution Shared nothing scale out Powerful procedure language with parallel, distributed control structures Full-text and geospatial indexes

4 © 2012 OpenLink Software, All rights reserved. Storage Freely mix column-, and row-wise indices All SQL and RDF data types natively supported, single execution engine for SQL/SPARQL Column compression 3x more space efficient than row- wise compression for RDF Column stores are not only for big scans, random access surpasses rows as as soon as there is some locality 9 B/quad with DBpedia, 7 B/quad with BSBM or RDF-H, 14 B/quad with web crawls (PSOG, POSG, SP, OP, GS, excluding literals)

5 © 2012 OpenLink Software, All rights reserved. Execution Engine Vectoring is not only for column stores Vectoring makes a random access into a linear merge join if there is any locality: Always a win, mileage depends on run time factors Vectoring eliminates interpretation overhead and makes CPU friendly code possible Even with run time data typing, vectoring allows use of type- specific operators on homogenous data, e.g. arithmetic Dynamically adjust vector size: Larger vector may not fit in cache but will get better locality for random access

6 © 2012 OpenLink Software, All rights reserved. Graph operations Run time computation plus caching instead of materialization SPARQL/SQL extension for arbitrary transitive subqueries: Flexible options for returning shortest paths, all paths, all /distinct reachable, attributes of steps on paths etc. Efficient execution, searching the graph from both ends if looking for a path with ends given Query operators for RDF hierarchy traversal Special query operator for OWL sameAs and IFP based identity Taking OWL sameAs / IFP identity into account for DISTINCT /GROUP BY

7 © 2012 OpenLink Software, All rights reserved. Query Optimization Challenges Typical SQL stats do not help Need to measure data cardinalities starting from constants in the query Need to sample fanout predicate by predicate, as needed Predicate and class hierarchies are easy to handle in sampling sameAs or IFP inference voids all guesses Is hash join worthwhile? High setup cost means that one must be sure of cardinalities first

8 © 2012 OpenLink Software, All rights reserved. Deep Sampling Everything is a join -> sampling must also do joins As the candidate plan grows, the cost model executes all the ops on a sample of the data Actual cardinality and locality are known, also when search conditions are correlated Having high confidence in the cost model, hash join plans become safe and attractive Even though there is an indexed access path for all, a scan can be better because it produces results in order. Need to be sure of selectivity before taking the risk

9 © 2012 OpenLink Software, All rights reserved. Elastic Cluster Data is partitioned by key, different indices may have different partition keys Partitions may split and migrate between servers Partitions may be kept in duplicate for fault tolerance/load balancing Actual access stats drive partition split and placement

10 © 2012 OpenLink Software, All rights reserved. Optimizing for Cluster Vectored execution is natural in a cluster since single-tuple messages are not an option Keep max ops in flight at all times, always send long messages Fully distributed query coordination:  Any node can service a client request. Correlated subqueries, stored procedures may execute anywhere, arbitrary parallelism and recursion between partitions  On single shared memory box, cluster is approximately even with single process multithreading, low overhead  Distributed stored procedures, send the proc to the data, as in map- reduce, except that there are no limits on cross partition calling/recursion  Choice of transactional and auto-commit update semantics, can have atomic ops without global transaction

11 © 2012 OpenLink Software, All rights reserved. 55 billion triples in LOD cache, only 384 GB of RAM, 2TB disk Most of Linked Open Data and Web Crawls LOD Cache

12 © 2012 OpenLink Software, All rights reserved. Future Work Complete deep sampling: No more bad query plans Caching and recycling of intermediate results, specially inference and partial plans Automatic cluster sizing and load redistribution Automatic balancing of storage between disk and SSD Run TPC-H and TPC-DS in SQL and their 1:1 translation in SPARQL, demonstrating SPARQL performance as near to SQL as possible

© 2012 OpenLink Software, All rights reserved. Making Technology Work For You openlinksw.com/virtuoso