Introduction to Column-Oriented Databases Seminar: Columnar Databases, Nov 2012, Univ. Helsinki.

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.
CS 540 Database Management Systems
Clydesdale: Structured Data Processing on MapReduce Jackie.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009.
6.814/6.830 Lecture 8 Memory Management. Column Representation Reduces Scan Time Idea: Store each column in a separate file GM AAPL.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Cloud Computing Lecture Column Store – alternative organization for big relational data.
C-Store: Column Stores over Solid State Drives Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 19, 2009.
C-Store: A Column-oriented DBMS Speaker: Zhu Xinjie Supervisor: Ben Kao.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
CSC271 Database Systems Lecture # 30.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read-Optimized Databases Stavros Harizopoulos MIT CSAIL joint work with: Velen Liang, Daniel Abadi,
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
C-Store: Column-Oriented Data Warehousing Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Performance Tradeoffs in Read-Optimized Databases Stavros Harizopoulos * MIT CSAIL joint work with: Velen Liang, Daniel Abadi, and Sam Madden massachusetts.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read- Optimized Databases: from a Data Layout Perspective Stavros Harizopoulos MIT CSAIL Modified.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Column-Stores vs. Row-Stores How Different are they Really? Daniel J. Abadi, Samuel Madden, and Nabil Hachem, SIGMOD 2008 Presented By, Paresh Modak( )
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
1 C-Store: A Column-oriented DBMS By New England Database Group.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
C-Store: Concurrency Control and Recovery Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun. 5, 2009.
Column Oriented Database Vs Row Oriented Databases By Rakesh Venkat.
C-Store: Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 27, 2009.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
CS4432: Database Systems II Query Processing- Part 2.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 4 Logical & Physical Database Design
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
Chapter 5 Index and Clustering
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Query Processing – Implementing Set Operations and Joins Chap. 19.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Column Oriented Database By: Deepak Sood Garima Chhikara Neha Rani Vijayita Gumber.
Column-Stores vs. Row-Stores How Different are they Really? Daniel J. Abadi, Samuel Madden, and Nabil Hachem, SIGMOD 2008: Talk by Karthik Ramachandra,
Practical Database Design and Tuning
CS 540 Database Management Systems
CS 440 Database Management Systems
CPT-S 415 Big Data Yinghui Wu EME B45 1.
Chapter 12: Query Processing
Chapter 15 QUERY EXECUTION.
Physical Database Design
Selected Topics: External Sorting, Join Algorithms, …
(A Research Proposal for Optimizing DBMS on CMP)
Column-Stores vs. Row-Stores: How Different Are They Really?
Four Rules For Columnstore Query Performance
OLAP Query Performance in Column-Oriented Databases
Presentation transcript:

Introduction to Column-Oriented Databases Seminar: Columnar Databases, Nov 2012, Univ. Helsinki

Content Introduction Row-Oriented Execution Column-Oriented Execution Experiments Conclusion 2

Introduction Column-oriented database: Each column is stored contiguously on a separate location on a disk. Column-stores ideas begins in late 70’s. MonetDB[1] and C-store[2] has been intoduced in 2000’s. Star Schema Benchmark (SSBM)[3] has been implemented with column-stores as possible. 3

Row-Oriented Execution The implementation of column-stores in a row-stored based system. Three techniques are being introduced: 1.Vertical Partitioning 2.Index-Only Plans 3.Materizalized Views 4

Vertical Partitioning Process: Full vertical partitioning of each relation Each column = 1 physical table This can be achieved by adding integer position column to every table Join on Position for multi column fetch Problems: ‘Position’ - Space and disk bandwidth Header for every tuple – further space waste 5

Index-Only Plans 6 Process: Add B+Tree index for every table.column Build list of (record-id, value) pairs satisfying predicates on each table Merge the lists in memory when there are multiple predicates on the same table Problem: Separate indices may require full index scan, which is slower Eg: SELECT AVG(salary) FROM emp WHERE age > 40 Composite index with (age, salary) key helps

Materialized views Process: Create ‘optimal’ set of MVs for given query workload Objective: Provide just the required data Avoid overheads Performs better Expected to perform better than other two approaches Problem: Practical only in limited situation Require knowledge of query workloads in advance 7

Column-Oriented Execution Optimization approaches to improve the performance of column-stores. Four techniques are being introduced: 1.Compression 2.Late Materizalization 3.Block Iteration 4.Invisible Join[4] (a new technique) 8

Compression Low information entropy (high data value locality) leads to high compression ratio. Advantages Disk space is saved. Less I/O from disk to memory (or from memory to CPU) Performance can be further improved if we can perform operation directly on compressed data. Light weight compression schemes do better. 9

Late Materialization Most query results entity-at-a-time not column-at-a-time At some point of time multiple column must be combined Performance can be improved by using late-materialization Keep the data in columns until much later in the query plan, operating directly on these columns. Eg: SELECT R.a FROM R WHERE R.c = 1 AND R.b = 7 Output of each predicate is a bit string Perform Bitwise AND Use final position list to extract R.a Advantages: Unnecessary construction of tuple is avoided, direct operation on compressed data 10

Block Iteration Operators operate on blocks of tuples at once. Iterate over blocks rather than tuples Like batch processing Block of values from the same columns are sent to an operator in a single function call. If column is fixed width, it can be operated as an array. Minimizes per-tuple overhead Exploits potential for parallelism 11

Invisible Join - SSBM tables 12

Invisible Join- A query Find total revenue from Asian customers who purchase a product supplied by an Asian supplier between 1992 and 1997 grouped by nation of the customer, supplier and year of transaction 13

Invisible Join – Cont’d Traditional plan for this type of query is to pipeline join in order of predicate selectivity Alternate plan is late materialized join technique But both have disadvantages Traditional plan lacks all the advantages described previously of late materialization In the late materialized join technique group by columns need to be extracted in out-of-position order 14

Invisible Join – Cont’d Invisible join is a late materialized join but minimize the values that need to be extracted out of order Invisible join Rewrite joins into predicates on the foreign key columns in the fact table. These predicates can be evaluated by hash-lookup. 15

Invisible Join - The first phase of Invisible Join 16

Invisible Join- The second phase of Invisible Join 17

Invisible Join- The third phase of Invisible Join 18

Experiments [4] Column-store simulation in a row-store to find out whether it is possible for a row-store to obtain the benefits of column-oriented design. Column-store performance after removing the optimizations to find out which optimizations are most significant. 19

Experiments - Environment 2.8 GHz Dual Core Pentium(R) workstation 3 GB RAM RHEL 5 4 disk array mapped as a single logical volume Reported numbers are average of several runs Warm buffer (30% improvement for both systems) Star Schema Benchmark (SSBM) with the fact table (17 columns, 60,000,000 rows) and 4 dimension tables (largest one: 80,000 rows) 20

RS = System X (row-store)CS = C-store (column-store) RS (MV) = System X materialized viewCS (Row-MV) = Row-MV in C-Store Baseline performance of C-store and System X 21

Baseline performance Results From the graph we can see C-Store performs better than System X by a Factor of six in the base case Factor of three when System X use materialized view However CS (Row-MV) (row-oriented materialized view inside C-Store) performs worse than RS (MV) System X provide advance performance feature C-Store has multiple known performance bottlenecks C-Store doesn't support partitioning, multithreading 22

Column-store simulation in a row-store Five different configurations: 1.Traditional row-oriented representation with bitmap 2.Traditional (bitmap): Biased to use bitmaps; might be inferior sometimes 3.Vertical Partitioning: Each column is a relation 4.Index-Only: B+Tree on each columns 5.Materialized Views: Optimal set of views for every query 23

Average performance across all the queries T = Traditional T(B) = Traditional Bitmap MV = Materialized View VP = Vertical Partitioning AI = All indexes 24

Column-store simulation in a row-store - Results MV performs best since they read minimal amount of data needed by a query. Index only plans are the worst: Expensive column joins on fact table Unable to use merge join Vertical partitioning: Tuple overheads and reconstruction LineOrder Table – 60 million tuples, 17 columns Compressed data 25

Column-store performance Column-store performs better than the best case of row-store (4.0sec sec) Approach: Start with column-store (C-Store) Remove column-store-specific performance optimizations which are compression, block processing, late materialization, invisible join. End up with column-store having a row-oriented query executer 26

Column-store performance - Average performance numbers for C-Store across all queries while various optimizations removed T= tuple-at-a-time processing t= block processing I=invisible join enabled i= disabled C= compression enabled c= disabled L= late materialization enabled l= disabled 27

Column-store performance - Results Block processing improves the performance by a factor of 5% to 50% depending on the compression. Compression improves the performance by almost a factor of 2. Late materialization improves performance by almost a factor of 3 because of the selective predicates. Invisible join improves the performance by 50-75%. The most significant optimizations are compression and late materialization. After all the optimizations are removed, the column store acts just like a row store. Column-stores are faster than simulated-column stores. 28

Conclusion To emulate column-stores in row-stores, techniques like Vertical portioning Index only plan does not yield good performance for the reasons of Tuple reconstruction costs Per tuple overheads. Reasons why column-stores have good performance Late materialization Compression Block iteration Invisible join 29

Open Question Building a complete row-store that can transform into a column-store on workloads where column-stores perform well. 30

References [1] S. Idreos, F. Groffen, N. Nes, S. Manegold, S. Mullender, M. Kersten. MonetDB: Two Decades of Research in Column-oriented Database Artitectures [2] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. R. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, S. B. Zdonik. C-Store: A Column-Oriented DBMS. In VLDB, pages 553–564, 2005 [3] E. O’Neil, X. Chen, E. J. O’Neil. Adjoined Dimension Column Index (ADC Index) to Improve Star Schema Query Performance. In ICDE, 2008 and [4] D.J. Abadi, S.R. Madden, N. Hachem. Column-stores vs. row-stores: how different are they really? In Proc. SIGMOD,

Thanks for your attention, Any comments, questions? Seminar: Columnar Databases, Nov 2012, Univ. Helsinki