C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II
Advertisements

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
CMPT 354 Views and Indexes Spring 2012 Instructor: Hassan Khosravi.
C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 8 – File Structures.
CS 540 Database Management Systems
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
C-Store: Introduction to TPC-H Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009.
6.814/6.830 Lecture 8 Memory Management. Column Representation Reduces Scan Time Idea: Store each column in a separate file GM AAPL.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Unary Query Processing Operators Not in the Textbook!
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Query Execution Chapter 15 Section 15.1 Presented by Khadke, Suvarna CS 257 (Section II) Id
Team Dosen UMN Physical DB Design Connolly Book Chapter 18.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Introduction to Column-Oriented Databases Seminar: Columnar Databases, Nov 2012, Univ. Helsinki.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Scalable Semantic Web Data Management Using Vertical Partitioning Daniel J. Abadi, Adam Marcus, Samuel R. Madden, Kate Hollenbach VLDB, 2007 Oct 15, 2014.
C-Store: Column Stores over Solid State Drives Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 19, 2009.
C-Store: A Column-oriented DBMS Speaker: Zhu Xinjie Supervisor: Ben Kao.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
DISTRIBUTED DATABASES IN ADBMS Shilpa Seth
C-Store: Column-Oriented Data Warehousing Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read- Optimized Databases: from a Data Layout Perspective Stavros Harizopoulos MIT CSAIL Modified.
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Column-Stores vs. Row-Stores How Different are they Really? Daniel J. Abadi, Samuel Madden, and Nabil Hachem, SIGMOD 2008 Presented By, Paresh Modak( )
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
Data Warehouse Design Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
1 C-Store: A Column-oriented DBMS By New England Database Group.
C-Store: Concurrency Control and Recovery Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun. 5, 2009.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
C-Store: Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 27, 2009.
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
SQL/Lesson 7/Slide 1 of 32 Implementing Indexes Objectives In this lesson, you will learn to: * Create a clustered index * Create a nonclustered index.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 4 Logical & Physical Database Design
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
Column Oriented Database By: Deepak Sood Garima Chhikara Neha Rani Vijayita Gumber.
Column-Stores vs. Row-Stores How Different are they Really? Daniel J. Abadi, Samuel Madden, and Nabil Hachem, SIGMOD 2008: Talk by Karthik Ramachandra,
Unary Query Processing Operators
15.1 – Introduction to physical-Query-plan operators
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Chapter 15 QUERY EXECUTION.
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Column-Stores vs. Row-Stores: How Different Are They Really?
Four Rules For Columnstore Query Performance
CSTORE E0261 Jayant Haritsa Computer Science and Automation
ICOM 5016 – Introduction to Database Systems
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Presentation transcript:

C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009

Row and Column Stores

Limitation of Current Comparison Assuming a row store does not use any column-oriented physical design.  Columns of the same logical tuple is traditionally stored together. In fact it is possible to simulate a column store in a row store.

Simulating a Column Store in a Row Store Vertical Partitioning Indexing Every Column Materialized views C-Tables  Vertical Partitioning + Run-Length Encoding

Simulation 1: Vertical Partitioning Each column makes its own table  N columns result in N tables, each table stores (tuple-id, attr) pairs. Need to do tuple reconstruction  In C-store, columns are sorted in the same order.  In MonetDB, base columns are kept in insertion order, updates only happen to cracker columns.  In this “simulation” scenario, an integer “tuple-id” column is associated to each column. Use hash join on “tuple-id” column to reconstruct tuples.

Two Problems of Vertical Partitioning 1. It requires the “tuple-id” column to be explicitly stored in each column. Waste space and disk bandwidth. 2. Most row-stores store a relatively large header on every tuple further waste space.

Tuple Header A tuple header provides metadata about the tuple. For example, in Postgres, each tuple contains a 27 bytes header including information such as Insert transaction timestamp. Number of attributes in the tuple. NULL flags Length of tuple header

Tuple Header in a Column Store A column-store puts tuple header in separate columns Some information in the tuple header can be removed  For example, in MonetDB, each logical column stores (key, attr) pairs.  Hence, the number of attributes is always 2.  And there is no need for NULL flags, say we simply don’t store the tuple with NULL value.

Simulation 2: Indexing Every Column Base relations are stored using a standard, row-oriented physical design. And an additional secondary B-Tree is built on every column of every table.  Join columns on tuple-id. This approach answers queries by reading values directly from the indexes.

One Problems of Indexing Every Column If a column has no predicate on it, this approach requires  its index to be scanned to extract the needed values. One optimization is to create indexes with composite keys SELECT AVG(salary) FROM emp WHERE age > 40; If we have a composite index with an (age, salary) key, we can answer this query directly from the index.

Simulation 3: Materialized Views For every query in the benchmark workload, create an optimal set of materialized views.  Involve only the columns needed to answer queries. Do not pre-join columns from different tables in these materialized views. Problem of materialized views.  It requires query workload to be known in advance.

Comparison Results by Abadi et al. (SIGMOD ’ 2008) Using a simplified version of TPC-H called the Star Schema Benchmark (SSBM). Major Results:  None of the three attemps to simulate a column store in a row store are particularly effective.  The Materialized View approach is the best  The Vertical Partitioning approach is the moderate  The Indexing Every Column approach is the worst.

Relevant Factors for Successful Simulation For successfully simulating a column store in a row store, we may have to  Store tuple headers separately  Use virtual tuple-id to do tuple reconstruction (i.e., joins).  Maintain heap files (for base relations) in guaranteed position order.

Simulation 4: the C-Table Approach (By Bruno, CIDR ’ 2009) The main idea is to extend the Vertical Partitioning approach to explicitly the Run- Length Encoding (RLE) of tuple values. A C-Table is a sequence of (f, v, c) triples: f is the starting tuple-id or position. v is the data value. c is the count (length). First sort by a, then b, and finally c.

An Interesting Property of C-Table For any pair of tuples t1 and t2, possibly on different c-tables,  the range [f1, f1 + c1 -1] and [f2, f2 + c2 -1] do not partially overlap,  i.e., they are either disjoint, or the one associated with the column deeper in the sort order is included in the other. This property allows us to use specific query rewriting to combine information from different c- tables to answer queries.

Query Rewriting for C-Tables The Schema D1: (lineitem | l_shipdate, l_suppkey)

Comparison Results by Bruno. (CIDR ’ 2009) Using TPC-H. Using for comparison a loose lower bound on any column store implementation  Manually compute how many pages in disk need to be read by any column store execution plan,  And measure the time taken to just read the pages. Results:  The C-Table approach can compete with column stores. Vision:  Row-stores should be able to incorporate column specific optimizations.

Column-Specific Optimizations Late materialization  Construct tuples as late as possible. Block iteration  Multiple-tuples-at-a-time Column-specific compression  Run-Length Encoding Invisible join on star schema (New)  Like semi-join in distributed DBMS.

Schema of the SSBM Benchmark

An Example Query for Illustrating Invisible Join

Invisible Join: First Phase Each predicate is applied to the appropriate dimension table to extract a list of dimension table keys that satisfy the predicate.

Invisible Join: Second Phase Each hash table is used to extract the positions of tuples in the fact table that satisfy the corresponding predicate.

Invisible Join: Third Phase The third phase uses the list of satisfying positions P in the fact table to get foreign key values and hence needed data values from the corresponding dimension table.

Which Column-Specific Optimization is most significant? Late materialization  Improves performance by a factor of 3. Block iteration  Improves performance by 5% - 50%. Column-specific compression  Improves performance by a factor of 2 on average  Improves performance by a factor of 10 on queries that access sorted data. Invisible join on star schema  Improves performance by 50% - 70%.

Open Question Building a complete row-store that can transform into a column-store on workloads where column-stores perform well.

References Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. Column-Stores vs. Row- Stores: How Different Are They Really?. In SIGMOD, Allison L. Holloway, David J. DeWitt. Read- optimized databases, in depth. In VLDB, N Bruno. Teaching an Old Elephant New Tricks. In CIDR’2009.