EECS 262a Advanced Topics in Computer Systems Lecture 16 C-Store / DB Cracking October 28 th, 2013 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.

Slides:



Advertisements
Similar presentations
C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.
Advertisements

CS 540 Database Management Systems
CS 245Notes 31 (1) Insertion/Deletion (2) Buffer Management (3) Comparison of Schemes Other Topics.
C-Store: Class Overview Spring, 2009 Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Feb 27, 2009.
C-Store: Introduction to TPC-H Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009.
6.814/6.830 Lecture 8 Memory Management. Column Representation Reduces Scan Time Idea: Store each column in a separate file GM AAPL.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Locking Key Ranges with Unbundled Transaction Services 1 David Lomet Microsoft Research Mohamed Mokbel University of Minnesota.
Introduction to Column-Oriented Databases Seminar: Columnar Databases, Nov 2012, Univ. Helsinki.
CS 345: Topics in Data Warehousing Thursday, October 28, 2004.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
C-Store: Column Stores over Solid State Drives Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 19, 2009.
C-Store: A Column-oriented DBMS Speaker: Zhu Xinjie Supervisor: Ben Kao.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.
1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read-Optimized Databases Stavros Harizopoulos MIT CSAIL joint work with: Velen Liang, Daniel Abadi,
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
C-Store: Column-Oriented Data Warehousing Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
Performance Tradeoffs in Read-Optimized Databases Stavros Harizopoulos * MIT CSAIL joint work with: Velen Liang, Daniel Abadi, and Sam Madden massachusetts.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read- Optimized Databases: from a Data Layout Perspective Stavros Harizopoulos MIT CSAIL Modified.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
MIT DB GROUP. People Sam Madden Daniel Abadi (Yale)Daniel Abadi Magdalena Balazinska (U. Wash.)Magdalena Balazinska.
1 Data Warehouses BUAD/American University Data Warehouses.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
1 C-Store: A Column-oriented DBMS By New England Database Group.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
M.Kersten Dec 31, Cracking the database store The far side of the Moon Martin Kersten, Stefan Manegold Centre for Mathematics and Computer Science.
C-Store: Concurrency Control and Recovery Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun. 5, 2009.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
C-Store: Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 27, 2009.
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
EECS 262a Advanced Topics in Computer Systems Lecture 16 C-Store / DB Cracking October 22 nd, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.
C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
C-Store: RDF Data Management Using Column Stores Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 24, 2009.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
University of Sunderland COM 220 Lecture Ten Slide 1 Database Performance.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
CS 540 Database Management Systems
Chapter 5 Record Storage and Primary File Organizations
CS4432: Database Systems II
October 15-18, 2013 Charlotte, NC Accelerating Database Performance Using Compression Joseph D’Antoni, Solutions Architect Anexinet.
Column Oriented Database By: Deepak Sood Garima Chhikara Neha Rani Vijayita Gumber.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Database cracking Stratos Idreos, Martin Kersten and Stefan Manegold
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Module 11: File Structure
CHP - 9 File Structures.
Physical Database Design and Performance
Database Management Systems (CS 564)
CSTORE E0261 Jayant Haritsa Computer Science and Automation
ICOM 5016 – Introduction to Database Systems
Self-organizing Tuple Reconstruction in Column-stores
John Kubiatowicz Electrical Engineering and Computer Sciences
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
Presentation transcript:

EECS 262a Advanced Topics in Computer Systems Lecture 16 C-Store / DB Cracking October 28 th, 2013 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences University of California, Berkeley

10/28/20132cs262a-S13 Lecture-16 Today’s Papers C-Store: A Column-oriented DBMS * Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, Stan Zdonik. Appears in Proceedings of the ACM Conference on Very Large Databases(VLDB), 2005C-Store: A Column-oriented DBMS Database Cracking + Stratos Idreos, Martin L. Kersten, and Stefan Manegold. Appears in the 3rd Biennial Conference on Innovative Data Systems Research (CIDR), January 7-10, 2007Database Cracking Wednesday: Comparison of PDBMS, CS, MR Thoughts? * Some slides based on slides from Jianlin Feng, School of Software, Sun Yat-Sen University + Some slides based on slides from Stratos Idreos, CWI Amsterdam, The Netherlands

10/28/20133cs262a-S13 Lecture-16 Relational Data: Logical View NameAgeDepartmentSalary Bob25Math10K Bill27EECS50K Jill24Biology80K

10/28/20134cs262a-S13 Lecture-16 C-Store: A Column-oriented DBMS Physical layout: Row store or Column store Record 1 Record 2 Column 1 Column 2 Record 3 Column 3 Relation or Tables

10/28/20135cs262a-S13 Lecture-16 Row Stores On-Line Transaction Processing (OLTP) –ATM, POS in supermarkets Characteristics of OLTP applications: –Transactions that involve small numbers of records (or tuples) –Frequent updates (including queries) –Many users –Fast response times OLTP needs a write-optimized row store –Insert and delete a record in one physical write Easy to add new record, but might read unnecessary data (wasted memory and I/O bandwidth)

10/28/20136cs262a-S13 Lecture-16 Row Store: Columns Stored Together Record id = Page i Rid = (i,N) Rid = (i,2) Rid = (i,1) Pointer to start of free space SLOT DIRECTORY N N # slots Slot Array Data

10/28/20137cs262a-S13 Lecture-16 Current DBMS Gold Standard Store columns from one record contiguously on disk Use B-tree indexing Use small (e.g. 4K) disk blocks Align fields on byte or word boundaries Conventional (row-oriented) query optimizer and executor (technology from 1979) Aries-style transactions

10/28/20138cs262a-S13 Lecture-16 OLAP and Data Warehouses On-Line Analytical Processing (OLAP) –Flexible reporting for Business Intelligence Characteristics of OLAP applications –Transactions that involve large numbers of records –Frequent Ad-hoc queries and infrequent updates –A few decision-making users –Fast response times Data Warehouses facilitate reporting and analysis –Read-Mostly Other read-mostly applications –Customer Relationship Management (Siebel, Oracle) –Catalog Search in E-Commerce (Amazon.com, Bestbuy.com)

10/28/20139cs262a-S13 Lecture-16 Column Stores Logical data model: Relational Model Key Intuition: Only read relevant columns –Example: Ad-hoc queries read 2 columns out of 20 Multiple prior column store implementations –Sybase IQ (early ’90s, bitmap index) –Addamark (i.e., SenSage, for Event Log data warehouse) –KDB (Column-stores for financial services companies) –MonetDB (Hyper-Pipelining Query Execution, CIDR’05) Only read necessary data, but might need multiple seeks

10/28/201310cs262a-S13 Lecture-16 Row and Column Stores

10/28/201311cs262a-S13 Lecture-16 Read-Optimized Databases Effect of column-orientation on performance? –Read-less, seek more so depends on prefetching, query selectivity, tuple width, competing query traffic 45 … 37 Joe … Sue 1 … 2 column stores 1 Joe 45 … … … 2 Sue 37 row stores Sybase IQ MonetDB KDB C-Store SQL Server DB2 Oracle

10/28/201312cs262a-S13 Lecture-16 Rows versus Columns 1 Joe 45 2 Sue 37 … … … single file project Joe 45 12…12… Joe Sue … … 3 files Joe 45 reconstruct Joe 45 seek column storesrow stores

10/28/201313cs262a-S13 Lecture-16 C-Store Technical Ideas Column Store with some “novel” ideas (below) Only materialized views on each relation (perhaps many) Active data compression Column-oriented query executor and optimizer Shared-nothing architecture Replication-based concurrency control and recovery Writeable Store (WS) Read-optimized Store (RS) Tuple Mover Read-optimized Store (RS)

10/28/201314cs262a-S13 Lecture-16 Architecture of Vertica C-Store

10/28/201315cs262a-S13 Lecture-16 Basic Concepts A logical table is physically represented as a set of projections Each projection consists of a set of columns –Columns are stored separately, along with a common sort order defined by SORT KEY Each column appears in at least one projection A column can have different sort orders if it is stored in multiple projections

10/28/201316cs262a-S13 Lecture-16 Example C-Store Projection LINEITEM (shipdate, quantity, retflag, suppkey | shipdate, quantity, retflag) –First sorted by shipdate –Second sorted by quantity –Third sorted by retflag Sorting increases locality of data –Favors compression techniques such as Run- Length Encoding (see Elephant paper)

10/28/201317cs262a-S13 Lecture-16 C-Store Operators Selection –Produce bitmaps that can be efficiently combined Mask –Materialize a set of values from a column and a bitmap Permute –Reorder a column using a join index Projection –Free operation! –Two columns in the same order can be concatenated for free Join –Produces positions rather than values

10/28/201318cs262a-S13 Lecture-16 Example: Join over Two Columns

10/28/201319cs262a-S13 Lecture-16 Column Store has High Compressibility Each attribute is stored in a separate column –Related values are compressible (versus values of separate attributes) Compression benefits –Reduces the data sizes –Improves disk (and memory) I/O performance by: »reducing seek times (related data stored nearer together) »reducing transfer times (less data to read/write) »increasing buffer hit rate (buffer can hold larger fraction of data)

10/28/201320cs262a-S13 Lecture-16 Compression Methods Dictionary Bit-pack –Pack several attributes inside a 4-byte word –Use as many bits as max-value Delta –Base value per page –Arithmetic differences No Run-Length Encoding (unlike Elephant paper) … ‘low’ … … ‘high’ … … ‘low’ … … ‘normal’ … … 00 … … 10 … … 00 … … 01 …

10/28/201321cs262a-S13 Lecture-16 C-Store Use of Snapshot Isolation Snapshot Isolation for Fast OLAP/Data Warehousing –Allows very fast transactions without locks –Can read large consistent snapshot of database Divide into RS and WS stores –Read Store is Optimized for Fast Read-Only Transactions –Write Store is Optimized for Transactional Updates Low Water Mark (LWM) –Represents earliest epoch at which read-only transactions can run –RS contains tuples added before LWM High Water Mark (HWM) –Represents latest epoch at which read-only transactions can run

10/28/201322cs262a-S13 Lecture-16 Other Ideas K-safety: Can handle up to K-1 failures –Every piece of data replicated K times –Different projections sorted in different ways Join Tables –Construct original tuples given covering projects –Vertica gave up on Join Tables – too expensive, require super-projections

10/28/201323cs262a-S13 Lecture-16 Evaluation? Series of 7 queries against C-Store vs two commercial DBMS –C-Store faster In all cases, sometimes significantly Why so much faster? –Column Representation – avoid extra reads –Overlapping projections – multiple orderings of column as appropriate –Better data compression –Query operators operating on compressed data

10/28/201324cs262a-S13 Lecture-16 Summary Columns outperform rows in listed workloads –Reasons: »Column representation avoids reads of unused attributes »Query operators operate on compressed representation, mitigating storage barrier problem »Avoids memory-bandwidth bottleneck in rows Storing overlapping projections, rather than the whole table allows storage of multiple orderings of a column as appropriate Better compression of data allows more orderings in the same space Results from other papers: –Prefetching is key to columns outperforming rows –Systems with competing traffic favor columns

10/28/201325cs262a-S13 Lecture-16 Is this a good paper? What were the authors’ goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the “Test of Time” challenge? How would you review this paper today?

BREAK

10/28/201327cs262a-S13 Lecture-16 DB Physical Organization Problem Many DBMS perform well and efficiently only after being tuned by a DBA DBA decides: –Which indexes to build? –On which data parts? –and when to build them?

10/28/201328cs262a-S13 Lecture-16 Timeline Sample workload Analyze performance Prepare estimated physical design Queries Very complex and time consuming process! What about: Dynamic, changing workloads? Very Large Databases?

10/28/201329cs262a-S13 Lecture-16 Database Cracking Solve challenges of dynamic environments: –Remove all tuning and physical design steps, but still get similar performance as a fully tuned system How? Design new auto-tuning kernels –DBA with cracking (operators, plans, structures, etc.)

10/28/201330cs262a-S13 Lecture-16 Database Cracking No monitoring No preparation No external tools No full indexes No human involvement Continuous on-the-fly physical reorganization –Partial, incremental, adaptive indexing Designed for modern column-stores

10/28/201331cs262a-S13 Lecture-16 Cracking Example Each query is treated as an advice on how data should be stored –Triggers physical re-organization of the database Q1: select * from R where R.A > 10 and R.A < 14 Column A

10/28/201332cs262a-S13 Lecture-16 Cracking Design The first time a range query is posed on an attribute A, a cracking DBMS makes a copy of column A, called the cracker column of A A cracker column is continuously physically re- organized based on queries that need to touch attribute such as the result is in a contiguous space For each cracker column, there is a cracker index

10/28/201333cs262a-S13 Lecture-16 Cracking Algorithms Two types of cracking algorithms based on select’s where clause where X < # where # < X < # Split a piece into Split a piece into two new pieces three new pieces

10/28/201334cs262a-S13 Lecture-16 Cracker Select Operator Traditional select operator –Scans the column –Returns a new column that contains the qualifying values The cracker select operator –Searches the cracker index –Physically re-organizes the pieces found –Updates the cracker index –Returns a slice of the cracker column as the result More steps but faster because analyzes less data

10/28/201335cs262a-S13 Lecture-16 Cracker Column A Cracking Example Each query is treated as advice on how data should be stored –Physically reorganize based on the selection predicate Q1: select * from R where R.A > 10 and R.A < 14 Column A Piece 1: A ≤ 10 Piece 2: 10 < A < 14 Piece 3: 14 ≤ A

10/28/201336cs262a-S13 Lecture-16 Cracker Column A Cracking Example Each query is treated as advice on how data should be stored –Physically reorganize based on the selection predicate Q1: select * from R where R.A > 10 and R.A < 14 Column A Piece 1: A ≤ 10 Piece 2: 10 < A < 14 Piece 3: 14 ≤ A

10/28/201337cs262a-S13 Lecture-16 Cracker Column A Cracking Example Each query is treated as advice on how data should be stored –Physically reorganize based on the selection predicate Q1: select * from R where R.A > 10 and R.A < 14 Column A Piece 1: A ≤ 10 Piece 2: 10 < A < 14 Piece 3: 14 ≤ A Gain knowledge on how to organize data on-the-fly within the select-operator Improve data access for future queries

10/28/201338cs262a-S13 Lecture-16 Cracker Column A Cracking Example Each query is treated as advice on how data should be stored –Physically reorganize based on the selection predicate Q1: select * from R where R.A > 10 and R.A < 14 Column A Piece 1: A ≤ 10 Piece 2: 10 < A < 14 Piece 3: 14 ≤ A Q1 Q2: select * from R where R.A > 7 and R.A ≤16

10/28/201339cs262a-S13 Lecture-16 Cracker Column A Cracking Example Each query is treated as advice on how data should be stored –Physically reorganize based on the selection predicate Q1: select * from R where R.A > 10 and R.A < 14 Column A Piece 1: A ≤ 10 Piece 2: 10 < A < 14 Piece 3: 14 ≤ A Q1 Q2: select * from R where R.A > 7 and R.A ≤16 Cracker Column A Piece 1: A ≤ 7 Piece 3: 10<A<14 Piece 4: 14≤A≤16 Q2 Piece 2: 7 < A ≤10 Piece 5: 16 < A

10/28/201340cs262a-S13 Lecture-16 Cracker Column A Cracking Example Each query is treated as advice on how data should be stored –Physically reorganize based on the selection predicate Q1: select * from R where R.A > 10 and R.A < 14 Column A Piece 1: A ≤ 10 Piece 2: 10 < A < 14 Piece 3: 14 ≤ A Q1 Q2: select * from R where R.A > 7 and R.A ≤16 Cracker Column A Piece 1: A ≤ 7 Piece 3: 10<A<14 Piece 4: 14≤A≤16 Q2 Piece 2: 7 < A ≤10 Piece 5: 16 < A

10/28/201341cs262a-S13 Lecture-16 Cracker Column A Cracking Example Each query is treated as advice on how data should be stored –Physically reorganize based on the selection predicate Q1: select * from R where R.A > 10 and R.A < 14 Column A Piece 1: A ≤ 10 Piece 2: 10 < A < 14 Piece 3: 14 ≤ A Q1 Q2: select * from R where R.A > 7 and R.A ≤16 Cracker Column A Piece 1: A ≤ 7 Piece 3: 10<A<14 Piece 4: 14≤A≤16 Q2 Piece 2: 7 < A ≤10 Piece 5: 16 < A Result tuples The more cracking, the more learned

10/28/201342cs262a-S13 Lecture-16 Self-Organizing Behavior ( Count(*) range query )

10/28/201343cs262a-S13 Lecture-16 Self-Organizing Behavior (TPC-H Query 6) TPC-H is an ad-hoc, decision support benchmark –Business oriented ad-hoc queries, concurrent data modifications Example: –Tell me the amount of revenue increase that would have resulted from eliminating certain company-wide discounts in a given percentage range in a given year Workload: –Database load –Execution of 22 read-only queries in both single and multi-user mode –Execution of 2 refresh functions

10/28/201344cs262a-S13 Lecture-16 Is this a good paper? What were the authors’ goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the “Test of Time” challenge? How would you review this paper today?