CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Joins on Encoded and Partitioned Data Jae-Gil Lee 2* Gopi Attaluri 3 Ronald Barber 1 Naresh Chainani 3 Oliver Draese 3 Frederick Ho 5 Stratos Idreos 4*
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Presented by Vigneshwar Raghuram
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009.
6.814/6.830 Lecture 8 Memory Management. Column Representation Reduces Scan Time Idea: Store each column in a separate file GM AAPL.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
Columnar Database Systems
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Team Dosen UMN Physical DB Design Connolly Book Chapter 18.
Introduction to Column-Oriented Databases Seminar: Columnar Databases, Nov 2012, Univ. Helsinki.
CS 345: Topics in Data Warehousing Thursday, October 28, 2004.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Cloud Computing Lecture Column Store – alternative organization for big relational data.
C-Store: Column Stores over Solid State Drives Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 19, 2009.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
A Hybrid Row-column OLTP Database Architecture for Operational Reporting Jan Schaffner, Anja Bog, Jens Krüger, Alexander Zeier.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Physical Database Design Chapter 6. Physical Design and implementation 1.Translate global logical data model for target DBMS  1.1Design base relations.
C-Store: Column-Oriented Data Warehousing Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin Zukowski, Arjen de Vries, Peter Boncz January 08, 2007.
© Stavros Harizopoulos 2006 Performance Tradeoffs in Read- Optimized Databases: from a Data Layout Perspective Stavros Harizopoulos MIT CSAIL Modified.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
MIT DB GROUP. People Sam Madden Daniel Abadi (Yale)Daniel Abadi Magdalena Balazinska (U. Wash.)Magdalena Balazinska.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Column-Stores vs. Row-Stores How Different are they Really? Daniel J. Abadi, Samuel Madden, and Nabil Hachem, SIGMOD 2008 Presented By, Paresh Modak( )
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
C-Store: Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 27, 2009.
C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.
C-Store: Integrating Compression and Execution Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Mar 20, 2009.
1 Distributed Databases Chapter 21, Part B. 2 Introduction v Data is stored at several sites, each managed by a DBMS that can run independently. v Distributed.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 4 Logical & Physical Database Design
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#25: Column Stores.
Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
CS4432: Database Systems II
October 15-18, 2013 Charlotte, NC Accelerating Database Performance Using Compression Joseph D’Antoni, Solutions Architect Anexinet.
Column Oriented Database By: Deepak Sood Garima Chhikara Neha Rani Vijayita Gumber.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Column-Stores vs. Row-Stores How Different are they Really? Daniel J. Abadi, Samuel Madden, and Nabil Hachem, SIGMOD 2008: Talk by Karthik Ramachandra,
CSCI5570 Large Scale Data Processing Systems
Module 11: File Structure
Indexing Structures for Files and Physical Database Design
Parallel Databases.
CPT-S 415 Big Data Yinghui Wu EME B45 1.
Data Warehouse.
COMP 430 Intro. to Database Systems
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Paritosh Aggarwal Rushi Nadimpally
DB storage architectures: Rows, Columns, LSM trees
Faloutsos/Pavlo C. Faloutsos – A. Pavlo Lecture#25: Column Stores
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Column-Stores vs. Row-Stores: How Different Are They Really?
Chapter 13: Data Storage Structures
OLAP Query Performance in Column-Oriented Databases
CSTORE E0261 Jayant Haritsa Computer Science and Automation
DB storage architectures: Rows and Columns
Chapter 13: Data Storage Structures
Chapter 13: Data Storage Structures
Presentation transcript:

CPT-S Advanced Databases 11 Yinghui Wu EME 49

In-memory databases 2 MMDBS for OLAP Slides modified from Sebastian Dorok

Main-Memory DBMSs for OLAP Read-mostly → Data is typically not modified; new data is just inserted → Append new data via bulk loading Analysis mostly on few columns and over all rows –→ Use column-oriented data layout –→ Yields better data cache utilization Query response time is important –→ Bulk processing to improve instruction cache effectiveness –→ Reduce amount of data to process by lightweight compression techniques optimizing query processing

Main-Memory DBMSs for OLAP Main memory as primary storage + Column-oriented data layout + Lightweight compression + Optimized query processing ~ Main-Memory DBMS for efficient OLAP

Data Compression - Motivation Make Big data small → Reduced costs for storage as we need less storage space to store the same amount of data → More data can be stored using the same amount of storage space → Better utilization of memory bandwidth

Data Compression - Requirements Lossless compression → otherwise we generate data errors Lightweight (de-)compression → otherwise (de-)compression overhead would outweigh our possible performance potentials Enable processing of compressed values → no additional overhead for decompression

Classification of compression techniques General idea: Replace data by representation that needs less bits than original data Granularity: Attribute values, tuples, tables, pages Index structures Code length: Fixed code length: All values are encoded with same number of bits Variable code length: Number of bits differs (e.g., correlate number of used bits with value frequency; Huffman Encoding)

Dictionary Encoding Use a dictionary that contains data values and their surrogates Surrogate can be derived from values’ dictionary position Applicable to row- and column-oriented data layouts

Bit Packing Surrogate values do not have to be multiples of one byte Example: 16 distinct values can be effectively stored using 4 bit per surrogate → 2 values per byte ~ Processing of compressed values is not straight forward

Run Length Encoding Reduce size of sequences of same value Store the value and an indicator about the sequence length Applicable to column-oriented data layouts Sorting can further improve compression effectiveness

Common Value Suppression A common value is scattered across a column (e.g., null ) Use a data structure that indicates whether a common value is stored at a given row index or not Yes: Common value is stored here No: Lookup value in the dictionary (using prefix sum) Applicable to column-oriented data layouts

Bit-Vector Encoding Suitable for columns that have low number of distinct values Use bit string for every column value that indicates whether the value is present at a given row index or not Length of bit string equals number of tuples Used in Bitmap-Indexes

Delta Coding Store difference to precedent value instead of the original value Applicable to column-oriented data layouts Sorting can further improve compression effectiveness

Frequency Compression Idea: Exploit data skew Principle: More frequent values are encoded using fewer bits Less frequent values are encoded using more bits Use prefix codes (e.g., Huffman Encoding)

Frequency Compression - Column Store Keep track where encoded values start and end → Use fixed code length for a partition rather than unique code for every value (i.e., Huffman Encoding) Picture taken from [Raman et al., 2013]

Frequency Compression - Row Store Apply frequency compression on each column and perform additional delta coding → Introduces overhead for reading nth column value Picture taken from [Raman and Swart, 2006]

Frequency Partitioning Developed for IBM’s BLINK project [Raman et al., 2008] Similar to frequency compression of tuples in a row- oriented data layout But, partitioning tuples regarding column values → Overhead is reduced as within one partition code length is fixed

18 Frequency Partitioning: Principle Picture taken from [Raman et al., 2008]

Data Compression - Summary General idea: Replace data by representation that needs less bits than original data Discussed approaches: Fixed code length: Dictionary Encoding, RLE, Common Value Suppression Variable code length: Delta Coding, Frequency Compression, Frequency Partitioning Improvements: bit packing, partitioning Benefits for main-memory DBMSs: Reduced storage requirements Better memory bandwidth utilization

Query Processing - Motivation For analytical applications (OLAP), most of the time a small set of columns is needed to answer a query → Read only columns that are needed to answer a query → Column-oriented data layout is highly suitable Beware for queries accessing all columns of a table → high materialization costs

Materialization Strategies Recall: In a column-oriented data layout each column is stored separately Consider, e.g., a selection query: SELECT l_shipdate, l_linenumber FROM lineitem WHERE l_shipdate < " " AND l_linenumber < "10000" → Query accesses and returns values of two columns → Materialization (tuple reconstruction) during query processing necessary

Early Materialization Reconstruct tuples as soon as possible Picture taken from [Abadi et al., 2007] SPC... Scan, Predicate, Construct

Late Materialization Postpone tuple reconstruction to the latest possible time Picture taken from [Abadi et al., 2007] DS... Data Sources

Advantages of Early and Late Materialization Early Materialization (EM): Reduces access cost if one column has to be accessed multiple times during query processing [Abadi et al., 2007] Late Materialization (LM): Reduces amount of tuples to reconstruct LM allows processing of columns as long as possible → Processing of compressed data → LM improves cache effectiveness [Abadi et al., 2008]

Example: Invisible Join [Abadi et al., 2008] SELECT c.nation, s.nation, d.year, sum(lo.revenue) as revenue FROM customer AS c, lineorder AS lo, supplier AS s, dwdate AS d WHERE lo.custkey = c.custkey AND lo.suppkey = s.suppkey AND lo.orderdate = d.datekey AND c.region = ASIA AND s.region = ASIA AND d.year >= 1992 AND d.year <= 1997 GROUP BY c.nation, s.nation, d.year ORDER BY d.year asc, revenue desc;

Example: Invisible Join [Abadi et al., 2008] SELECT c.nation, s.nation, d.year, sum(lo.revenue) as revenue FROM customer AS c, lineorder AS lo, supplier AS s, dwdate AS d WHERE lo.custkey = c.custkey AND lo.suppkey = s.suppkey AND lo.orderdate = d.datekey AND c.region = ASIA AND s.region = ASIA AND d.year >= 1992 AND d.year <= 1997 GROUP BY c.nation, s.nation, d.year ORDER BY d.year asc, revenue desc;

27 Picture ta k en from [Abadi et al., 2 008] Example: Invisible Join - Hash [Abadi et al., 2008]

Example: Invisible Join [Abadi et al., 2008] SELECT c.nation, s.nation, d.year, sum(lo.revenue) as revenue FROM customer AS c, lineorder AS lo, supplier AS s, dwdate AS d WHERE lo.custkey = c.custkey AND lo.suppkey = s.suppkey AND lo.orderdate = d.datekey AND c.region = ASIA AND s.region = ASIA AND d.year >= 1992 AND d.year <= 1997 GROUP BY c.nation, s.nation, d.year ORDER BY d.year asc, revenue desc;

29 Example: Invisible Join - Probe [Abadi et al., 2008] Picture ta k en from [Abadi et al., 2008]

Example: Invisible Join [Abadi et al., 2008] SELECT c.nation, s.nation, d.year, sum(lo.revenue) as revenue FROM customer AS c, lineorder AS lo, supplier AS s, dwdate AS d WHERE lo.custkey = c.custkey AND lo.suppkey = s.suppkey AND lo.orderdate = d.datekey AND c.region = ASIA AND s.region = ASIA AND d.year >= 1992 AND d.year <= 1997 GROUP BY c.nation, s.nation, d.year ORDER BY d.year asc, revenue desc;

31 Example: Invisible Join - Materialize [Abadi et al., 2008] Picture ta k en from [Abadi et al., 2008]

Star-Schema-Benchmark Performance Figure taken from [Abadi et al., 2008] T=tuple-at-a-time processing, t=block processing I=invisible join enabled, i=disabled C=compression enabled, c=disabled L=late materialization enabled, l=disabled

References I Abadi, D. J., Madden, S., and Hachem, N. (2008). Column-stores vs. row-stores: how different are they really? In SIGMOD, pages 967–980. Abadi, D. J., Myers, D. S., DeWitt, D. J., and Madden, S. (2007). Materialization Strategies in a Column-Oriented DBMS. In ICDE, pages 466–475. Boncz, P. A., Zukowski, M., and Nes, N. (2005). Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225–237. Copeland, G. P. and Khoshafian, S. N. (1985). A decomposition storage model. In SIGMOD, pages 268–279. Drepper, U. (2007). What Every Programmer Should Know About Memory. Graefe, G. (1990). Encapsulation of Parallelism in the Volcano Query Processing System. In SIGMOD, pages 102–111.

References II Harizopoulos, S., Abadi, D. J., Madden, S., and Stonebraker, M. (2008). OLTP through the looking glass, and what we found there. In SIGMOD, pages 981–992. Hennessy, J. L. and Patterson, D. A. (1996). Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 2 edition. Hennessy, J. L. and Patterson, D. A. (2006). Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 4 edition. Hennessy, J. L. and Patterson, D. A. (2012). Computer Architecture - A Quantitative Approach. Morgan Kaufmann, 5 edition. Kallman, R., Kimura, H., Natkins, J., Pavlo, A., Rasin, A., Zdonik, S., Jones, E. P. C., Madden, S., Stonebraker, M., Zhang, Y., Hugg, J., and Abadi, D. J. (2008). H-store: A High-performance, Distributed Main Memory Transaction Processing System. PVLDB, 1(2):1496–1499. Kemper, A. and Neumann, T. (2011). HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In ICDE, pages 195–206.

References III Lau, E. and Madden, S. (2006). An Integrated Approach to Recovery and High Availability in an Updatable, Distributed Data Warehouse. In VLDB, pages 703–714. Manegold, S., Boncz, P. A., and Kersten, M. L. (2000). Optimizing Database Architecture for the New Bottleneck: Memory Access. The VLDB Journal, 9(3):231–246. Manegold, S., Kersten, M. L., and Boncz, P. (2009). chttp:// PVLDB, 2(2):1648–1653. Raman, V., Attaluri, G., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., Liu, S., Lohman, G. M., Malkemus, T., Mueller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A., and Zhang, L. (2013). DB2 with BLU Acceleration: So Much More Than Just a Column Store. PVLDB, 6(11):1080–1091. Raman, V. and Swart, G. (2006). How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations. In VLDB, pages 858–869. Raman, V., Swart, G., Qiao, L., Reiss, F., Dialani, V., Kossmann, D., Narang, I., and Sidle, R. (2008). Constant-Time Query Processing. In ICDE, pages 60–69.

References IV R¨osch, P., Dannecker, L., Hackenbroich, G., and Faerber, F. (2012). A Storage Advisor for Hybrid-Store Databases. PVLDB, 5(12):1748–1758. Sikka, V., F¨arber, F., Lehner, W., Cha, S. K., Peh, T., and Bornh¨ovd, C. (2012). Efficient transaction processing in SAP HANA database: the end of a column store myth. In SIGMOD, pages 731–742. Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos, S., Hachem, N., and Helland, P. (2007). The end of an architectural era: (it’s time for a complete rewrite). In VLDB, pages 1150–1160. Zukowski, M. (2009). Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, CWI Amsterdam.