1 C-Store: A Column-oriented DBMS New England Database Group (Stonebraker, et al. Brandeis/Brown/MIT/UMass-Boston) Extended for Big Data Reading Group Presentation by Shimin Chen
2 M.I.T Relational Database Record 1 Record 2 Record 3 Attribute1Attribute2Attribute3 e.g. Customer(cid, name, address, discount) Product(pid, name, manufacturer, price, quantity) Order(oid, cid, pid, quantity)
3 M.I.T Current DBMS -- “Row Store” Record 2 Record 4 Record 1 Record 3 E.g. DB2, Oracle, Sybase, SQLServer, …
4 M.I.T Row Stores are Write Optimized (use white board) Row Stores are Write Optimized (use white board) Store fields in one record contiguously on disk Use small (e.g. 4K) disk blocks Use B-tree indexing Align fields on byte or word boundaries Assume shifting data values is costly Transactions: write-ahead logging Store fields in one record contiguously on disk Use small (e.g. 4K) disk blocks Use B-tree indexing Align fields on byte or word boundaries Assume shifting data values is costly Transactions: write-ahead logging
5 M.I.T Row Stores are Write Optimized Row Stores are Write Optimized Can insert and delete a record in one physical write Good for on-line transaction processing (OLTP) But not for read mostly applications Data warehouses Customer Relationship Management (CRM) Electronic library card catalogs … Can insert and delete a record in one physical write Good for on-line transaction processing (OLTP) But not for read mostly applications Data warehouses Customer Relationship Management (CRM) Electronic library card catalogs …
6 M.I.T Column Stores
7 M.I.T At 100K Feet…. Read-optimized: Periodically a bulk load of new data Long period of ad-hoc queries Benefit: Ad-hoc queries read 2 columns out of 20 Column store reads 10% of what a row store reads Previous pioneering work: Sybase IQ (early ’90s) Monet (see CIDR ’05 for the most recent description) Read-optimized: Periodically a bulk load of new data Long period of ad-hoc queries Benefit: Ad-hoc queries read 2 columns out of 20 Column store reads 10% of what a row store reads Previous pioneering work: Sybase IQ (early ’90s) Monet (see CIDR ’05 for the most recent description)
8 M.I.T C-Store Technical Ideas Data storage: Only materialized views (perhaps many) Compress the columns to save space No alignment Big disk blocks Innovative redundancy Optimize for grid (cluster) computing Focus on Sorting not indexing Automatic physical DBMS design Column optimizer and executor Data storage: Only materialized views (perhaps many) Compress the columns to save space No alignment Big disk blocks Innovative redundancy Optimize for grid (cluster) computing Focus on Sorting not indexing Automatic physical DBMS design Column optimizer and executor
9 M.I.T How to Evaluate This Paper…. None of the ideas in isolation merit publication Judge the complete system by its (hopefully intelligent) choice of Small collection of inter-related powerful ideas That together put performance in a new sandbox None of the ideas in isolation merit publication Judge the complete system by its (hopefully intelligent) choice of Small collection of inter-related powerful ideas That together put performance in a new sandbox
10 M.I.T Outline Overview Read-optimized column store Query execution and optimization Handling transactional updates Performance Summary
11 M.I.T Data Model Projection (materialized view): some number of columns from a fact table plus columns in a dimension table – with a 1-n join between Fact and Dimension table (conceptually) no duplicate elimination Stored in order of a storage key(s) Note: base table is not stored anywhere Projection (materialized view): some number of columns from a fact table plus columns in a dimension table – with a 1-n join between Fact and Dimension table (conceptually) no duplicate elimination Stored in order of a storage key(s) Note: base table is not stored anywhere
12 M.I.T Example Logical base tables: –EMP (name, age, salary, dept) –DEPT (dname, floor) Example projections –EMP1 (name, age | age) –EMP2 (dept, age, DEPT.floor | DEPT.floor) –EMP3 (name, salary | salary) –DEPT1 (dname, floor | floor)
13 M.I.T Optimize for Grid Computing I.e. shared-nothing Horizontal partitioning and intra-query parallelism as in Gamma Paper talks about “Grid computers … may have tens to hundreds of nodes …” I.e. shared-nothing Horizontal partitioning and intra-query parallelism as in Gamma Paper talks about “Grid computers … may have tens to hundreds of nodes …”
14 M.I.T Projection Detail #1 Each projection is horizontally partitioned into “segment”s –Segment identifier –Unit of distribution and parallelism –Value-based partitioning, key range of sort key(s) Column-wise store inside segment Storage key: ordinal record number in segment –calculated as needed
15 M.I.T Projection Detail #2 Different encoding schemes for different columns Depends on ordering and value distribution –Self-order, few distinct values: (value, position, num_entries) –Foreign-order, few distinct values: (value, bitmap), bitmap is run-length encoded –Self-order, many distinct values: block-oriented, delta value encoding –Foreign-order, many distinct values: gzip
16 M.I.T Different Indexing Few valuesMany values Sequential (self-order) RLE encoded Conventional B-tree at the value level Delta encoded Conventional B-tree at the block level Non sequential (foreign-order) Bitmap per value Conventional Gzip Conventional B-tree at the block level
17 M.I.T Big Disk Blocks Tunable Big (minimum size is 64K) Tunable Big (minimum size is 64K)
18 M.I.T Reconstructing Base Table from Projections Join Index: –Projection T1 has M segments, projection T2 has n segments –T1 and T2 are on same base table –Join index consists of M tables, one per T1 segment –Entry: segment ID and storage key of corresponding record in T2 In general, needs multiple join indices for reconstructing a base table Join index is costly to store and maintain –Each column expected to be in multiple projections –Reduce # of join indices
19 M.I.T Innovative Redundancy Hardly any warehouse is recovered by redo from log Takes too long! Store enough projections to ensure K-safety Column can be in K different projections Rebuild dead objects from elsewhere in the network Hardly any warehouse is recovered by redo from log Takes too long! Store enough projections to ensure K-safety Column can be in K different projections Rebuild dead objects from elsewhere in the network
20 M.I.T Automatic Physical DBMS Design Accept a “training set” of queries and a space budget Choose the projections and join indices auto-magically Re-optimize periodically based on a log of the interactions Accept a “training set” of queries and a space budget Choose the projections and join indices auto-magically Re-optimize periodically based on a log of the interactions
21 M.I.T Outline Overview Read-optimized column store Query execution and optimization Handling transactional updates Performance Summary
22 M.I.T Operators Decompress Select: generate bitstring Mask: bitstring+projection selected rows Project: choose a subset of columns Concat: combine multiple projections that are sorted in the same order Sort Permute: according to a join index Join Aggregation operators Bitstring operators
23 M.I.T Execution Query plan: a tree of operators (data flow) –Leaf: accessing the data storage –Internal: calls “get_next” Operators return 64KB blocks
24 M.I.T Column Optimizer (discussion) Cost-based estimation for query plan construction Chooses projections on which to run the query Cost model includes compression types When to perform “mask” operator Build in snowflake schemas Which are simple to optimize without exhaustive search Looking at extensions Cost-based estimation for query plan construction Chooses projections on which to run the query Cost model includes compression types When to perform “mask” operator Build in snowflake schemas Which are simple to optimize without exhaustive search Looking at extensions
25 M.I.T Outline Overview Read-optimized column store Query execution and optimization Handling transactional updates Performance Summary
26 M.I.T Online Updates Are Necessary Transactional updates are necessary even in read- mostly environment Online updates for error corrections Real-time data warehouses –Reduce the delay between OLTP system and warehouse towards zero
27 M.I.T Solution – a Hybrid Store Read-optimized Column store Write-optimized Column store Tuple mover (What we have been talking about so far) (Batch rebuilder)
28 M.I.T Write Store Column store Horizontally partitioned as the read store –1:1 mapping between RS segments and WS segments Storage keys are explicitly stored –Btree: sort key storage key No compression (the data size is small)
29 M.I.T Handling Updates Optimize read-only query: do not hold locks –Snapshot isolation –The query is run on a snapshot of the data –Ensure transactions related to this snapshot have already committed Each WS site: insertion vector (with timestamps), deletion vector, (updates become insertions and detetions) Maintain a high water mark and a low water mark of WS sites: –HWM: all transactions before HWM have committed –LWM: no records in read store are inserted before LWM Queries can specify a time before HWM
30 M.I.T HWM and epochs TA: time authority updates the coarse timer (epochs)
31 M.I.T Transactions Undo from a log (that does not need to be persistent) Redo by rebuild from elsewhere in the network Undo from a log (that does not need to be persistent) Redo by rebuild from elsewhere in the network
32 M.I.T Tuple-Mover Read RS segment Combine WS segment into a new version of the RS segment, do not update in place Record last move time for this segment in WS T last_move LWM Time authority will periodically sends out a new LWM epoch number
33 M.I.T Current Performance Varying storage: 100X popular row store in 40% of the space 10X popular column store in 70% of the space 7X popular row store in 1/6 th of the space Code available with BSD license Varying storage: 100X popular row store in 40% of the space 10X popular column store in 70% of the space 7X popular row store in 1/6 th of the space Code available with BSD license
34 M.I.T Summary Column store is optimized for read queries Cluster parallelism Interesting data organization Handling write queries