Cloud Computing Lecture Column Store – alternative organization for big relational data
C-store C-store is Read-optimized, for OLAP type apps Traditional DBMS, write-optimized (optimized for online transactions) Based on records(rows)
C-Store What are the cost-sensitive major factors in query processing? Size of database Index or not Join Current hardware configuration and what a DBMS can do… Cheap storage – allow distributed redundant data store Fast CPUs – compression/decompression Limited disk bandwidth – reduce I/O
C-store Supporting OLAP (online analytic processing) operations Optimized read operations Balanced write performance Address the conflict between writes and reads Fast write – append records Fast read – indexed, compressed Think if data organized in columns, what are the unique challenges (different from the row- organization)?
C-store’s features Column based store saves space Compression is possible Index size is smaller Multiple projections Allow multiple indices Parallel processing on the same attributes Materialized join results Separation of writeable store and read- optimized store Both write/read are optimized Transactions are not blocked by write locks
Data model Same as relational data model Tables, rows, columns Primary keys and foreign keys Projections From single table Multiple joined tables Example EMP1 (name, age) EMP2 (dept, age, DEPT.floor) EMP3 (name, salary) DEPT1(dname, floor) EMP(name, age, dept, salary) DEPT(dname, floor) Normal relational model Possible C-store model
Physical projection organization Sort key each projection has one Rows are ordered by sort key Partitioned by key range Linking columns in the same projection Storage key – (segment id, key, i.e.,offset in segment) Linking projections To reconstruct a table Join index
Conceptual organization column Segment: by sort key range Sort key column Seg id offset Join index Projection 1 Projection 2
Architectural consideration between writes and reads Read often needs indices to speedup Write often index unfriendly: needs to update indices frequently Use “read store” and “write store”
Read store: Column encoding Use compression schemes and indices Self-order (key), few distinct values (value, position, # items) Indexed by clustered B-tree Foreign-order (non-key), few distinct values (value, bitmap index) B-tree index: position values Self-order, many distinct values Delta from the previous value B-tree index Foreign-order, many distinct values Unencoded
Write Store Same structure, but explicitly use (segment, key) to identify records Easier to maintain the mapping Only concerns the inserted records Tuple mover Copies batch of records to RS Delete record Mark it on RS Purged by tuple mover
Tuple mover Moves records in WS to RS Happens between read-only transactions Use merge-out process
How to solve read/write conflict Situation: one transaction updates the record X, while another transaction reads X. Use snapshot isolation
Benefits in query processing Selection – has more indices to use Projection – some “projections” already defined Join – some projections are materialized joins Aggregations – works on required columns only
Evaluation Use TPC-H – decision support queries Storage
Query performance
Row store uses materialized views
Summary: the performance gain Column representation – avoids reads of unused attributes Storing overlapping projections – multiple orderings of a column, more choices for query optimization Compression of data – more orderings of a column in the same amount of space Query operators operate on compressed representation