CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49

In-memory databases 2

We will talk about The Past and the Present Cache Awareness Processing Models Storage Models Design of Main-Memory DBMSs for Database Applications

The Past - Computer Architecture Data taken from [Hennessy and Patterson, 1996] Main-Memory Database Management Systems 3/116

The Past - Database Systems Main-memory capacity is limited to several megabytes → Only a small fraction of the database fits in main memory

The Past - Database Systems Main-memory capacity is limited to several megabytes → Only a small fraction of the database fits in main memory And disk storage is ”huge”, → Traditional database systems use disk as primary storage

The Past - Database Systems Main-memory capacity is limited to several megabytes → Only a small fraction of the database fits in main memory And disk storage is ”huge”, → Traditional database systems use disk as primary storage But disk latency is high → Parallel query processing to hide disk latencies → Choose proper buffer replacement strategy to reduce I/O → Architectural properties inherited from system R, the first ”real” relational DBMS → From the 1970’s...

The Present - Computer Architecture Data taken from [Hennessy and Patterson, 2012] Main-Memory Database Management Systems 5/116

The Present - Database Systems Hundreds to thousands of gigabyte of main memory available → Up to 106 times more capacity! → Complete database having less than a TB size can be kept in main memory → Use main memory as primary storage for the database and remove disk access as main performance bottleneck

The Present - Database Systems Hundreds to thousands of gigabyte of main memory available → Up to 106 times more capacity! → Complete database having less than a TB size can be kept in main memory → Use main memory as primary storage for the database and remove disk access as main performance bottleneck But the architecture of traditional DBMSs is designed for disk- oriented database systems → ”30 years of Moore’s law have antiquated the disk-oriented relational architecture for OLTP applications.” [Stonebraker et al., 2007]

Disk-based vs. Main-Memory DBMS 7/116

Disk-based vs. Main-Memory DBMS (2) ATTENTION: Main-memory storage != No Durability → ACID properties have to be guaranteed → However, there are new ways of guaranteeing it, such as a second machine in hot standby

Disk-based vs. Main-Memory DBMS (3) Having the database in main memory allows us to remove buffer manager and paging → Remove level of indirection → Results in better performance

Disk-based vs. Main-Memory DBMS (4) Disk bottleneck is removed as database is kept in main memory → Access to main memory becomes new bottleneck

The New Bottleneck: Memory Access Picture taken from [Manegold et al., 2000] Accessing main-memory is much more expensive than accessing CPU registers. → Is main-memory the new disk?

Rethink the Architecture of DBMSs Even if the complete database fits in main memory, there are significant overheads of traditional, System R like DBMSs: Many function calls → stack manipulation overhead1 + instruction-cache misses Adverse memory access → data-cache misses → Be aware of the caches! 1 Can be reduced by function inlining

Cache awareness 17

A Motivating Example (Memory Access) Task: sum up all entries in a two-dimensional array. Alternative 1: for (r = 0; r < rows; r++) for (c = 0; c < cols; c++) sum += src[r * cols + c]; Alternative 2: for (c = 0; c < cols; c++) for (r = 0; r < rows; r++) sum += src[r * cols + c]; Both alternatives touch the same data, but in different order.

A Motivating Example (Memory Access) 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 1 10 0 10 9 10 8 10 7 10 6 10 5 10 4 10 3 10 2 1s 2s 1s 2s 20s 10s 5s 100 s 50s 100 s 50s rows cols total execution time

Memory Wall Data ta k en from [Hennessy and P atterson, 2012] 1980 1985 1990 1995 2000 2005 2010 year 1 10 100 100,00 0 10,000 1,000 n o rmalized p erf o rm a nce Processor DRAM Memory

Hardware Trends There is an increasing gap between CPU and memory speeds. Also called the memory wall. CPUs spend much of their time waiting for memory. How can we break the memory wall and better utilize the CPU?

Memory Hierarchy CPU L1 Cache L2 Cache main memory.. disk technology SRAM SRAM DRAM capacity bytes kilobytes megabyte s gigabytes latency < 1 ns ≈ 1 ns < 10 ns 70–100 ns Some systems also use a 3rd level cache. → Caches resemble the buffer manager but are controlled by hardware

Principle of Locality Caches take advantage of the principle of locality. The hot set of data often fits into caches. 90 % execution time spent in 10 % of the code. Spatial Locality: Related data is often spatially close. Code often contains loops. Temporal Locality: Programs tend to re-use data frequently. Code may call a function repeatedly, even if it is not spatially close.

CPU Cache Internals To guarantee speed, the overhead of caching must be kept reasonable. Organize cache in cache lines. Only load/evict full cache lines. Typical cache line size: 64 bytes. 0 1 2 3 4 5 6 7 cache line The organization in cache lines is consistent with the principle of (spatial) locality. line size

Memory Access On every memory access, the CPU checks if the respective cache line is already cached. Cache Hit: Read data directly from the cache. No need to access lower-level memory. Cache Miss: Read full cache line from lower-level memory. Evict some cached block and replace it by the newly read cache line. CPU stalls until data becomes available. Modern CPUs support out-of-order execution and several in-flight cache misses.

Example: AMD Opteron Data taken from [Hennessy and Patterson, 2006] Example: AMD Opteron, 2.8 GHz, PC3200 DDR SDRAM L1 cache: separate data and instruction caches, each 64 kB, 64 B cache lines L2 cache: shared cache, 1 MB, 64 B cache lines L1 hit latency: 2 cycles (≈ 1 ns) L2 hit latency: 7 cycles (≈ 3.5 ns) L2 miss latency: 160–180 cycles (≈ 60 ns)

Block Placement: Direct-Mapped Cache In a direct-mapped cache, a block has only one place it can appear in the cache. Much simpler to implement. Easier to make fast. Increases the chance of conflicts. 0 1 2 3 4 5 6 70 1 2 3 4 5 6 7 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 30 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 11 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 30 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 place block 12 in cache line 4 (4 = 12 mod 8)

Block Placement: Fully Associative Cache In a fully associative cache, a block can be loaded into any cache line Provide freedom to block replacement strategy. Does not scale to large caches → 4 MB cache, line size: 64 B: 65,536 cache lines. 0 1 2 3 4 5 6 70 1 2 3 4 5 6 7 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 30 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 11 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 30 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Block Placement: Set-Associative Cache A compromise are set-associative caches. 0 1 2 3 4 5 6 7 Group cache lines into sets. Each memory block maps to one set. Block can be placed anywhere within a set. Most processor caches today are set-associative. 0 1 2 3 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 30 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 11 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 30 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 place block 12 anywhere in set 0 (0 = 12 mod 4)

Example: Intel Q6700 (Core 2 Quad) Total cache size: 4 MB (per 2 cores). Cache line size: 64 bytes. → 6-bit offset (2 6 = 64) → There are 65,536 cache lines in total (4 MB ÷ 64 bytes). Associativity: 16-way set-associative. → There are 4,096 sets (65, 536 ÷ 16 = 4, 096). → 12-bit set index (2 12 = 4, 096). Maximum physical address space: 64 GB. → 36 address bits are enough (2 36 bytes = 64 GB) → 18-bit tags (36 − 12 − 6 = 18). 18 bit12 bit6 bit tag set indexoffset

Block Replacement When bringing in new cache lines, an existing entry has to be evicted: Least Recently Used (LRU) Evict cache line whose last access is longest ago. → Least likely to be needed any time soon. First In First Out (FIFO) Behaves often similar like LRU. But easier to implement. Random Pick a random cache line to evict. Very simple to implement in hardware. Replacement has to be decided in hardware and fast.

What Happens on a Write? To implement memory writes, CPU makers have two options: Write Through Data is directly written to lower-level memory (and to the cache). → Writes will stall the CPU. → Greatly simplifies data coherency. Write Back Data is only written into the cache. A dirty flag marks modified cache lines (Remember the status field.) → May reduce traffic to lower-level memory. → Need to write on eviction of dirty cache lines. Modern processors usually implement write back.

Putting it all Together To compensate for slow memory, systems use caches. DRAM provides high capacity, but long latency. SRAM has better latency, but low capacity. Typically multiple levels of caching (memory hierarchy). Caches are organized into cache lines. Set associativity: A memory block can only go into a small number of cache lines (most caches are set-associative). Systems will benefit from locality of data and code.

Why do DBSs show such poor cache behavior? Poor code locality: Polymorphic functions (E.g., resolve attribute types for each processed tuple individually.) Volcano iterator model (pipelining) Each tuple is passed through a query plan composed of many operators.

Why do DBSs show such poor cache behavior? Poor data locality: Database systems are designed to navigate through large data volumes quickly. Navigating an index tree, e.g., by design means to “randomly” visit any of the (many) child nodes.

36 Storage models Row-stores Column-stores

Row-stores a.k.a. row-wise storage or n-ary storage model, NSM: a1b1c1d1a1b1c1d1 a2b2c2d2a2b2c2d2 a3b3c3d3a3b3c3d3 a4b4c4d4a4b4c4d4 a1ca1c b1b1 c1ac1a 1 d 12 b2db2d c2c2 d2bd2b 2 a 33 c3c3 d3d3 page 0 a4ca4c b4b4 c4c4 4 d 4 page 1

Column-stores a.k.a. column-wise storage or decomposition storage model, DSM: a 1 b 1 c 1 d 1 a 2 b 2 c 2 d 2 a 3 b 3 c 3 d 3 a 4 b 4 c 4 d 4 a1aa1a a2a2 a3a3 3 a 4 page 0 b1bb1b b2b2 b3b3 3 b 4 page 1 · · ·

The effect on query processing Consider, e.g., a selection query: SELECT COUNT(*) FROM lineitem WHERE l_shipdate = "2016-01-25" This query typically involves a full table scan.

A full table scan in a row-store In a row-store, all rows of a table are stored sequentially on a database page. tuple

A full table scan in a row-store In a row-store, all rows of a table are stored sequentially on a database page. tuple l_shipdate

A full table scan in a row-store In a row-store, all rows of a table are stored sequentially on a database page. tuple l_shipdate cache block boundaries With every access to a l_shipdate field, we load a large amount of irrelevant information into the cache.

A ”full table scan” on a column-store In a column-store, all values of one column are stored sequentially on a database page. l_shipdate(s)

A ”full table scan” on a column-store In a column-store, all values of one column are stored sequentially on a database page. l_shipdate(s) cache block boundaries All data loaded into caches by a “ l_shipdate scan” is now actually relevant for the query.

Column-store advantages All data loaded into caches by a “l_shipdate scan” is now actually relevant for the query. → Less data has to be fetched from memory. → Amortize cost for fetch over more tuples. → If we’re really lucky, the full ( l_shipdate ) data might now even fit into caches. The same arguments hold, by the way, also for disk-based systems. Additional benefit: Data compression might work better.

Column-store trade-offs Tuple recombination can cause considerable cost. Need to perform many joins. Workload-dependent trade-off. Source: [Co p eland and Khoshafian, 1 985]

An example: Binary Association Tables in MonetDB MonetDB makes this explicit in its data model. All tables in MonetDB have two columns (“head” and “tail”). Sebastian DorokMain-Memory Database Management SystemsLast Change: May 11, 201559/116 NAMEAGESEX John34m Angelina31f Scott35m Nancy33f

An example: Binary Association Tables in MonetDB MonetDB makes this explicit in its data model. All tables in MonetDB have two columns (“head” and “tail”). oid NAME oid AGE oid SEX o 1 John → o 2 Angelina o 3 Scott o 4 Nancy o 1 34 o 2 31 o 3 35 o 4 33 o1mo2fo3mo4fo1mo2fo3mo4f Each column yields one binary association table (BAT). Object identifiers ( oid s) identify matching entries (BUNs). Often, oid s can be implemented as virtual oid s ( void s). → Not explicitly materialized in memory. oidNAMEAGESEX o1o1 John34m o2o2 Angelina31f o3o3 Scott35m o4o4 Nancy33f

Processing models 49

Processing Models There are basically two alternative processing models that are used in modern DBMSs: Tuple-at-a-time volcano model [Graefe, 1990] Operator requests next tuple, processes it, and passes it to the next operator Operator-at-a-time bulk processing [Manegold et al., 2009] Operator consumes its input and materializes its output

Tuple-At-A-Time Processing Most systems implement the Volcano iterator model: Operators request tuples from their input using next (). Data is processed tuple at a time. Each operator keeps its own state.

Tuple-At-A-Time Processing - Consequences Pipeline-parallelism → Data processing can start although data does not fully reside in main memory → Small intermediate results All operators in a plan run tightly interleaved. → Their combined instruction footprint may be large. → Instruction cache misses. Operators constantly call each other’s functionality. → Large function call overhead. The combined state may be too large to fit into caches. E.g., hash tables, cursors, partial aggregates. → Data cache misses.

Example: TPC-H Query Q1 on MySQL SELECT l_returnflag, l_linestatus, SUM (l_quantity) AS sum_qty, SUM(l_extendedprice) AS sum_base_price, SUM(l_extendedprice*(1-l_discount)) AS sum_disc_price, SUM(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge, AVG(l_quantity) AS avg_qty, AVG(l_extendedprice) AS avg_price, AVG(l_discount) AS avg_disc, COUNT(*) AS count_order FROM lineitem WHERE l_shipdate <= DATE ’2016-01-25’ GROUP BY l_returnflag, l_linestatus Scan query with arithmetics and a bit of aggregation. Source: MonetDB/X100: Hyper-Pipelining Query Execution. [Boncz et al., 2005]

time [sec]callsinstr./callIPCfunction name 11.9846M60.64ut fold ulint pair 8.50.15M27K0.71ut fold binary 5.877M370.85memcpy 3.123M640.88Item sum sum::update field 3.06M2470.83row search for mysql 2.917M790.70Item sum avg::update field 2.6108M110.60rec get bit field 1 2.56M2130.61row sel store mysql rec 2.448M250.52rec get nth field 2.46019M0.69ha print info 2.45.9M1951.08end update 2.111M890.98field conv 2.05.9M160.77Field float::val real 1.85.9M141.07Item field::val 1.542M170.51row sel field store in mysql 1.436M180.76buf frame align 1.317M380.80Item func mul::val 1.425M250.62pthread mutex unlock 1.2206M20.75hash get nth cell 1.225M210.65mutex test and set 1.0102M40.62rec get 1byte offs flag 1.053M90.58rec 1 get field start offs 0.942M110.65rec get nth field extern bit 1.011M380.80Item func minus::val 0.55.9M380.80Item func plus::val

Observations Only single tuple processed in each call; millions of calls. Only 10 % of the time spent on actual query task. Low instructions-per-cycle (IPC) ratio. Much time spent on field access. Polymorphic operators Single-tuple functions hard to optimize (by compiler). → Low instructions-per-cycle ratio. → Vector instructions (SIMD) hardly applicable. Function call overhead 38 instr. = 48 cycles vs. 3 instr. for load/add/store assembly 4 0.8 instr. cycle 4 Depends on underlying hardware

Operator-At-A-Time Processing Operators consume and produce full tables. Each (sub-)result is fully materialized (in memory). No pipelining (rather a sequence of statements). Each operator runs exactly once. · · · Database Result Operator 1 tuples Operator 2 tuples Operator 3 tuples

Operator-At-A-Time Consequences Parallelism: Inter-operator and intra-operator Function call overhead is now replaced by extremely tight loops that conveniently fit into instruction caches, can be optimized effectively by modern compilers → loop unrolling → vectorization (use of SIMD instructions) can leverage modern CPU features (hardware prefetching). Function calls are now out of the critical code path. No per-tuple field extraction or type resolution. Operator specialization, e.g., for every possible type. Implemented using macro expansion. Possible due to column-based storage.

Source: MonetDB/X100: Hyper-Pipelining Query Execution. [Boncz et al., 2005] result size time [ms] bandwidth [MB/s] MIL statement 5.9M 127 134 352 505 s0 := select (l_shipdate, · · · ).mark (); 5.9M134506 s2 := join (s0, l_linestatus); 5.9M235483 s3 := join (s0, l_extprice); 5.9M233488 s4 := join (s0, l_discount); 5.9M232489 s5 := join (s0, l_tax); 5.9M134507 s6 := join (s0, l_quantity); 5.9M290155 s7 := group (s1); 5.9M329136 s8 := group (s7, s2); 400 s9 := unique (s8.mirror ()); 5.9M206440 r0 := [+](1.0, s5); 5.9M210432 r1 := [-](1.0, s4); 5.9M274498 r2 := [*](s3, r1); 5.9M274499 r3 := [*](s12, r0); 4165271 r4 := {sum}(r3, s8, s9); 4165271 r5 := {sum}(r2, s8, s9); 4163275 r6 := {sum}(s3, s8, s9); 4163275 r7 := {sum}(s4, s8, s9); 4144151 r8 := {sum}(s6, s8, s9); 4112196 r9 := {count}(s7, s8, s9); 3,724365

Tuple-At-A-Time vs. Operator-At-A-Time The operator-at-a-time model is a two-edged sword: +Cache-efficient with respect to code and operator state. +Tight loops, optimizable code. - Data won’t fully fit into cache. → Repeated scans will fetch data from memory over and over. → Strategy falls apart when intermediate results no longer fit into main memory. Can we aim for the middle ground between the two extremes? tuple-at-a-time operator-at-a-time vectorized execution

Vectorized Execution Model Idea: Use Volcano-style iteration, but: for each next () call return a large number of tuples → a so called “vector” Choose vector size large enough to compensate for iteration overhead (function calls, instruction cache misses,... ), but small enough to not thrash data caches. vectors

Vector Size ↔ Instruction Cache Effectiveness 50G 20G 10G 5G 2G 1G 500M 200M 100M 18 64 1K 8K 64K 1M18 64 1K 8K 64K 1MVector size (tuples) Instructions executed Q1’’ Q1’ Instruction-cache misses Q1 1K 10K 100K 1M 10M 100M Source: [Zukowski, 2009] 1000 tuples should conveniently fit into caches.

Comparison of Execution Models Source [Zukowski, 2009] execution modeltupleoperatorvector instr. cache utilizationpoorextremely goodvery good function callsmanyextremely fewvery few attribute accesscomplexdirect most time spent oninterpretationprocessing CPU utilizationpoorgoodvery good compiler optimizationslimitedapplicable materialization overheadvery cheapexpensivecheap scalabilitygoodlimitedgood

Conclusion Row-stores store complete tuples sequentially on a database page Column-stores store all values of one column sequentially on a database page Depending on the workload column-stores or row-stores are more advantageous Tuple reconstruction is overhead in column-stores Analytical workloads that process few columns at a time benefit from column-stores → One data storage approach is not optimal to serve all possible workloads

CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Similar presentations

Presentation on theme: "CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Similar presentations

Presentation on theme: "CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49."— Presentation transcript:

Similar presentations

About project

Feedback