CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Memory Organization.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Systems I Locality and Caching
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
Memory and cache CPU Memory I/O. CEG 320/52010: Memory and cache2 The Memory Hierarchy Registers Primary cache Secondary cache Main memory Magnetic disk.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Memory Architecture Chapter 5 in Hennessy & Patterson.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Memory and cache CPU Memory I/O.
CPT-S 415 Big Data Yinghui Wu EME B45 1.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Appendix B. Review of Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Memory and cache CPU Memory I/O.
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Fundamentals of Computing: Computer Architecture
Cache Memory Rabi Mahapatra
Presentation transcript:

CPT-S Advanced Databases 11 Yinghui Wu EME 49

In-memory databases 2

We will talk about The Past and the Present Cache Awareness Processing Models Storage Models Design of Main-Memory DBMSs for Database Applications

The Past - Computer Architecture Data taken from [Hennessy and Patterson, 1996] Main-Memory Database Management Systems 3/116

The Past - Database Systems Main-memory capacity is limited to several megabytes → Only a small fraction of the database fits in main memory

The Past - Database Systems Main-memory capacity is limited to several megabytes → Only a small fraction of the database fits in main memory And disk storage is ”huge”, → Traditional database systems use disk as primary storage

The Past - Database Systems Main-memory capacity is limited to several megabytes → Only a small fraction of the database fits in main memory And disk storage is ”huge”, → Traditional database systems use disk as primary storage But disk latency is high → Parallel query processing to hide disk latencies → Choose proper buffer replacement strategy to reduce I/O → Architectural properties inherited from system R, the first ”real” relational DBMS → From the 1970’s...

The Present - Computer Architecture Data taken from [Hennessy and Patterson, 2012] Main-Memory Database Management Systems 5/116

The Present - Database Systems Hundreds to thousands of gigabyte of main memory available → Up to 106 times more capacity! → Complete database having less than a TB size can be kept in main memory → Use main memory as primary storage for the database and remove disk access as main performance bottleneck

The Present - Database Systems Hundreds to thousands of gigabyte of main memory available → Up to 106 times more capacity! → Complete database having less than a TB size can be kept in main memory → Use main memory as primary storage for the database and remove disk access as main performance bottleneck But the architecture of traditional DBMSs is designed for disk- oriented database systems → ”30 years of Moore’s law have antiquated the disk-oriented relational architecture for OLTP applications.” [Stonebraker et al., 2007]

Disk-based vs. Main-Memory DBMS 7/116

Disk-based vs. Main-Memory DBMS (2) ATTENTION: Main-memory storage != No Durability → ACID properties have to be guaranteed → However, there are new ways of guaranteeing it, such as a second machine in hot standby

Disk-based vs. Main-Memory DBMS (3) Having the database in main memory allows us to remove buffer manager and paging → Remove level of indirection → Results in better performance

Disk-based vs. Main-Memory DBMS (4) Disk bottleneck is removed as database is kept in main memory → Access to main memory becomes new bottleneck

The New Bottleneck: Memory Access Picture taken from [Manegold et al., 2000] Accessing main-memory is much more expensive than accessing CPU registers. → Is main-memory the new disk?

Rethink the Architecture of DBMSs Even if the complete database fits in main memory, there are significant overheads of traditional, System R like DBMSs: Many function calls → stack manipulation overhead1 + instruction-cache misses Adverse memory access → data-cache misses → Be aware of the caches! 1 Can be reduced by function inlining

Cache awareness 17

A Motivating Example (Memory Access) Task: sum up all entries in a two-dimensional array. Alternative 1: for (r = 0; r < rows; r++) for (c = 0; c < cols; c++) sum += src[r * cols + c]; Alternative 2: for (c = 0; c < cols; c++) for (r = 0; r < rows; r++) sum += src[r * cols + c]; Both alternatives touch the same data, but in different order.

A Motivating Example (Memory Access) s 2s 1s 2s 20s 10s 5s 100 s 50s 100 s 50s rows cols total execution time

Memory Wall Data ta k en from [Hennessy and P atterson, 2012] year , ,000 1,000 n o rmalized p erf o rm a nce Processor DRAM Memory

Hardware Trends There is an increasing gap between CPU and memory speeds. Also called the memory wall. CPUs spend much of their time waiting for memory. How can we break the memory wall and better utilize the CPU?

Memory Hierarchy CPU L1 Cache L2 Cache main memory.. disk technology SRAM SRAM DRAM capacity bytes kilobytes megabyte s gigabytes latency < 1 ns ≈ 1 ns < 10 ns 70–100 ns Some systems also use a 3rd level cache. → Caches resemble the buffer manager but are controlled by hardware

Principle of Locality Caches take advantage of the principle of locality. The hot set of data often fits into caches. 90 % execution time spent in 10 % of the code. Spatial Locality: Related data is often spatially close. Code often contains loops. Temporal Locality: Programs tend to re-use data frequently. Code may call a function repeatedly, even if it is not spatially close.

CPU Cache Internals To guarantee speed, the overhead of caching must be kept reasonable. Organize cache in cache lines. Only load/evict full cache lines. Typical cache line size: 64 bytes cache line The organization in cache lines is consistent with the principle of (spatial) locality. line size

Memory Access On every memory access, the CPU checks if the respective cache line is already cached. Cache Hit: Read data directly from the cache. No need to access lower-level memory. Cache Miss: Read full cache line from lower-level memory. Evict some cached block and replace it by the newly read cache line. CPU stalls until data becomes available. Modern CPUs support out-of-order execution and several in-flight cache misses.

Example: AMD Opteron Data taken from [Hennessy and Patterson, 2006] Example: AMD Opteron, 2.8 GHz, PC3200 DDR SDRAM L1 cache: separate data and instruction caches, each 64 kB, 64 B cache lines L2 cache: shared cache, 1 MB, 64 B cache lines L1 hit latency: 2 cycles (≈ 1 ns) L2 hit latency: 7 cycles (≈ 3.5 ns) L2 miss latency: 160–180 cycles (≈ 60 ns)

Block Placement: Direct-Mapped Cache In a direct-mapped cache, a block has only one place it can appear in the cache. Much simpler to implement. Easier to make fast. Increases the chance of conflicts place block 12 in cache line 4 (4 = 12 mod 8)

Block Placement: Fully Associative Cache In a fully associative cache, a block can be loaded into any cache line Provide freedom to block replacement strategy. Does not scale to large caches → 4 MB cache, line size: 64 B: 65,536 cache lines

Block Placement: Set-Associative Cache A compromise are set-associative caches Group cache lines into sets. Each memory block maps to one set. Block can be placed anywhere within a set. Most processor caches today are set-associative place block 12 anywhere in set 0 (0 = 12 mod 4)

Example: Intel Q6700 (Core 2 Quad) Total cache size: 4 MB (per 2 cores). Cache line size: 64 bytes. → 6-bit offset (2 6 = 64) → There are 65,536 cache lines in total (4 MB ÷ 64 bytes). Associativity: 16-way set-associative. → There are 4,096 sets (65, 536 ÷ 16 = 4, 096). → 12-bit set index (2 12 = 4, 096). Maximum physical address space: 64 GB. → 36 address bits are enough (2 36 bytes = 64 GB) → 18-bit tags (36 − 12 − 6 = 18). 18 bit12 bit6 bit tag set indexoffset

Block Replacement When bringing in new cache lines, an existing entry has to be evicted: Least Recently Used (LRU) Evict cache line whose last access is longest ago. → Least likely to be needed any time soon. First In First Out (FIFO) Behaves often similar like LRU. But easier to implement. Random Pick a random cache line to evict. Very simple to implement in hardware. Replacement has to be decided in hardware and fast.

What Happens on a Write? To implement memory writes, CPU makers have two options: Write Through Data is directly written to lower-level memory (and to the cache). → Writes will stall the CPU. → Greatly simplifies data coherency. Write Back Data is only written into the cache. A dirty flag marks modified cache lines (Remember the status field.) → May reduce traffic to lower-level memory. → Need to write on eviction of dirty cache lines. Modern processors usually implement write back.

Putting it all Together To compensate for slow memory, systems use caches. DRAM provides high capacity, but long latency. SRAM has better latency, but low capacity. Typically multiple levels of caching (memory hierarchy). Caches are organized into cache lines. Set associativity: A memory block can only go into a small number of cache lines (most caches are set-associative). Systems will benefit from locality of data and code.

Why do DBSs show such poor cache behavior? Poor code locality: Polymorphic functions (E.g., resolve attribute types for each processed tuple individually.) Volcano iterator model (pipelining) Each tuple is passed through a query plan composed of many operators.

Why do DBSs show such poor cache behavior? Poor data locality: Database systems are designed to navigate through large data volumes quickly. Navigating an index tree, e.g., by design means to “randomly” visit any of the (many) child nodes.

36 Storage models Row-stores Column-stores

Row-stores a.k.a. row-wise storage or n-ary storage model, NSM: a1b1c1d1a1b1c1d1 a2b2c2d2a2b2c2d2 a3b3c3d3a3b3c3d3 a4b4c4d4a4b4c4d4 a1ca1c b1b1 c1ac1a 1 d 12 b2db2d c2c2 d2bd2b 2 a 33 c3c3 d3d3 page 0 a4ca4c b4b4 c4c4 4 d 4 page 1

Column-stores a.k.a. column-wise storage or decomposition storage model, DSM: a 1 b 1 c 1 d 1 a 2 b 2 c 2 d 2 a 3 b 3 c 3 d 3 a 4 b 4 c 4 d 4 a1aa1a a2a2 a3a3 3 a 4 page 0 b1bb1b b2b2 b3b3 3 b 4 page 1 · · ·

The effect on query processing Consider, e.g., a selection query: SELECT COUNT(*) FROM lineitem WHERE l_shipdate = " " This query typically involves a full table scan.

A full table scan in a row-store In a row-store, all rows of a table are stored sequentially on a database page. tuple

A full table scan in a row-store In a row-store, all rows of a table are stored sequentially on a database page. tuple l_shipdate

A full table scan in a row-store In a row-store, all rows of a table are stored sequentially on a database page. tuple l_shipdate cache block boundaries With every access to a l_shipdate field, we load a large amount of irrelevant information into the cache.

A ”full table scan” on a column-store In a column-store, all values of one column are stored sequentially on a database page. l_shipdate(s)

A ”full table scan” on a column-store In a column-store, all values of one column are stored sequentially on a database page. l_shipdate(s) cache block boundaries All data loaded into caches by a “ l_shipdate scan” is now actually relevant for the query.

Column-store advantages All data loaded into caches by a “l_shipdate scan” is now actually relevant for the query. → Less data has to be fetched from memory. → Amortize cost for fetch over more tuples. → If we’re really lucky, the full ( l_shipdate ) data might now even fit into caches. The same arguments hold, by the way, also for disk-based systems. Additional benefit: Data compression might work better.

Column-store trade-offs Tuple recombination can cause considerable cost. Need to perform many joins. Workload-dependent trade-off. Source: [Co p eland and Khoshafian, 1 985]

An example: Binary Association Tables in MonetDB MonetDB makes this explicit in its data model. All tables in MonetDB have two columns (“head” and “tail”). Sebastian DorokMain-Memory Database Management SystemsLast Change: May 11, /116 NAMEAGESEX John34m Angelina31f Scott35m Nancy33f

An example: Binary Association Tables in MonetDB MonetDB makes this explicit in its data model. All tables in MonetDB have two columns (“head” and “tail”). oid NAME oid AGE oid SEX o 1 John → o 2 Angelina o 3 Scott o 4 Nancy o 1 34 o 2 31 o 3 35 o 4 33 o1mo2fo3mo4fo1mo2fo3mo4f Each column yields one binary association table (BAT). Object identifiers ( oid s) identify matching entries (BUNs). Often, oid s can be implemented as virtual oid s ( void s). → Not explicitly materialized in memory. oidNAMEAGESEX o1o1 John34m o2o2 Angelina31f o3o3 Scott35m o4o4 Nancy33f

Processing models 49

Processing Models There are basically two alternative processing models that are used in modern DBMSs: Tuple-at-a-time volcano model [Graefe, 1990] Operator requests next tuple, processes it, and passes it to the next operator Operator-at-a-time bulk processing [Manegold et al., 2009] Operator consumes its input and materializes its output

Tuple-At-A-Time Processing Most systems implement the Volcano iterator model: Operators request tuples from their input using next (). Data is processed tuple at a time. Each operator keeps its own state.

Tuple-At-A-Time Processing - Consequences Pipeline-parallelism → Data processing can start although data does not fully reside in main memory → Small intermediate results All operators in a plan run tightly interleaved. → Their combined instruction footprint may be large. → Instruction cache misses. Operators constantly call each other’s functionality. → Large function call overhead. The combined state may be too large to fit into caches. E.g., hash tables, cursors, partial aggregates. → Data cache misses.

Example: TPC-H Query Q1 on MySQL SELECT l_returnflag, l_linestatus, SUM (l_quantity) AS sum_qty, SUM(l_extendedprice) AS sum_base_price, SUM(l_extendedprice*(1-l_discount)) AS sum_disc_price, SUM(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge, AVG(l_quantity) AS avg_qty, AVG(l_extendedprice) AS avg_price, AVG(l_discount) AS avg_disc, COUNT(*) AS count_order FROM lineitem WHERE l_shipdate <= DATE ’ ’ GROUP BY l_returnflag, l_linestatus Scan query with arithmetics and a bit of aggregation. Source: MonetDB/X100: Hyper-Pipelining Query Execution. [Boncz et al., 2005]

time [sec]callsinstr./callIPCfunction name M60.64ut fold ulint pair M27K0.71ut fold binary 5.877M370.85memcpy 3.123M640.88Item sum sum::update field 3.06M row search for mysql 2.917M790.70Item sum avg::update field M110.60rec get bit field M row sel store mysql rec 2.448M250.52rec get nth field M0.69ha print info M end update 2.111M890.98field conv M160.77Field float::val real M141.07Item field::val 1.542M170.51row sel field store in mysql 1.436M180.76buf frame align 1.317M380.80Item func mul::val 1.425M250.62pthread mutex unlock M20.75hash get nth cell 1.225M210.65mutex test and set M40.62rec get 1byte offs flag 1.053M90.58rec 1 get field start offs 0.942M110.65rec get nth field extern bit 1.011M380.80Item func minus::val M380.80Item func plus::val

Observations Only single tuple processed in each call; millions of calls. Only 10 % of the time spent on actual query task. Low instructions-per-cycle (IPC) ratio. Much time spent on field access. Polymorphic operators Single-tuple functions hard to optimize (by compiler). → Low instructions-per-cycle ratio. → Vector instructions (SIMD) hardly applicable. Function call overhead 38 instr. = 48 cycles vs. 3 instr. for load/add/store assembly instr. cycle 4 Depends on underlying hardware

Operator-At-A-Time Processing Operators consume and produce full tables. Each (sub-)result is fully materialized (in memory). No pipelining (rather a sequence of statements). Each operator runs exactly once. · · · Database Result Operator 1 tuples Operator 2 tuples Operator 3 tuples

Operator-At-A-Time Consequences Parallelism: Inter-operator and intra-operator Function call overhead is now replaced by extremely tight loops that conveniently fit into instruction caches, can be optimized effectively by modern compilers → loop unrolling → vectorization (use of SIMD instructions) can leverage modern CPU features (hardware prefetching). Function calls are now out of the critical code path. No per-tuple field extraction or type resolution. Operator specialization, e.g., for every possible type. Implemented using macro expansion. Possible due to column-based storage.

Source: MonetDB/X100: Hyper-Pipelining Query Execution. [Boncz et al., 2005] result size time [ms] bandwidth [MB/s] MIL statement 5.9M s0 := select (l_shipdate, · · · ).mark (); 5.9M s2 := join (s0, l_linestatus); 5.9M s3 := join (s0, l_extprice); 5.9M s4 := join (s0, l_discount); 5.9M s5 := join (s0, l_tax); 5.9M s6 := join (s0, l_quantity); 5.9M s7 := group (s1); 5.9M s8 := group (s7, s2); 400 s9 := unique (s8.mirror ()); 5.9M r0 := [+](1.0, s5); 5.9M r1 := [-](1.0, s4); 5.9M r2 := [*](s3, r1); 5.9M r3 := [*](s12, r0); r4 := {sum}(r3, s8, s9); r5 := {sum}(r2, s8, s9); r6 := {sum}(s3, s8, s9); r7 := {sum}(s4, s8, s9); r8 := {sum}(s6, s8, s9); r9 := {count}(s7, s8, s9); 3,724365

Tuple-At-A-Time vs. Operator-At-A-Time The operator-at-a-time model is a two-edged sword: +Cache-efficient with respect to code and operator state. +Tight loops, optimizable code. - Data won’t fully fit into cache. → Repeated scans will fetch data from memory over and over. → Strategy falls apart when intermediate results no longer fit into main memory. Can we aim for the middle ground between the two extremes? tuple-at-a-time operator-at-a-time vectorized execution

Vectorized Execution Model Idea: Use Volcano-style iteration, but: for each next () call return a large number of tuples → a so called “vector” Choose vector size large enough to compensate for iteration overhead (function calls, instruction cache misses,... ), but small enough to not thrash data caches. vectors

Vector Size ↔ Instruction Cache Effectiveness 50G 20G 10G 5G 2G 1G 500M 200M 100M K 8K 64K 1M K 8K 64K 1MVector size (tuples) Instructions executed Q1’’ Q1’ Instruction-cache misses Q1 1K 10K 100K 1M 10M 100M Source: [Zukowski, 2009] 1000 tuples should conveniently fit into caches.

Comparison of Execution Models Source [Zukowski, 2009] execution modeltupleoperatorvector instr. cache utilizationpoorextremely goodvery good function callsmanyextremely fewvery few attribute accesscomplexdirect most time spent oninterpretationprocessing CPU utilizationpoorgoodvery good compiler optimizationslimitedapplicable materialization overheadvery cheapexpensivecheap scalabilitygoodlimitedgood

Conclusion Row-stores store complete tuples sequentially on a database page Column-stores store all values of one column sequentially on a database page Depending on the workload column-stores or row-stores are more advantageous Tuple reconstruction is overhead in column-stores Analytical workloads that process few columns at a time benefit from column-stores → One data storage approach is not optimal to serve all possible workloads