Building an Elastic Main-Memory Database: E-Store

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

What is it? What kind of system need it?. Distributing system, cloud system etc.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
CSC271 Database Systems Lecture # 30.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
H-Store: A Specialized Architecture for High-throughput OLTP Applications Evan Jones (MIT) Andrew Pavlo (Brown) 13 th Intl. Workshop on High Performance.
Overview – Chapter 11 SQL 710 Overview of Replication
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems Jihui Yang CS525 Advanced Distributed System March 1, 2016.
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
CSCI5570 Large Scale Data Processing Systems
Practical Database Design and Tuning
Memory Management.
Jonathan Walpole Computer Science Portland State University
CSCI5570 Large Scale Data Processing Systems
Large-scale file systems and Map-Reduce
Lecture 16: Data Storage Wednesday, November 6, 2006.
Applying Control Theory to Stream Processing Systems
Chapter 9: Virtual Memory
Physical Database Design and Performance
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Maximum Availability Architecture Enterprise Technology Centre.
GlassFish in the Real World
The Client/Server Database Environment
COMP 430 Intro. to Database Systems
Introduction to NewSQL
Swapping Segmented paging allows us to have non-contiguous allocations
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Physical Database Design for Relational Databases Step 3 – Step 8
Elastic Consistent Hashing for Distributed Storage Systems
Software Architecture in Practice
Main Memory Management
Lecture 11: DMBS Internals
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
MapReduce Simplied Data Processing on Large Clusters
HashKV: Enabling Efficient Updates in KV Storage via Hashing
April 30th – Scheduling / parallel
Real world In-Memory OLTP
Chapter 9: Virtual-Memory Management
Database Implementation Issues
SQL 2014 In-Memory OLTP What, Why, and How
Predictive Performance
Zhen Xiao, Qi Chen, and Haipeng Luo May 2013
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Ch 4. The Evolution of Analytic Scalability
Practical Database Design and Tuning
Main Memory Background Swapping Contiguous Allocation Paging
Selected Topics: External Sorting, Join Algorithms, …
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Specialized Cloud Architectures
H-store: A high-performance, distributed main memory transaction processing system Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex.
CS510 - Portland State University
Lecture 20: Intro to Transactions & Logging II
Self-organizing Tuple Reconstruction in Column-stores
Database System Architectures
The Gamma Database Machine Project
CSE 326: Data Structures Lecture #14
Database Implementation Issues
The Database World of Azure
Presentation transcript:

Building an Elastic Main-Memory Database: E-Store Aaron J. Elmore aelmore@cs.uchicago.edu

Collaboration Between Many Rebecca Taft, Vaibhav Arora, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andy Pavlo, Amr El Abbadi, Divy Agrawal, Michael Stonebraker E-Store @ VLDB 2015, Squall @ SIGMOD 2015, E-Store++ @ ????

Databases are Great Developer ease via ACID Turing Award winning great

But they are Rigid and Complex

Growth… Rapid growth of some web services led to design of new “web-scale” databases…

Rise of NoSQL Scaling is needed Chisel away at functionality No transactions No secondary indexes Minimal recovery Mixed Consistency Not always suitable…

Workloads Fluctuate Resources Demand Time Slide Credits: Berkeley RAD Lab

Peak Provisioning Capacity Resources Demand Time Unused resources Slide Credits: Berkeley RAD Lab

Peak Provisioning isn’t Perfect Time Resources Slide Credits: Berkeley RAD Lab

Growth is not always sustained http://www.statista.com/statistics/273569/monthly-active-users-of-zynga-games/

Need Elasticity Elasticity > Scalability

The Promise of Elasticity Demand Capacity Time Resources Unused resources Slide Credits: Berkeley RAD Lab

Primary use-cases for elasticity Database-as-a-Service with elastic placement of non- correlated tenants, often low utilization per tenant. High-throughput transactional systems (OLTP)

No Need to Weaken the Database!

High Throughput = Main Memory Cost per GB for RAM is dropping. Network memory is faster than local disk. Much faster than disk based DBs.

Approaches for “NewSQL” main-memory* Highly concurrent, latch-free data structures Partitioning into single-threaded executors *Excuse the generalization

Procedure Name Input Parameters Client Application In case some of you aren’t familiar with H-Store, I’ll give a brief overview of the architecture. As I mentioned, it’s a main-memory, shared-nothing data base. Data is horizontally partitioned across the cluster, and each node is responsible for some number of data partitions, typically the number of cores on that machine. Each partition is uniquely accessed by a single-threaded execution engine, which means no locking or latching required for single-partition transactions. Another feature is that H-Store is optimized to execute stored procedures, which are basically parameterized SQL queries mixed with control code written in Java. To execute a transaction, the client just passes the name of a stored procedure and the input parameters. Slide Credits: Andy Pavlo

Database Partitioning TPC-C Schema Schema Tree WAREHOUSE WAREHOUSE ITEM DISTRICT STOCK DISTRICT STOCK CUSTOMER CUSTOMER ORDERS ITEM Replicated ORDERS ORDER_ITEM ORDER_ITEM Slide Credits: Andy Pavlo

Database Partitioning Schema Tree Partitions P1 P2 P3 P4 P5 WAREHOUSE P1 P2 P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 DISTRICT STOCK ITEM ITEM P1 P2 P3 P4 P5 CUSTOMER P3 P4 ITEM ITEM P1 P2 P3 P4 P5 ORDERS ITEM ITEM ITEMj ITEM ITEM ITEM Replicated P5 P1 P2 P3 P4 P5 ORDER_ITEM ITEM Slide Credits: Andy Pavlo

The Problem: Workload Skew Many OLTP applications suffer from variable load and high skew: Extreme Skew: 40-60% of NYSE trading volume is on 40 individual stocks Time Variation: Load “follows the sun” Seasonal Variation: Ski resorts have high load in the winter months Load Spikes: First and last 10 minutes of trading day have 10X the average volume Hockey Stick Effect: A new application goes “viral” The problem we are trying to solve: workload skew – extremely common in many OLTP workloads Just to give a few examples: Ok, so skew exists, but does it really matter?

The Problem: Workload Skew No Skew Uniform data access Low Skew 2/3 of queries access top 1/3 of data In order to test this, we looked at three different workload distributions High Skew Few very hot items

The Problem: Workload Skew High skew increases latency by 10X and decreases throughput by 4X Partitioned shared-nothing systems are especially susceptible And we found that skew does matter a lot!

The Problem: Workload Skew Possible solutions: Provision resources for peak load (Very expensive and brittle!) Limit load on system (Poor performance!) Enable system to elastically scale in or out to dynamically adapt to changes in load

Elastic Scaling Key Value 1 kjfhlaks 2 ewoiej 3 fsakjfkf 4 fkjwei Key 5 wekjxmn 6 fmeixtm 7 ewjkdm 8 weiuqw Key Value 9 mznjku 10 ewrenx 11 qieucm 12 roiwrio Key Value 1 kjfhlaks 2 ewoiej 3 fsakjfkf 4 fkjwei Key Value 5 wekjxmn 6 fmeixtm 7 ewjkdm 8 weiuqw Key Value 9 mznjku 10 ewrenx Key Value 11 qieucm 12 roiwrio

Load Balancing Key Value 1 kjfhlaks 2 ewoiej 3 fsakjfkf 4 fkjwei Key 5 wekjxmn 6 fmeixtm 7 ewjkdm 8 weiuqw Key Value 9 mznjku 10 ewrenx 11 qieucm 12 roiwrio Key Value 1 kjfhlaks 2 ewoiej 3 fsakjfkf 4 fkjwei 10 ewrenx 11 qieucm Key Value 5 wekjxmn 6 fmeixtm 7 ewjkdm 8 weiuqw 12 roiwrio Key Value 9 mznjku

Two-Tiered Partitioning What if only a few specific tuples are very hot? Deal with them separately! Two tiers: Individual hot tuples, mapped explicitly to partitions Large blocks of colder tuples, hash- or range-partitioned at coarse granularity Possible implementations: Fine-grained range partitioning Consistent hashing with virtual nodes Lookup table combined with any standard partitioning scheme Existing systems are “one-tiered” and partition data only at course granularity Unable to handle cases of extreme skew ----- Meeting Notes (1/14/15 21:32) ----- This brings us to the question of what to do if only a few tuples are hot. As alluded to in the previous slide, it's often best to deal with them separately.

E-Store End-to-end system which extends H-Store (a distributed, shared-nothing, main memory DBMS) with automatic, adaptive, two-tiered elastic partitioning

E-Store Normal operation, high level monitoring Load imbalance detected Reconfiguration complete Online reconfiguration (Squall) Tuple level monitoring (E-Monitor) Tuple placement planning (E-Planner) New partition plan Hot tuples, partition-level access counts

E-Monitor: High-Level Monitoring High level system statistics collected every ~1 minute CPU indicates system load, used to determine whether to add or remove nodes, or re- shuffle the data Accurate in H-Store since partition executors are pinned to specific cores Cheap to collect When a load imbalance (or overload/underload) is detected, detailed monitoring is triggered

E-Monitor: Tuple-Level Monitoring Tuple-level statistics collected in case of load imbalance Finds the top 1% of tuples accessed per partition (read or written) during a 10 second window Finds total access count per block of cold tuples Can be used to determine workload distribution, using tuple access count as a proxy for system load Reasonable assumption for main-memory DBMS w/ OLTP workload Minor performance degradation during collection

E-Monitor: Tuple-Level Monitoring Sample output

E-Planner Given current partitioning of data, system statistics and hot tuples/partitions from E-Monitor, E- Planner determines: Whether to add or remove nodes How to balance load Optimization problem: minimize data movement (migration is not free) while balancing system load. We tested five different data placement algorithms: One-tiered bin packing (ILP – computationally intensive!) Two-tiered bin packing (ILP – computationally intensive!) First Fit (global repartitioning to balance load) Greedy (only move hot tuples) Greedy Extended (move hot tuples first, then cold blocks until load is balanced)

E-Planner: Greedy Extended Algorithm Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 … New YCSB partition plan "usertable": { 0: [1000-100000) 1: [1-2),[100000-200000) 2: [200000-300000),[0-1), [2-1000) } Current YCSB partition plan "usertable": { 0: [0-100000) 1: [100000-200000) 2: [200000-300000) } ?

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [0-100000) 77,000 1 [100000-200000) 23,000 2 [200000-300000) 5,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [0-100000) 77,000 1 [100000-200000) 23,000 2 [200000-300000) 5,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [1-100000) 57,000 1 [100000-200000) 23,000 2 [200000-300000),[0-1) 25,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [1-100000) 57,000 1 [100000-200000) 23,000 2 [200000-300000),[0-1) 25,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [1-100000) 57,000 1 [100000-200000) 23,000 2 [200000-300000),[0-1) 25,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [2-100000) 45,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0-1) 25,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [2-100000) 45,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0-1) 25,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [2-100000) 45,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0-1) 25,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [3-100000) 40,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0-1), [2-3) 30,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [3-100000) 40,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0-1), [2-3) 30,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [3-100000) 40,000 1 [100000-200000), [1-2) 35,000 2 [200000-300000),[0-1), [2-3) 30,000 Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [1000-100000) 35,000 1 [100000-200000), [1-2) 2 [200000-300000),[0-1), [2-1000) Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Greedy Extended Algorithm Partition Keys Total Cost (tuple accesses) [1000-100000) 35,000 1 [100000-200000), [1-2) 2 [200000-300000),[0-1), [2-1000) Target cost per partition: 35,000 Hot tuples Accesses 20,000 1 12,000 2 5,000 Range Accesses 3-1000 5,000 1000-2000 3,000 2000-3000 2,000 …

E-Planner: Other Heuristic algorithms Greedy Like Greedy Extended, but the algorithm stops after all hot tuples have been moved If there are not many hot tuples (e.g. low skew), may not sufficiently balance the workload First Fit First packs hot tuples onto partitions, filling one partition at a time Then packs blocks of cold tuples, filling the remaining partitions one at a time Results in a balanced workload, but does not attempt to limit the amount of data movement

E-Planner: Optimal Algorithms Two-Tiered Bin Packer Uses Integer Linear Programming (ILP) to optimally pack hot tuples and cold blocks onto partitions Constraints: each tuple/block must be assigned to exactly one partition, and each partition must have total load less than the average + 5% Optimization Goal: minimize the amount of data moved in order the satisfy the constraints One-Tiered Bin Packer Like Two-Tiered Bin Packer, but can only pack blocks of tuples, not individual tuples Both are computationally intensive, but show one- and two-tiered approaches in best light

Squall Given plan from E-Planner, Squall physically moves the data while the system is live For immediate benefit, moves data from hottest partitions to coldest partitions first More on this in a bit…

Results – Two-Tiered V. One-Tiered YCSB High skew YCSB Low skew Showing that two-tier is good

Results – Heuristic Planners YCSB High skew YCSB Low skew In high skew first fit and GE does well, greedy ok. In low skew first fit moves too much and greedy doesn’t move enough

Results YCSB High skew YCSB Low skew GE has low latency and high tps for all cases

But What About… Distributed Transactions??? Current E-Store does not take them into account when planning data movement Ok when most transactions access a single partitioning key – tends to be the case for “tree schemas” such as YCSB, Voter, and TPC-C E-Store++ will address the general case More later…

Squall Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases

The Problem Need to migrate tuples between partitions to reflect the updated partitioning. Would like to do this without bringing the system offline: Live Reconfiguration Similar to live migration of an entire database between servers.

Existing Solutions are Not Ideal Predicated on disk based solutions with traditional concurrency and recovery. Zephyr: Relies on concurrency (2PL) and disk pages. ProRea: Relies on concurrency (SI and OCC) and disk pages. Albatross: Relies on replication and shared disk storage. Also introduces strain on source. Slacker: Replication middleware based.

Not Your Parent’s Migration More than a single source and destination Want lightweight coordination Single threaded execution model Either doing work or migration Presence of distributed transactions and replication Migrating 2 warehouses in TPC-C In E-Store with a Zephyr like migration

Squall Given plan from E-Planner, Squall physically moves the data while the system is live Conforms to H-Store single-threaded execution model While data is moving, transactions are blocked To avoid performance degradation, Squall moves small chunks of data at a time, interleaved with regular transaction execution

Squall Steps Client 1. Identify migrating data 2. Live reactive pulls for required data 3. Periodic lazy/async pulls for large chunks 1 3 4 Reconfiguration (New Plan, Leader ID) 1 3 4 2 2 Partition 2 Partition 2 Outgoing: 2 Pull W_ID=2 Partition 1 Partition 1 5 6 9 8 9 5 6 8 Pull W_ID>5 7 10 Incoming: 2 Outgoing: 5 7 10 Incoming: 5 Partition 3 Partition 3 district customer stock warehouse Partition 4 Partition 4 Client

Keys to Performance Redirect or pull only if needed. Properly size reconfiguration granule. Split large reconfigurations to limit demands on single partition. Tune what gets pulled. Sometimes pull a little extra. Migrating 2 warehouses in TPC-C In E-Store with a Zephyr like migration (skip overview that we filter and eagerly pull small requests)

Redirect and Pull Only When Needed

Data Migration Query arrives, must be trapped to check if data is potentially moving. Check key map, then ranges list. If either source or destination partition is local check their map, keep local if possible. If neither partition is local, forward to destination. If data is not moving, process transaction.

Trap for Data Movement If txn requires incoming data, block execution and schedule data pull. Can only block dependent nodes in query plan Upon receipt mark and dirty tracking structures, and unblock. If txn requires lost data, restart as distributed transaction or forward request.

Data Pull Requests Live data pulls are scheduled at destination as high priority transactions. Current transaction finishes before extraction. Timeout detection is needed.

Chunk Data for Asynchronous Pulls

Why Chunk? Unknown amount of data when not partitioned by clustered index. Customers by W_ID in TPC-C Time spent extracting, is time not spent on TXNS. Want a mechanism to support partial extraction while maintaining consistency.

Async Pulls Periodically pull chunks of cold data These pulls are answered lazy Execution is interwoven with extracting and sending data (dirty the range though!)

Mitigating Async Pulls Pull Async Data Txn Queue Txn Queue Partition 1 Partition 2 Idle Clock

New Transactions Take Precedent Txn Queue Txn Queue Partition 1 Partition 2

Extract up to Chunk Limit Txn Queue Txn Queue Partition 1 Partition 2 Important to note data is partially migrated!

Repeat Until Complete Partition 1 Partition 2 Txn Queue Txn Queue Repeat chunking until complete. New transactions still take precedent

Sizing Chunks Static analysis to set chunk sizes, future work to dynamically set sizing and scheduling. Impact on chunk sizes on a 10% reconfiguration during a YCSB workload.

Space Async Pulls Introduce delay at destination between new async pull requests. Impact on chunk sizes on a 10% reconfiguration during a YCSB workload with 8mb chunk size.

Splitting Reconfigurations Split by pairs of source and destination Example: partition 1 is migrating W_ID 2,3 to partitions 3 and 7, execute as two reconfigurations. If migrating large objects, split them and use distributed transactions.

Splitting into Sub-Plans Set a cap on sub-plan splits, and split on pairs and ability to decompose migrating objects

All about trade-offs Trading off time to complete migration and performance degradation. Future work to consider automating this trade-off based on service level objectives.

TPC-C load balancing hotspot warehouses Results Highlight TPC-C load balancing hotspot warehouses

YCSB Latency YCSB data shuffle 10% pairwise YCSB cluster consolidation 4 to 3 nodes

I Fell Asleep… What Happened Skew happens, two-tiered partitioning for greedy load-balancing and elastic growth helps If you have work or migrate, be careful to break up the migrations and don’t be too needy on any one partition. We are thinking hard about skewed workloads that aren’t trivial to partition. Questions?