6/28/2015EECS 584, Fall 20111 The Gamma Database Machine DeWitt, Ghandeharizadeh, Schneider, Bricker, Hsiao, Rasmussen Deepak Bastakoty (With slide material.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Gamma DBMS (Part 2): Failure Management Query Processing Shahram Ghandeharizadeh Computer Science Department University of Southern California.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Parallel Database Systems
Homework 1: Common Mistakes Memory Leak Storing of memory pointers instead of data.
Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.
Chapter 13 (Web): Distributed Databases
Lecture 3: A Case for RAID (Part 1) Prof. Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Midterm 2: April 28th Material:   Query processing and Optimization, Chapters 12 and 13 (ignore , 12.7, and 13.5)   Transactions, Chapter.
Efficient Storage and Retrieval of Data
Chapter 3 Parallel Search 3.1Search Queries 3.2Data Partitioning 3.3Search Algorithms 3.4Summary 3.5Bibliographical Notes 3.6Exercises.
Institut für Scientific Computing – Universität WienP.Brezany Parallel Databases Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Chapter 19 Query Processing and Optimization
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Distributed Databases
PMIT-6102 Advanced Database Systems
1 The Google File System Reporter: You-Wei Zhang.
Redundant Array of Independent Disks
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Introduction to Hadoop and HDFS
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Storage and Indexing1 Overview of Storage and Indexing.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
9-1 © Prentice Hall, 2007 Topic 9: Physical Database Design Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich,
Gamma DBMS Part 1: Physical Database Design Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Query Processing CS 405G Introduction to Database Systems.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
CS 540 Database Management Systems
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
BIG DATA/ Hadoop Interview Questions.
CS 540 Database Management Systems
CS 440 Database Management Systems
Database Performance Tuning and Query Optimization
Evaluation of Relational Operations
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
April 30th – Scheduling / parallel
Physical Database Design
Introduction to Teradata
Parallel DBMS Chapter 22, Sections 22.1–22.6
Chapter 11 Database Performance Tuning and Query Optimization
The Gamma Database Machine Project
Parallel DBMS DBMS Textbook Chapter 22
Presentation transcript:

6/28/2015EECS 584, Fall The Gamma Database Machine DeWitt, Ghandeharizadeh, Schneider, Bricker, Hsiao, Rasmussen Deepak Bastakoty (With slide material sourced from Ghandeharizadeh and DeWitt)

6/28/2015 EECS 584, Fall Outline Motivation Physical System Designs Data Clustering Failure Management Query Processing Evaluation and Results Bonus : Current Progress, Hadoop, Clustera..

6/28/2015 EECS 584, Fall Outline Motivation Physical System Designs Data Clustering Failure Management Query Processing Evaluation and Results Bonus : Current Progress, Hadoop, Clustera..

6/28/2015 EECS 584, Fall Motivation What ? –Parallelizing Databases How to setup system / hardware? How to distribute data ? How to tailor software ? Where does this rabbit hole lead?

6/28/2015 EECS 584, Fall Motivation Why ?

6/28/2015 EECS 584, Fall Motivation Why ? –Obtain faster response time –Increase query throughput –Improve robustness to failure –Reduce system cost –Enable Massive Scalability (Faster, Safer, Cheaper, and More of it !)

6/28/2015 EECS 584, Fall Outline Motivation Physical System Designs Data Clustering Failure Management Query Processing Evaluation and Results Bonus : Current Progress, Hadoop, Clustera..

System Design Schemes Shared Memory DeWitt’s DIRECT etc Shared Disk Storage Networks Shared Nothing Gamma, everything else (PigLatin, MapReduce setups, BigTable, GFS)

System Design Schemes Shared Memory –poor scalability (number of nodes) –custom hardware / systems

System Design Schemes Shared Disk –storage networks –used to be popular –why ? Network

Advantages: Advantages: High Availability Data Backup Data Sharing –Many clients share storage and data. –Redundancy is implemented in one place protecting all clients from disk failure. –Centralized backup: The administrator does not care/know how many clients are on the network sharing storage. Shared Disk

Storage Area Network (SAN): Storage Area Network (SAN): –Block level access, –Write to storage is immediate, –Specialized hardware including switches, host bus adapters, disk chassis, battery backed caches, etc. –Expensive –Supports transaction processing systems. Network Attached Storage (NAS): Network Attached Storage (NAS): –File level access, –Write to storage might be delayed, –Generic hardware, –In-expensive, –Not appropriate for transaction processing systems.

Shared Nothing Each node has its own processor(s), memory, disk(s) Network …. CPU 1 CPU 2 CPU n DRAM 1 DRAM 2 DRAM D … … CPU 1 CPU 2 CPU n DRAM 1 DRAM 2 DRAM D … … Node M Node 1

Shared Nothing Why ? –Low Bandwidth Parallel Data Processing –Commodity hardware (Cheap !!) –Minimal Sharing = Minimal Interference –Hence : Scalability (keep adding nodes!) –DeWitt says : Even writing code is simpler

Shared Nothing Why ? –Low Bandwidth Parallel Data Processing –Commodity hardware (Cheap !!) –Minimal Sharing = Minimal Interference –Hence : Scalability (keep adding nodes!) –DeWitt says : Even writing code is simpler

6/28/2015 EECS 584, Fall Outline Motivation Physical System Designs Data Clustering Failure Management Query Processing Evaluation and Results Bonus : Current Progress, Hadoop, Clustera..

No Shared Data Declustering Physical View Bob2010K Shideh1835K Ted5060K Kevin62120K Angela55140K Mike4590K Logical View nameage salary

Spreading data between disks : –Attribute-less partitioning Random Round-Robin –Single Attribute Schemes Hash De-clustering Range De-clustering –Multiple Attributes schemes possible MAGIC, BERD etc No Shared Data Declustering

Spreading data between disks : –Attribute-less partitioning Random Round-Robin –Single Attribute Schemes Hash De-clustering Range De-clustering –Multiple Attributes schemes possible MAGIC, BERD etc No Shared Data Declustering

Bob2010K Shideh1835K Ted5060K Kevin62120K Angela55140K Mike4590K nameage salary salary % 3 Ted5060KKevin62120K nameage salaryBob2010KMike4590K nameage salaryShideh1835KAngela55140K nameage salary salary is the partitioning attribute. Hash Declustering

Selections with equality predicates referencing the partitioning attribute are directed to a single node: Selections with equality predicates referencing the partitioning attribute are directed to a single node: –Retrieve Emp where salary = 60K Equality predicates referencing a non-partitioning attribute and range predicates are directed to all nodes: Equality predicates referencing a non-partitioning attribute and range predicates are directed to all nodes: –Retrieve Emp where age = 20 –Retrieve Emp where salary < 20K SELECT * FROM Emp WHERE salary=60K SELECT * FROM Emp WHERE salary<20K

Bob2010K Shideh1835K Ted5060K Kevin62120K Angela55140K Mike4590K nameage salary salary is the partitioning attribute. Bob2010KShideh1835K nameage salaryTed5060KMike4590K nameage salaryKevin62120KAngela55140K nameage salary 0-50K51K-100K 101K- ∞ Range Declustering

Equality and range predicates referencing the partitioning attribute are directed to a subset of nodes: Equality and range predicates referencing the partitioning attribute are directed to a subset of nodes: –Retrieve Emp where salary = 60K –Retrieve Emp where salary < 20K Predicates referencing a non-partitioning attribute are directed to all nodes. Predicates referencing a non-partitioning attribute are directed to all nodes. In the example, both queries are directed to one node.

Range selection predicate using a clustered B + -tree Range selection predicate using a clustered B + -tree 0.01% selectivity 0.01% selectivity (10 records) Range Hash/Random/Round-robin Multiprogramming Level Throughput(Queries/Second) Declustering Tradeoffs

Range selection predicate using a clustered B + -tree Range selection predicate using a clustered B + -tree 1% selectivity 1% selectivity (1000 records) Range Hash/Random/Round-robin Multiprogramming Level Throughput(Queries/Second) Declustering Tradeoffs

6/28/2015 EECS 584, Fall Why the difference ? Declustering Tradeoffs

6/28/2015 EECS 584, Fall Why the difference ? When selection was small, Range spread out load and was ideal When selection increased, Range caused high workload on one/few nodes while Hash spread out load Declustering Tradeoffs

6/28/2015 EECS 584, Fall Outline Motivation Physical System Designs Data Clustering Failure Management Query Processing Evaluation and Results Bonus : Current Progress, Hadoop, Clustera..

6/28/2015 EECS 584, Fall Failure Management Key Questions –Robustness (How much damage recoverable?) –Availability (How likely? Hot Recoverable?) –MTTR (Mean Time To Recovery) Consider two declustering schemes –Interleaved Declustering (TeraData) –Chained Declustering (Gamma DBM)

6/28/2015 EECS 584, Fall Interleaved Declustering A partitioned table has a primary and a backup copy. The primary copy is constructed using one of the partitioning techniques. The secondary copy is constructed by: –Dividing the nodes into clusters (cluster size 4 here), –Partition a primary fragment (R0) across the remaining nodes of the cluster: 1, 2, and 3. Realizing r0.0, r0.1, and r0.2.

Interleaved Declustering On failure, query load re-directed to backup nodes in cluster MTTR : –replace node –reconstruct failed primary from backups –reconstruct backups stored in failed node Second failure before this can cause unavailability Large cluster size improves failure load balancing but increases risk of data being unavailable

Chained Declustering (Gamma) Nodes are divided into disjoint groups called relation clusters. A relation is assigned to one relation cluster and its records are declustered across the nodes of that relation cluster using a partitioning strategy (Range, Hash). Given a primary fragment Ri, its backup copy is assigned to node (i+1) mod M (M is the number of nodes in the relation cluster).

Chained Declustering (Gamma) During normal operation: –Read requests are directed to the fragments of primary copy, –Write requests update both primary and backup copies.

Chained Declustering (Gamma) In presence of failure: –Both primary and backup fragments are used for read operations, Objective: Balance the load and avoid bottlenecks! –Write requests update both primary and backup copies. Note: –Load of R1 (on node 1) is pushed to node 2 in its entirety. –A fraction of read request from each node is pushed to the others for a 1/8 load increase attributed to node 1’s failure.

Chained Declustering (Gamma) MTTR involves: –Replace node 1 with a new node, –Reconstruct R1 (from r1 on node 2) on node 1, –Reconstruct backup copy of R0 (i.e., r0) on node 1. Note: –Once Node 1 becomes operational, primary copies are used to process read requests.

Chained Declustering (Gamma) Any two node failures in a relation cluster does not result in data un-availability. Two adjacent nodes must fail in order for data to become unavailable

6/28/2015 EECS 584, Fall Outline Motivation Physical System Designs Data Clustering Failure Management Query Processing Evaluation and Results Bonus : Current Progress, Hadoop, Clustera..

Query Processing General Idea : Divide And Conquer

Parallelizing Hash Join Consider Hash Join –Join of Tables A and B using attribute j (A.j = B.j) consists of two phase: 1.Build phase: Build a main-memory hash table on Table A using the join attribute j, e.g., build a hash table on the Toy department using dno as the key of the hash table. 2.Probe phase: Scan table B one record at a time and use its attribute j to probe the hash table constructed on Table A, e.g., probe the hash table using the rows of the Emp department.

R join S where R is the inner table. Parallelism and Hash-Join

Example Join of Emp and Dept Emp join Dept (using dno) SS#NameAgeSalarydno 1Joe Mary Bob Kathy Shideh EMP dnodnamefloormgrss# 1Toy15 2Shoe21 Dept SS#NameAgeSalarydnodnamefloormgrss# 1Joe Shoe21 2Mary Toy15 3Bob Toy15 4Kathy Shoe21 5Shideh440001Toy15

Hash-Join: Build Read rows of Dept table one at a time and place in a main-memory hash table 1Toy15 2Shoe21 dno % 7

Hash-Join: Build Read rows of Emp table and probe the hash table. 1Toy15 2Shoe21 dno % 7 SS#NameAgeSalarydno 1Joe

Hash-Join: Build Read rows of Emp table and probe the hash table and produce results when a match is found. Repeat until all rows processed SS#NameAgeSalarydno 1Joe Toy15 2Shoe21 dno % 7 SS#NameAgeSalarydnodnamefloormgrss# 1Joe Shoe21

Hash-Join (slide for exam etc review only) Prob: Table used to build hash table larger than memory A divide-and-conquer approach: –Use the inner table (Dept) to construct n memory buckets where each bucket is a hash table. –Every time memory is exhausted, spill a fixed number of buckets to the disk. –The build phase terminates with a set of in-memory buckets and a set of disk-resident buckets. –Read the outer relation (Emp) and probe the in-memory buckets for joining records. For those records that map onto the disk-resident buckets, stream and store them to disk. –Discard the in memory buckets to free memory space. –While disk-resident buckets of inner-relation exist: Read as many (say i) of the disk-resident buckets of the inner-relation into memory as possible. Read the corresponding buckets of the outer relation (Emp) to probe the in- memory buckets for joining records. Discard the in memory buckets to free memory space. Delete the i buckets of the inner and outer relations.

Hash-Join: Build Two buckets of Dept table. One in memory and the second is disk-resident. 1Toy15 2Shoe21 dno % 7

Hash-Join: Probe Read Emp table and probe the hash table for joining records when dno=1. With dno=2, stream the data to disk. 1Toy15 2Shoe21 dno % 7 SS#NameAgeSalarydno 1Joe Mary Bob Kathy Shideh440001

Hash-Join: Probe Those rows of Emp table with dno=1 probed the hash table and produce 3 joining records. 1Toy15 2Shoe21 dno % 7 SS#NameAgeSalarydno 1Joe Kathy

Hash-Join: While loop Read the disk-resident bucket of Dept into memory. Probe it with the disk-resident buckets of Emp table to produce the remaining two joining records. 2Shoe21 dno % 7 SS#NameAgeSalarydno 1Joe Kathy

Parallelism and Hash-Join Each node may perform hash-join independently when: –The join attribute is the declustering attribute of the tables participating in the join operation. –The participating tables are declustered across the same number of nodes using the same declustering strategy. –The system may re-partition the table (see the next bullet) if its aggregate memory exceeds the size of memory the tables are declustered across. Otherwise, the data must be re-partitioned to perform the join operation correctly.

R join S where R is the inner table. Parallelism and Hash-Join

6/28/2015 EECS 584, Fall Outline Motivation Physical System Designs Data Clustering Failure Management Query Processing Evaluation and Results Bonus : Current Progress, Hadoop, Clustera..

System Evaluation Two Key Metrics –Speed-up Given a system with 1 node, does adding n nodes speed it up with a factor of n ? –Scale-up Given a system with 1 node, does the response time remain the same with n nodes ?

Ideal Parallelism

Speedup (Selection)

Scaleup (Selection)

Speedup (Join)

Scaleup (Join)

6/28/2015 EECS 584, Fall Outline Motivation Physical System Designs Data Clustering Failure Management Query Processing Overview of Results Bonus: Current Progress, Hadoop, Clustera..

Modern Systems are Gamma too? Ghandeharizadeh’s 2009 summary : –All systems below, like Gamma, are “share nothing” and run on 1000s of nodes Google File System Google’s Bigtable Data Model Google’s Map/Reduce Framework Yahoo’s Pig Latin Hadoop Pig GFS BigTable

DeWitt’s Clustera : –DBMS centric cluster management system –Evolved from Gamma –Like MapReduce but Full DBMS support SQL optimization –Outperforms Hadoop! Old is the New New? Data complexity Job complexity Condor Parallel SQL Map/Reduce

Old is the New New?

6/28/2015 EECS 584, Fall Outline Motivation Physical System Designs Data Clustering Failure Management Query Processing Overview of Results Bonus: Current Progress, Hadoop, Clustera.. Questions ?