1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: “MapReduce and SQL” & “SSD and DB”

Slides:



Advertisements
Similar presentations
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
NFS, AFS, GFS Yunji Zhong. Distributed File Systems Support access to files on remote servers Must support concurrency – Make varying guarantees about.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Cloud Computing Lecture #2 Introduction to MapReduce Jimmy Lin The iSchool University of Maryland Monday, September 8, 2008 This work is licensed under.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
MapReduce Technical Workshop This presentation includes course content © University of Washington Redistributed under the Creative Commons Attribution.
Cloud Computing Lecture #2 From Lisp to MapReduce and GFS
The Google File System.
Distributed Computations MapReduce
1 Generalizing Map-Reduce The Computational Model Map-Reduce-Like Algorithms Computing Joins.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Google File System.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
1 The Google File System Reporter: You-Wei Zhang.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Intro to Map/Reduce Andrew Rau-Chaplin - Adapted from What is Cloud Computing? (and an intro to parallel/distributed processing), Jimmy Lin, The iSchool.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY Network File System Except as.
Presenters: Rezan Amiri Sahar Delroshan
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Advanced Database Systems: DBS CB, 2nd Edition
Large-scale file systems and Map-Reduce
Google File System.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
Distributed System Gang Wu Spring,2018.
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
Presentation transcript:

1 Advanced Database Systems: DBS CB, 2 nd Edition Advanced Topics of Interest: “MapReduce and SQL” & “SSD and DB”

2 Outline MapReduce and SQL SSD and SQL

333 MapReduce and SQL

Introduction It is all about divide and conquer 4 “Work” w1w1 w2w2 w3w3 r1r1 r2r2 r3r3 “Result” “worker” Partition Combine

Introduction Different workers:  Different threads in the same core  Different cores in the same CPU  Different CPUs in a multi-processor system  Different machines in a distributed system Parallelization Problems:  How do we assign work units to workers?  What if we have more work units than workers?  What if workers need to share partial results?  How do we aggregate partial results?  How do we know all the workers have finished?  What if workers die? 5

Introduction General Themes:  Parallelization problems arise from: Communication between workers Access to shared resources (e.g., data)  Thus, we need a synchronization system!  This is tricky: Finding bugs is hard Solving bugs is even harder 6

Introduction Patterns for Parallelism:  Master/Workers  Producer/Consumer Flow  Work Queues 7 workers master CP P P C C CP P P C C C P P P C C shared queue

Introduction: Evolution Functional Programming MapReduce Google File System (GFS) 8

Introduction Functional Programming:  MapReduce = functional programming meets distributed processing on steroids Not a new idea… dates back to the 50’s (or even 30’s)  What is functional programming? Computation as application of functions Theoretical foundation provided by lambda calculus  How is it different? Traditional notions of “data” and “instructions” are not applicable Data flows are implicit in program Different orders of execution are possible  Exemplified by LISP and ML 9

Introduction: Lisp  MapReduce? What does this have to do with MapReduce? After all, Lisp is about processing lists Two important concepts in functional programming  Map: do something to everything in a list  Fold: combine results of a list in some way 10

Introduction: Map Map is a higher-order function How map works:  Function is applied to every element in a list  Result is a new list 11 fffff

Introduction: Fold Fold is also a higher-order function How fold works:  Accumulator set to initial value  Function applied to list element and the accumulator  Result stored in the accumulator  Repeated for every item in the list  Result is the final value in the accumulator 12 ffffffinal value Initial value

Lisp  MapReduce Let’s assume a long list of records: imagine if...  We can distribute the execution of map operations to multiple nodes  We have a mechanism for bringing map results back together in the fold operation That’s MapReduce! (and Hadoop) Implicit parallelism:  We can parallelize execution of map operations since they are isolated  We can reorder folding if the fold function is commutative and associative 13

Typical Problem Iterate over a large number of records Map: extract something of interest from each Shuffle and sort intermediate results Reduce: aggregate intermediate results Generate final output Key idea: provide an abstraction at the point of these two operations 14

MapReduce Programmers specify two functions: map (k, v) → * reduce (k’, v’) → *  All v’ with the same k’ are reduced together Usually, programmers also specify: partition (k’, number of partitions ) → partition for k’  Often a simple hash of the key, e.g. hash(k’) mod n  Allows reduce operations for different keys in parallel 15

It’s just divide and conquer! Data Store Initial kv pairs map Initial kv pairs map Initial kv pairs map Initial kv pairs k 1, values… k 2, values… k 3, values… k 1, values… k 2, values… k 3, values… k 1, values… k 2, values… k 3, values… k 1, values… k 2, values… k 3, values… Barrier: aggregate values by keys reduce k 1, values… final k 1 values reduce k 2, values… final k 2 values reduce k 3, values… final k 3 values 16

Recall these problems? How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? 17

MapReduce Runtime Handles data distribution  Gets initial data to map workers  Shuffles intermediate key-value pairs to reduce workers  Optimizes for locality whenever possible Handles scheduling  Assigns workers to map and reduce tasks Handles faults  Detects worker failures and restarts Everything happens on top of GFS (later) 18

“Hello World”: Word Count Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, "1"); Reduce(String key, Iterator intermediate_values): // key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 19

Behind the scenes… 20

Bandwidth Optimizations Take advantage of locality  Move the process to where the data is! Use “Combiner” functions  Executed on same machine as mapper  Results in a “mini-reduce” right after the map phase  Reduces key-value pairs to save bandwidth When can you use combiners? 21

Skew Problem Issue: reduce is only as fast as the slowest map Solution: redundantly execute map operations, use results of first to finish  Addresses hardware problems...  But not issues related to inherent distribution of data Data, Data, More Data  All of this depends on a storage system for managing all the data…  That’s where GFS (Google File System), and by extension HDFS in Hadoop 22

Assumptions High component failure rates  Inexpensive commodity components fail all the time “Modest” number of HUGE files  Just a few million (!!!)  Each is 100MB or larger; multi-GB files typical Files are write-once, mostly appended to  Perhaps concurrently Large streaming reads High sustained throughput favoured over low latency 23

GFS Design Decisions Files stored as chunks  Fixed size (64MB) Reliability through replication  Each chunk replicated across 3+ chunk servers Single master to coordinate access, keep metadata  Simple centralized management No data caching  Little benefit due to large data sets, streaming reads Familiar interface, but customize the API  Simplify the problem; focus on Google apps  Add snapshot and record append operations 24

GFS Architecture Single master Multiple chunk servers Can anyone see a potential weakness in this design? 25

Single master From distributed systems we know this is a  Single point of failure  Scalability bottleneck GFS solutions:  Shadow masters  Minimize master involvement Never move data through it, use only for metadata (and cache metadata at clients) Large chunk size Master delegates authority to primary replicas in data mutations (chunk leases) Simple, and good enough! 26

Metadata Global metadata is stored on the master  File and chunk namespaces  Mapping from files to chunks  Locations of each chunk’s replicas All in memory (64 bytes / chunk)  Fast  Easily accessible Master has an operation log for persistent logging of critical metadata updates  Persistent on local disk  Replicated  Checkpoints for faster recovery 27

Mutations Mutation = write or append  Must be done for all replicas Goal: minimize master involvement Lease mechanism:  Master picks one replica as primary; gives it a “lease” for mutations  Primary defines a serial order of mutations  All replicas follow this order  Data flow decoupled from control flow 28

Relaxed Consistency Model “Consistent” = all replicas have the same value “Defined” = replica reflects the mutation, consistent Some properties:  Concurrent writes leave region consistent, but possibly undefined  Failed writes leave the region inconsistent Some work has moved into the applications:  E.g., self-validating, self-identifying records  Google apps can live with it  What about other apps? 29

Master’s Responsibilities (1/2) Metadata storage Namespace management/locking Periodic communication with chunk servers  Give instructions, collect state, track cluster health Chunk creation, re-replication, rebalancing  Balance space utilization and access speed  Spread replicas across racks to reduce correlated failures  Re-replicate data if redundancy falls below threshold  Rebalance data to smooth out storage and request load 30

Master’s Responsibilities (2/2) Garbage Collection  Simpler, more reliable than traditional file delete  Master logs the deletion, renames the file to a hidden name  Lazily garbage collects hidden files Stale replica deletion  Detect “stale” replicas using chunk version numbers 31

Fault Tolerance High availability  Fast recovery: master and chunk servers restartable in a few seconds  Chunk replication: default 3 replicas  Shadow masters Data integrity  Checksum every 64KB block in each chunk 32

Parallelization Problems How do we assign work units to workers? What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results? How do we know all the workers have finished? What if workers die? 33

Managing Dependencies Remember: Mappers run in isolation  You have no idea in what order the mappers run  You have no idea on what node the mappers run  You have no idea when each mapper finishes Question: what if your computation is a non-commutative operation on mapper results? Answer: Cleverly “hide” dependencies in the reduce stage  The reducer can hold state across multiple map operations  Careful choice of partition function  Careful choice of sorting function Example: computing conditional probabilities 34

Other things to beware of… Object creation overhead Reading in external resources is tricky  Possibility of creating hotspots in underlying file system 35

M/R Application: Cost Measures for Algorithms 1. Communication cost = total I/O of all processes. 2. Elapsed communication cost = max of I/O along any path. 3. (Elapsed ) computation costs analogous, but count only running time of processes. 36

M/R Application: Example: Cost Measures For a map-reduce algorithm:  Communication cost = input file size + 2  (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce processes  Elapsed communication cost is the sum of the largest input + output for any map process, plus the same for any reduce process 37

M/R Application: What Cost Measures Mean Either the I/O (communication) or processing (computation) cost dominates.  Ignore one or the other. Total costs tell what you pay in rent from your friendly neighborhood cloud. Elapsed costs are wall-clock time using parallelism 38

Join By Map-Reduce Our first example of an algorithm in this framework is a map-reduce example Compute the natural join R(A,B) ⋈ S(B,C) R and S each are stored in files Tuples are pairs (a,b) or (b,c) Use a hash function h from B-values to 1..k. A Map process turns input tuple R(a,b) into key-value pair (b,(a,R)) and each input tuple S(b,c) into (b,(c,S)) 39

Map-Reduce Join – (2) Map processes send each key-value pair with key b to Reduce process h(b)  Hadoop does this automatically; just tell it what k is Each Reduce process matches all the pairs (b,(a,R)) with all (b,(c,S)) and outputs (a,b,c) 40

Cost of Map-Reduce Join Total communication cost = O(|R|+|S|+|R ⋈ S|) Elapsed communication cost = O(s)  We’re going to pick k and the number of Map processes so I/O limit s is respected With proper indexes, computation cost is linear in the input + output size  So computation costs are like comm. costs 41

Three-Way Join We shall consider a simple join of three relations, the natural join R(A,B) ⋈ S(B,C) ⋈ T(C,D) One way: cascade of two 2-way joins, each implemented by map-reduce Fine, unless the 2-way joins produce large intermediate relations 42

Example: 3-Way Join Reduce processes use hash values of entire S(B,C) tuples as key Choose a hash function h that maps B- and C-values to k buckets There are k 2 Reduce processes, one for each (B- bucket, C-bucket) pair 43

Mapping for 3-Way Join We map each tuple S(b,c) to ((h(b), h(c)), (S, b, c)) We map each R(a,b) tuple to ((h(b), y), (R, a, b)) for all y = 1, 2,…,k We map each T(c,d) tuple to ((x, h(c)), (T, c, d)) for all x = 1, 2,…,k. KeysValues Aside: even normal map-reduce allows inputs to map to several key-value pairs. 44

Assigning Tuples to Reducers h(b) = h(c) = S(b,c) where h(b)=1; h(c)=2 R(a,b), where h(b)=2 T(c,d), where h(c)=3 45

Job of the Reducers Each reducer gets, for certain B-values b and C- values c: 1. All tuples from R with B = b, 2. All tuples from T with C = c, and 3. The tuple S(b,c) if it exists Thus it can create every tuple of the form (a, b, c, d) in the join 46

Semijoin Option A possible solution: first semijoin – find all the C- values in S(B,C) Feed these to the Map processes for R(A,B), so they produce only keys (b, y) such that y is in  C (S) Similarly, compute  B (S), and have the Map processes for T(C,D) produce only keys (x, c) such that x is in  B (S) 47

Semijoin Option – (2) Problem: while this approach works, it is not a map- reduce process Rather, it requires three layers of processes: 1. Map S to  B (S),  C (S), and S itself (for join) 2. Map R and  B (S) to key-value pairs and do the same for T and  C (S) 3. Reduce (join) the mapped R, S, and T tuples 48

49 SSD and SQL

Problem Definition SSD has the potential of displacing disk; experiments by Oracle/HP and Teradata indicates the great performance potential But placing the database on SSD instead of disk today is still a very expensive proposition (storage is ~60% of BI system cost, and Enterprise SSD cost = 10-15X disk  cost of BI system based on SSD = 7-8X BI cost using disk)! Wanted to experiment:  Is it possible to explore using SSD instead of disk in the DW internals in specific areas where we can get maximum ROI with minimal $$ increase to the overall system?  Assume that SSD will take over in few years, what are the implications on the DW architecture, specially with indices? 50

Areas of Interest Explore multiple SSD vendors and chose Fusion-io Explore impact of using Fusion-io on the following DW areas:  Overflow handling  Secondary indices  Audit-trail  As a DB secondary cache Possible SSD connectivity:  Dedicated SSD card per DB server  SSD pool on IB using SRP protocol to be shared among blades within an enclosure 51

Overflow Handling SORT> select [last 0] * from lineitem where L_ORDERKEY < 15,000,000 order by L_ORDERKEY; Lineitem table is 60 million rows We run 1,2,4,and 8 streams using SSD or disk for overflow Measurements indicate 5-8X performance gain over disk-based SQL overflow 52

Indexes Secondary Index> select [last 0] * x.l_linenumber, y.l_linenumber from lineitem x, lineitem y where x.l_linenumber = y.l_linenumber and x.l_orderkey = y.l_orderkey and x.l_orderkey < 30,000,000; Lineitem table is 60 million rows, primary key. Secondary index is Without reducing the outer table rows in the nested join; it takes forever (60 million * 60 million rows); we reduced it to 30,000,000 rows We run the above query where the base table and main index are on disk and all secondary indexes are either on SSD or on disk Measurements indicate few 100% performance gain over indexes on disk 53

Audit Trail Six tables, single partition and 200,000 rows each. Single Audit Trail process for the whole configuration. Each row is 1012 bytes. Each of the 2 blades has 3 tables Run Audit Trail as both disk- or SSD-based (direct attached, IB/IPC, IB/disk) as follows:  M updaters do update (900 bytes each) each record in a single partition in a single transaction, 50 times – 50 transactions, where M = 1, 2,, 4, 6 Measurements indicates order of magnitude improvement over disk-based audit trail 54

DW Cache Run TPC-H SF10 benchmark (due to limited SSD) Run both database and index on SSD with TSE cache equal 700 MB and 2 MB – overflow is not relevant with TPC-H Run both database and index on disk with TSE cache equal 700 MB and 2 MB Using SSD for DBMS with very small TSE cache / Disk for database with large buffer cache = 7X times 55

Summary Clearly SSD is a “game changing” technology Instead of putting the database on SSD you can leverage SSD in specific subsystems in the DB, i.e., till price comes down 56

57 END

M/R: Overview of Lisp Lisp ≠ Lost In Silly Parentheses We’ll focus on particular a dialect: “Scheme” Lists are primitive data types Functions written in prefix notation 58 '( ) '((a 1) (b 2) (c 3)) (+ 1 2)  3 (* 3 4)  12 (sqrt (+ (* 3 3) (* 4 4)))  5 (define x 3)  x (* x 5)  15

M/R: Overview of Lisp Functions are defined by binding lambda expressions to variables Syntactic sugar for defining functions  Above expressions is equivalent to: Once defined, function can be applied: 59 (define foo (lambda (x y) (sqrt (+ (* x x) (* y y))))) (define (foo x y) (sqrt (+ (* x x) (* y y)))) (foo 3 4)  5

Recursion Simple factorial example Even iteration is written with recursive calls! 60 (define (factorial n) (if (= n 1) 1 (* n (factorial (- n 1))))) (factorial 6)  720 (define (factorial-iter n) (define (aux n top product) (if (= n top) (* n product) (aux (+ n 1) top (* n product)))) (aux 1 n 1)) (factorial-iter 6)  720