Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2 Yahoo! 1 Computer Science Department, UCLA 2 SIGMOD 2007, Beijing, China Presented by Jongheum Yeon, 2009. 08. 13.
Outline Introduction Map-Reduce Map-Reduce-Merge Conclusions
Introduction New data-processing systems should consider alternatives to using big, traditional databases Map-Reduce does a good job, in a limited context, with extraordinary simplicity Map-Reduce-Merge will try to extend the applicability without giving up too much simplicity
Introduction (cont’d) Application SQL Sawzall ≈SQL LINQ, SQL Parallel Databases Sawzall Pig, Hive DryadLINQ Scope Language Map-Reduce Hadoop Dryad Execution GFS BigTable HDFS S3 Cosmos Azure SQL Server Storage
Map-Reduce : Motivation Many special purpose tasks that operate on and produce large amounts of data Crawled documents, web requests, etc Inverted indices, summaries, other kinds of derived data Needs to be distributed across large number of machines to finish in a reasonable time Parallelize the computation Distribute data Obscures original computation with these extra concerns
Map-Reduce : Benefits Automatic parallelization and distribution User code complexity and size reduced Transparent fault-tolerance I/O scheduling Fine grained partitioning of tasks Dynamically scheduled on available workers Status and monitoring
Map-Reduce : Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list (out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list (out_value) Produces a set of merged output values (usually just one)
Map-Reduce : Data Flow Data Map Reduce
Map-Reduce : Data Flow Map : Generate new Key and its value Reduce : Integrate values of same key Map Reduce Key1 Value1 KeyA ValueX KeyB ValueY ValueZ A=X B=Y,Z
Map-Reduce : Architecture Master Worker Worker Map GFS GFS Reduce Worker Worker Reduce Map
Map-Reduce : Architecture Master Assigns and maintains the state of each map/reduce task Propagating intermediate files to reduce tasks Worker Execute Map or Reduce by request of Master
Map-Reduce : Distributed Processing Input File Input 1 Input 2 … Input M Map Map … Map Intermediate File 1 2 … 1 2 … R … 2 … R Shuffle Reduce Shuffle Reduce Shuffle Reduce … Output File Output 1 Output 2 Output R …
Map-Reduce : Example Inverted Index wordID docID Location 101 1 2 201 203 3 301 302 DocID=1 IDS 연구실의 페이지 DocID=2 IDB 연구실의 페이지 Word docID 연구실 101 의 201 페이지 203 IDS 301 IDB 302
Map-Reduce : Example (cont’d) Input data to Map Output of Map Data Map Reduce Key(docID) Value(Text) 1 IDS 연구실의 페이지 2 IDB 연구실의 페이지 Key (wordID) Value (docID:Location) 301 1:0 101 1:1 201 1:2 203 1:3 Key (wordID) Value (docID:Location) 302 2:0 101 2:1 201 2:2 203 2:3
Map-Reduce : Example (cont’d) Shuffle Collect same keys and convey them to Reduce Reduce writes the final result Key (wordID) Value (docID:Location) 101 1:1 2:1 201 1:2 2:2 203 1:3 2:3 301 1:0 302 2:0 Data Map Reduce 101=1:1, 2:1 201=1:2, 2:2 203=1:3, 2:3 301=1:0 302=2:0
Map-Reduce : Example (cont’d) Other Examples Distributed Grep Count URL Access Frequency <URL, 1> <URL, total count> Reverse Web-Link Graph <target, source> <target, list(source)>
Map-Reduce-Merge Map-Reduce is an extremely simple model, but with limited context Map-Reduce handles mainly homogeneous datasets Relational operators are hard to implement with Map-Reduce(especially join operations) Map-Reduce-Merge tries to keep the simplicity of Map-Reduce while extending it to be more complete
Map-Reduce-Merge Adds a merge phase to the Map-Reduce algorithm Allows processing of multiple heterogeneous datasets Like Map and Reduce, the Merge phase is implemented by the developer Example: Two datasets: department and employee Goal: compute employee’s bonus based on individual rewardsand department bonus adjustment
Map-Reduce-Merge Example Match keys on dept_id in tables
Map-Reduce-Merge: Extending Map-Reduce Change to reduce phase / Merge phase Phases 1. Map: (k1, v1) → [(k2, v2)] 2. Reduce: (k2, [v2]) → [v3] becomes: 2. Reduce: (k2, [v2]) → (k2, [v3]) 3. Merge: ((k2, [v3]), (k3, [v4])) → (k4, [v5])
Map-Reduce-Merge Additional user-definable operations Merger: same principle as map and reduce analogous to the map and reduce definitions, define logic to do the merge operation Processor: processes data from one source process data on an individual source Partition selector: selects the data that should go to the merger which data should go to which merger? Configurable iterator: how to iterate through each list as the merging is done how to step through each of the lists as you merge
Map-Reduce-Merge
Map-Reduce-Merge : Relational Data Processing Relational operators can be implemented using the Map-Reduce-Merge model. This includes: Projection Aggregation Generalized selection Joins Set union Set intersection Set difference Etc…
Map-Reduce-Merge : Example, Set Union The two Map-Reduces emit each a sorted list of unique elements The Merge merges the two lists by iterating in the following way: Store the smallest value of two and increase it’s iterator by one If they are equal, store one of them and increase both iterators
Map-Reduce-Merge : Example, Set Difference We have two sets, A and B, we want to compute A-B The two Map-Reduces emit each a sorted list of unique elements The merge iterates simultaneously over the two lists: If the value of A is less than B’s, store A’s value If the value of B is smaller, increment B’s iterator If the two are equal, increment both iterators
Map-Reduce-Merge : Example, Sort-Merge Join Map: partition records into buckets which are mutually exclusive and each key range is assigned to a reducer Reduce: data in the sets are merged into a sorted set => sort the data Merge: the merger joins the sorted data for each key range
Map-Reduce-Merge : Optimizations Map-reduce already optimizes using locality and backup tasks Optimize the number of connections between the outputs of the reduce phase and the input of the merge phase ( Example: Set intersection) Combining two phases into one (example: ReduceMerge)
Conclusions Map-Reduce-Merge allows us to work on heterogeneous datasets Map-Reduce-Merge supports joins which Map-reduce didn’t directly do Nextstep: develop an SQL-like interface and an optimizer which simplifies the development of a Map-reduce-Merge workflow