”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay
Overview MapReduce Hadoop & MapReduce HadoopDB & MapReduce MapReduceMerge Implementations of Relational Operators
Overview MapReduce Hadoop & MapReduce HadoopDB & MapReduce MapReduceMerge Implementations of Relational Operators
MapReduce 100 lines of the given document Word starting with A to H Word starting with I to M Partition Reducer 1 Reducer 2 Reducer 3 split 2 split 1 split 4 split 3 Mapper 1 Mapper 2 Word Count Word starting with N to Z
MapReduce – Example (map) map(“/path/wikipedia.org”, “to be or as”) Output of map function is (potentially many) key/value pairs. In our case, output (word, “1”) once per word in the document “to”, “1” “be”, “1” “or”, “1” “to”, “1” “be”, “1” “or”, “1”
MapReduce – Example (Reduce) Each reducer read key/value pair from one partition, then sort and group the key/value pair In our case, compute the sum “be,1”, “as 1” “be 1” “be 1” “as 1” in partition 1 “or,1”, “to 1” “or 1” “to 1” “or 1” in partition 2 Reducer 1 – “be , [1,1]” “as, [1,1]” → “[be 2, as 2]” Reducer 2 – “or , [1,1]” “to, [1,1]” → “[or 2, to 2]”
MapReduce Schematic
Overview Hadoop & MapReduce MapReduce HadoopDB & MapReduce MapReduceMerge Implementations of Relational Operators
Input Format Implementation Hadoop & MapReduce Map Reduce Job Hadoop core HDFS Map Reduce Framework Name Node Job Tracker Master Node Input Format Implementation Worker Node Task Tracker Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node Data Node
Input Format Implementation Hadoop & MapReduce Sql Query Map Reduce Job Hive Hadoop core HDFS Map Reduce Framework Name Node Job Tracker Master Node Input Format Implementation Worker Node Task Tracker Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node Data Node
Relational Data Processing using MapReduce map(key emp_id, value bonus) reduce(key emp_id, value bonus_list)
Overview HadoopDB & MapReduce MapReduce Hadoop & MapReduce MapReduceMerge Implementations of Relational Operators
HadoopDB & MapReduce
Overview MapReduceMerge MapReduce Hadoop & MapReduce HadoopDB & MapReduce MapReduceMerge Implementations of Relational Operators
Why to extend MapReduce Well suitable for un-structured data but not for structured data (RDBS) Join is inefficient in MapReduce Strict 2-Phase overflow Map - first phase Reduce – second phase
Extension to MapReduce Add a merge phase to map reduce algorithm Allow efficient processing of multiple dataset (RDBMS) Merge is made up of several components Merge Function Processor Function Partition Selector Configurable Iterator
Components of Merge Merger → same principle as map and reduce Processor →processes data from one source Partition Selector →select the data that should go to the merger Configurable Iterator →how to iterate through each list as the merging is done
Implementation Overview
MapReduceMerge Example Join employee and department table and compute bonuses Employee Department
Example : Map(employee dataset) 1 B $150
Example : Map(department dataset) 1 B $150
Example : Reduce(employee dataset) 1 B $150
Example: Reduce(department dataset)
Example: Merge(employee, department) 1 B $150
Implementing Relational Algebra Operations Projection Aggregation Selection Set perations: Union, Intersection, Difference Cartesian Product Rename Join
Projection All we have to do is emit a subset of the data passed in. Just a mapper can do this map(key emp_id, value emp_info){ emit(emp_id); }
Aggregation By choosing appropriate keys, can implement “group by” and aggregate SQL operators in MapReduce. Reducer sort and group the intermediate key/value pair based on the key
Aggregation
Selection If selection condition involves only the attributes of one data source, can implement in mappers. map(key emp_id, value emp_info){ if(emp_info.dept_id = A) emit(emp_id, emp_info); }
Selection If it’s on aggregates or a group of values contained in one data source, can implement in reducers. If it involves attributes or aggregates from both data sources, implement in mergers
Set Union Let each of the two MapReduces emit a sorted list of unique elements Merges just iterate simultaneously over the lists: store the lesser value and increment its iterator, if there is a lesser value if the two are equal, store one of the two, and increment both iterators
Set Intersection Let each of the two MapReduces emit a sorted list of unique elements Merges just iterate simultaneously over the lists: if there is a lesser value, increment its iterator if the two are equal, store one of the two, and increment both iterators
Sort-Merge Join Map: partition records into key ranges according to the values of the attributes on which you’re sorting, aiming for even distribution of values to mappers. Reduce: sort the data. Merge: join the sorted data for each key range.
Conclusion Map-Reduce-Merge adds the ability to execute arbitrary relational algebra queries Next steps: develop SQL-like interface and a query optimizer
References Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters, OSDI '04 MapReduce: Simplified Data Processing on Large Clusters, SIGMOD '07 HadoopDB :An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. VLDB '09 Hive : A Warehousing Solution over a MapReduce Framework, VLDB 09