Download presentation
Presentation is loading. Please wait.
Published byFrederica Byrd Modified over 9 years ago
1
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA) SIGMOD 2007 (Industrial) Presented by Kisung Kim 2010. 7. 14
2
Contents Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Case Study Conclusion
3
Introduction New challenges of data processing –A vast amount of data collected from the entire WWW Solutions of search engine companies –Customized parallel data processing systems –Use large clusters of shared-nothing commodity nodes –Ex) Google’s GFS, BigTable, MapReduce Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop
4
Introduction Properties of data-intensive systems –Simple Adopt only a selected subset of database principles –Sufficiently generic and effective –Parallel data processing system deployed on large clusters of shared-nothing commodity nodes –Refactoring of data processing into two primitives: Map function Reduce function Map-Reduce allow users not to worry about the nuisance details of: –Coordinating parallel sub-tasks –Maintaining distributed file storage\ This abstraction can greatly increase user productivity
5
Introduction Map-Reduce framework is best at handling homogeneous datasets –Ex) Joining multiple heterogeneous datasets does not quite fit into the Map-Reduce framework Extending Map-Reduce to process heterogeneous datasets simultaneously –Processing data relationships is ubiquitous –Join-enabled Map-Reduce system can provide a highly parallel yet cost effective alternative –Include relational algebra in the subset of the database principles Relational operators can be modeled using various combinations of the three primitives: Map, Reduce, and Merge
6
Map-Reduce Input dataset is stored in GFS Mapper –Read splits of the input dataset –Apply map function to the input records –Produce intermediate key/value sets –Partition the intermediate sets into # of reducers sets Reducer –Read their part of intermediate sets from mappers –Apply reduce function to the values of a same key –Output final results split mapper reducer map: (k1, v1) [(k2, v2)] reduce: (k2, [v2]) [v3] map: (k1, v1) [(k2, v2)] reduce: (k2, [v2]) [v3] Signatures of Map, Reduce Function Input Intermediate Sets Final Results
7
Join using Map-Reduce: Use homogenization procedure –Apply one map/reduce task on each dataset –Insert a data-source tag into every value –Extract a key attribute common for all heterogeneous datasets –Transformed datasets now have two common attributes Key and data-source Problems –Take lots of extra disk space and incur excessive map-reduce communications –Limited only to queries that can be rendered as equi-joins
8
Join using Map-Reduce: Homogenization KeyValue 101, “Value1” 851, “Value2” 3201, “Value3” KeyValue 102, “Value4” 542, “Value5” 3202, “Value6” map reduce map reduce map reduce Dataset 1 Dataset 2 Collect records with same key
9
Map-Reduce-Merge Signatures –α, β, γ represent dataset lineages –Reduce function produces a key/value list instead of just values –Merge function reads data from both lineages These three primitives can be used to implement the parallel version of several join algorithm map: (k1, v1) [(k2, v2)] reduce: (k2, [v2]) [v3] map: (k1, v1) [(k2, v2)] reduce: (k2, [v2]) [v3] Map-Reduce
10
Merge Modules Merge function –Process two pairs of key/values Processor function –Process data from one source only –Users can define two processor functions Partition selector –Determine from which reducers this merger retrieves its input data based on the merger number Configurable iterator –A merger has two logical iterators –Control their relative movement against each others
11
Merge Modules Partition Selector Processor Iterator Merge Reducer Output Reducers for 1 st DatasetReducers for 2 nd Dataset Reducer Output
12
Applications to Relational Data Processing Map-Reduce-Merge can be used to implement primitive and some derived relational operators –Projection –Aggregation –Generalized selection –Joins –Set union –Set intersection –Set difference –Cartesian product –Rename Map-Reduce-Merge is relationally complete, while being load- balanced, scalable and parallel
13
Example: Hash Join split mapper reducer merger split mapper reducer Use a hash partitioner Read from every mapper for one designated partition Read from two sets of reducer outputs that share the same hashing buckets One is used as a build set and the other probe
14
Case Study: TPC-H Query 2 Involves 5 tables, 1 nested query, 1 aggregate and group by clause, and 1 order by
15
Case Study: TPC-H Query 2 Map-Reduce-Merge workflow 13 passes of Map-Reduce-Merge 10 mappers, 10 reducers, and 4 mergers 6 passes of Map-Reduce-Merge 5 mappers, 4 reduce-merge-mappers, 1 reduce-mapper and 1 reducer Combining phases
16
Conclusion Map-Reduce-Merge programming model –Retain Map-Reduce’s many great features –Add relational algebra to the list of database principles it upholds –Contains several configurable components that enable many data- processing patterns Next step –Develop an SQL-like interface and an optimizer to simplify the process of developing a Map-Reduce-Merge workflow –This work can readily reuse well-studied RDBMS techniques
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.