Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

CS CS4432: Database Systems II Operator Algorithms Chapter 15.

CMSC424: Database Design Instructor: Amol Deshpande

Clydesdale: Structured Data Processing on MapReduce Jackie.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.

ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization.

1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,

Chapter 19 Query Processing and Optimization

Hinrich Schütze and Christina Lioma Lecture 4: Index Construction

Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.

CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Ch 4. The Evolution of Analytic Scalability

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Database Management 9. course. Execution of queries.

Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

MapReduce How to painlessly process terabytes of data.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Mining High Utility Itemset in Big Data

Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.

Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.

Copyright © Curt Hill Query Evaluation Translating a query into action.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

MapReduce Algorithm Design Based on Jimmy Lin’s slides

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

B+ Trees: An IO-Aware Index Structure Lecture 13.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

CSCE Database Systems Chapter 15: Query Execution 1.

Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.

Lecture 17: Query Execution Tuesday, February 28, 2001.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

CS 440 Database Management Systems Lecture 5: Query Processing 1.

MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 

File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

Query Processing – Implementing Set Operations and Joins Chap. 19.

CS 540 Database Management Systems

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Chapter 13: Query Processing

1 VLDB, Background What is important for the user.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Chapter 15 QUERY EXECUTION.

湖南大学-信息科学与工程学院-计算机与科学系

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Ch 4. The Evolution of Analytic Scalability

Advance Database Systems

Presentation transcript:

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data Reading Group Presentation

Motivation Map-Reduce framework Compared to relational DBMS “simplified” for data processing in search engines Problem: join multiple heterogeneous datasets Not quite fit into map-reduce Ad-hoc solutions: map-reduce on one data set while reading data from the other dataset on the fly

Contribution Goal: support relational algebra primitives without sacrificing existing generality and simplicity Proposal: map-reduce-merge

Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Let’s Refresh Our Memory Functional programming model

Comments Low-cost unreliable commodity hardware Failure often occurs during each map/reduce task Coordinator re-run mapper or reducer Homogenization: for equi-join Transform each dataset into (join key, payload) Then apply map-reduce to merge entries from different datasets Problem: only equi-joins may take lots of extra disk space, incur excessive communications

Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Map-Reduce-Merge Primitives key join

Focusing on Merge Two sets of inputs generated by multiple reducers: Which α reducers and β reducers match? How to get the next key-value pair? Customized preprocessing for inputs? Merging algorithm? All of these are customizable

Focusing on Merge Two sets of inputs generated by multiple reducers: Partition Selector: Which α reducers and β reducers match? Iterator: How to get the next key-value pair? Processor: Customized preprocessing for inputs? Merger: Merging algorithm? All of these are customizable

Example: Emp & Dept EmployeeDepartment

Partition Selector LHS: reduce key:dept-id, emp-idpartition key: dept-id RHS: reduce key:dept-id, partition key: dept-id Assuming #reducer is the same, LHS reducer K matches RHS reducer K

Processor Pre-processing for each input E.g. building hash table for hash join This example is sort-merge Processor is empty

Iterator for sort-merge

Merger

Other Iterators Nested-loop: For each (k,v) of the first input, get all the second input Then rewind the second input and process the next (k,v) of the first input Hash join: Read all of one input, then read all of the other input

Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Relation Relation R with an attribute set A A is broken down into a key part K, and a value part V

Relational Operators Generalized selection: choosing a subset of records Filtering can be done in mapper/reducer/merger Projection: choosing a subset of attributes User-defined mapper (k,v)  (k’,v’) Aggregation Group-by is performed before reduce Easy to implement aggregation in reducer Joins (set union, intersection, difference, cartesian product) Sort-merge, hash join, nested-loop Rename

Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Partition Selector In general: LHS has R1 reducers, RHS has R2 reducers, performing cartesian-product like operator Suppose R1  R2, use R1 merger, where merger j selects: Input from LHS reducer j Input from RHS all reducers Remote reads: R1*(1+R2) = R1 + R1*R2 Natural equi-join case: Let R1==R2==R, use R merger, where merger j selects: LHS reducer j and RHS reducer j Remote reads: 2*R

Combining Phases Entire workflow consists of multiple map-reduce-merge To avoid remote copying: ReduceMap, MergeMap: co-locate next mapper with previous reducer or merger ReduceMerger: co-locate merger with one of the reducer ReduceMergeMap

Map-Reduce-Merge Library Put common merge implementations into a library Joins Common iterators etc.

Configuration API for building a Customized Workflow Map/ reduce Map/reduce/mergeMultiple Map/reduce/merge

Outline Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Webgraphs Each row: (URL, in-links, out-links) Potentially large number of links Only a few are needed for many operations Store each column of the table in a separate file Reconstruct the table by join E.g. compute the intersection of in-links and out- links

TPC-H Query 2

After Combining Phases

Conclusion Extend map-reduce Support relational operators However, the merge step seems quite complicated