MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides (http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html) (licensed.

Slides:



Advertisements
Similar presentations
Database and MapReduce
Advertisements

Two-Pass Algorithms Based on Sorting
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
CS 540 Database Management Systems
Database and MapReduce Based on slides from Jimmy Lin’s lecture slides ( d-2010-Spring/index.html) (licensed under.
Tutorial at WWW 2011, Distributed reasoning: because size matters Andreas Harth, Aidan Hogan, Spyros Kotoulas,
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Dr. Kalpakis CMSC 661, Principles of Database Systems Query Execution [15]
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
MapReduce and databases Data-Intensive Information Processing Applications ― Session #7 Jimmy Lin University of Maryland Tuesday, March 23, 2010 This work.
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:
Introduction to Hadoop and HDFS
Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
CS 338Query Evaluation7-1 Query Evaluation Lecture Topics Query interpretation Basic operations Costs of basic operations Examples Textbook Chapter 12.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Computing & Information Sciences Kansas State University Tuesday, 03 Apr 2007CIS 560: Database System Concepts Lecture 29 of 42 Tuesday, 03 April 2007.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Query Processing.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
MapReduce Programming Model Data type: key-value records Map function: (K in, V in )  list(K inter, V inter ) Reduce function: (K inter, list(V inter.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
CS4432: Database Systems II Query Processing- Part 2.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Processing CS 405G Introduction to Database Systems.
Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Big Data Infrastructure Week 2: MapReduce Algorithm Design (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems MapReduce Spring 2016.
Part II NoSQL Database (MapReduce) Yuan Xue
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
MapReduce and Data Management
Database Management System
Evaluation of Relational Operations: Other Operations
File Processing : Query Processing
MapReduce and Data Management
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Lecture 2- Query Processing (continued)
Evaluation of Relational Operations: Other Techniques
Lecture 22: Query Execution
COS 518: Distributed Systems Lecture 11 Mike Freedman
Evaluation of Relational Operations: Other Techniques
Lecture 20: Query Execution
Presentation transcript:

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed under Creation Commons Attribution 3.0 License)

Mapreduce and Databases

Relational Algebra Primitives –Projection (  ) –Selection (  ) –Cartesian product (  ) –Set union (  ) –Set difference (  ) –Rename (  ) Other operations –Join ( ⋈ ) –Group by… aggregation –…

Projection R1R1 R2R2 R3R3 R4R4 R5R5 R1R1 R2R2 R3R3 R4R4 R5R5

Projection in MapReduce Easy! –Map over tuples, emit new tuples with appropriate attributes –Reduce: take tuples that appear many times and emit only one version (duplicate elimination) Tuple t in R: Map(t, t) -> (t’,t’) Reduce (t’, [t’, …,t’]) -> [t’,t’] Basically limited by HDFS streaming speeds –Speed of encoding/decoding tuples becomes important –Relational databases take advantage of compression –Semistructured data? No problem!

Selection R1R1 R2R2 R3R3 R4R4 R5R5 R1R1 R3R3

Selection in MapReduce Easy! –Map over tuples, emit only tuples that meet criteria –No reducers, unless for regrouping or resorting tuples (reducers are the identity function) –Alternatively: perform in reducer, after some other processing But very expensive!!! Has to scan the database –Better approaches?

Union, Set Intersection and Set Difference Similar ideas: each map outputs the tuple pair (t,t). For union, we output it once, for intersection only when in the reduce we have (t, [t,t]) For Set difference?

Set Difference -Map Function: For a tuple t in R, produce key- value pair (t, R), and for a tuple t in S, produce key-value pair (t, S). -Reduce Function: For each key t, do the following. 1. If the associated value list is [R], then produce (t, t). 2. If the associated value list is anything else, which could only be [R, S], [S, R], or [S], produce (t, NULL).

Group by… Aggregation Example: What is the average time spent per URL? In SQL: –SELECT url, AVG(time) FROM visits GROUP BY url In MapReduce: –Map over tuples, emit time, keyed by url –Framework automatically groups values by keys –Compute average in reducer –Optimize with combiners

Relational Joins R1R1 R2R2 R3R3 R4R4 S1S1 S2S2 S3S3 S4S4 R1R1 S2S2 R2R2 S4S4 R3R3 S1S1 R4R4 S3S3

Join Algorithms in MapReduce Reduce-side join Map-side join In-memory join –Striped variant –Memcached variant

Reduce-side Join Basic idea: group by join key –Map over both sets of tuples –Emit tuple as value with join key as the intermediate key –Execution framework brings together tuples sharing the same key –Perform actual join in reducer –Similar to a “sort-merge join” in database terminology

Map-side Join: Parallel Scans If datasets are sorted by join key, join can be accomplished by a scan over both datasets How can we accomplish this in parallel? –Partition and sort both datasets in the same manner In MapReduce: –Map over one dataset, read from other corresponding partition –No reducers necessary (unless to repartition or resort) Consistently partitioned datasets: realistic to expect?

In-Memory Join Basic idea: load one dataset into memory, stream over other dataset –Works if R << S and R fits into memory –Called a “hash join” in database terminology MapReduce implementation –Distribute R to all nodes –Map over S, each mapper loads R in memory, hashed by join key –For every tuple in S, look up join key in R –No reducers, unless for regrouping or resorting tuples

In-Memory Join: Variants Striped variant: –R too big to fit into memory? –Divide R into R 1, R 2, R 3, … s.t. each R n fits into memory –Perform in-memory join:  n, R n ⋈ S –Take the union of all join results Memcached join: –Load R into memcached –Replace in-memory hash lookup with memcached lookup

Memcached Join Memcached join: –Load R into memcached –Replace in-memory hash lookup with memcached lookup Capacity and scalability? –Memcached capacity >> RAM of individual node –Memcached scales out with cluster Latency? –Memcached is fast (basically, speed of network) –Batch requests to amortize latency costs Source: See tech report by Lin et al. (2009)

Which join to use? In-memory join > map-side join > reduce- side join –Why? Limitations of each? –In-memory join: memory –Map-side join: sort order and partitioning –Reduce-side join: general purpose

Processing Relational Data: Summary MapReduce algorithms for processing relational data: –Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce –Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer –Multiple strategies for relational joins Complex operations require multiple MapReduce jobs –Example: top ten URLs in terms of average time spent –Opportunities for automatic optimization

Map-Reduce-Merge Map-Reduce-Merge can form a hierarchical workflow which is similar to, but much more general than a DBMS query execution plan. –No query operators, but arbitrary programming logic specified by the developers –More general than relational query plans –More general than Map-Reduce

From the paper

MRM MR: map: (k1, v1) -> [(k2, v2)] reduce: (k2, [v2]) -> [v3] MRM: map: (k1, v1) -> [(k2, v2)] reduce: (k2, [v2]) -> [(k2, v3)] merge((k2, [v2]) a, (k3, [v3]) b ) -> [(k4, v5)]

Additional components Merge function: user-defined data processing logic for the merger of two pairs of key/values, each coming from a different source. Processor function: user-defined function that processes data from one source only. Partition selector: user-definable module that shows I/O relationship btw reducers and mergers. Configurable iterator: user-configurable module that shows how to iteratethrougheachinput data as the mergingisdone.

Sort-Merge Join Algorithm Map: Partition records intobucketswhichare mutuallyexclusive and eachkeyrange isassignedto a Reducer. Reduce: Data in the sets are mergedintoa sortedset (sort the data). Merge: The mergerjoins the sorteddata for eachkeyrange.

Other Join Algorithms MRM allows for implementation of other join algorithms like Hash Join and Nested Loop Join.

MRM join paper: Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, Douglas Stott Parker Jr.: Map-reduce-merge: simplified relational data processing on large clusters. SIGMOD Conference 2007: