Download presentation
Presentation is loading. Please wait.
Published byAmos Fitzgerald Modified over 9 years ago
1
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing
2
IBM Research | India Research Lab Hadoop – A Very Brief Introduction A framework for creating distributed applications that process huge amounts of data. Scalability, Fault Tolerance, Ease of Programming Two main components HDFS – Hadoop Distributed File System Map-Reduce How data is organized on HDFS? How data is processed using Map-Reduce?
3
IBM Research | India Research Lab HDFS Stores files in blocks across many nodes in a cluster Replicates the blocks across nodes for durability Default – 64 MB Master/Slave Architecture HDFS Master NameNode Runs on a single node as a master process Directs client access to files in HDFS HDFS Slave DataNode Runs on all nodes in the cluster Block creation/replication/deletion Takes orders from the namenode
4
IBM Research | India Research Lab HDFS A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 64 MB Replication Factor = 3 All these blocks Distributed on the Cluster
5
IBM Research | India Research Lab HDFS NameNode Data Nodes Put File 123 456 1, 4, 5 2, 5, 6 2, 3, 4 File1.txt
6
IBM Research | India Research Lab HDFS Read-Time = Transfer-Rate x Number of Machines NameNode Data Nodes Read File 123 456 1, 4, 5 2, 5, 6 2, 3, 4
7
IBM Research | India Research Lab HDFS Fault-Tolerant Handles Node Failures Self-Healing Rebalances files across cluster Data from the remaining two nodes is automatically copied Scalable Just by adding new nodes NameNode Data Nodes Read File 123 456 1, 4, 5 2, 5, 6 2, 3, 4 3, 5, 6 2, 3, 6
8
IBM Research | India Research Lab Map-Reduce Logical Functions : Mappers and Reducers Developers write map and reducer functions then submit a jar to the Hadoop Cluster Hadoop handles distributing the Map and Reduce tasks across the cluster
9
IBM Research | India Research Lab Map-Reduce A map task is started for each split / 64 MB block. Each map task generates some intermediate data. Hadoop collects the output of all map tasks, reorganizes them and passes the reorganized data to Reduce tasks Reduce tasks process this re-organized data and generate the final output Flow HDFS Block to Map Task Map Task to Hadoop Engine Hadoop Shuffles and Sorts the Map output Hadoop Engine to Reduce Tasks and Reduce Processing
10
IBM Research | India Research Lab HDFS to Map Tasks Records are read one by one from each block and passed to map for processing. The component is called InputFormat / RecordReader A record is passed as a key-value pair. Key is an offset and the value is the record Offset is usually ignored by the map MAP-1 MAP-2 MAP-3 ( 0, R1 1 2 3) (10, R2 2 3 5) (20, R3 2 4 6) (30, R4 6 4 2) (40, R5 1 3 6) ( 50, R6 8 9 1) (60, R7 2 3 1) (70, R8 9 9 2) (80, R9 1 7 4) (90, R10 1 2 2)) (100, R11 2 3 4) (110, R12 4 5 6) (120, R13 6 7 8) (130, R14 9 8 3) (140, R15 3 2 1) Input- Format
11
IBM Research | India Research Lab Map Task Takes in a key-value pair and transforms it to a set of key-value pairs {K1, V1} ==> [{K2, V2}] ( 0, R1 1 2 3) (10, R2 2 3 5) (20, R3 2 4 6) (30, R4 6 4 2) (40, R5 1 3 6) ( 0, R6 8 9 1) (10, R7 2 3 1) (20, R8 9 9 2) (30, R9 1 7 4) (50, R10 1 2 2)) ( 0, R11 2 3 4) (10, R12 4 5 6) (20, R13 6 7 8) (30, R14 9 8 3) (50, R15 3 2 1) MAP-1 MAP-2 MAP-3 (2, 3) (2, 4) (2,4) (6, 4) (2, 9) (4, 9) (8, 9) (2, 3) (2, 5) (4, 5) (2, 7) Example: If the second column is an odd number, don’t do anything. If the second column is an even number generate as many pairs as the number of even divisors of the value in the second column. The key is the divisor and the value is the value in the third column
12
IBM Research | India Research Lab Hadoop Sorting And Shuffling Hadoop processes the key-value pairs output by map in a fashion so that the values in all pairs with the same key are grouped together These groups are then passed to reducers for processing MAP-1 MAP-2 MAP-3 (2, 3) (2, 4) (2,4) (6, 4) (2, 9) (4, 9) (8, 9) (2, 3) (2, 5) (4, 5) (2, 7) (2, [3, 3, 3, 4, 4, 5, 7, 9]) (4, [5, 9]) (6, [4]) (8, [9]) Hadoop Shuffle
13
IBM Research | India Research Lab Hadoop Engine to Reduce Tasks and Reduce Processing Let the number of distinct keys (groups) be m Let the number of reduce tasks be k. These m groups are distributed across k reduce tasks using a Hash function Reduce task processes each group and generates the output. Example – Sums all the values REDUCER 1 (2, [3, 4, 4, 9, 3, 3, 5, 7]) (6, [4]) REDUCER 2 (4, [9, 5]) (8, [9]) (2, 38) (6, 4) (4, 14) (8, 9)
14
IBM Research | India Research Lab Word-Count Hadoop Uses Map-Reduc There is a Map-Phase There is a Reduce phase (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce, 1) (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) (a, [1,1]) (Hadoop, 1) (is, [1,1]) (map, [1,1]) (phase, [1,1]) (reduce, [1,1]) (there, [1,1]) (uses, 1) A-I J-Q R-Z (a, 2) (hadoop, 1) (is, 2) (map, 2) (phase, 2) (reduce, 2) (there, 2) (uses, 1)
15
IBM Research | India Research Lab Map-Reduce Example: Aggregation Compute Avg of B for each distinct value of A ABC R1 11012 R2 22034 R3 11022 R4 13056 R5 34017 R6 21049 R7 12044 MAP 1 MAP 2 (1, 10) (2, 20) (1, 10) (1, 30) (3, 40) (2, 10) (1, 20) (1, 17.5) (2, 15) (3, 40) (1, [10, 10, 30, 20]) (2, [10, 20]) (3, [40]) Reducer 1 Reducer 2
16
IBM Research | India Research Lab Designing a Map-Reduce Algorithm Thinking in terms of Map and Reduce What data should be the key? What data should be the values? Minimizing Cost Reading and Map Processing Cost Communication Cost Processing Cost at Reducer Load Balancing All reducers should get similar volume of traffic Should not happen that only few machines are busy while others are loaded
17
IBM Research | India Research Lab Join On Point Data Select R.A, R.B, S.D where R.A==S.A ABC R1 11012 R2 22034 R3 11022 R4 13056 R5 34017 ADE S1 12022 S2 23036 S3 21029 S4 35016 S5 34037 MAP 1 MAP 2 (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) (1, 10, 20) (1, 30, 20) (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40, 40) (1, [(R, 10), (R, 10), (R, 30), (S, 20)] ) (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] Reducer 1 Reducer 2
18
IBM Research | India Research Lab Join On Point Data Select R.A, R.B, S.D where R.A==S.A Attribute A range is divided into k parts. A hash function hashes the value of attribute A to [1,…,k] 1 2 … … k A reducer is defined for each of the k part A tuple from R and S is communicated to reducer k if the value of R.A or S.A hashes to bucket k Each reducer computes the partial join output
19
IBM Research | India Research Lab Join On Point Data Assume k = 3, h(1)=0, h(2)=1, h(3)=2 ABC R1 11012 R2 22034 R3 11022 R4 13056 R5 34017 ADE S1 12022 S2 23036 S3 21029 S4 35016 S5 34037 R1 1 10 12 R3 1 10 22 R4 1 30 56 S1 1 20 22 0 1 2 R2 2 20 34 S2 2 30 36 S3 2 10 29 R5 3 40 17 S4 3 50 16 S5 3 40 37 R1 S1 R3 S1 R4 S1 R2 S2 R2 S3 R5 S4 R5 S5
20
IBM Research | India Research Lab Map-Reduce Example : Inequality Join Select R.A, R.B, S.D where R.A <= S.A Consider 3-Node Cluster ABC R1 11012 R2 22034 R3 11022 R4 13056 R5 34017 ADE S1 12022 S2 23036 S3 21029 S4 35016 S5 34037 MAP 2 (r1, [S, 1, 20]) (r2, [S, 2, 30]) (r2, [S, 2, 10]) (r3, [S, 3, 50]) (r3, [S, 3, 40]) (1, 10, 20) (1, 30, 20) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40, 40) MAP 1 (r1, [R, 1, 10]) (r2, [R, 1, 10]) (r3, [R, 1, 10]) (r2, [R, 2, 20]) (r3, [R, 2, 20]) ….. (r3, [R, 3, 40]) (r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20]) (r3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) Reducer 1 Reducer 3 …… Reducer 2
21
IBM Research | India Research Lab Why Join On Map-Reduce Is A Complex Task? Data for multiple relations distributed across different machines Map-Reduce is inherently designed for processing a single dataset. An output tuple can be generated only when all the input tuples are collected at a common machine This needs to happen for all output tuples, is non-trivial. Apriori, we don’t know which tuples are going to join to form an output tuple. That is precisely the join problem Ensuring it, may involve lot of replication and hence lot of communication Tuples from every candidate combination need to be collected at reducers and the join predicates need to be checked
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.