Download presentation
Presentation is loading. Please wait.
Published byJessie Robinson Modified over 9 years ago
1
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013
3
Developed at Google 1999-2000, published by Google 2004 Used to make/maintain Google WWW index Open source implementation by the Apache Software Foundation: Hadoop ◦ “Spinoffs” eg HBase (used by Facebook) Amazon’s Elastic MapReduce (EMR) service ◦ Uses the Hadoop implementation of MapReduce Various wrapper libraries, eg MRjob
4
Split data for distributed processing But some data may depend on other data to be processed correctly MapReduce maps which data need to be processed together Then reduces (processes) the data
5
Input is split into different chunks Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9
6
Each chunk is sent to one of several computers running the same map() function Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3
7
Each map() function outputs several (key, value) pairs Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11)
8
The map() outputs are collected and sorted by key Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11)
9
Several computers running the same reduce() function receive the (key, value) pairs Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3
10
All the records for a given key will be sent to the same reducer; this is why we sort Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3
11
Each reducer outputs a final value (maybe with a key) Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3 Output 1 Output 2 Output 3
12
The reducer outputs are aggregated and become the final output Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3 Output 1 Output 2 Output 3
13
Problem: given a large body of text, count how many times each word occurs How can we parallelize? ◦ Mapper key = ◦ Mapper value = ◦ Reducer key = ◦ Reducer value = word # occurrences in this mapper’s input word sum of # occurrences over all mappers
14
function map(input): counts = new dictionary() for word in input: counts[word]++ for word in counts: yield (word, count[word])
15
function reduce (key, values): sum = 0 for val in values: sum += val yield (key, sum)
16
I need 3 volunteer slave nodes I’ll be the master node
17
Hadoop takes care of distribution, but only as efficiently as you allow Input must be split evenly Values should be spread evenly over keys ◦ If not, reduce() step will not be very well distributed – imagine all values get mapped to the same key, then the reduce() step is not parallelized at all! Several keys should be used ◦ If you have few keys, then few computers can be used as reducers By the same token, more/smaller input chunks are good You need to know the data you’re processing!
18
I/O is often the bottleneck, so use compression! Some compression formats are not splittable ◦ Entire input files (large!) will be sent to single mappers, destroying hopes of distribution Consider using a combiner (“pre-reducer”) EMR considerations: ◦ Input from S3 is fast ◦ Nodes are virtual machines
19
Hadoop in its original form uses Java Hadoop Streaming allows programmers to avoid direct interaction with Java by instead using Unix STDIN/STDOUT Requires serialization of keys and values ◦ Potential problems – “ \t ”, but what if serialized key or value contains a “\t”? Beware of stray “print” statements ◦ Safer to print to STDERR
20
JAVA HADOOP Serialized Input STDIN Serialized Output STDOUT
21
Thanks for your attention Please provide feedback, comments, questions, etc: vyassa.baratham@stonybrook.edu vyassa.baratham@stonybrook.edu Interested in physics? Want to learn about Monte Carlo Simulation?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.