Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.

Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013

 Developed at Google 1999-2000, published by Google 2004  Used to make/maintain Google WWW index  Open source implementation by the Apache Software Foundation: Hadoop ◦ “Spinoffs” eg HBase (used by Facebook)  Amazon’s Elastic MapReduce (EMR) service ◦ Uses the Hadoop implementation of MapReduce  Various wrapper libraries, eg MRjob

 Split data for distributed processing  But some data may depend on other data to be processed correctly  MapReduce maps which data need to be processed together  Then reduces (processes) the data

 Input is split into different chunks Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9

 Each chunk is sent to one of several computers running the same map() function Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3

 Each map() function outputs several (key, value) pairs Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11)

 The map() outputs are collected and sorted by key Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11)

 Several computers running the same reduce() function receive the (key, value) pairs Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3

 All the records for a given key will be sent to the same reducer; this is why we sort Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3

 Each reducer outputs a final value (maybe with a key) Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3 Output 1 Output 2 Output 3

 The reducer outputs are aggregated and become the final output Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9 Mapper 1 Mapper 2 Mapper 3 (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Master Node: collect & sort (k1, v1) (k3, v2) (k3, v3) (k3, v6) (k2, v4) (k1, v5) (k3, v9) (k2, v8) (k2, v10) (k1, v7) (k1, v12) (k2, v11) Reducer 1 Reducer 2 Reducer 3 Output 1 Output 2 Output 3

 Problem: given a large body of text, count how many times each word occurs  How can we parallelize? ◦ Mapper key = ◦ Mapper value = ◦ Reducer key = ◦ Reducer value = word # occurrences in this mapper’s input word sum of # occurrences over all mappers

function map(input): counts = new dictionary() for word in input: counts[word]++ for word in counts: yield (word, count[word])

function reduce (key, values): sum = 0 for val in values: sum += val yield (key, sum)

 I need 3 volunteer slave nodes  I’ll be the master node

 Hadoop takes care of distribution, but only as efficiently as you allow  Input must be split evenly  Values should be spread evenly over keys ◦ If not, reduce() step will not be very well distributed – imagine all values get mapped to the same key, then the reduce() step is not parallelized at all!  Several keys should be used ◦ If you have few keys, then few computers can be used as reducers  By the same token, more/smaller input chunks are good  You need to know the data you’re processing!

 I/O is often the bottleneck, so use compression!  Some compression formats are not splittable ◦ Entire input files (large!) will be sent to single mappers, destroying hopes of distribution  Consider using a combiner (“pre-reducer”)  EMR considerations: ◦ Input from S3 is fast ◦ Nodes are virtual machines

 Hadoop in its original form uses Java  Hadoop Streaming allows programmers to avoid direct interaction with Java by instead using Unix STDIN/STDOUT  Requires serialization of keys and values ◦ Potential problems – “ \t ”, but what if serialized key or value contains a “\t”?  Beware of stray “print” statements ◦ Safer to print to STDERR

JAVA HADOOP Serialized Input STDIN Serialized Output STDOUT

 Thanks for your attention  Please provide feedback, comments, questions, etc: vyassa.baratham@stonybrook.edu vyassa.baratham@stonybrook.edu  Interested in physics? Want to learn about Monte Carlo Simulation?

Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.

Similar presentations

Presentation on theme: "Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.

Similar presentations

Presentation on theme: "Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013."— Presentation transcript:

Similar presentations

About project

Feedback