10-405 Big ML Hadoop.

Big ML Hadoop

Large-vocabulary Naïve Bayes with stream-and-sort
… Y=business ^ X=zynga += 1 Y=sports ^ X=hat += 1 Y=sports ^ X=hockey += 1 For each example id, y, x1,….,xd in train: Print “Y=ANY += 1” Print “Y=y += 1” For j in 1..d: Print “Y=y ^ X=xj += 1” Sort the event-counter update “messages” Scan and add the sorted messages and output the final counter values previousKey = Null sumForPreviousKey = 0 For each (event,delta) in input: If event==previousKey sumForPreviousKey += delta Else OutputPreviousKey() previousKey = event sumForPreviousKey = delta define OutputPreviousKey(): If PreviousKey!=Null print PreviousKey,sumForPreviousKey python MyTrainer.py train | sort | python MyCountAdder.py >model

More Stream-And-Sort Examples

Some other stream and sort tasks
A task: classify Wikipedia pages Features: words on page: src y w1 w2 …. or, outlinks from page: src y dst1 dst2 … how about inlinks to the page?

outlinks from page: src dst1 dst2 … Algorithm: For each input line src dst1 dst2 … dstn print out dst1 inlinks.= src dst2 inlinks.= src … dstn inlinks.= src Sort this output Collect the messages and group to get dst src1 src2 … srcn

prevKey = Null sumForPrevKey = 0 For each (event, delta) in input: If event==prevKey sumForPrevKey += delta Else OutputPrevKey() prevKey = event sumForPrevKey = delta define OutputPrevKey(): If PrevKey!=Null print PrevKey,sumForPrevKey prevKey = Null docsForPrevKey = [ ] For each (dst, src) in input: If dst==prevKey docsForPrevKey.append(src) Else OutputPrevKey() prevKey = dst docsForPrevKey = [src] define OutputPrevKey(): If PrevKey!=Null print PrevKey, docsForPrevKey

What if we run this same program on the words on a page? Features: words on page: src w1 w2 …. outlinks from page: src dst1 dst2 … Out2In.java w1 src1,1 src1,2 src1,3 …. w2 src2,1 … … an inverted index for the documents

Finding unambiguous geographical names GeoNames.org: for each place in its database, stores Several alternative names Latitude/Longitude … Problem: you need to soft-match names, and many names are ambiguous, if you allow an approximate match

Carnegie Mellon University Carnegie Mellon School
Point Park College Point Park University Carnegie Mellon Carnegie Mellon University Carnegie Mellon School

Finding almost unambiguous geographical names GeoNames.org: for each place in the database print all plausible soft-match substrings in each alternative name, paired with the lat/long, e.g. Carnegie Mellon University at lat1,lon1 Carnegie Mellon at lat1,lon1 Mellon University at lat1,lon1 Carnegie Mellon School at lat2,lon2 Carnegie Mellon at lat2,lon2 Mellon School at lat2,lon2 … Sort and collect… and filter

prevKey = Null sumForPrevKey = 0 For each (event += delta) in input: If event==prevKey sumForPrevKey += delta Else OutputPrevKey() prevKey = event sumForPrevKey = delta define OutputPrevKey(): If PrevKey!=Null print PrevKey,sumForPrevKey prevKey = Null locOfPrevKey = Gaussian() For each (place at lat,lon) in input: If dst==prevKey locOfPrevKey.observe(lat, lon) Else OutputPrevKey() prevKey = place define OutputPrevKey(): If PrevKey!=Null and locOfPrevKey.stdDev() < 1 mile print PrevKey, locOfPrevKey.avg()

prevKey = Null locOfPrevKey = Gaussian() For each (place at lat,lon) in input: If dst==prevKey locOfPrevKey.observe(lat, lon) Else OutputPrevKey() prevKey = place define OutputPrevKey(): If PrevKey!=Null and locOfPrevKey.stdDev() < 1 mile print PrevKey, locOfPrevKey.avg() Can incrementally maintain estimates for mean + variance as you add points to a Gaussian distribution

Parallelizing Stream And Sort

Stream and Sort Counting  Distributed Counting
example 1 example 2 example 3 …. C[x1] += D1 C[x1] += D2 …. “C[x] +=D” Logic to combine counter updates Sort Counting logic process A process C process B

Standardized message routing logic example 1 example 2 example 3 …. C[x1] += D1 C[x1] += D2 …. “C[x] +=D” Logic to combine counter updates Counting logic Sort Machines A1,… Machines C1,.., Machines B1,…, Can parallelize! Can parallelize! Can parallelize!

example 1 example 2 example 3 …. C[x1] += D1 C[x1] += D2 …. Spill 1 Spill 2 Merge Spill Files Logic to combine counter updates Spill 3 Counting logic … Sort “C[x] +=D” Machines A1,… Machines C1,..,

Sort key example 1 example 2 example 3 …. C[x1] += D1 C[x1] += D2 …. Spill 1 Spill 2 Merge Spill Files Logic to combine counter updates Spill 3 Counting logic … Sort “C[x] +=D” Reducer Machine Counter Machine Combiner Machine Observation: you can “reduce” in parallel (correctly) as no sort key is split across multiple machines

C[“al”] += D1 C[“al”] += D2 …. combine counter updates Reducer Machine 1 Merge Spill Files C[“bob”] += D1 C[“joe”] += D2 Reducer Machine 2 example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic … Sort “C[x] +=D” Spill 4 Counter Machine Observation: you can “reduce” in parallel (correctly) as no sort key is split across multiple machines

C[“al”] += D1 C[“bob”] += D2 …. combine counter updates Reducer Machine 1 Merge Spill Files C[“bob”] += D1 C[“joe”] += D2 Reducer Machine 2 example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic … Sort “C[x] +=D” Spill 4 Counter Machine Observation: you can “reduce” in parallel (correctly) as no sort key is split across multiple machines

Same holds for counter machines: you can count in parallel as no sort key is split across multiple reducer machines Counter Machine 1 C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Reducer Machine 1 Merge Spill Files C[x2] += D1 C[x2] += D2 combine counter updates Reducer Machine 2 example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic Partition/Sort … Spill n example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic Partition/Sort … Counter Machine 2

Counter Machine 1 C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Reducer Machine 1 Merge Spill Files C[x2] += D1 C[x2] += D2 combine counter updates Reducer Machine 2 example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic Partition/Sort … Spill n example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic Partition/Sort … Counter Machine 2

Mapper Machine 1 Mapper/counter machines run the “Map phase” of map-reduce Input different subsets of the total inputs Output (sort key,value) pairs The (key,value) pairs are partitioned based on the key Pairs from each partition will be sent to a different reducer machine. example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic Partition/Sort … Spill n example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic Partition/Sort … Mapper Machine 2

C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Reducer Machine 1 Merge Spill Files C[x2] += D1 C[x2] += D2 combine counter updates Reducer Machine 2 The shuffle/sort phrase: (key,value) pairs from each partition are sent to the right reducer The reducer will sort the pairs together to get all the pairs with the same key together. The reduce phase: Each reducer will scan through the sorted key-value pairs. Spill 1 Spill 2 Spill 3 … Spill n Spill 1 Spill 2 Spill 3 …

Counter Machine 1 C[x1] += D1 C[x1] += D2 …. Logic to combine counter updates Reducer Machine 1 Merge Spill Files C[x2] += D1 C[x2] += D2 combine counter updates Reducer Machine 2 example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic Partition/Sort … Spill n example 1 example 2 example 3 …. Spill 1 Spill 2 Spill 3 Counting logic Partition/Sort … Counter Machine 2

Distributed Stream-and-Sort: Map, Shuffle-Sort, Reduce
Map Process 1 Distributed Shuffle-Sort Reducer 1 example 1 example 2 example 3 …. C[x1] += D1 C[x1] += D2 …. Spill 1 Spill 2 Merge Spill Files Spill 3 Logic to combine counter updates Counting logic Partition/Sort … Spill n example 1 example 2 example 3 …. C[x2] += D1 C[x2] += D2 …. Spill 1 Spill 2 Merge Spill Files Spill 3 Counting logic combine counter updates Partition/Sort … Reducer 2 Map Process 2

Hadoop as Parallel Stream-and-sort

Hadoop: Distributed Stream-and-Sort
Local implementation: cat input.txt | MAP | sort | REDUCE > output.txt How could you parallelize this?

Local implementation: cat input.txt | MAP | sort | REDUCE > output.txt In parallel Run N mappers mout.txt, mout2.txt, … Partition mouti.txt for the M mapper machines: part1.1.txt, part1.2..txt, … partN.M.txt Send each partition to the right reducer

Local implementation: cat input.txt | MAP | sort | REDUCE > output.txt Map Partition Send In parallel: Accept N partition files, part1.j.txt, … partN.j.txt Sort/merge them together to rinj.txt Reduce to get the final result (reduce output for each key in partition j) If necessary concatenate the reducer outputs to a single file.

Local implementation: cat input.txt | MAP | sort | REDUCE > output.txt In parallel Map Partition Send In parallel: Accept Merge Reduce You could do this as a class project ... but for really big data …

Robustness: with 10,000 machines, machines and disk drives fail all the time. How do you detect and recover? Efficiency: with many jobs and programmers, how do you balance the loads? How do you limit network traffic and file i/o? Usability: can programmers keep track of their data and jobs? In parallel Map Partition Send In parallel: Accept Merge Reduce

Hadoop: Intro pilfered from: Alona Fyshe, Jimmy Lin, Google, Cloudera

Surprise, you mapreduced!
Mapreduce has three main phases Map (send each input record to a key) Sort (put all of one key in the same place) handled behind the scenes Reduce (operate on each key and its set of values) Terms come from functional programming: map(lambda x:x.upper(),["william","w","cohen"])['WILLIAM', 'W', 'COHEN'] reduce(lambda x,y:x+"-"+y,["william","w","cohen"])”william-w-cohen”

Mapreduce overview Map Shuffle/Sort Reduce

Issue: reliability Questions:
How will you know when each machine is done? Communication overhead How will you know if a machine is dead? Remember: we can to use a huge pile of cheap machines, so failures will be common! Is it dead or just really slow?

Issue: reliability What’s the difference between slow and dead?
Who cares? Start a backup process. If the process is slow because of machine issues, the backup may finish first If it’s slow because you poorly partitioned your data... waiting is your punishment

Issue: reliability If a disk fails you can lose some intermediate output Ignoring the missing data could give you wrong answers Who cares? if I’m going to run backup processes I might as well have backup copies of the intermediate data also

HDFS: The Hadoop File System
Distributes data across the cluster distributed file looks like a directory with shards as files inside it makes an effort to run processes locally with the data Replicates data default 3 copies of each file Optimized for streaming really really big “blocks”

$ hadoop fs -ls rcv1/small/sharded
Found 10 items -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00000 -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00001 -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00002 -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00003 -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00004 -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00005 -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00006 -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00007 -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00008 -rw-r--r … :28 /user/wcohen/rcv1/small/sharded/part-00009 $ hadoop fs -tail rcv1/small/sharded/part-00005 weak as the arrival of arbitraged cargoes from the West has put the local market under pressure… M14,M143,MCAT The Brent crude market on the Singapore International …

This would be pretty systems-y (remote copy files, waiting for remote processes, …)
It would take work to make run for 500 jobs Reliability: Replication, restarts, monitoring jobs,… Efficiency: load-balancing, reducing file/network i/o, optimizing file/network i/o,… Useability: stream defined datatypes, simple reduce functions, ….

Map reduce with Hadoop streaming

Breaking this down… Our imaginary assignment uses key-value pairs. What’s the data structure for that? How do you interface with Hadoop? One very simple way: Hadoop’s streaming interface. Mapper outputs key-value pairs as: One pair per line, key and value tab-separated Reducer reads in data in the same format Lines are sorted so lines with the same key are adjacent.

An example (in Java, sorry):
SmallStreamNB.java and StreamSumReducer.java: the code you just wrote.

To run locally:

To run locally: python streamNB.py python streamSumReducer.py

To train with streaming Hadoop you do this:
But first you need to get your code and data to the “Hadoop file system”

To train with streaming Hadoop:
First, you need to prepare the corpus by splitting it into shards … and distributing the shards to different machines:

One way to shard text: hadoop fs -put LocalFileName HDFSName then run a streaming job with ‘cat’ as mapper and reducer and specify the number of shards you want with option -numReduceTasks

Next, prepare your code for upload and distribution to the machines cluster

Now you can run streaming Hadoop:
small-events-hs: hadoop fs –rmr rcv1/small/events hadoop jar $(STRJAR) \ - input rcv1/small/sharded –output rcv1/small/events \ - mapper ‘python streamNB.py’ \ - reducer ‘python streamSumReducer.py’ - file streamNB.py -file streamSumReducer.py

Map reduce without Hadoop streaming

“Real” Hadoop Streaming is simple but
There’s no typechecking of inputs/outputs You need to parse strings a lot You can’t use compact binary encodings … basically you have limited control over the messages you’re sending i/o costs = O(message size) often dominates

others: KeyValueInputFormat SequenceFileInputFormat

“Real” Hadoop vs Streaming Hadoop
Tradeoff: simplicity vs control In general: If you want to really optimize you need to get down to the actual Hadoop layers Often it’s better to work with abstractions “higher up”

Debugging Map-Reduce

Some common pitfalls You have no control over the order in which reduces are performed You have no* control over the order in which you encounter reduce values *by default anyway The only ordering you should assume is that Reducers always start after Mappers

Some common pitfalls You should assume your Maps and Reduces will be taking place on different machines with different memory spaces Don’t make a static variable and assume that other processes can read it They can’t. It appear that they can when run locally, but they can’t No really, don’t do this.

Some common pitfalls Do not communicate between mappers or between reducers overhead is high you don’t know which mappers/reducers are actually running at any given point there’s no easy way to find out what machine they’re running on because you shouldn’t be looking for them anyway

When mapreduce doesn’t fit
The beauty of mapreduce is its separability and independence If you find yourself trying to communicate between processes you’re doing it wrong or what you’re doing is not a mapreduce

When mapreduce doesn’t fit
Not everything is a mapreduce Sometimes you need more communication We’ll talk about other programming paradigms later

What’s so tricky about MapReduce?
Really, nothing. It’s easy. What’s often tricky is figuring out how to write an algorithm as a series of map-reduce substeps. How and when do you parallelize? When should you even try to do this? when should you use a different model?

Combiners in Hadoop

Some of this is wasteful
Remember - moving data around and writing to/reading from disk are very expensive operations No reducer can start until: all mappers are done data in its partition has been sorted

How much does buffering help?
BUFFER_SIZE Time Message Size none 1.7M words 100 47s 1.2M 1,000 42s 1.0M 10,000 30s 0.7M 100,000 16s 0.24M 1,000,000 13s 0.16M limit 0.05M Recall idea here: in stream-and-sort, use a buffer to accumulate counts in messages for common words before the sort so sort input was smaller

Combiners Reducer still just sums the counts
Sits between the map and the shuffle Do some of the reducing while you’re waiting for other stuff to happen Avoid moving all of that data over the network Eg, for wordcount: instead of sending (word,1) send (word,n) where n is a partial count (over data seen by that mapper) Reducer still just sums the counts Only applicable when order of reduce values doesn’t matter (since order is undetermined) effect is cumulative Reduce must be commutative and associative Input and output formats must match

job.setCombinerClass(Reduce.class);

Combiners in streaming Hadoop
small-events-hs: hadoop fs –rmr rcv1/small/events hadoop jar $(STRJAR) \ - input rcv1/small/sharded –output rcv1/small/events \ - mapper ‘python streamNB.py’ \ - reducer ‘python streamSumReducer.py’ \ - combiner ‘python streamSumReducer.py’

Deja vu: Combiner = Reducer
Often the combiner is the reducer. like for word count but not always remember you have no control over when/whether the combiner is applied

10-405 Big ML Hadoop.

Similar presentations

Presentation on theme: "10-405 Big ML Hadoop."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

10-405 Big ML Hadoop.

Similar presentations

Presentation on theme: "10-405 Big ML Hadoop."— Presentation transcript:

Similar presentations

About project

Feedback