Download presentation
Presentation is loading. Please wait.
Published byAldous Casey Modified over 9 years ago
1
The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth
2
Sections 1.MapReduce and Hadoop 2.Map and Reduce 3.Mappers and Reducers 4.Using Tools (Amazon) 5.Conclusions
3
1. MapReduce and Hadoop What is it? And how do I get it?
4
Google MapReduce Circa 2003 Based on Map and Reduce (go figure) – and Functional Programming! Proprietary
5
Apache Hadoop Circa 2006, released 2009 Named after an Elephant Toy Seconds, maybe a minute, to install
6
Installing Hadoop on OSX Single Cluster setup is a piece of cake Download the archive (tar.gz) Modify conf/hadoop-env.sh: – # export JAVA_HOME=/usr/lib/j2sdk1.6-sun – export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/ Modify bin/hadoop: – JAVA=$JAVA_HOME/bin/java – JAVA=$JAVA_HOME/Commands/java Just run bin/hadoop with arguments
7
STOP! Actually, installing Hadoop wasn’t necessary We can write parallel code without it
8
2. Map and Reduce What is it? – Quick Primer to Functional Programming Higher-Order Functions Alonzo Church (Lamba Calculus) Haskell Curry (Spicy Food) How do I use it? (x ↦ (y ↦ x*x + y*y))(5)(2)
9
Code w/ Side-Effects >>> thing = {'name':'Donald'} >>> def change_name(object): object['name'] = 'Donnie'... >>> change_name(thing) >>> thing {'name': 'Donnie'}
10
Pure Code, Side-effect Free >>> thing = {'name':'Donald'} >>> def change_name(object):... new_obj = {'name': 'Donnie'}... # copy any other values... return new_obj... >>> thing = change_name(thing) >>> thing {'name': 'Donnie'}
11
Benefits of Pure Code / FP easy to understand – Local vars = easy – Global vars + side-effects = hard it’s easy to parallelize – We only care about what we know RIGHT NOW
12
Map 1 2 3 1 4 6 f(x)
13
Map in Python Use the map(, ) built-in >>> map(lambda x: x*x, range(1,100)) [1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]
14
Reduce 1 2 3 0 f(x, y) f(x, y) = 6
15
Reduce in Python Use the map(,, ) built- in >>> reduce(lambda x, y: x+y, [1,2,3], 0) 6 >>> reduce(lambda x, y: x+y, (map(lambda x: x*x, range(1,100)), 0) 328350
16
3. Mappers and Reducers How do I write them? – Word Count (Hello World for Distrib. Comp.) – Longest Repeat Show me how to pipe them
17
Mappers Pseudo-Code – Take some input – Process it – And emit a Key – Value pair
18
Word Count Mapper For some input: – Donald Demuth Donald Draper The output should be: – Donald 1 – Demuth 1 – Donald 1 – Draper 1
19
Word Count Mapper Code wordcount/mapper.py #!/usr/bin/env python import sys, re word_re = re.compile('[a-zA-Z]+') for line in sys.stdin: line = line.strip().lower() for word in word_re.findall(line): print '%s\t%s' % (word, 1)
20
Reducers Dependant on the Mapper’s emissions Pseudo-Code for word count – Read an emission from the mapper – Find the key and the value – Store the key in a dictionary with it’s value But if the key already exists, add the value with the pre- existing value! – Emit the dictionary
21
Word Count Reducer Code wordcount/reducer.py #!/usr/bin/env python import sys counts = {} for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) count = int(count) counts[word] = counts.get(word, 0) + count for word, count in counts.items(): print '%s\t%s'% (word, count)
22
Unix Pipes Does this really work?? $ cat books/*.txt | wordcount/mapper.py | wordcount/reducer.py | sort | head a10526 ab3 aback1 abaft2 abaht1 abandon2 abandoned10 abandonment1 abasement1 abash1
23
Longest Repeat (LCS) Many problems can be solved with a series of Maps and Reduces However, Hadoop Streaming is a single Map and Reduce step After much trial and error my solution involves a pre-processing step
24
Pre-processing fasta_to_line.py gen_suffixes.py ecoli.fasta.line ecoli.fasta.line.0 ecoli.fasta.line.100000 ecoli.fasta.line.200000 4.6 megs 4.5 megs 4.4 megs 4.3 megs ecoli.fasta ecoli.fasta.line
25
LCS Mapper Pseudo-code – Read a line from a suffix file – Determine the index (first chars) – Cycle through the first 100,000 positions Cycle through possible lengths (10 3000) – Emit the Length (Key) and the Position (Val) Emit (-1) and (-1) to STAY ALIVE
26
LCS Reducer Pseudo-Code – Simple – Find the largest KEY emitted by any mapper – Display it
27
LCS w/ Murmur.txt $ cat murmur.txt.line.0 | lcs/mapper.py | lcs/reducer.py length(63)pos(128) $ python >>> text = open('murmur.txt.line').read() >>> text[128:128+63] 'Dance the cha chaOr the can canShake your pom pomTo Duran Duran' >>> seq = text[128:128+63] >>> text.index(seq) 128 >>> text[129:].index(seq) + 129 1777 >>> text[128:128+63] == text[1777:1777+63] True >>> text[1777:1777+63] 'Dance the cha chaOr the can canShake your pom pomTo Duran Duran'
28
4. Using Tools, Amazon Harness the power of many machines at once – Easy to use 20 Need to sign up for: – Amazon Elastic MapReduce Service (EMS) – Amazon Elastic Compute Cloud (EC2) – Amazon Simple Storage Service (S3) – Amazon SimpleDB
29
Deploying Data/Code First you’ll need to upload it to S3 – Create a new bucket (or global folder) named ecoli-lcs – Create a new path named input, ecoli-lcs/input – Upload all of the generated suffixes to the input folder – Upload mapper.py and reducer.py to ecoli-lcs
30
Creating a Job (Flow)
31
Creating a Job Flow (…)
32
RESULTS! Need to download the output $ cd output $ cat * | sort (...) length(2815)pos(4166641) $ python >>> text = open('ecoli.fasta.line').read() >>> seq = text[4166641:4166641+2815] >>> text.index(seq) 4166641 >>> text[4166642:].index(seq) + 4166642 4208043 >>> text[4166641:4166641+2815] == text[4208043:4208043+2815]
33
5. Conclusions Costs – It’s about 3 cents an hour for a “medium” VM – One run took 840 instance hours (20+ actual) Approx. $25 – Used about 2000 instance hours in total Hadoop Streaming is EASY – Though requires many (easy) tools – But costly if you have “bugs”
34
A Better Solution? Jeff Parker’s program used the following approach: – Cycle through the sequence and find all repeats of a given size – Emit the location – Increase the size and use the previously known locations to find larger matches Looks good for MapReduce (Core)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.