The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.

Slides:



Advertisements
Similar presentations
For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
MapReduce.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Group practice in problem design and problem solving
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
REPETITION STRUCTURES. Topics Introduction to Repetition Structures The while Loop: a Condition- Controlled Loop The for Loop: a Count-Controlled Loop.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Pipes and Filters Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Functions. Built-in functions You’ve used several functions already >>> len("ATGGTCA")‏ 7 >>> abs(-6)‏ 6 >>> float("3.1415")‏ >>>
Getting Started with Python: Constructs and Pitfalls Sean Deitz Advanced Programming Seminar September 13, 2013.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
You Need an Interpreter!. Closing the GAP Thus far, we’ve been struggling to speak to computers in “their” language, maybe its time we spoke to them in.
Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 5 Repetition Structures.
CSC 1010 Programming for All Lecture 4 Loops Some material based on material from Marty Stepp, Instructor, University of Washington.
CSC 1010 Programming for All Lecture 1 Some material courtesy of Python for Informatics: Exploring Information (
Compsci 101.2, Fall Plan for FWON l Review current assignments and APTs  Review Dictionaries and how to use them  Code and APT walk-through.
PC204 Lecture 5 Conrad Huang Genentech Hall, N453A
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
LECTURE 2 Python Basics. MODULES So, we just put together our first real Python program. Let’s say we store this program in a file called fib.py. We have.
Computational Methods in Astrophysics Dr Rob Thacker (AT319E)
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
GETTING STARTED WITH AWS AND PYTHON. OUTLINE  Intro to Boto  Installation and configuration  Working with AWS S3 using Bot  Working with AWS SQS using.
Csinparallel.org Workshop 307: CSinParallel: Using Map-Reduce to Teach Parallel Programming Concepts, Hands-On Dick Brown, St. Olaf College Libby Shoop,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
1 PROGRAMMING IN HASKELL An Introduction Based on lecture notes by Graham Hutton The book “Learn You a Haskell for Great Good” (and a few other sources)
Advanced Functional Programming 2010
Big Data is a Big Deal!.
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
Laziness and Infinite Datastructures
Spark.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Topics Introduction to Repetition Structures
Wordcount CSCE 587 Spring 2018.
Cloud Distributed Computing Environment Hadoop
Wordcount CSCE 587 Spring 2018.
INFO 344 Web Tools And Development
Map Reduce Workshop Monday November 12th, 2012
VI-SEEM data analysis service
Lecture 16 (Intro to MapReduce and Hadoop)
Lecture 16B: Instructions on how to use Hadoop on Amazon Web Services
Different types of Linux installation
MapReduce Practice :WordCount
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth

Sections 1.MapReduce and Hadoop 2.Map and Reduce 3.Mappers and Reducers 4.Using Tools (Amazon) 5.Conclusions

1. MapReduce and Hadoop What is it? And how do I get it?

Google MapReduce Circa 2003 Based on Map and Reduce (go figure) – and Functional Programming! Proprietary

Apache Hadoop Circa 2006, released 2009 Named after an Elephant Toy Seconds, maybe a minute, to install

Installing Hadoop on OSX Single Cluster setup is a piece of cake Download the archive (tar.gz) Modify conf/hadoop-env.sh: – # export JAVA_HOME=/usr/lib/j2sdk1.6-sun – export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/ Modify bin/hadoop: – JAVA=$JAVA_HOME/bin/java – JAVA=$JAVA_HOME/Commands/java Just run bin/hadoop with arguments

STOP! Actually, installing Hadoop wasn’t necessary We can write parallel code without it

2. Map and Reduce What is it? – Quick Primer to Functional Programming Higher-Order Functions Alonzo Church (Lamba Calculus) Haskell Curry (Spicy Food) How do I use it? (x ↦ (y ↦ x*x + y*y))(5)(2)

Code w/ Side-Effects >>> thing = {'name':'Donald'} >>> def change_name(object): object['name'] = 'Donnie'... >>> change_name(thing) >>> thing {'name': 'Donnie'}

Pure Code, Side-effect Free >>> thing = {'name':'Donald'} >>> def change_name(object):... new_obj = {'name': 'Donnie'}... # copy any other values... return new_obj... >>> thing = change_name(thing) >>> thing {'name': 'Donnie'}

Benefits of Pure Code / FP easy to understand – Local vars = easy – Global vars + side-effects = hard it’s easy to parallelize – We only care about what we know RIGHT NOW

Map f(x)

Map in Python Use the map(, ) built-in >>> map(lambda x: x*x, range(1,100)) [1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801]

Reduce f(x, y) f(x, y) = 6

Reduce in Python Use the map(,, ) built- in >>> reduce(lambda x, y: x+y, [1,2,3], 0) 6 >>> reduce(lambda x, y: x+y, (map(lambda x: x*x, range(1,100)), 0)

3. Mappers and Reducers How do I write them? – Word Count (Hello World for Distrib. Comp.) – Longest Repeat Show me how to pipe them

Mappers Pseudo-Code – Take some input – Process it – And emit a Key – Value pair

Word Count Mapper For some input: – Donald Demuth Donald Draper The output should be: – Donald 1 – Demuth 1 – Donald 1 – Draper 1

Word Count Mapper Code wordcount/mapper.py #!/usr/bin/env python import sys, re word_re = re.compile('[a-zA-Z]+') for line in sys.stdin: line = line.strip().lower() for word in word_re.findall(line): print '%s\t%s' % (word, 1)

Reducers Dependant on the Mapper’s emissions Pseudo-Code for word count – Read an emission from the mapper – Find the key and the value – Store the key in a dictionary with it’s value But if the key already exists, add the value with the pre- existing value! – Emit the dictionary

Word Count Reducer Code wordcount/reducer.py #!/usr/bin/env python import sys counts = {} for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) count = int(count) counts[word] = counts.get(word, 0) + count for word, count in counts.items(): print '%s\t%s'% (word, count)

Unix Pipes Does this really work?? $ cat books/*.txt | wordcount/mapper.py | wordcount/reducer.py | sort | head a10526 ab3 aback1 abaft2 abaht1 abandon2 abandoned10 abandonment1 abasement1 abash1

Longest Repeat (LCS) Many problems can be solved with a series of Maps and Reduces However, Hadoop Streaming is a single Map and Reduce step After much trial and error my solution involves a pre-processing step

Pre-processing fasta_to_line.py gen_suffixes.py ecoli.fasta.line ecoli.fasta.line.0 ecoli.fasta.line ecoli.fasta.line megs 4.5 megs 4.4 megs 4.3 megs ecoli.fasta ecoli.fasta.line

LCS Mapper Pseudo-code – Read a line from a suffix file – Determine the index (first chars) – Cycle through the first 100,000 positions Cycle through possible lengths (10  3000) – Emit the Length (Key) and the Position (Val) Emit (-1) and (-1) to STAY ALIVE

LCS Reducer Pseudo-Code – Simple – Find the largest KEY emitted by any mapper – Display it

LCS w/ Murmur.txt $ cat murmur.txt.line.0 | lcs/mapper.py | lcs/reducer.py length(63)pos(128) $ python >>> text = open('murmur.txt.line').read() >>> text[128:128+63] 'Dance the cha chaOr the can canShake your pom pomTo Duran Duran' >>> seq = text[128:128+63] >>> text.index(seq) 128 >>> text[129:].index(seq) >>> text[128:128+63] == text[1777: ] True >>> text[1777: ] 'Dance the cha chaOr the can canShake your pom pomTo Duran Duran'

4. Using Tools, Amazon Harness the power of many machines at once – Easy to use 20 Need to sign up for: – Amazon Elastic MapReduce Service (EMS) – Amazon Elastic Compute Cloud (EC2) – Amazon Simple Storage Service (S3) – Amazon SimpleDB

Deploying Data/Code First you’ll need to upload it to S3 – Create a new bucket (or global folder) named ecoli-lcs – Create a new path named input, ecoli-lcs/input – Upload all of the generated suffixes to the input folder – Upload mapper.py and reducer.py to ecoli-lcs

Creating a Job (Flow)

Creating a Job Flow (…)

RESULTS! Need to download the output $ cd output $ cat * | sort (...) length(2815)pos( ) $ python >>> text = open('ecoli.fasta.line').read() >>> seq = text[ : ] >>> text.index(seq) >>> text[ :].index(seq) >>> text[ : ] == text[ : ]

5. Conclusions Costs – It’s about 3 cents an hour for a “medium” VM – One run took 840 instance hours (20+ actual) Approx. $25 – Used about 2000 instance hours in total Hadoop Streaming is EASY – Though requires many (easy) tools – But costly if you have “bugs”

A Better Solution? Jeff Parker’s program used the following approach: – Cycle through the sequence and find all repeats of a given size – Emit the location – Increase the size and use the previously known locations to find larger matches Looks good for MapReduce (Core)