Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.

Slides:



Advertisements
Similar presentations
Designing Algorithms Csci 107 Lecture 4. Outline Last time Computing 1+2+…+n Adding 2 n-digit numbers Today: More algorithms Sequential search Variations.
Advertisements

Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Chapter 2: Fundamentals of the Analysis of Algorithm Efficiency
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Wednesday, 12/6/06 String Matching Algorithms Chapter 32.
Efficiency of Algorithms
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 8 Tuesday, 11/13/01 String Matching Algorithms Chapter.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
Designing Algorithms Csci 107 Lecture 4.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 5 Instructor: Paul Beame TA: Gidon Shavit.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
Spring 2015 Lecture 6: Hash Tables
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:
Chapter 3 Sec 3.3 With Question/Answer Animations 1.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Sampling Approaches to Pattern Extraction
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
String Algorithms David Kauchak cs302 Spring 2012.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
String-Matching Problem COSC Advanced Algorithm Analysis and Design
1 UNIT-I BRUTE FORCE ANALYSIS AND DESIGN OF ALGORITHMS CHAPTER 3:
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
Probabilistic Algorithms
Advanced Algorithms Analysis and Design
Advanced Algorithms Analysis and Design
Advanced Algorithm Design and Analysis (Lecture 12)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Hash functions Open addressing
Rabin & Karp Algorithm.
Space-for-time tradeoffs
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
String matching.
String-Matching Algorithms (UNIT-5)
Chapter 7 Space and Time Tradeoffs
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Space-for-time tradeoffs
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Space-for-time tradeoffs
Presentation transcript:

Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine

Lambda cI/cro binding sites A: C: G: T: Positional weight matrix representation Sequence Logo

Probabilistic model for motif discovery Input: a set of sequence: {S 1, S 2 …,S N }, each of which is of length w, i.e. |S i |=w for all i. Two models: –Motif model: P(y|  ) –Background model: P(y|  ) Hidden variables: {z 1, z 2 …,z N }, where z i =1 if S i is a motif site and 0 otherwise. –P(S i |z i =1) = P(S i |  ) and P(S i |z i =0) = P(S i |  ) –P(z i =1|S i ) = P(S i |z i =1)P(z i =1)/[P(S i |z i =1)P(z i =1)+P(S i |z i =0)P(z i =0)]

Probabilistic model for motif discovery Input: a set of sequences: {S 1, S 2 …,S N }, each of which is of length w, i.e. |S i |=w for all i. Parameter estimation problem: –Find the  and  that maximize P(S 1, S 2 …,S N | ,  ) If z i =1 for all i, then  ij = c ij /N where c ij is the number of letter j occurring at position i in the set of sequences.

String matching

Exact String Matching Input: Two strings T[1…n] and P[1…m], containing symbols from alphabet . Example:  = {A,C,G,T} T[1…12] = “CAGTACATCGAT” P[1..3] = “AGT” Goal: find all “shifts” 0≤s ≤n-m such that T[s+1…s+m] = P

Simple Algorithm for s ← 0 to n-m Match ← 1 for j ← 1 to m if T[s+j]≠P[j] then Match ← 0 exit loop if Match=1 then output s

Analysis Running time of the simple algorithm: –Worst-case: O(nm) –Average-case (random text): O(n) (expectation) Ts = time spend on checking shift s (the number of comparisons until 1 st mismatch) E[T s ] < 2 (why) E[  s T s ] =  s E[T s ] = O(n)

Worst-case Is it possible to achieve O(n) for any input ? –Knuth-Morris-Pratt’77: deterministic –Karp-Rabin’81: randomized Digression: what is the difference between –Algorithm that is fast on a random input (as seen on the previous slide) –Randomized algorithm (as in the rest of this lecture)

Karp-Rabin Algorithm Idea: semi-numerical approach: –Consider all m-mers: –T[1…m], T[2…m+1], …, T[m-n+1…n] –Map each T[s+1…s+m] into a number ts –Map the pattern P[1…m] into a number p –Report the m-mers that map to the same value as p Problem: how to map all m-mers in O(n) time ?

Implementation Attempt I: –Assume  ={0,1} (for {A,C,G,T} convert: A:00, C:01, G:10, T:11) –Think about each T[s+1…s+m] as a number in binary representation, i.e., t s =T[s+1]2 m-1 +T[s+2]2 m-2 +…+T[s+m]2 0 –Find a fast way of computing t s+1 given t s –Output all s such that t s is equal to the number p represented by P

Magic formula How to transform t s =T[s+1]2 m-1 +T[s+2]2 m-2 +…+T[s+m]2 0 into t s+1 =T[s+2]2 m-1 +T[s+3]2 m-2 +…+T[s+m+1]2 0 ? Three steps: –Subtract T[s+1]2 m-1 –Multiply by 2 (i.e., shift the bits by one position) –Add T[s+m+1]2 0 Therefore: t s+1 = (t s - T[s+1]2 m-1 )*2 + T[s+m+1]2 0

Algorithm t s+1 = (t s - T[s+1]2 m-1 )*2 + T[s+m+1]2 0 Can compute t s+1 from t s using 3 arithmetic operations Therefore, we can compute all t 0,t 1,…,t n-m using O(n) arithmetic operations We can compute a number corresponding to P using O(m) arithmetic operations Are we done ?

Problem To get O(n) time, we would need to perform each arithmetic operation in O(1) time However, the arguments are m-bit long ! If m large, it is unreasonable to assume that operations on such big numbers can be done in O(1) time We need to reduce the number range to something more managable

Hashing We will instead compute t’ s =T[s+1]2 m-1 +T[s+2]2 m-2 +…+T[s+m]2 0 mod q where q is an “appropriate” prime number One can still compute t’ s+1 from t’ s : t’ s+1 = (t’ s - T[s+1]2 m-1 )*2+T[s+m+1]2 0 mod q If q is not large, we can compute all t’ s (and p’) in O(n) time

Problem Unfortunately, we can have false positives, i.e., T[s+1…s+m]  P but t s mod q = p mod q Our approach: –Use a random q –Show that the probability of a false positive is small → randomized algorithm

Aligning two sequences Longest common substring (LCS) problem (no gaps): –Input: Two strings T[1…n] and P[1…m], n≥m –Goal: Largest k such that T[i+1…i+k] = P[j+1…j+k] for some i,j How can we solve this problem efficiently ? –Hint: How can we check if LCS has length k ?

Checking for a common m-mer