March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send email to me with a title of your project May.

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

4b Lexical analysis Finite Automata
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
Regular Expressions and DFAs COP 3402 (Summer 2014)
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
1 Midterm I review Reading: Chapters Test Details In class, Wednesday, Feb. 25, :10pm-4pm Comprehensive Closed book, closed notes.
Two implementation issues Alphabet size Generalizing to multiple strings.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Examples for Finite Automata
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Introduction to Computability Theory
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture2: Non Deterministic Finite Automata Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Introduction to Computability Theory
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
March 2006Vineet Bafna Database Filtering. March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May 9 –
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
Fa 06CSE182 CSE182-L6 Protein sequence analysis Fa 06CSE182 Possible domain queries Case 1: –You have a collection of sequences that belong to a family.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
CS5371 Theory of Computation Lecture 6: Automata Theory IV (Regular Expression = NFA = DFA)
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Introduction to Finite Automata Adapted from the slides of Stanford CS154.
Theory of Computing Lecture 22 MAS 714 Hartmut Klauck.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
CSE182-L5: Scoring matrices Dictionary Matching
Finite-State Machines with No Output
NFA ε - NFA - DFA equivalence. What is an NFA An NFA is an automaton that its states might have none, one or more outgoing arrows under a specific symbol.
Theory of Computation, Feodor F. Dragan, Kent State University 1 Regular expressions: definition An algebraic equivalent to finite automata. We can build.
By: Er. Sukhwinder kaur.  What is Automata Theory? What is Automata Theory?  Alphabet and Strings Alphabet and Strings  Empty String Empty String 
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
CSE182-L9 Modeling Protein domains using HMMs. Profiles Revisited Note that profiles are a powerful way of capturing domain information Pr(sequence x|
10-07CSE182 CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
CHAPTER 1 Regular Languages
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching Yao Song 11/05/2015.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
Lecture 14: Theory of Automata:2014 Finite Automata with Output.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
NFAε - NFA - DFA equivalence
Lexical analysis Finite Automata
Two issues in lexical analysis
Recognizer for a Language
Non-Deterministic Finite Automata
Finite Automata.
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
Presentation transcript:

March 2006Vineet Bafna Designing Spaced Seeds

March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May 9 – Each student/group gives a 10 min. presentation on their proposed project. – Show preliminary computations. What is the test plan? What is the data like, and how much is there. Last week of classes: – A 20 min. presentation from each group – A written report on the project – A take home exam, due electronically on the date of the final exam

March 2006Vineet Bafna Accuracy Consider a 64bp sequence that is 70% similar to the query. Pr(an 11 mer matches) = 0.3 Pr(A spaced seed Matches) = This non-intuitive result leads to selection of spaced words that are an order of magnitude faster for identical specificity and sensitivity Implemented in PATTERNHUNTER

March 2006Vineet Bafna How to compute a spaced seed No good algorithm is known. Iterate over all (M choose W) seeds. – Use a computation to decide Pr(match) – Choose the seed that maximizes probability.

March 2006Vineet Bafna Prob. Computation for Spaced Seeds Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. We can assume that there is a probability p of match. The match mismatch string is a binary string with probability p of 1 1L

March 2006Vineet Bafna Prob. Computation for Spaced Seeds Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. – Q is a binary string of length M, with W 1s We try to match the binary ‘match string’ S which is a random binary string with probability p of success. 110…0.1…1..0 M 1L P Q = Prob. (Q matches random S at some location) How can we compute P Q ?

March 2006Vineet Bafna Computing F(i,b) For a specific string b, define F(i,b) = Prob. (Q matches a random string S of length i, s.t. S ends in b) 1 b i Why is it sufficient to compute f(I,b) for all I, b? – P Q = f(L,  )

March 2006Vineet Bafna Computing f(i,b) Define B 1 as the set of all strings b that match a suffix of Q b We have two possibilities: b  B 1 : b is consistent with a suffix of Q. b  B 0 = B-B Q

March 2006Vineet Bafna Computing f(i,b) Case b  B 0 – f(i,b) = f(i-1,b>>1) Case b  B 1 and |b| = M – f(i,b) = 1 b Q

March 2006Vineet Bafna Computing f(i,b) Case b  B 1, |b|<M f(i,b) = pf(i-1,1b) + (1-p) pf(i-1,0b) – Note that if b  B 1, then 1b  B 1 – However, it is possible that 0b  B 1 We want to iterate only over b  B 1 Find smallest j s.t. 0b>>j  B 1 f(i,b) = pf(i-1,1b) + (1-p) f(i-j,0b>>j) b Q

March 2006Vineet Bafna Efficiency |B 1 | = M2 M-W The iteration proceeds for all i, and all b  B 1, and each comparison needs O(M) steps O(M2 M-W L M) = O(M 2 2 M-W L)

March 2006Vineet Bafna More efficient algorithm for spaced seed design Due to Buhler, Keich, and Sun Consider seed  (weight w, span s). Let Q  be the set of all possible 2 s-w strings matching  : :

March 2006Vineet Bafna Trie construction Our goal is to make an automaton that accepts all strings which contain a string from Q. Make a trie T  from Q . T  is a DFA that precisely accepts Q  Can we convert T  to an DFA that accepts all strings that matches a string from Q  as a suffix? Ex:  =1001

March 2006Vineet Bafna Failure links Use of failure links allows us to traverse any string till Q is reached. Note: the DFA has special structure. Does it help? No failure links when outgoing edge is 0. Therefore, we fail only when we see a Ex:  =1001

March 2006Vineet Bafna Substring automaton We started with a Trie that only accepts Q  Next, we use failure links to accept any string with a suffix from Q . Finally, make every accepting state an absorbing one, to accept all strings containing a string from Q  as suffix Ex:  =1001 0,1

March 2006Vineet Bafna Computing sensitivity of  Compute the probability that a ‘random’ string of length l will match  ? Equivalently: What is the probability that a random string of length l that starts at the begin node will end in an accepting state of A . Case 1: Each bit of S is 1 with probability p P(q,t)=Probability that we reach q after reading the first t bits.

March 2006Vineet Bafna Complexity Size of the Automaton W2 M-W What is the in-degree? Claimed complexity =  (W2 M-W L) – O(M 2 /W) faster then the previous algorithm

March 2006Vineet Bafna Generalizing the match string The match string may have a different distribution – Errors do not fall independently at random Instead of independent bernoulli trials, we can have a higher order markovian process generating the match string. The algorithm of Keich et al. Cannot deal with this extension, but it is natural in Mandala

March 2006Vineet Bafna Experimental Results with Mandala 428 human/mouse genomic aligned regions. Repeat mask the alignments and separate into coding/non-coding regions. A total of similarities (alignments) were pulled. These are used to check for sensitivity (accuracy) of filters.

March 2006Vineet Bafna Effect of Span Solid line: 0-th order model Dashed line: 5-th order model. W=11 throughout: larger span implies more gaps, span=11 implies ungapped (BLASTN) seed

March 2006Vineet Bafna Accuracy of different seeds Non-coding Coding

March 2006Vineet Bafna Model order Non-Coding: solid line coding: dashed line

March 2006Vineet Bafna What about multiple keywords All of the analysis is for ungapped alignments. With indels, multiple words might be more sensitive. Mandala works for multiple keywords also. Can we make the algorithm more efficient? In particular, there is an explosion of states in making a deterministic automaton? Can we match a non- deterministic automaton?

March 2006Vineet Bafna Regular Expressions Concise representation of a set of strings over alphabet . Described by a string over R is a r.e. if and only if

March 2006Vineet Bafna Regular Expression Q: Let  ={A,C,E} – Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E? Q: When is a string s in a regular expression? – R =(A+C)*EEC* – Is CEEC in R? – AEC? – ACEE?

March 2006Vineet Bafna Regular Expression & Automata  Every R.E can be expressed by an automaton (a directed graph) with the following properties: – The automaton has a start and end node – Each edge is labeled with a symbol from , or   Suppose R is described by automaton A  S  R if and only if there is a path from start to end in A, labeled with s.

March 2006Vineet Bafna Examples: Regular Expression & Automata (A+C)*EEC* CA C startend EE

March 2006Vineet Bafna Constructing automata from R.E R = {  } R = {  },    R = R 1 + R 2 R = R 1 · R 2 R = R 1 *      

March 2006Vineet Bafna Regular Expression Matching Given a database D, and a regular expression R, is a substring of D in R? Is there a string D[l..c] that is accepted by the automaton of R? Simpler Q: Is D[1..c] accepted by the automaton of R?

March 2006Vineet Bafna Alg. For matching R.E. If D[1..c] is accepted by the automaton R A – There is a path labeled D[1]…D[c] that goes from START to END in R A D[1] D[2]  D[c]

March 2006Vineet Bafna Alg. For matching R.E. If D[1..c] is accepted by the automaton R A – There is a path labeled D[1]…D[c] that goes from START to END in R A – There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END D[1].. D[c-1] D[c] u

March 2006Vineet Bafna D.P. to match regular expression Define: – A[u,  ] = Automaton node reached from u after reading  – Eps(u): set of all nodes reachable from node u using epsilon transitions. – N[c] = subset of nodes reachable from START node after reading D[1..c] – Q: when is v  N[c]?  u v  u Eps(u)

March 2006Vineet Bafna Q: when is v  N[c]? A: If for some u  N[c-1], w = A[u,D[c]], v  {w}+ Eps(w) D.P. to match regular expression D[1].. D[c-1] D[c] u w 

March 2006Vineet Bafna Algorithm

March 2006Vineet Bafna The final step We have answered the question: – Is D[1..c] accepted by R? – Yes, if END  N[c] We need to answer – Is D[l..c] (for some l, and some c) accepted by R

March 2006Vineet Bafna