Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Lecture 24 MAS 714 Hartmut Klauck
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
1 String Matching of Bit Parallel Suffix Automata.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Applied Computer Science II Chapter 1 : Regular Languages Prof. Dr. Luc De Raedt Institut für Informatik Albert-Ludwigs Universität Freiburg Germany.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.
Computational Language Finite State Machines and Regular Expressions.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Indexing and Searching
Modern Information Retrieval Chapter 4 Query Languages.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Multiple Sequence Alignment
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Great Theoretical Ideas in Computer Science.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches1 By Jayakumar Rudhrasenan S Primary Supervisor: Prof. Heiko Schroder.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
MCS 101: Algorithms Instructor Neelima Gupta
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
String Matching of Regular Expression
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Brian Mitchell - Drexel University MCS680-FCS 1 Patterns, Automata & Regular Expressions int MSTWeight(int graph[][], int size)
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.
Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
1 Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm Speaker: Minghua ZHANG March. 12, 2003 Authors: Isidore Rigoutsos Aris.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Introduction to Profile HMMs
Finite State Machines Dr K R Bond 2009
@#? Text Search g ~ A R B n f u j u q e ! 4 k ] { u "!"
Chapter 2 Finite Automata
Recuperació de la informació
Nondeterministic Finite Automata
Chapter 2 FINITE AUTOMATA.
NFAs and Transition Graphs
Finite Automata.
Discrete Math II Howon Kim
Recuperació de la informació
CSCI 2670 Introduction to Theory of Computing
CSC312 Automata Theory Transition Graphs Lecture # 9
Chapter 1 Regular Language
Improved Two-Way Bit-parallel Search
NFAs and Transition Graphs
Lexical Analysis Uses formalism of Regular Languages
Text Search ~ k A R B n f u j ! k e
Presentation transcript:

Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park

Contents The exact/approximate gapped pattern matching problem Previous approaches Our contributions

Exact gapped pattern matching problem Definition find the occurrences of the pattern that contains gaps from the text. Pattern P = AA * (2,3) GC * (1,3) TT AAGCTT * (2,3 ) * (1,3 ) P1P1 P2P2 P3P3 any string whose length is between 2 and 3 any string whose length is between 1 and 3 subpatterns

Example – Exact matching GCATCAATTGCTC AAGCTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) GC * (1,3) TT Text T = GCAATTGCACTTC

Approximate gapped pattern matching problem Definition find all the substrings of the text which match each subpattern P i with k i number of insertion, deletion, and substitution. Pattern P = AA * (2,3) GC * (1,3) TT AAGCTT * (2,3 ) * (1,3 ) P 1 k 1 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3 P 2 k 2 = 1 P 3 k 3 = 0

Example – Approximate matching GCATCAATTGTTC AAGCTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) GC * (1,3) TT, k 1 = k 3 = 0, k 2 = 1 Text T = GCAATTGTACTTC 1 substitution

Class of characters Allow more than two different characters at a position of the pattern Pattern P = AA * (2,3) G[CT] * (1,3) TT AAGTTT * (2,3 ) * (1,3 ) P 1 k 1 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3 P 2 k 2 = 1 P 3 k 3 = 0 C C or T

Example – Class of characters GCATCAATTGTTC AAGTTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) G[CT] * (1,3) TT Text T = GCAATTGTACTTC C

Application of the gapped pattern matching Information retrieval Data mining Computational biology Especially, finding motifs in a sequence

Motifs Motifs (biologically important common region) Sequence 1 Sequence 2 Sequence 3 Sequence 4 Sometimes overall sequence alignment doesn ’ t show the relation between biologically related sequences.

PROSITE database Database of protein families, domains and motifs Motifs are represented as gapped patterns from the alphabet of 20 amino acids. Prion protein (Creutzfeld-Jacob Disease) : E* (1,1) [ED]* (1,1) K[LIVM][LIVM]* (1,1) [KR][LIVM][LIVM]* (1,1) [QE]MC* (2,2) QY Ribosomal protein L1 : [IM]* (2,2) [LIVA]* (2,3) [LIVM][GA]* (2,2) [LMS] [GSNH][PTKR][KRAV]G* (1,1) [LIMF]P[DENSTKQ]

Finding hidden motifs a set of sequences how to find unknown motifs?

Finding motifs in a sequence known motif new sequence As biological sequences may contain errors, we should consider approximate matching occurrences. x Our topic

Previous approaches Regular expression approaches Exact matching Navarro and Raffinot’s approach [RECOMB 2002] Exact and approximate matching Akutsu’s approach [IEICE Trans. Info rmation and Systems 1996] Approximate matching

Regular expression approach Pattern P = AA * (2,3) GC * (1,3) TT Regular expression AA**(*|  )GC*(*|  ) (*|  )TT A * GC  TA **  T * ** Nondeterministic Finite State Automata (NFA) or its equivalent Deterministic Finite State Automata (DFA) Too general!

Navarro and Raffinot’s approach A * GC  TA **  T *** NFA is not easy to run and DFA can be large Bit-Vector Simulate NFA by the bit-parallelism technique. (A word can be read and written simultaneously)

Navarro and Raffinot’s approach A * GC  TA **  T *** Allow k errors for all the pattern. A * GC  TA **  T * 0 errors 1 errors Works for small size pattern and small number of errrors. O (km’n / w) time algorithm (m’ is the total length of the pattern, n is the length of the text, w is the word size)   * *

Akutsu’s approach Combination of the dynamic programming and the balanced search tree. O (mn log n) time Text P1P1 * (a1, b1) P2P2 P3P3 * (a2, b2) Dynamic Table use the tree to compute the smallest values here

Drawbacks of the previous approaches XXX XXXOOOOO OOOOO OOOO OOOO OXO OXOOXOOO OXOOO OOXO OOXO O O O O ? k = 3 for all the pattern more sensitive and desirable k 1 = 1 k 2 = 1k 3 = 1

Our contributions O (ln + m) time algorithm for the exact gapped pattern matching problem. l : number of subpatterns n : length of the text m : length of the pattern O (mn) time algorithm for the approximate gapped pattern matching problem.

Graph Modeling 1. Create a node where a subpattern appears (exactly or approximately) in the text 2. Link two nodes with an edge if they represent the two consecutive subpatterns and satisfy the gap condition. 3. If there is a path P 1 – P 2 - … - P m in the graph, there is an occurrence of the pattern in the text.

Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 1. Create nodes P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3

Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 2. Connect the nodes with the edges P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3

Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 3. Find the path by Depth-First Search P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3

A better idea GCATCAATTGCTC Text No need to build the graph explicitly. Step 1. Find P 1 = AA and compute the candidate range for P 2. P = AA * (2,3) GC * (1,3) TT P1P1 candidate range

A better idea GCATCAATTGCTC Text Step 2. Find P 2 = GC within the candidate range and compute candidate range for P 3. P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 candidate range

A better idea Text Step 3. After findng P 3 = TT within the candidate range, we found the occurrence of P. P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P3P3 GCATCAATTGCTC

Approximate matching Almost the same idea as the exact matching case. Find the approximate occurrence of subpatterns, instead of the exact one. GCAATTGCACTTC A A Text P1P1 * (2,3) k 1 = 0, k 2 = 1 candidate range

Approximate matching GCACTTC 00??? G10123 C21012 Text P2P2 * (1,3) k 2 = 1, k 3 = 0 candidate range Infinity – no alignment can start from here

Approximate matching TTC 000? T1001 T2101 Text P3P3 k 3 = 0 approximate occurrence of the pattern

Handling class of characters Represent characters as bit masks. AGTC 0101 [GC] Text Pattern G [GC] & T [GC] & nonzerozero

Time Complexity O (mn) (m is the length of the pattern, n is the length of the text), but faster in practice Text P1P1 P2P2 P3P3

Conclusion O (ln + m) time algorithm for the exact gapped pattern matching problem O (mn) time algorithm for the approximate gapped pattern matching problem. Open problem time complexity in the average case?