Advanced Data Structure: Bioinformatics

Slides:



Advertisements
Similar presentations
Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Advertisements

Space-for-Time Tradeoffs
MSc Bioinformatics for H15: Algorithms on strings and sequences
Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle.
1 String Matching of Bit Parallel Suffix Automata.
Boyer Moore Algorithm String Matching Problem Algorithm 3 cases Searching Timing.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,
1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule.
The chromosomes contains the set of instructions for alive beings
Finding approximate palindromes in genomic sequences.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Reverse Colussi algorithm
Backward Nondeterministic DAWG Matching Algorithm
Raita Algorithm T. RAITA Advisor: Prof. R. C. T. Lee
Indexing and Searching
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
String Matching Chapter 32 Highlights Charles Tappert Seidenberg School of CSIS, Pace University.
Exact string matching Rhys Price Jones Anne Haake Week 2: Bioinformatics Computing I continued.
Chapter 7 Space and Time Tradeoffs James Gain & Sonia Berman
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
MCS 101: Algorithms Instructor Neelima Gupta
Application: String Matching By Rong Ge COSC3100
Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner.
String Matching of Regular Expression
MCS 101: Algorithms Instructor Neelima Gupta
Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
Design and Analysis of Algorithms – Chapter 71 Space-Time Tradeoffs: String Matching Algorithms* Dr. Ying Lu RAIK 283: Data Structures.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
CSG523/ Desain dan Analisis Algoritma
Advanced Algorithms Analysis and Design
Exact string matching: one pattern (text on-line)
Sequence comparison: Local alignment
The short-read alignment in distributed memory environment
Recuperació de la informació
13 Text Processing Hongfei Yan June 1, 2016.
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
Chapter 2 FINITE AUTOMATA.
Chapter 7 Space and Time Tradeoffs
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Tècniques i Eines Bioinformàtiques
Recuperació de la informació
Chap 3 String Matching 3 -.
Tècniques i Eines Bioinformàtiques
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Improved Two-Way Bit-parallel Search
MA/CSSE 473 Day 27 Student questions Leftovers from Boyer-Moore
Presentation transcript:

Advanced Data Structure: Bioinformatics 24/02/15 24/02/15 First week: Algorithms for exact string matching. Second week: Approximate search and alignment of short sequences. Third week: Dealing with long sequences. 1

Advanced Data Structure:bibliography 24/02/15 24/02/15 Bioinformatics, Sequence and Genome Analysis David W. Mount Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://www-igm.univ-mlv.fr/~lecroq/string/index.html http://www.ncbi.nlm.nih.gov/ 2

First week First week: algorithms for exact string matching: 24/02/15 24/02/15 First week: algorithms for exact string matching: One pattern: The algorithm depends on |p| and | k patterns: The algorithm depends on k, |p| and || Second week: approximate search and alignment of short sequences. Third week: dealing with long sequences. 3

Exact string matching for one pattern 24/02/15 24/02/15 How does the string algorithms made the search? For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA. and for the pattern TACTACGGTATGACTAA As you have seen this morning .... 4

Exact string matching: Brute force algorithm 24/02/15 24/02/15 Example: Given the pattern ATGTA, the search is G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A As you have seen this morning .... 5

The window is shifted only one cell Exact string matching: Brute force algorithm 24/02/15 24/02/15 Which is the next position of the window? How the comparison is made? Text : Pattern : From left to right: prefix Text : Pattern : As you have seen this morning .... The window is shifted only one cell 6

Exact string matching: one pattern 24/02/15 24/02/15 How does the matching algorithms made the search? There is a sliding window along the text against which the pattern is compared: Pattern : Text : At each step the comparison is made and the window is shifted to the right. As you have seen this morning .... Which are the facts that differentiate the algorithms? How the comparison is made. The length of the shift. 7

BNDM : Backward Nondeterministic Dawg Matching Exact string matching for one pattern 24/02/15 24/02/15 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256e 8

Horspool algorithm How the comparison is made? 24/02/15 24/02/15 Which is the next position of the window? How the comparison is made? Text : Pattern : Sufix search Pattern : Text : a As you have seen this morning .... Shift until the next ocurrence of “a” in the pattern: a We need a preprocessing phase to construct the shift table. 9

Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A C G T As you have seen this morning .... 10

Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C G T As you have seen this morning .... 11

Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G T As you have seen this morning .... 12

Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T As you have seen this morning .... 13

Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 As you have seen this morning .... 14

Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A As you have seen this morning .... 15

Horspool algorithm: example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A As you have seen this morning .... A T G T A 16

Some questions about Horspool algorithm 24/02/15 24/02/15 Given the pattern ATGTA, the shift table is A 4 C 5 G 2 T 1 Given a random text over an equally likely probability distribution (EPD): 1.- Determine the expected shift of the window. And, if the PD is not equally likely? 2.- Determine the expected number of shifts assuming a text of length n. As you have seen this morning .... 3.- Determine the expected number of comparisons in the suffix search phase 17

BNDM : Backward Nondeterministic Dawg Matching Exact string matching for one pattern 24/02/15 24/02/15 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256 18

BNDM algorithm How the comparison is made? 24/02/15 24/02/15 Which is the next position of the window ? How the comparison is made? Text : Pattern : Search for suffixes of T that are factors of Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) x That is denoted as D2 = 1 0 0 0 1 0 0 Depends on the value of the leftmost bit of D As you have seen this morning .... 19

BNDM algorithm: example 24/02/15 24/02/15 Given the pattern ATGTA The mask of characters is: B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 0 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) As you have seen this morning .... D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 ) 20

BNDM algorithm: example of window shift 24/02/15 24/02/15 Given the pattern ATGTA The mask of characters is : The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) D1 = ( 1 0 0 0 1 ) A T G T A D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) As you have seen this morning .... D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Found 21

BNDM algorithm: example 24/02/15 24/02/15 Given the pattern ATGTA The mask of characters is : B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) How the shif is determined? The searching phase: G T A C T A G A A T A C G T A T G T A C T G ... A T G T A A T G T A A T G T A D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) As you have seen this morning .... D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) D3 = ( 0 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) 22

Extended string matching 24/02/15 24/02/15 Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. Wild cards: we find pattern as AT*TA where * means an arbitrary long string. Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times.. As you have seen this morning .... 23

BNDM : Backward Nondeterministic Dawg Matching Exact string matching for one pattern 24/02/15 Algorismes més eficients (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256

Autòmata Factor Oracle: propietats 24/02/15 24/02/15 Factor Oracle of word G T A T G T A G A T All states are accepting states. Recognizes all factors … but more, which? As you have seen this morning .... If a word is rejected, it isn't a factor, then 25

BOM algorithm (Backward Oracle Matching) 24/02/15 24/02/15 How many cells are shifted? How the comparison is made? Text : Pattern : Automata: Factor Oracle Checks from right to left a If the a isn't into the automaton As you have seen this morning .... If we reach the last stat of the automaton with the a a 26

BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G As you have seen this morning .... 27

BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G As you have seen this morning .... 28

BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G As you have seen this morning .... 29

BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning .... 30

BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning .... 31

BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning .... 32

Automata Factor Oracle 24/02/15 24/02/15 Given the pattern GTATA, in which state the factors are accepted? G A T GT GTA TA When the new T is read, 4 factors should be accepted GTAT TAT AT T, how it can be reached? GTAT TAT AT T G A GT GTA TA When the new A is read, 5 factors should be accepted GTATA TATA ATA TA A, how it can be reached? As you have seen this morning .... 33

Automata Factor Oracle 24/02/15 24/02/15 GTATA TATA ATA TA A GTAT TAT AT T G GT GTA When the new G is read, 6 factors should be accepted GTATAG TATAG ATAG TAG AG G G GTATAG TATAG ATAG TAG AG G As you have seen this morning .... 34

? Automaton Factor Oracle: linear algorithm 24/02/15 24/02/15 As you have seen this morning .... 35

Autòmata Factor Oracle: algorisme 24/02/15 24/02/15 If there is a T transition ... T As you have seen this morning .... 36

Autòmata Factor Oracle: algorisme 24/02/15 24/02/15 But if there isn't a T transition ... T T As you have seen this morning .... … and recursively continue ... 37