Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle.

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Deterministic Finite Automata (DFA)
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
MSc Bioinformatics for H15: Algorithms on strings and sequences
Suffix Trees Construction and Applications João Carreira 2008.
1 String Matching of Bit Parallel Suffix Automata.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Goodrich, Tamassia String Processing1 Pattern Matching.
Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan
The chromosomes contains the set of instructions for alive beings
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 15 Instructor: Paul Beame.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Aho-Corasick String Matching An Efficient String Matching.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Reverse Colussi algorithm
Indexing and Searching
Finite Automata Chapter 5. Formal Language Definitions Why need formal definitions of language –Define a precise, unambiguous and uniform interpretation.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Great Theoretical Ideas in Computer Science.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
Advisor: Prof. R. C. T. Lee Speaker: T. H. Ku
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Great Theoretical Ideas in Computer Science.
Exact String Matching Algorithms: A Survey Mehreen Ali, Hina Naz Khan, Shumaila Sayyab, Nadeem Iftikhar Department of Bio-Science Mohammad Ali Jinnah University,
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2007.
Great Theoretical Ideas in Computer Science for Some.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2006.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
Lecture 14: Theory of Automata:2014 Finite Automata with Output.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Advanced Data Structure: Bioinformatics
Tries 07/28/16 11:04 Text Compression
CSCI 2670 Introduction to Theory of Computing
Tries 5/27/2018 3:08 AM Tries Tries.
Non Deterministic Automata
Modeling Arithmetic, Computation, and Languages
Recuperació de la informació
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Strings: Tries, Suffix Trees
Finite Automata.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Tècniques i Eines Bioinformàtiques
Strings: Tries, Suffix Trees
Improved Two-Way Bit-parallel Search
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle

Factor Oracle, Suffix Oracle2 Outline Factor oracle definition Construction methods Suffix oracle Factor oracle for a set of words Applications in string matching

Factor Oracle, Suffix Oracle3 Data Structures that Represent the Factors of a String Suffix trie – tree representing all the suffixes of the string. Suffix automaton (DAWG) – the minimal automaton recognizing all the suffixes of the string. Both the suffix automaton and the factor oracle can be obtained from the suffix trie. Figure 1. Suffix Trie of the string abbc Figure 2. Suffix Automaton of the string abbc Figure 3. Factor Oracle for the string abbc

Factor Oracle, Suffix Oracle4 Factor Oracle – Basic Ideas The factor oracle is a data structure used for indexing all the factors of a given word. An automaton built on a string p that acts like an oracle on the factors of the string. If a string is accepted by the automaton it may be a factor of p – weak factor recognition. All the correct factors are accepted.

Factor Oracle, Suffix Oracle5 Factor Oracle – Example Factor oracle for the string abbbaab. All states are considered final. The word abba is accepted although it is not a factor of abbbaab. Figure 4. Factor oracle for abbbaab ba is a factor of baab so a transition from 2 to 5 by a is added

Factor Oracle, Suffix Oracle6 Factor Oracle – Formal Definition Definition 1. The factor oracle of a string is the automaton built by the algorithm Build_Oracle, where all the states are terminal. Figure 5. High level construction algorithm of Oracle(p). The algorithm has a quadratic time complexity.

Factor Oracle, Suffix Oracle7 Factor Oracle – Properties 1. Acyclic homogenous deterministic automaton. 2. Recognizes at least the factors of p, the string that it was built for. 3. Has the fewest states possible (for a string p of length m there are precisely m+1 states). 4. Has a linear number of transitions (the total number ranges between m and 2m-1).

Factor Oracle, Suffix Oracle8 Factor Oracle – Construction In the sequential construction the letters of the word are read from left to right and the automaton is upgraded at each step. We denote the longest suffix of that appears at least twice in it. We define a function on the states of the automaton called supply function that maps each state i of Oracle(p) to the state j where the reading of ends.

Factor Oracle, Suffix Oracle9 Factor Oracle – Construction Algorithm Buid_Oracle_Sequential ( ) 1. create initial state 0, set 2. for i=1 to m do 3. create new state i 4. add a new transition from i-1 to i by 5. set 6. while and there is no transition from k by do 7. add new transition from k to i by 8. set 9. endwhile 10. if k = -1 then set 11. else set 12. endfor

Factor Oracle, Suffix Oracle10 Construction of the Factor Oracle for the string abbbaab Add a new transition from 0 to 2 by b No new transition is needed

Factor Oracle, Suffix Oracle11 Construction of the Factor Oracle for the string abbbaab Add new transitions from 3 and 2 to 5 by aAdd new transition from 1 to 6 by a No new transition is needed

Factor Oracle, Suffix Oracle12 Suffix Oracle - Definition We mark some states in the factor oracle for the string p as final in order to recognize suffixes of p. The new structure is called suffix oracle. A state q of the suffix oracle is terminal if and only if there is a path labeled by a suffix of p from the initial state leading to q. Terminal states are determined by following the supply function from state m of Oracle( ).

Factor Oracle, Suffix Oracle13 Suffix Oracle – Example The suffix oracle is a little more complicated to implement than the factor oracle. Also, it requires more memory space. Figure 6. Suffix oracle for the string abbbaab. Double circled states are terminal.

Factor Oracle, Suffix Oracle14 Factor Oracle for a Set of Words The factor oracle can be extended for a set of words so that it contains at least all the factors of the words from the set. We set an order on the words from the set, in order to avoid the uniqueness problem. The oracle is built on a trie of all the words which is updated similarly to the factor oracle for one word. The supply function maps each state i of the oracle to the state j where the reading of the longest repeated suffix that appears in one of the words ends.

Factor Oracle, Suffix Oracle15 Factor Oracle for a Set of Words Example Figure 7. Trie for the set {abbba, baaa}Figure 8. Intermediate phase in the construction of the factor oracle for the set {abbba, baaa} Figure 9. Factor oracle for the set {abbba, baaa}

Factor Oracle, Suffix Oracle16 Backward Oracle Matching Algorithm Version of the BDM algorithm using the factor oracle instead of the suffix automaton. Fast in practice for very long patterns and small alphabets. Preprocessing phase linear in time and space complexity. Optimal on average (conjecture.)

Factor Oracle, Suffix Oracle17 BOM – Main Idea The search uses the oracle of the reversed pattern. The search stops when the word is no longer recognized by the oracle (which shows it is certainly not a factor of the reversed pattern). The search window is shifted beyond the point the search failed (safe shift).

Factor Oracle, Suffix Oracle18 BOM – Facts The suffix oracle of the reversed pattern can be used instead of the factor oracle. The shifts are longer but there are more operations needed. Worst case complexity of BOM is O (mn), where m is the length of the pattern, and n the total length of the text. Because the factor oracle accepts some words that are not really factors of the pattern in some cases the total number of inspections is greater than in BDM. TurboBOM combines BOM with KMP to obtain an algorithm linear in the worst case.

Factor Oracle, Suffix Oracle19 Factor Oracle – Applications Finding the repeats in a string  Data compression  Bioinformatics  Machine improvisation

Factor Oracle, Suffix Oracle20 Factor Oracle – Open Problems What is the automaton-independent characterization of the language recognized by the oracle. Figure 10. The factor oracle for the string abbb accepts exactly all the factors of the string.

Factor Oracle, Suffix Oracle21 Factor Oracle – Open Problems The factor oracle is not the minimal homogenous automaton which recognizes at least the factors of the string. Figure 11. The factor oracle for the string abcacdace has 8 extra transitions Figure 12. A similar automaton with 7 extra transitions

Factor Oracle, Suffix Oracle22 References 1. Cyril Allauzen, Maxime Crochemore, Mathieu Raffinot Efficient Experimental String Matching by Weak Factor Recognition in Proceedings of 12 th conference on Combinatorial Pattern Matching, Cyril Allauzen, Mathieu Raffinot Oracle des facteurs d’un ensemble de mots Technical report 99-11, Institut Gaspard Monge Universite Marne la Valee, Loek Cleophas, Gerard Zwaan, Bruce Watson Constructing Factor Oracles in Proceedings of the Prague Stringology Conference 2003, Arnaud Levebvre, Thierry Lecroq Computing repeated factors with a factor oracle in Proceedings of 11 th Australian Workshop on Combinatorial Algorithms, G. Assayag, S. Dubnov Using Factor Oracles for Machine Improvisation Soft Computing, 2004