Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.

Slides:



Advertisements
Similar presentations
Space-for-Time Tradeoffs
Advertisements

Suffix Trees Construction and Applications João Carreira 2008.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Modern Information Retrieval
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Suffix trees and suffix arrays presentation by Haim Kaplan.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
MCS 101: Algorithms Instructor Neelima Gupta
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Succinct Data Structures
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Andrzej Ehrenfeucht, University of Colorado, Boulder
Reducing the Space Requirement of LZ-index
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Chapter 7 Space and Time Tradeoffs
Chapter 11 Data Compression
Suffix trees.
String Data Structures and Algorithms
2-Dimensional Pattern Matching
String Data Structures and Algorithms
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
String Matching with k Mismatches
Chap 3 String Matching 3 -.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009

2 of 59 Overview Pattern Matching in 1D Dictionary Matching Pattern Matching in 2D Indexing – Suffix Tree – Suffix Array Research Directions

3 of 59 What is Pattern Matching? Given a pattern and text, find the pattern in the text.

4 of 59 What is Pattern Matching? Σ is an alphabet. Input: Text T = t 1 t 2 … t n Pattern P = p 1 p 2 … p m Output: All i such that

5 of 59 Pattern Matching - Example Input: P=cagc = {a,g,c,t} T=acagcatcagcagctagcat Output: {2,8,11} …. 11 acagcatcagcagctagcat

6 of 59 Pattern Matching Algorithms Naïve Approach – Compare pattern to text at each location. – O(mn) time. More efficient algorithms utilize information from previous comparisons.

7 of 59 Pattern Matching Algorithms Linear time methods have two stages 1.preprocess pattern in O(m) time and space. 2.scan text in O(n) time and space. Knuth, Morris, Pratt (1977): automata method Boyer, Moore (1977): can be sublinear

8 of 59 KMP Automaton P = ababcb

9 of 59 Dictionary Matching Σ is an alphabet. Input: Text T = t 1 t 2 … t n Dictionary of patterns D = {P 1, P 2, …, P k } All characters in patterns and text belong to Σ. Output: All i, j such that where m j = |P j |

10 of 59 Dictionary Matching Algorithms Naïve Approach: – Use an efficient pattern matching algorithm for each pattern in the dictionary. – O(kn) time. More efficient algorithms process text once.

11 of 59 AC Automaton Aho and Corasick extended the KMP automaton to dictionary matching Preprocessing time: O(d) Matching time: O(n log |Σ| +k). Independent of dictionary size!

12 of 59 AC Automaton D = {ab, ba, bab, babb, bb}

13 of 59 Dictionary Matching KMP automaton does not depend on alphabet size while AC automaton does – branching. Dori, Landau (2006): AC automaton is built in linear time for integer alphabets. Breslauer (1995) eliminates log factor in text scanning stage.

14 of 59 Periodicity A crucial task in preprocessing stage of most pattern matching algorithms: computing periodicity. Many forms – failure table – witnesses

15 of 59 Periodicity A periodic pattern can be superimposed on itself without mismatch before its midpoint. Why is periodicity useful? Can quickly eliminate many candidates for pattern occurrence.

16 of 59 Periodicity Definition: S is periodic if S = and is a proper suffix of. S is periodic if its longest prefix that is also a suffix is at least half |S|. The shortest period corresponds to the longest border.

17 of 59 Periodicity - Example S = abcabcabcab|S| = 11 Longest border of S: b = abcabcab; |b| = 8 so S is periodic. Shortest period of S: =abc = 3 so S is periodic.

18 of 59 Witnesses Popular paradigm in pattern matching: 1.find consistent candidates 2.verify candidates consistent candidates → verification is linear

19 of 59 Witnesses Vishkin introduced the duel to choose between two candidates by checking the value of a witness. Alphabet-independent method.

20 of 59 Witnesses Preprocess pattern: Compute witness for each location of self- overlap. Size of witness table:, if P is periodic,, otherwise.

21 of 59 Witnesses WIT[i] = any k such that P[k] ≠ P[k-i+1]. WIT[i] = 0, if there is no such k. k is a witness against i being a period of P. Example:Pattern Witness Table

22 of 59 Witnesses Let j>i. Candidates i and j are consistent if  they are sufficiently far from each other OR  WIT[j-i]=0.

23 of 59 Duel Scan text: If pair of candidates is close and inconsistent, perform duel to eliminate one (or both). Sufficient to identify pairwise consistent candidates: transitivity of consistent positions. P= T= i j witness b a ?

24 of 59 2D Pattern Matching Σ is an alphabet. Input: Text T [1… n, 1… n] Pattern P [1… m, 1… m] Output: All (i, j) such that MRI

25 of 59 2D Pattern Matching - Example Input: Pattern= {A,B} Text Output: { (1,4),(2,2),(4, 3)} ABA ABA AAB AABABAA BABABAB AABAABB BAABAAA ABABAAA BBAABAB BBBABAB AABABAA BABABAB AABAABB BAABAAA ABABAAA BBAABAB BBBABAB AABABAA BABABAB AABAABB BAABAAA ABABAAA BBAABAB BBBABAB AABABAA BABABAB AABAABB BAABAAA ABABAAA BBAABAB BBBABAB

26 of 59 Bird / Baker First linear-time 2D pattern matching algorithm. View each pattern row as a metacharacter to linearize problem. Convert 2D pattern matching to 1D.

27 of 59 Bird / Baker Preprocess pattern: Name rows of pattern using AC automaton. Using names, pattern has 1D representation. Construct KMP automaton of pattern. Identical rows receive identical names.

28 of 59 Bird / Baker Scan text: Name positions of text that match a row of pattern, using AC automaton within each row. Run KMP on named columns of text. Since the 1D names are unique, only one name can be given to a text location.

29 of 59 Bird / Baker - Example Preprocess pattern: Name rows of pattern using AC automaton. Using names, pattern has 1D representation. Construct KMP automaton of pattern. ABA ABA AAB 1 1 2

30 of 59 Bird / Baker - Example Scan text: Name positions of text that match a row of pattern, using AC automaton within each row. Run KMP on named columns of text. AABABAA BABABAB AABAABB BAABAAA ABABAAA BBAABAB BBBABAB

31 of 59 Bird / Baker Complexity of Bird / Baker algorithm: time and space. Alphabet-dependent. Real-time since scans text characters once. Can be used for dictionary matching: replace KMP with AC automaton.

32 of 59 2D Witnesses Amir et. al. – 2D witness table can be used for linear time and space alphabet-independent 2D matching. The order of duels is significant. Duels are performed in 2 waves over text.

33 of 59 Indexing Index text – Suffix Tree – Suffix Array Find pattern in O(m) time Useful paradigm when text will be searched for several patterns.

34 of 59 Suffix Trie banana$ anana$ nana$ ana$ na$ a$ $ n b n a a a a n n a a n n a a $ $ $ $ $ $ suf 1 suf 2 suf 3 suf 4 suf 5 suf 6 suf 7 One leaf per suffix. An edge represents one character. Concatenation of edge-labels on the path from the root to leaf i spells the suffix that starts at position i. suf 1 suf 2 suf 6 suf 5 suf 4 suf 3 $ suf 7 T = banana$

35 of 59 Suffix Tree banana$ anana$ nana$ ana$ na$ a$ $ banana$ a na na$ na na$ $ $ $ suf 1 suf 2 suf 3 suf 4 suf 5 suf 6 suf 7 Compact representation of trie. A node with one child is merged with its parent. Up to n internal nodes. O(n) space by using indices to label edges suf 1 suf 2 suf 6 suf 5 suf 4 suf 3 [7,7] $ [1,7] [3,4] [2,2] [7,7] [5,7][7,7] [5,7] [3,4] T = banana$

36 of 59 Suffix Tree Construction Naïve Approach: O(n 2 ) time Linear-time algorithms: AuthorDateInnovationScan Direction Weiner1973First linear-time algorithm, alphabet-dependent suffix links Right to left McCreight1976Alphabet-independent suffix links, more efficient Left to right Ukkonen1995Online linear-time construction, represents current end Left to right Amir and Nor2008Real-time constructionLeft to right

37 of 59 Suffix Tree Construction Linear-time suffix tree construction algorithms rely on suffix links to facilitate traversal of tree. A suffix link is a pointer from a node labeled xS to a node labeled S; x is a character and S a possibly empty substring. Alphabet-dependent suffix links point from a node labeled S to a node labeled xS, for each character x.

38 of 59 Index of Patterns Can answer Lowest Common Ancestor (LCA) queries in constant time if preprocess tree accordingly. In suffix tree, LCA corresponds to Longest Common Prefix (LCP) of strings represented by leaves.

39 of 59 Index of Patterns To index several patterns:  Concatenate patterns with unique characters separating them and build suffix tree. Problem: inserts meaningless suffixes that span several patterns. OR  Build generalized suffix tree – single structure for suffixes of individual patterns. Can be constructed with Ukkonen’s algorithm.

40 of 59 Suffix Array The Suffix Array stores lexicographic order of suffixes. More space efficient than suffix tree. Can locate all occurrences of a substring by binary search. With Longest Common Prefix (LCP) array can perform even more efficient searches. LCP array stores longest common prefix between two adjacent suffixes in suffix array.

41 of 59 Suffix Array IndexSuffixIndexSuffixLCP 1 mississippi11i0 2 ississippi8ippi1 3 ssissippi5issippi1 4 sissippi2ississippi4 5 issippi 1mississippi0 6 ssippi10pi0 7sippi9ppi1 8 ippi7sippi0 9 ppi4sissippi2 10 pi6ssippi1 11 i3ssissippi3 sort suffixes alphabetically

42 of 59 Suffix array T = mississippi Index Suffix LCP

43 of 59 Search in Suffix Array O(m log n): Idea: two binary searches - search for leftmost position of X - search for rightmost position of X In between are all suffixes that begin with X With LCP array: O(m + log n) search.

44 of 59 Suffix Array Construction Naïve Approach: O(n 2 ) time Indirect Construction: – preorder traversal of suffix tree – LCA queries for LCP. Problem: does not achieve better space efficiency.

45 of 59 Suffix Array Construction Direct construction algorithms: LCP array construction: range-minima queries. AuthorDateComplexityInnovation Manber, Myers1993O(n log n)Sort and search, KMR renaming Karkkainen and Sanders2003O(n)Linear-time Ko and Aluru2003O(n)Linear-time Kim, et. al.2003O(n)Linear-time

46 of 59 Compressed Indices Suffix Tree: O(n) words = O(n log n) bits Compressed suffix tree Grossi and Vitter (2000) – O(n) space. Sadakane (2007) – O(n log |Σ|) space. – Supports all suffix tree operations efficiently. – Slowdown of only polylog(n).

47 of 59 Compressed Indices Suffix array is an array of n indices, which is stored in: O(n) words = O(n log n) bits Compressed Suffix Array (CSA) Grossi and Vitter (2000) O(n log |Σ|) bits access time increased from O(1) to O(log ε n) Sadakane (2003) Pattern matching as efficient as in uncompressed SA. O(n log H 0 ) bits Compressed self-index

48 of 59 Compressed Indices FM – index Ferragina and Manzini (2005) Self-indexing data structure First compressed suffix array that respects the high-order empirical entropy Size relative to compressed text length. Improved by Navarro and Makinen (2007)

49 of 59 Dynamic Suffix Tree Choi and Lam (1997) Strings can be inserted or deleted efficiently. Update time proportional to string inserted/deleted. No edges labeled by a deleted string. Two-way pointer for each edge, which can be done in space linear in the size of the tree.

50 of 59 Dynamic Suffix Array Recent work by Salson et. al. Can update suffix array after construction if text changes. More efficient than rebuilding suffix array. Open problems: – Worst case O(n log n). – No online algorithm yet.

51 of 59 Word-Based Index Text size n contains k distinct words Index a subset of positions that correspond to word beginnings With O(n) working space can index entire text and discard unnecessary positions. Desired complexity – O(k) space. – will always need O(n) time. Problem: missing suffix links.

52 of 59 Word-Based Suffix Tree Construction Algorithms: AuthorDateResults Karkkainen and Ukkonen1996O(n) time and O(n/j) space construction of sparse suffix tree (every j th suffix) Anderson et. al.1999Expected linear-time and k-space construction of word-based suffix tree for k words. Inenaga and Takeda2006Online, O(n) time and k-space construction of word- based suffix tree for k words.

53 of 59 Word-Based Suffix Array Ferragina and Fischer (2007) – word-based suffix array construction algorithm Time and space optimal construction. Computation of word-based LCP array in O(n) time and O(k) space. Alternative algorithm for construction of word-based suffix tree. Searching as efficient as ordinary sufffix array.

54 of 59 Research Directions Problems we are considering: Small space dictionary matching. Time-space optimal 2D compressed dictionary matching algorithm. Compressed parameterized matching. Self-indexing word-based data structure. Dynamic suffix array in O(n) construction time.

55 of 59 Small-Space Applications arise in which storage space is limited. Many innovative algorithms exist for single pattern matching using small additional space: – Galil and Seiferas (1981) developed first time- space optimal algorithm for pattern matching. – Rytter (2003) adapted the KMP algorithm to work in O(1) additional space, O(n) time.

56 of 59 Research Directions Fast dictionary matching algorithms exist for 1D and 2D. Achieve expected sublinear time. No deterministic dictionary matching method that works in linear time and small space. We believe that recent results in compressed self-indexing will facilitate the development of a solution to the small space dictionary matching problem.

57 of 59 Compressed Matching Data is compressed to save space. Lossless compression schemes can be reversed without loss of data. Pattern matching cannot be done in compressed text – pattern can span a compressed character. LZ78: data can be uncompressed in time and space proportional to the uncompressed data.

58 of 59 Research Directions Amir et. al. (2003) devised an algorithm for 2D LZ78 compressed matching. They define strongly inplace as a criteria for the algorithm: that the extra space is proportional to the optimal compression of all strings of the given length. We are seeking a time-space optimal solution to 2D compressed dictionary matching.

59 of 59 Thank you!