Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Slides:



Advertisements
Similar presentations
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.
Advertisements

Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
PhD Thesis Iwona Bialynicka-Birula Ranked Queries in Index Data Structures.
Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix trees.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Space Efficient Linear Time Construction of Suffix Arrays
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Linear Time Suffix Array Construction Using D-Critical Substrings
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
COMP9319 Web Data Compression and Search
Two equivalent problems
Andrzej Ehrenfeucht, University of Colorado, Boulder
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Suffix trees.
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
String Matching with k Mismatches
Presentation transcript:

Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein

Indexing problem Input: Text T=t 1,…,t n (preprocess to DS) Queries: Pattern P=p 1,…,p m (use DS) T= 51430

Suffix Property P appears at location i of T iff P is a prefix of the suffix T i T= T 14 = 51430

Suffix Tree A suffix tree for string S is a compressed trie of all suffixes of S. { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $ Example: s=abab$

Suffix Tree The size of the suffix tree of S is O(|S|). { $ b$ ab$ bab$ abab$ } Example: s=abab$ 0 1 a b a b $ a b $ b 2 $ 3 $ 4 $

Suffix Tree The size of the suffix tree of S is O(|S|). { $ b$ ab$ bab$ abab$ } 0 1 [2,3] Example: s=abab$ [2,4] [4,4] [1,1] [2,4]

Indexing and Suffix Trees Navigate from root. (Use suffix property). P = ssi Time: O(|P| + occ)

Indexing and Suffix Trees Navigate from root. (Use suffix property). P = ssi Time: O(|P| log|Σ| + occ)

Suffix Trees Weiner 1973 (linear time construction!) McCreight 1975 (space efficient) Ukkonen 1995 (online) Farach 1997 (poly range alphabets)

Suffix Array POS All suffixes S1S1 mississippi S2S2 ississippi S3S3 ssissippi S4S4 sissippi S5S5 issippi S6S6 ssippi S7S7 sippi S8S8 ippi S9S9 ppi S 10 pi S 11 i Sorted suffixes S 11 i S8S8 ippi S5S5 issippi S2S2 ississippi S1S1 mississippi S 10 pi S9S9 ppi S7S7 sippi S4S4 sissippi S6S6 ssippi S3S3 ssissippi

Suffix Array m i s s i s s i p p i S = SA(S) = P = pi

Suffix Array m i s s i s s i p p i S = SA(S) = P = pi

Suffix Array m i s s i s s i p p i S = SA(S) = P = pi

Suffix Array m i s s i s s i p p i S = SA(S) = P = pi

Suffix Array m i s s i s s i p p i S = SA(S) = P = pi Time: O(|P|*log |S|)

Suffix Array Introduced: Manber and Myers (1993). Gonnet, Baeza-Yates, Snider (1992) (PAT arrays). Manber and Myers (1993): Time - O(|P| + log |S|)

Suffix Array Construction Manber and Myers (1993) - O(n log n). Karkkainen-Sanders (2003) - O(n) (poly range) 2 Other papers as well.

End of Story? No. Lots of questions. 1.Construction Time of Suffix Trees. 2.Query Time. 3.Compressed Indexing Structures. 4.Indexing with Errors. 5.Real-Time S.T. construction.

Query Time for Large Alphabets Suffix Trees: O(|P|*log|Σ|) (deterministic) Suffix Arrays: O(|P| + log |T|) Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}

Query Time for Large Alphabets Actually it is easy to answer queries in O(|P|) time. Create at every node of suffix tree - |∑| length array. Then navigation at every node is O(1). However, time and space of suffix tree construction = O(n|∑| )

Query Time for Large Alphabets Suffix Trees: O(|P|*log|Σ|) (deterministic) Suffix Arrays: O(|P| + log |S|) Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}

Suffix Tree – Suffix Array connection The ordering of the suffixes (leaves) in suffix tree is exactly the suffix array

Suffix Array POS All suffixes S1S1 mississippi$ S2S2 ississippi$ S3S3 ssissippi$ S4S4 sissippi$ S5S5 issippi$ S6S6 ssippi$ S7S7 sippi$ S8S8 ippi$ S9S9 ppi$ S 10 pi$ S 11 i$ S 12 $ sorted suffixes S8S8 ippi$ S5S5 issippi$ S2S2 ississippi$ S 11 i$ S1S1 mississippi$ S9S9 ppi$ S 10 pi$ S6S6 ssippi$ S3S3 ssissippi$ S7S7 sippi$ S4S4 sissippi$ S 12 $

Example: Mississippi$ SA(mississippi) =

Suffix Tree – Suffix Array connection We utilize this connection as follows: Every node in the suffix tree corresponds to an interval in suffix array.

Example: Mississippi$ SA(mississippi) =

Suffix Tree – Suffix Array connection Moreover, Time to search in suffix array on interval I is: O(|P| + log |I|).

Suffix Tree – Suffix Array connection DFN: a |Σ|-leaf is a node that (1) has at least |Σ| leaves in its subtree (2) all its children do not. Number of leaves in subtree of |Σ|-leaf is O(|Σ| 2 ). Why? At most |Σ| children – each with less than |Σ| leaves in subtree.

Suffix Tree – Suffix Array connection Number of leaves in subtree of |Σ|-leaf is O(|Σ| 2 ). Time to search in suffix array for |Σ|-leaf is: O(|P| + log |Σ|).

Example: Mississippi$ SA(mississippi) =

Suffix Tray Idea Outline: Navigate in suffix tree till a |Σ|-leaf is hit and then move to suffix array (time in SA - O(|P| + log |Σ|)) Problem: Navigation in suffix tree O(|P| log |Σ|) time. We promised O(|P| + log |Σ|).

Suffix Tray Recall idea: Create at every node of suffix tree - |∑| length array. Then navigation at every node is O(1). Too expensive overall: O(n|∑| ) But OK for O(n/|Σ|) nodes.

Suffix Tray Idea: Truncate suffix trees at |Σ|-leaves for Σ-tree Would be nice: size of Σ-tree = O(n/|Σ|) However, this is not the case. a $ $ $ $ $a a a a $ < | Σ| leaves | Σ|-leaf - the rest

< | Σ| leaves | Σ|-leaf - the rest $ a $ $ $ $ $ab ab $ $ab $ $ baba S=ababababa$

Suffix Tray Alternative Idea: Extend def. of Σ-tree by removing all nodes with fewer than |Σ| leaves in its subtree. Nodes in Σ-tree: 1.Σ-leaf 2.Branching-Σ-node: node with at least 2 children 3.Others – nodes with only one child.

Suffix Tray - Example $a $ $ $ $$ab ab $ $ab $ $ baba < | Σ| leaves | Σ|-leaf - others - branching |Σ|- node

Suffix Tray Observation: # of Σ-leafs = O(n/|Σ|) Hence, # of branching-Σ-nodes = O(n/|Σ|) So, we can save Σ-tables for navigation at each.

Suffix Tray – What is Left? $a $ $ $ $$ab ab $ $ab $ $ baba < | Σ| leaves | Σ|-leaf - others - branching |Σ|- node

Suffix Tray Nodes in Σ-tree with only one child. a b b c d e Interval less than |Σ| 2

Suffix Tray Size of suffix Tray: O(n) Navigation: 1.Σ-leaf - jump to suffix array 2.Branching-Σ-node: look at Σ-array 3.Others – look at one character to Σ-tree child. Time: O(|P| + log|Σ|)

End of Story? No. Lots of questions. 1.Construction Time of Suffix Trees. 2.Query Time. 3.Compressed Indexing Structures. 4.Indexing with Errors. 5.Real-Time S.T. construction.