Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein
Indexing problem Input: Text T=t 1,…,t n (preprocess to DS) Queries: Pattern P=p 1,…,p m (use DS) T= 51430
Suffix Property P appears at location i of T iff P is a prefix of the suffix T i T= T 14 = 51430
Suffix Tree A suffix tree for string S is a compressed trie of all suffixes of S. { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $ Example: s=abab$
Suffix Tree The size of the suffix tree of S is O(|S|). { $ b$ ab$ bab$ abab$ } Example: s=abab$ 0 1 a b a b $ a b $ b 2 $ 3 $ 4 $
Suffix Tree The size of the suffix tree of S is O(|S|). { $ b$ ab$ bab$ abab$ } 0 1 [2,3] Example: s=abab$ [2,4] [4,4] [1,1] [2,4]
Indexing and Suffix Trees Navigate from root. (Use suffix property). P = ssi Time: O(|P| + occ)
Indexing and Suffix Trees Navigate from root. (Use suffix property). P = ssi Time: O(|P| log|Σ| + occ)
Suffix Trees Weiner 1973 (linear time construction!) McCreight 1975 (space efficient) Ukkonen 1995 (online) Farach 1997 (poly range alphabets)
Suffix Array POS All suffixes S1S1 mississippi S2S2 ississippi S3S3 ssissippi S4S4 sissippi S5S5 issippi S6S6 ssippi S7S7 sippi S8S8 ippi S9S9 ppi S 10 pi S 11 i Sorted suffixes S 11 i S8S8 ippi S5S5 issippi S2S2 ississippi S1S1 mississippi S 10 pi S9S9 ppi S7S7 sippi S4S4 sissippi S6S6 ssippi S3S3 ssissippi
Suffix Array m i s s i s s i p p i S = SA(S) = P = pi
Suffix Array m i s s i s s i p p i S = SA(S) = P = pi
Suffix Array m i s s i s s i p p i S = SA(S) = P = pi
Suffix Array m i s s i s s i p p i S = SA(S) = P = pi
Suffix Array m i s s i s s i p p i S = SA(S) = P = pi Time: O(|P|*log |S|)
Suffix Array Introduced: Manber and Myers (1993). Gonnet, Baeza-Yates, Snider (1992) (PAT arrays). Manber and Myers (1993): Time - O(|P| + log |S|)
Suffix Array Construction Manber and Myers (1993) - O(n log n). Karkkainen-Sanders (2003) - O(n) (poly range) 2 Other papers as well.
End of Story? No. Lots of questions. 1.Construction Time of Suffix Trees. 2.Query Time. 3.Compressed Indexing Structures. 4.Indexing with Errors. 5.Real-Time S.T. construction.
Query Time for Large Alphabets Suffix Trees: O(|P|*log|Σ|) (deterministic) Suffix Arrays: O(|P| + log |T|) Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}
Query Time for Large Alphabets Actually it is easy to answer queries in O(|P|) time. Create at every node of suffix tree - |∑| length array. Then navigation at every node is O(1). However, time and space of suffix tree construction = O(n|∑| )
Query Time for Large Alphabets Suffix Trees: O(|P|*log|Σ|) (deterministic) Suffix Arrays: O(|P| + log |S|) Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}
Suffix Tree – Suffix Array connection The ordering of the suffixes (leaves) in suffix tree is exactly the suffix array
Suffix Array POS All suffixes S1S1 mississippi$ S2S2 ississippi$ S3S3 ssissippi$ S4S4 sissippi$ S5S5 issippi$ S6S6 ssippi$ S7S7 sippi$ S8S8 ippi$ S9S9 ppi$ S 10 pi$ S 11 i$ S 12 $ sorted suffixes S8S8 ippi$ S5S5 issippi$ S2S2 ississippi$ S 11 i$ S1S1 mississippi$ S9S9 ppi$ S 10 pi$ S6S6 ssippi$ S3S3 ssissippi$ S7S7 sippi$ S4S4 sissippi$ S 12 $
Example: Mississippi$ SA(mississippi) =
Suffix Tree – Suffix Array connection We utilize this connection as follows: Every node in the suffix tree corresponds to an interval in suffix array.
Example: Mississippi$ SA(mississippi) =
Suffix Tree – Suffix Array connection Moreover, Time to search in suffix array on interval I is: O(|P| + log |I|).
Suffix Tree – Suffix Array connection DFN: a |Σ|-leaf is a node that (1) has at least |Σ| leaves in its subtree (2) all its children do not. Number of leaves in subtree of |Σ|-leaf is O(|Σ| 2 ). Why? At most |Σ| children – each with less than |Σ| leaves in subtree.
Suffix Tree – Suffix Array connection Number of leaves in subtree of |Σ|-leaf is O(|Σ| 2 ). Time to search in suffix array for |Σ|-leaf is: O(|P| + log |Σ|).
Example: Mississippi$ SA(mississippi) =
Suffix Tray Idea Outline: Navigate in suffix tree till a |Σ|-leaf is hit and then move to suffix array (time in SA - O(|P| + log |Σ|)) Problem: Navigation in suffix tree O(|P| log |Σ|) time. We promised O(|P| + log |Σ|).
Suffix Tray Recall idea: Create at every node of suffix tree - |∑| length array. Then navigation at every node is O(1). Too expensive overall: O(n|∑| ) But OK for O(n/|Σ|) nodes.
Suffix Tray Idea: Truncate suffix trees at |Σ|-leaves for Σ-tree Would be nice: size of Σ-tree = O(n/|Σ|) However, this is not the case. a $ $ $ $ $a a a a $ < | Σ| leaves | Σ|-leaf - the rest
< | Σ| leaves | Σ|-leaf - the rest $ a $ $ $ $ $ab ab $ $ab $ $ baba S=ababababa$
Suffix Tray Alternative Idea: Extend def. of Σ-tree by removing all nodes with fewer than |Σ| leaves in its subtree. Nodes in Σ-tree: 1.Σ-leaf 2.Branching-Σ-node: node with at least 2 children 3.Others – nodes with only one child.
Suffix Tray - Example $a $ $ $ $$ab ab $ $ab $ $ baba < | Σ| leaves | Σ|-leaf - others - branching |Σ|- node
Suffix Tray Observation: # of Σ-leafs = O(n/|Σ|) Hence, # of branching-Σ-nodes = O(n/|Σ|) So, we can save Σ-tables for navigation at each.
Suffix Tray – What is Left? $a $ $ $ $$ab ab $ $ab $ $ baba < | Σ| leaves | Σ|-leaf - others - branching |Σ|- node
Suffix Tray Nodes in Σ-tree with only one child. a b b c d e Interval less than |Σ| 2
Suffix Tray Size of suffix Tray: O(n) Navigation: 1.Σ-leaf - jump to suffix array 2.Branching-Σ-node: look at Σ-array 3.Others – look at one character to Σ-tree child. Time: O(|P| + log|Σ|)
End of Story? No. Lots of questions. 1.Construction Time of Suffix Trees. 2.Query Time. 3.Compressed Indexing Structures. 4.Indexing with Errors. 5.Real-Time S.T. construction.