Succinct Data Structures Kunihiko Sadakane National Institute of Informatics
Suffix Trees [1,2] ababac$ 1234567 Edge labels Depths of nodes $ a c b a 7 1 6 Edge labels Depths of nodes Leaf indexes Pointers to children Suffix link String T c b a 2 3 b c 5 b c 2 4 1 3 1234567 ababac$
Operations on Suffix Trees root(): returns the root node isleaf(v): returns Yes if v is a leaf child(v,c): returns a child w of v (edge label from v to w begins with a letter c) firstchild(v): returns the first child of v sibling(v): returns the immediate sibling of v parent(v): returns the parent of v
edge(v,d): returns d-th letter of label of edge to v depth(v): returns the string depth of v lca(v,w): returns lca between v, w sl(v): returns the node pointed by suffix link of v $ a b c 7 1 3 5 2 4 6
Components of Suffix Trees [3] String: n lg |A| bits Tree structure: O(n lg n) bits String depths of nodes: n lg n bits Edge labels: n lg n bits Suffix link: n lg n bits
Representation of Tree Structure Represent the tree by BP sequence Internal nodes: (...) n-1 Leaves:() n At most 4n+o(n) bits Nodes are represented by positions of ( 1 3 5 2 7 4 6 7 1 3 5 2 4 6 (()((()())())(()())())
Representation of Nodes v: position of ( in the BP sequence j: preorder of node j = rank((P,v) v = select((P,j) i: inorder of node preorder 1 3 8 4 2 5 6 7 9 10 11 1 2 3 4 5 6 7 8 9 10 11 (()((()())())(()())())
Inorder of Nodes Defined for only internal nodes Number of internal nodes visited from below during DFS traversal from the root to v An internal node may have more than one inorder (A node with degree k has exactly k1 inorders) 146 x 3 x 5 2 x x x x x
Computation of inorder v and its smallest inorder i are converted each other in constant time i = rank()(P,findclose(P,v+1)) v = enclose(P,select)( (P,i)+1) 146 3 5 2 x 1 7 3 2 1 3 5 5 2 4 6 (()((()())())(()())()) v
Proof: i = rank()(P,findclose(P,v+1)) v+1 is the first child w of v. u = findclose(P,v+1) is the last position of the subtree rooted at w. inorder is defined once on a path from a leaf to the next leaf. There is one-to-one correspondence between leaves and inorders. Value of inorder is number of leaves on the tour from root to v. Thus, i = rank()(P,u) 146 3 5 2 x v w v w u (()((()())())(()())())
Proof: v = enclose(P,select)( (P,i)+1) i is the number of times that during the DFS traversal a node w is visited from below and a child of w is visited next. This action is represented by “)(” on P. x = select)( (P,i)+1 represents a child of v. Its parent is the answer. 146 3 5 2 x v v x (()((()())())(()())())
String Depths of Nodes ababac$ $ a b c 7 1 3 5 2 4 6 1 2 3 $ a b c 7 1 3 5 2 4 6 1 2 3 ababac$ Hgt 0 3 1 0 2 0 0 String depths are represented by the lengths of common prefixes between two adjacent leaves. Hgt array represents it.
Hgt Array Hgt[i]= lcp(SA[i], SA[i+1]) Size: n log n bits 0 7 $ 3 1 ababac$ 1 3 abac$ 0 5 ac$ 2 2 babac$ 0 4 bac$ 0 6 c$ SA Hgt
Hgt[i] is equal to the string depth of node with inorder i 2 3 5 $ a b c 7 1 4 6 Hgt 0 3 1 0 2 0 0 (()((()())())(()())()) One-to-one correspondence between internal nodes and leaves. It can be computed in constant time. i = rank()(findclose(v+1)) depth(v) = Hgt[i]
Computation of Edge Labels Let i be the inorder of node v i-th leaf is a descendant of v i-th leaf represents SA[i] Edge incoming to v is a subsring of SA[i] v parent(v) SA[i] d1 d2 b a c d Edge length = d2 d1
Computation of Hgt Array Given i and SA[i], Hgt[i] is computed in constant time using an index of 2n +o(n) bits
Permuting Hgt Array Values of SA+Hgt become increasing if they are Hgt[i]= lcp(SA[i], SA[i+1]) 0 3 1 0 2 0 0 Hgt SA 7 1 3 5 2 4 6 7 4 4 5 4 4 6 SA+Hgt Values of SA+Hgt become increasing if they are sorted with respect to values of SA 4 4 4 4 5 6 7 SA+Hgt SA 1 2 3 4 5 6 7 n increasing numbers in [1,n] is represented in 2n bits 00001 1 1 1 01 01 01
Lemma: Let SA[i]=p, SA[j]=p+1. Then Hgt[j] Hgt[i] 1 d p ababac$ q abac$ d-1 p+1 babac$ q+1 bac$ SA Hgt i j d p ababac$ q abac$ d-1 p+1 babac$ bab.. q+1 bac$ SA Hgt i j Hgt[SA-1[p+1]] Hgt[SA-1[p]]-1
Hgt[SA-1[k]]+k (k = 1,2,...,n) are monotone increasing and in the range [1, n]
Computation of Hgt[i] Compute k = SA[i] constant time using the suffix array O(log n) time using the compressed suffix array (0<<2) Decode the k-th element v in the monotone sequence constant time by select Hgt[i] = v - k
Computation of lca lca = lowest common ancestor u = lca(v,w) Constant time v w u
Let E[i] = rank((P,i) rank)(P,i). Then u = parent(RMQE(v,w)+1) m = RMQE(v,w): the index of minimum value in E[v..w] u 146 3 5 2 7 1 4 6 w 1 7 3 2 1 3 5 5 2 4 6 v P (()((()())())(()())()) 1212343432321232321210 E u v m w
Representing Suffix links c sl(v) b 2 5 3 6 x y x’ y’ v w sl(node(c)) = node() Use the function of the compressed suffix array
Proof: Leaves are represented by () and appear in P in lex Proof: Leaves are represented by () and appear in P in lex. orders of suffixes. Therefore x = rank()(P,v1)+1 is the smallest suffix in lex. order among descendant leaves of v y = rank()(P,findclose(P,v)) is the largest suffix in lex. order among descendant leaves of v x, y represent T[SA[x]..n], T[SA[y]..n]. x’, y’ represent T[SA[x]+1..n], T[SA[y]+1..n].
x is the leftmost leaf, y is the rightmost leaf Let l = lcp(x,y). Then l is identical to the string depth of v It holds lcp(x’,y’) = l1 lca(x’,y’) represents a string one shorter than v. That is, sl(v). v y x SA[y] SA[x]
Going to a Child Node w = child(v,c): a child w of v with edge label starting with letter c By enumerating children of v enumerate a child u by firstchild and sibling find u such that edge(u,1) = c By binary search on children of v use the operation to find i-th child of v By binary search on SA find lex. orders l, r of leftmost/rightmost leaves of v binary search on SA[l..r] according to (d +1)-th letter of suffixes (d = depth(v))
Data Structure of Compressed Suffix Trees It consists of the following components Compressed Suffix Arrays: |CSA| BP sequence of the tree: 4n+o(n) bits Hgt array: 2n+o(n) bits The size of the compressed suffix tree is |CSA|+6n+o(n) bits
Time Complexities of Operations root, isleaf, firstchild, sibling, parent, lca: O(1) depth, edge: O(tSA) time sl: O(t) time child: O(tSA log |A|) time tSA: time to compute SA[i] t: time to compute [i]
References [1] P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pages 1–11, 1973. [2] E. M. McCreight. A Space-economical Suffix Tree Construction Algorithm. Journal of the ACM, 23(12):262–272, 1976. [3] Kunihiko Sadakane: Compressed Suffix Trees with Full Functionality. Theory Comput. Syst. 41(4): 589-607 (2007)