Download presentation
Presentation is loading. Please wait.
Published byEsteban Barkus Modified over 10 years ago
1
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University, Korea
2
Overview Background Suffix arrays(SA) Compressed suffix arrays (CSA) Problem definition Previous works Our contributions Description of our algorithm Conclusions
3
Background (1) Given a string T of length n over an alphabet Σ, Suffix array (SA) of T [Manber&Myers ’93] Lexicographically sorted list of the suffixes of T i SA T 19$ 28a $ 34a a b b a $ 42a b a a b b a $ 55a b b a $ 67b a $ 73b a a b b a $ 81b a b a a b b a $ 96b b a $ T : b a b a a b b a $ O(n log n) -bits
4
Background (2) Compressed suffix array (CSA) [Grossi&Vitter ’00] Compressed version of SA Space requirement of O(n log|Σ|) -bit FM-index [Ferragina&Manzini 2000] i SA T ΨTΨT 198$ 281a $ 345a a b b a $ 427a b a a b b a $ 559a b b a $ 672b a $ 733b a a b b a $ 814b a b a a b b a $ 966b b a $ T : b a b a a b b a $ O(n log |Σ|) -bits
5
Problem definition Constructing SA, CSA and FM-index using o(n log n) -time and o(n log n) -bit working space Working space Temporary space required for executing an algorithm Not including the space for the input and output
6
Related works Constructing SA and CSA ※ O(n log n) -bit working space Manber & Myers [1993] : O(n log n) -time Kim et al. [2003] : O(n ) -time Kärkkäinen & Sanders [2003] : O(n ) -time Ko & Aluru [2003]: O(n ) -time ※ O(n log |Σ| ) -bit working space Lam et al. [COCOON 2002]: O(|Σ|n log n ) -time Hon et al. [ISAAC 2003]: O(n log n ) -time None of these algorithms satisfy both time and space requirement of our problem.
7
Previous results Hon et al. [FOCS 2003] An algorithm using O(n loglog|Σ|) -time and O(n log|Σ|) -bit working space The first algorithm using o(n log n) -time and o(n log n) -bit working space following ½-recursion (the odd-even scheme)
8
Our contributions Another algorithm using o(n log n) -time and o(n log n) -bit working space O(n) -time and O(n log|Σ|·log |Σ| α n) -bit working space α = log 3 2 ≈ 0.63 The first alphabet-independent linear-time algorithm for constructing SA, CSA, and FM-index using o(n log n) -bit working space Following ⅔-recursion (the skew scheme)
9
Hon et al. vs. Our results Hon et al.Our results Time O(n loglog|Σ|)O(n) Space (bit) O(n log|Σ|)O(n log|Σ|·log |Σ| α n) Scheme½-recursion⅔-recursion (merging)complexsimple (encoding)*implicit *The encoding step is the most complex and time-consuming step in 2/3-recursion. However, both algorithms don’t need the encoding step.
10
Description of our algorithm
11
Overview Preliminaries Basic definitions and notations Main technique Outline of our algorithm
12
Preliminaries-Ψ function T[k..n] : lexicographically the i th smallest suffix of T ■ SA[i] = k ■ i SA T ΨTΨT 198$ 281a $ 345a a b b a $ 427a b a a b b a $ 559a b b a $ 672b a $ 733b a a b b a $ 814b a b a a b b a $ 966b b a $ T : b a b a a b b a $ 1 2 3 4 5 6 7 8 9 The position in SA where T[k+1..n] is stored
13
Preliminaries-Lemmas Text, Ψ → SA, CSA O(n) time, O(n log|Σ|)-bit working space Text, Ψ → C array (BWT) → FM-index O(n) time, O(n log|Σ|)-bit working space Note : goal Text → Ψ Hon et al. [FOCS 2003]
14
Basic def. and not. (1) Residue-1 suffixes of T T[3i-2..n] for 1 ≤ i ≤ n/3 T[1..n], T[4..n], T[7..n],… Residue-2 suffixes of T T[3i-1..n] for 1 ≤ i ≤ n/3 T[2..n], T[5..n], T[8..n],… Residue-3 suffixes of T T[3i..n] for 1 ≤ i ≤ n/3 T[3..n], T[6..n], T[9..n],… 123456789 T[1..n] =babaabba$ babaabba$ aabba$ ba$ abaabba$ abba$ a$ baabba$ bba$ $
15
Basic def. and not. (2) length : ⅔ n alphabet : Σ 3 SA 12 : suffix array of T 12 length : ⅓ n alphabet : Σ 3 SA 3 : suffix array of T 3 123456789 T =babaabba$ 1 2 34 5 67 8 92 3 45 6 78 9 1 T 12 =babaabba$abaabba$ba$b 3 4 56 7 89 1 2 T 3 =baabba$ba alphabet Σ T 12 [1.. ⅔ n] = T[1..n]T[2..n]T[1]T 3 [1.. ⅓ n] = T[3..n]T[1]T[2]
16
Main technique–Ψ’ function Ψ’ is just like Ψ, but Ψ’ is defined in SA 12 and SA 3 Ψ’ points to the position in SA 12 or SA 3 where T[k+1..n] (the next suffix of current suffix T[k..n] ) is stored. ※ Note that Ψ’ is not the Ψ-function of T 12 and T 3. Ψ’-function consists of Ψ’ T 12, and Ψ’ T 3
17
Ψ’ function (residue-1) Ψ’ T 12 (residue-1 suffixes of T) Let T[3k-2..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes of T) Let T[3k-1..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes of T) Let T[3k..n] be a suffix stored in SA 3 [i]. Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.
18
Ψ’ function (residue-1) 123456789 T =babaabba$ 1 2 34 5 67 8 92 3 45 6 78 9 1 T 12 =babaabba$abaabba$ba$b 3 4 56 7 89 1 2 T 3 =baabba$ba i SA 12 Ψ’ T 1 2 161a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba
19
Ψ’ function (residue-2) Ψ’ T 12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA 3 [i]. Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.
20
Ψ’ function (residue-2) 123456789 T =babaabba$ 1 2 34 5 67 8 92 3 45 6 78 9 1 T 12 =babaabba$abaabba$ba$b 3 4 56 7 89 1 2 T 3 =baabba$ba i SA 12 Ψ’ T 1 2 161a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba
21
Ψ’ function (residue-3) Ψ’ T 12 (residue-1 suffixes) Let T[3k-2..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 12 where the next suffix T[3k-1..n] is stored. Ψ’ T 12 (residue-2 suffixes) Let T[3k-1..n] be a suffix stored in SA 12 [i]. Then, Ψ’ T 12 [i] is the position in SA 3 where the next suffix T[3k..n] is stored. Ψ’ T 3 (residue-3 suffixes) Let T[3k..n] be a suffix stored in SA 3 [i]. Then, Ψ’ T 3 [i] is the position in SA 12 where the next suffix T[3k+1..n] is stored.
22
Ψ’ function (residue-3) 123456789 T =babaabba$ 1 2 34 5 67 8 92 3 45 6 78 9 1 T 12 =babaabba$abaabba$ba$b 3 4 56 7 89 1 2 T 3 =baabba$ba i SA 12 Ψ’ T 1 2 161a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba
23
Framework- outline How to construct Ψ function of T Bottom-up approach Ψ T Ψ T T 12 Ψ T 12 … Use any linear time construction algorithm step 0 step 1 … step h h = log 3 log |Σ| n lengthalphabet step i
24
Step i - outline S S 12 Ψ S 12 S3S3 Ψ S 12 (from step i+1) Ψ’ S 12 Ψ’S3Ψ’S3 → Ψ’ S 12 Ψ’S3Ψ’S3 ΨSΨS merge ΨSΨS
25
Merging step i SA 12 Ψ’ T 1 2 161a$b 224aab ba$ 342aba abb a$b 453abb a$b 531ba$ 613bab aab ba$ i SA 3 Ψ’ T 3 136$ba 212baa bba $ba 325bba $ba i SA T ΨTΨT 198$ 281a$ 355aabba$ 427abaabba$ 559abba$ 672ba$ 733baabba$ 814babaabba$ 966bba$ba * Comparing entries of SA 12 with entries of SA 3 in order - compare two suffixes by following Ψ’- functoin at most twice
26
Conclusions & future works We presented an alphabet-independent linear- time algorithm to construct SA, CSA, FM-index using o(n log n) -bit working space Future works To Construct SA, CSA, and FM-index optimally, i.e., using O(n) -time and O(n log|Σ|) -bit working space
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.