Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.

Similar presentations


Presentation on theme: "Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick."— Presentation transcript:

1 Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick

2 Outline Introduction Difference Covers Sequential Suffix Array Construction Bulk-Synchronous Parallel (BSP) Model Suffix Array Construction in BSP Conclusion

3 Introduction What is a suffix array? A data structure, denoted by, that holds the lexicographic order of all the suffixes of a given string of size. Suffix array construction related to sorting. Naïve solution is to radix sort all the suffixes in. We assume that a given string of size is over or an indexed alphabet. 0123456789 cabdabbda$ 0123456789 cabdabbda$ 8 0123456789 cabdabbda$ 84 0123456789 cabdabbda$ 841 0123456789 cabdabbda$ 8415 0123456789 cabdabbda$ 841562073

4 Introduction Manber and Myers [1990] presented the first suffix array construction algorithm (SACA) running in. Kärkkäinen and Sanders [2003], Kim et al. [2003], Ko and Aluru [2003] all developed SACAs running in. Kärkkäinen et al. [2006] extend their algorithm to run on a p processor BSP machine with optimal local computation and communication costs and requiring supersteps. We reduce the number of supersteps required to xxxxx while preserving the optimal computation and communication costs.

5 Introduction The idea behind the SACAs having linear worst case running time is to use recursion 1.Divide the indices of the input string into two nonempty disjoint sets. 2.Form string and from the characters indexed by each set. 3.Recursively construct. 4.Use to construct. 5.Merge and to obtain. 0123456789 cabdabbda$ cdb$ababda$

6 Difference Covers Given a positive integer, let denote the set of integers. Then can be defined such that for any, there exists such that. is known as a difference cover of. Let, i.e., then we can have e.g. xxxxxxxxxxxx, but not. 0 ≡ 1 – 1 (mod 4) 1 ≡ 3 – 2 (mod 4)1 ≡ 1 – 0 (mod 4) 2 ≡ 3 – 1 (mod 4)3 ≡ 0 – 1 (mod 4) 3 ≡ 1 – 2 (mod 4)2 ??

7 Difference Covers Colbourn and Ling [2000] give a method for computing the difference cover of, for any positive integer, in time. Lemma 1 [Kärkkäinen and Sanders 2003] If is a difference cover of, and and are integers, then there exists such that and are both in. Let and, then ijl(i + l) mod 3(j + l) mod 3 30353(30 + 3) mod 4 = 1(35 + 3) mod 4 = 2 20352(20 + 2) mod 4 = 2(35 + 2) mod 4 = 1

8 Sequential Suffix Array Construction Given string of size, and a positive integer, we construct the suffix array as follows: Construct difference cover of (e.g. for, ). Partition the set of indices into sets. Denote every character,, such that, as a sample character, and for each such character define a super- character corresponding to. x[0]x[1]x[2]x[3]x[4]x[5]…x[n-1] x[0]x[1]x[2]x[3]x[4]x[5]…x[n-1] x[2]x[3]x[5]x[6]… x[3]x[4]x[6]x[7] x[0]x[1]x[2]x[3]x[4]x[5]…x[n-1]

9 Sequential Suffix Array Construction Construct string of super-characters, of size. Construct, identical to with each super-character replaced by its rank in the sorted list of super-characters. Recursively call algorithm on string, with parameter. When algorithm returns with fill array with the rank of each suffix of. x[1:3]x[4:6]…x[n-2:n]x[2:4]x[5:7]…x[n-1:n+1] 4833…2

10 Sequential Suffix Array Construction For each, find an such that asdfkjhiuhoknmkjnkj (e.g. and, then ) Then for each,, define tuple, and sort the tuples separately for each. x[0]x[1]x[2]x[3]x[4]x[5]…x[n-1] rank[1]rank[4]…

11 Sequential Suffix Array Construction Sort all the suffixes of by first characters to get sets of suffixes having an identical prefix. Each set of suffixes with an identical prefix can be divided into subsets of suffixes whose order within the subset has already been found. Merge the subsets of each set of suffixes with identical prefixes, using Lemma 1. Suffix array is obtained in time. aaaaab… x[0:2]x[10:12]x[5:7]x[12:14]x[1:3] ⁞⁞⁞⁞⁞

12 Sequential Suffix Array Construction … …

13 BSP model Model developed to allow rigorous parallel algorithm design over diverse physical systems p processors each with local memory Global communication environment Barrier synchronisation comm env PPPP... MMMM

14 BSP model A BSP machine is defined by 3 parameters p – number of processors g – inverse bandwidth of the network l – network latency Algorithms run in supersteps, each of which is measured by comp – maximum computation over all processors comm – maximum communication over all processors Total cost of an algorithm having S supersteps is

15 Suffix Array Construction in BSP Sequential algorithm divided into four steps Three integer sorting steps Final merging step Integer sorting in BSP requires superstep with comp and comm, using a technique called regular sampling. [Chan and Dehne 1999] We can perform the final merging step using the same technique. Therefore, we can perform each level of recursion in supersteps.

16 Suffix Array Construction in BSP The size of the string decreases by a factor of in each level of recursion. n ⁞ This requires levels, i.e. supersteps. … … …

17 Suffix Array Construction in BSP However, by decreasing the sampling frequency at each level of recursion we can accelerate the rate by which the size of the input string in successive levels of recursion decreases. By setting, the size of the input string converges towards super-exponentially.

18 Suffix Array Construction in BSP

19 However, by decreasing the sampling frequency at each level of recursion we can accelerate the rate by which the size of the input string in successive levels of recursion decreases. By setting, the size of the input string converges towards super-exponentially. Therefore, we only require supersteps to construct the suffix array of a given string.

20 Conclusion Presented an algorithm for constructing suffix arrays in parallel on a processor machine. Algorithm requires optimal local computation and communication costs. Reduced the number of supersteps required to a near optimal. Open questions Can we construct suffix arrays in supersteps? Can we apply the accelerated sampling technique to other algorithms?

21 Thank you!


Download ppt "Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick."

Similar presentations


Ads by Google