Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science, Finland
Outline of the presentation Problem settings Methods used in the work –Preprocessing of text –Preprocessing of patterns –Executing the search Data Runs Results Conclusive remarks Future work.
Problem setting We would like to search text patterns or queries from text databases Multiple sets of large number of long patterns –Here we're handling a single set of 1000 patterns of length 1000 nucleotides each Multiple instances of preprocessed text –Can be text using a compressed suffix array (CSA) purely as index, having another instance of the text as is, or just saving the CSA, as it is a self-index In these experiments, both patterns and text were DNA
Problem setting As a single pattern set can be searched from multiple texts and vice versa, the preprocessing times are not limiting the usefulness of the possible method. –Time taken by preprocessing is amortized over large number of searches Because of this, it is smart to save the patterns already in preprocessed form This leads to searching a preprocessed set of patterns from preprocessed text.
Methods – Preprocessing the text Compressed suffix array (CSA) was constructed from the text, using the package available in the Pizza & Chili website (P. Ferragina and G. Navarro) Two main parameters exist for the CSA: –Samplerate: the interval between two indices of the suffix array stored explicitly. Default value 16 was used –Samplepsi: the interval between two indices of the psi function stored explicitly. Default value 128 was used.
Methods – Preprocessing patterns Using a compression tool called Re-Pair a certain collection of subpatterns of the patterns was retrieved Principal idea is to find a set of subpatterns, which would occur in a large number of patterns, but be rare in text Assuming that the letters in the text are independent and identically distributed, long patterns occur rarely Conveniently, Re-Pair produces phrases, which are long subpatterns of text which occur more than once These phrases were simply scored by the number of times they occur in the pattern set Additionally, the length of the subpattern was required to overcome a set threshold Done to limit the expected number of occurrences this subpattern would have in the text.
Methods – doing the search Search the preprocessed subpatterns from the CSA using locate O( m log(n) + occ * log ε (n) ), 0 < ε < 1 for space-time tradeoff Extend the initial matches of these subpatterns to check if they are an exact match, using character by character comparison –This is done for each pattern that includes the subpattern Stop this after a set number of patterns are handled using this approach Finish the search using the locate function for the remaining full patterns.
Data The 50MB DNA text was retrieved from the Pizza & Chili website 1000 patterns of length 1000 nucleotides were generated from this text at random –That is, substrings of the text were retrieved from random locations It came later apparent that all of the patterns occur only once in the text, which would necessarily not always be the case.
Data The patterns were searched from the text index as was described in the methods section Five different thresholds were used for the minimum length of the subpattern: 25, 28, 30, 33 and 35 Additionally, for each of these thresholds, the number of patterns handled by locating subpatterns was controlled by finishing this phase after 100, 300 or 500 patterns were handled –However, as the subpatterns did not always occur in the full allowed number of patterns, this number of patterns handled by locating subpatterns and extending was lower in some runs The time taken by these runs was compared to searching all of the patterns with the locate of CSA.
Multi-pattern search on CSA Results, set of 1000 patterns Msl = 30 → 14.0 % decrease in run-times.
Multi-pattern search on CSA Results, level of individual patterns Msl=35 → time per pattern was 71.6 % less than with traditional CSA.
Results Searching for the subpatterns generally took around 85% of the time, while checking for the exact match took 15% of the time, when using the implemented new method The memory consumption is not notably different –Phrases and their pattern-related information have to be saved, but this consumes a lot less memory than saving the CSA in practice Total preprocessing time for the set of patterns was roughly 0.8 s.
Conclusive remarks As minimum subpattern length is increased, the average time taken per pattern decreases Interestingly, average time per pattern taken also decreases when more patterns are handled by the proposed method –Suggests that subpatterns occurring extremely commonly in the set of patterns are not the most optimal ones More sophisticated method to choose subpatterns occurring in the set of multiple patterns would be helpful –The proposed method would work on independent and identically distributed text, but DNA definitely does not have these properties.
Future Work Consider k-mer distributions of the subpatterns and compare them to the k-mer distribution of the text –If the k-mer distribution of the text is unknown, sampling or other methods could be used –Hopefully this would lead to better estimates of the probability of a subpattern to occur in the text More work to be done in the sorting of the subpatterns This approach could be implemented for searches using other index structures as well –Anything where time taken by locate functionality strongly correlates with the length of the query should work well.