Download presentation
Presentation is loading. Please wait.
1
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen
2
Motivation Burrows-Wheeler Transformation (BWT) of a large text allows: –Fast exact matching –Compact representation (compared to suffix tree/array) –More readily compressible (basis of bzip ) The FM Index exploits an indexed and compressed BWT to allow: –Exact matching in time linear in the size of the pattern –Memory footprint as much as 50% smaller than original string FM Index and related techniques may allow us to “map reads” (match a large set of small patterns) in a single pass over the reads on a typical workstation without spilling onto the hard disk
3
Background Recall that BWT is derived from the Burrows-Wheeler matrix, which is related to the Suffix array a c a a c g $g c $ a a a c Suffix array Burrows Wheeler Matrix Last column BWT Text
4
Problem Memory footprint of building and storing suffix array is much larger than the BWT itself –Human genome: SA: ~12 GB, BWT: ~0.8 GB –Attempt to build BWT over whole human genome on a 32 GB server exhausts memory and crashes (I tried)
5
Solution Kärkkäinen: “Fast BWT in Small Space by Blockwise Suffix Sorting” –Theoretical Computer Science, 387 (3), pp. 249-257, Sept. 2007 Observation: –BWT[i] depends only on SA[i], not on any other element of SA Corollary: –No need to keep all of SA in memory at once! Solution: –Build SA and BWT a small “chunk” or “block” at a time –Greatly reduces the memory overhead By something like a factor of B, where B = # of blocks
6
Solution Typical suffix sort:
7
Solution Blockwise suffix sort:
8
Solution Calculate and sort a random sample of the suffixes
9
Solution Samples are used as “bookends” for “buckets” ? $ B1B1 B2B2 B3B3 B4B4
10
Solution In B linear-time passes over the text (B = # buckets), sort all suffixes into buckets, one bucket at a time, then sort the bucket $ B1B1 B2B2 B3B3 B4B4 Pass 1
11
Solution After a bucket has been sorted and turned into a BWT segment, it is discarded Pass B B1B1 B2B2 B3B3 B4B4 $
12
Solution Good time bounds in the presence of long repeats require use of a difference cover sample –Acts like an oracle that determines relative lexicographical order of two suffixes that share a prefix of some length v
13
Project Goals Basic goal: –Write a correct, usable library implementing blockwise SA sort and BWT building –Characterize performance and time/space tradeoffs Stretch goals: –Fine-tune for performance and memory usage –Implement difference cover sample Question: is this necessary for good performance on real-life inputs?
14
Concluding Remarks BWT is one application of Blockwise Suffix Sort, but any information derived locally from SA rows (e.g. LCP information) can be made more space-efficient this way
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.