1 Pattern Processing and Searching In RAM Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan School of Computing and Information Sciences BioRG Bioinformatics Research Group Florida International University SW 8th Street Miami, FL {mrobi002, Presented by Michael Robinson January 15, 2008 At: Florida International University Global CyberBridges, National Science Foundation Program Award Id: OCI October 1, 2006-December 31, 2009
2 Agenda - What are Suffix Trees – Suffix Arrays - Suffix Trees – Importance in Bioinformatics - Main Memory Bottleneck - Sadakane’s Compressed Suffix Tree Implementation - Compressed Suffix Tree Problem - Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results - Example Required File: LCP Array Solution - Implementation Design – Software - My Current Work - Experimental Results - My Future Work - References
3 Suffix Tree BANANAS ANANAS NANAS ANAS NAS AS S Suffix Trees inventor: Peter Wiener, A N A A N B S S A N N S S A S S S N N A A A
4 Suffix Array Implementation Simplified version of Suffix Array. Lexicographic ordered text. Sequence = ABRACADABRA Suffix Array Index Index Sorted ABRACADABRA 0 10 A BRACADABRA 1 7 ABRA RACADABRA 2 0 ABRACADABRA ACADABRA 3 3 ACADABRA CADABRA 4 5 ADABRA ADABRA 5 8 BRA DABRA 6 1 BRACADABRA ABRA 7 4 CADABRA BRA 8 6 DABRA RA 9 9 RA A 10 2 RACADABRA [1]Suffix Arrays inventors: Udi Manber, Gene Myers 1989
5 Suffix Trees Importance in Bioinformatics Biological Data Type (A C G T) vs. Search Engines Data (inverted) Applying Suffix Trees to Real Genomic Sequences is Impractical
6 Main Memory Bottleneck Suffix Array Index Storage ABRACADABRA 11 BRACADABRA 10 RACADABRA 9 ACADABRA 8 CADABRA 7 ADABRA 6 DABRA 5 ABRA 4 BRA 3 RA 2 A 1 66 = n(n+1)/2 = 11(12)/2 = 66 PA01 6Mg ~ 18 TeraBytes Human Genome 3,164,700,000 nucleotides (3,164,700,000* 3,164,700,001)/2 = 5,007,663,046,582,350,000 5,007,663 terabytes Suffix Arrays inventors: Udi Manber, Gene Myers 1989
7 Sadakane’s Compressed Suffix Implementation A = 00, C = 01, G = 10 T = 11 Storage = n log n bits = 2n bits, ~20% of original space Suffix Array Index Storage uncompressed compressed ABRACADABRA 0 22 bits = 2n = n log n BRACADABRA 1 20 RACADABRA 2 18 ACADABRA 3 16 CADABRA 4 14 ADABRA 5 12 DABRA 6 10 ABRA 7 8 BRA 8 6 RA 9 4 A bits = bits = (n(n+1)/2)*2 bits Unfortunately it is not linear, 100 mg ~ 5 gig [2]Kunihiko Sadakane
8 Compressed Suffix Tree Problem Unfortunately the Suffix Tree is not linear 100 mg ~ 5 gig The Sequence is linear ACGT = 4 bases = 2n bits = 8 bits = n log 2 ∑(ACGT) GTCAAGTC = 8 bases = 2n bits = 16 bits = n log 2 ∑(ACGT) But the Suffix Array is not: ACGT = 4 bases = (4(5)/2)*2 = 20 bits = (n(n+1)/2)*2 GTCAAGTC = 8 bases = (8(9)/2)*2 = 72 bits = (n(n+1)/2)*2 In first data structure, 2 nd sequence is twice as long as the first one, but in second data structure, 2 nd sequence is more than three times the first one. It is 30% slower than non-compressed trees.
9 Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results Algorithms results: Authors Sadakane Space ∑(AGCT) log 2 n n log 2 ∑(AGCT) = 2n bits GTCA 4*2 = 8 bits 2*4 = 8 bits GTCAAGTC 4*3 = 12 bits 2*8 = 16 bits TACAAGTAGTCAAGTC 4*4 = 16 bits 2*16 = 32 bits A 2048 base sequence 4*11= 44 bits 2*2048= 4096 bits Space needed during construction 1.4 times final space Authors created an Abstract Suffix Array using: Succinct Suffix Array, based on Wavelet Tree (for sound), build on Burrows-Wheeler transform [3]Niko Välimäki 1
10 Example Required File: LCP Array Solution A Useful additional Data Structure. An array of lengths of the Longest Common Prefixes, between each substring and it’s predecessor in the Suffix Array Lexicographic ordered text. Sequence = ABRACADABRA Suffix Array Index Index LCP Sorted ABRACADABRA A BRACADABRA ABRA RACADABRA ABRACADABRA ACADABRA ACADABRA CADABRA ADABRA ADABRA BRA DABRA BRACADABRA ABRA CADABRA BRA DABRA RA RA A RACADABRA Suffix Arrays inventors: Udi Manber, Gene Myers 1989
11 Implementation Design – Software - C++ object oriented. - Each Data Structure is its own class. - Generic Code, e.i. from Sadakane, retrieve short sequences. - For construction and retrieve long sequences, new code. - Tailored code is as time/space efficient as generic code.
12 My Current Work - Approach - Dissertation, Not Published Yet. - Suffix Arrays Approach. - Google’s Construction Approach. - Construction Time and Space Problems. Sequences From 11 Bases To PA01 with 6.2 Million Bases. PA01: Run on 7 different computers. Fastest Time 5 days. - All Files Contain Uncompressed Information - Space Required: Sequence File = n One Index File = from n to ~ 8 Times PA01 Sequence Size - Loading Time and RAM Space Problems. - Solution: Break Index File into 64, 1024 Sub Indexes Improving Loading, Processing Times and Allowing Processing of Larger Size Sequences.
13 My Current Work - Applications - Finding Patterns: How Many Times, and Where a Probe Appears in a Given Sequence acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg - Finding Inverted Patterns: Same as Finding Patterns plus inverted acgttg ….. gttgca ….. acgttg ….. gttgca..… acgttg ….. gttgca - Finding Inverted Reciprocal Patterns: acgttg ….. caacgt ….. acgttg ….. caacgt ….. acgttg ….. caacgty - Above Programs Generate a Text File Report for Further Processing
14 Sadakane’s Experimental Results Using: One 2.4 Ghz Pentium 4 Computer, with 1 GB ram Red Hat OS, Compiled programs using g++ (GCC)
15 Engineering a Compressed Suffix Tree Implementation Experimental Results
16 My Current Work - Results Finding Patterns in PA01: 6,264,404 bases Sequence Bases Seconds a = aa = aaa = aaaa = aaaaa = c = cc = ccc = cccc = ccccc = g = gg = ggg = gggg = ggggg = t = tt = ttt = tttt = ttttt =
17 My Future Work Do Construction for: Human Genome and All Pseudomonas aeruginosa Bacterias Consensus Pattern Search: To solve the Bioinformatics Consensus Problem to n-1 of a given probe. At the present time there are applications that solve this problem to value 3. For a probe with 50 bases, with alphabet A C G T, if we check for 3 mutations we need to do 4 * 4 mutations-1 = 64 pattern searches, for each group of 3 bases. For a sequence of 3.6 billion bases, a probe of 1,000 bases, and a mutation rank of n-1, we need to do 4 * pattern searches on the 3.5 billion sequence. Excel calculates up to 4* = E+307. The only way to do this work is with a Distributed System. Solving this problem for proteins will require more time because proteins have an alphabet of length 20.
18 Conclusions Due to Advances in computer hardware and reduction on prices, today a 1.5 terabyte hard disk costs around 400 US dollars. Recent implementations of Suffix Trees an Suffix Arrays concentrate on compressing the data causing large delays in user processing. We believe the previous bottleneck hard disk space problems have been resolved, therefore compressing data on hard disk is no longer necessary specially when the users applications slowdown to factors of 30 for Suffix Trees, and additionally log n for Suffix Arrays, when compared to uncompressed data. With the advances on Operating Systems accessing ram memories of 128 gigabytes in workstations, and with advances in Distributing (Grid) Computing, we believe that using uncompressed data with new methods like our implementation, we can produce applications that were not possible before.
19 References
20 References
21 References [1] Udi Manber and Gene Myers (1991). "Suffix arrays: a new method for on-line string searches". SIAM Journal on Computing, Volume 22, Issue 5 (October 1993), pp [2] Kunihiko Sadakane, Department of Computer Science and Communication Engineering, Kyushu University, Hakozaki , Higashi-ku, Fukuoka , Japan [3] Engineering a Compressed Suffix Tree Implementation Niko Välimäki 1, Wolfgang Gerlach 2, Kashyap Dixit 3, and Veli Mäkinen 1, Department of Computer Science, University of Helsinki, Finland Technische Fakultät, Universität Bielefeld, Germany Department of Computer Science and Engineering Indian Institute of Technology, Kanpur, India
22 Questions Thank you!! Presented by: Michael Robinson Florida International University January 15, 2007