SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich | Gerry Kammerer
Gerry Kammerer – ETH Zürich 2 Human Genome
Gerry Kammerer – ETH Zürich 3 Outline Introduction DNA and DNA sequences The problem and some approaches The SSAHA-approach Conclusions
Gerry Kammerer – ETH Zürich 4 Outline Introduction DNA and DNA sequences The problem and some approaches The SSAHA-approach Conclusions
Gerry Kammerer – ETH Zürich 5 DNA Deoxyribonucleic acid Contains genetic instructions Double helix Long polymer of simple units (Nucleotides) Backbone made of sugars and phospate Four types of molecules attached to each sugar Sequence of these four bases encodes information
Gerry Kammerer – ETH Zürich 6 DNA sequence Base Pair Bases from each strand form bonds DNA sequence Succession of letters Adenine, Cytosine, Guanine, Thymine Measured in Giga base (Gb) or Giga base pairs (Gbp)
Gerry Kammerer – ETH Zürich 7 The Problem Sequence comparison (exact / approx) Through comparison: Make conclusions on -Structure -Function -Cooperation of components Sequence specifying Produce multiple megabytes of data / day Big amount of queries/data: Overexert Techniques -Results not found in reasonable time / not exact enough
Gerry Kammerer – ETH Zürich 8 Approaches Dynamic Programming (First approaches) Needleman & Wunsch, 1970 Refinements: Smith & Waterman, 1981 (most popular) BLAST (Basic Local Alignment Search Tool) Altschul et al., 1990 Faster / less accurate Family of programs Suffix Tree Algorithms Need to much memory
Gerry Kammerer – ETH Zürich 9 Outline Introduction DNA and DNA sequences The problem and some approaches The SSAHA-approach Conclusions
Gerry Kammerer – ETH Zürich 10 SSAHA-approach Use hash table structures Need much memory (Nowadays we have more RAM!) But significantly less than suffix tree methods! orders of magnitude faster than BLAST
Gerry Kammerer – ETH Zürich 11 Definitions Query Q = „GGATCCCCTG“ DB = S 1, S 2, S 3, S 4,... (DNA sequences) k-tuple: 4-tuple = „GGAT“ S has (n – k + 1) (overlapping) k-tuples (i, j) references k-tuple -i is index of sequence -J is offset in the sequence 2-tuple (2,3) Example DB: S1 = „GGATCCCCTG“ S2 = „TGCAACAT“ S3 = „AACATCCTGGG“
Gerry Kammerer – ETH Zürich 12 Hash table construction K-tuples Only 4 k (as we have four bases) List of postions L Positions of k-tuples (sorted by k-tuple) Array A Pointers into L (Which positions in L belong to which k-tuples)
Gerry Kammerer – ETH Zürich 13 Hash table construction (ctd.) Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ List of positions L: Array A: A = [0,6,10,14] A = 0 C = 6 G = 10 T = 14 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2) 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5)
Gerry Kammerer – ETH Zürich 14 Sequence Search Query Q = „GAAT...“ – DNA sequence Proceed each k-tuple base-by-base E.g. with 2-tuple: „GA“, „AA“, „AT“,... Construct hits: (i,k,j) i, j is position for the current k-tuple (from hash table) k = (j – (offset of current k-tuple in Q)) n entries in DB = n hits
Gerry Kammerer – ETH Zürich 15 Sequence Search (ctd.) Sorting the hits (i,k,j) – First by i, then k, then j Let us have a look at a small example! Query Q = „AT“
Gerry Kammerer – ETH Zürich 16 Remember Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ List of positions L: Array A: A = [0,6,10,14] A = 0 C = 6 G = 10 T = 14 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2)
Gerry Kammerer – ETH Zürich 17 Sequence Search Example Example DB (1-tuples) List of positions L: 8: (2,3) 9: (2,5) 10:(3,2) 11:(1,0) 12:(1,1) 13:(2,1) 14:(1,3) 15:(1,1) 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Hits: (1,2,2) (2,3,3) (2,4,4) (3,0,0) (3,1,1) (3,3,3)
Gerry Kammerer – ETH Zürich 18 Sequence Search Example (ctd.) Example DB (1-tuples) List of positions L: 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Hits: (1,2,2)(1,2,3) (2,3,3)(1,0,1) (2,4,4)(3,1,2) (3,0,0) (3,1,1) (3,3,3) 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2) Hits: (1,2,2) (2,3,3) (2,4,4) (3,0,0) (3,1,1) (3,3,3)
Gerry Kammerer – ETH Zürich 19 Sequence Search Example (ctd.) Example DB (1-tuples) List of positions L: 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0) 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2)
Gerry Kammerer – ETH Zürich 20 Sequence Search Example (ctd.) Query Q = „AT“ Proceed base-by-base Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0) Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ Same i,k in Hits: Run of matching bases Example DB (1-tuples) Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0)
Gerry Kammerer – ETH Zürich 21 Sequence Search Summary Run of matching bases Region of exact matches Gapped matches Only finds in forward direction! Reverse query to find in reward direction 3-tuples, 9-base query Hits: (3,9,9)(5,3,3) (3,9,12)(5,3,9) (3,9,15)
Gerry Kammerer – ETH Zürich 22 Memory Requirements Array A: 4 * 4 k = 4 k+1 bytes 32 bit pointers, 4 k possible k-tuples List L: 8 * W bytes W = Number of k-tuples in database Reduce Memory usage Only consider non-overlapping k-tuples Discard highly frequent k-tuples Loss of accuracy!
Gerry Kammerer – ETH Zürich 23 Search speed Search speed depends on T hash Building Hash-tables T search Processing a specific query T hash does not matter much Computed once for one DB (save to disk, server usage)
Gerry Kammerer – ETH Zürich 24 Optimise Search speed Sorting algorithm In reality: Lies close to linear with quicksort Parameters k and W (tradeoff with accuracy) Increase k (loss of sensitivity) Reduce W by cutoff very often occuring k-tuples Strong effect! (There exists highly repetitive k-tuples)
Gerry Kammerer – ETH Zürich 25 Experimental results (from paper) 2.7 Gb of human genome DNA 292‘016 sequences 177 Query sequences Containing 104‘755 bases Compaq EV6 500MHz Processor, 16 GB RAM
Gerry Kammerer – ETH Zürich 26 Experimental results (ctd.) 90%95%100% kT hash T search T hash T search T hash T search s102.5s842.4s128.8s868.5s389.5s s26.3s810.5s36.1s808.8s199.1s s7.3s969.9s11.0s961.2s119.0s s2.2s s851.4s78.7s s0.9s932.0s2.5s927.1s51.6s s0.1s1015.5s1.7s999.2s35.4s
Gerry Kammerer – ETH Zürich 27 Outline Introduction DNA and DNA sequences The problem and some approaches The SSAHA-approach Conclusions
Gerry Kammerer – ETH Zürich 28 Reasons for fastness Hashing the database Nearly independent from database size BLAST e.g. hashes query and scans DB Human genome far from random Discard highly repetitive k-tuples has big effect
Gerry Kammerer – ETH Zürich 29 Conclusions Computers improved quickly Cheaper, more powerful More RAM available Hash the database
Gerry Kammerer – ETH Zürich 30 Questions?