Download presentation
Presentation is loading. Please wait.
Published byJayson Conley Modified over 8 years ago
1
05.04.2008 SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich | Gerry Kammerer
2
05.04.2008 Gerry Kammerer – ETH Zürich 2 Human Genome
3
05.04.2008 Gerry Kammerer – ETH Zürich 3 Outline Introduction DNA and DNA sequences The problem and some approaches The SSAHA-approach Conclusions
4
05.04.2008 Gerry Kammerer – ETH Zürich 4 Outline Introduction DNA and DNA sequences The problem and some approaches The SSAHA-approach Conclusions
5
05.04.2008 Gerry Kammerer – ETH Zürich 5 DNA Deoxyribonucleic acid Contains genetic instructions Double helix Long polymer of simple units (Nucleotides) Backbone made of sugars and phospate Four types of molecules attached to each sugar Sequence of these four bases encodes information
6
05.04.2008 Gerry Kammerer – ETH Zürich 6 DNA sequence Base Pair Bases from each strand form bonds DNA sequence Succession of letters Adenine, Cytosine, Guanine, Thymine Measured in Giga base (Gb) or Giga base pairs (Gbp)
7
05.04.2008 Gerry Kammerer – ETH Zürich 7 The Problem Sequence comparison (exact / approx) Through comparison: Make conclusions on -Structure -Function -Cooperation of components Sequence specifying Produce multiple megabytes of data / day Big amount of queries/data: Overexert Techniques -Results not found in reasonable time / not exact enough
8
05.04.2008 Gerry Kammerer – ETH Zürich 8 Approaches Dynamic Programming (First approaches) Needleman & Wunsch, 1970 Refinements: Smith & Waterman, 1981 (most popular) BLAST (Basic Local Alignment Search Tool) Altschul et al., 1990 Faster / less accurate Family of programs Suffix Tree Algorithms Need to much memory
9
05.04.2008 Gerry Kammerer – ETH Zürich 9 Outline Introduction DNA and DNA sequences The problem and some approaches The SSAHA-approach Conclusions
10
05.04.2008 Gerry Kammerer – ETH Zürich 10 SSAHA-approach Use hash table structures Need much memory (Nowadays we have more RAM!) But significantly less than suffix tree methods! 3 - 4 orders of magnitude faster than BLAST
11
05.04.2008 Gerry Kammerer – ETH Zürich 11 Definitions Query Q = „GGATCCCCTG“ DB = S 1, S 2, S 3, S 4,... (DNA sequences) k-tuple: 4-tuple = „GGAT“ S has (n – k + 1) (overlapping) k-tuples (i, j) references k-tuple -i is index of sequence -J is offset in the sequence 2-tuple (2,3) Example DB: S1 = „GGATCCCCTG“ S2 = „TGCAACAT“ S3 = „AACATCCTGGG“
12
05.04.2008 Gerry Kammerer – ETH Zürich 12 Hash table construction K-tuples Only 4 k (as we have four bases) List of postions L Positions of k-tuples (sorted by k-tuple) Array A Pointers into L (Which positions in L belong to which k-tuples)
13
05.04.2008 Gerry Kammerer – ETH Zürich 13 Hash table construction (ctd.) Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ List of positions L: Array A: A = [0,6,10,14] A = 0 C = 6 G = 10 T = 14 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2) 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5)
14
05.04.2008 Gerry Kammerer – ETH Zürich 14 Sequence Search Query Q = „GAAT...“ – DNA sequence Proceed each k-tuple base-by-base E.g. with 2-tuple: „GA“, „AA“, „AT“,... Construct hits: (i,k,j) i, j is position for the current k-tuple (from hash table) k = (j – (offset of current k-tuple in Q)) n entries in DB = n hits
15
05.04.2008 Gerry Kammerer – ETH Zürich 15 Sequence Search (ctd.) Sorting the hits (i,k,j) – First by i, then k, then j Let us have a look at a small example! Query Q = „AT“
16
05.04.2008 Gerry Kammerer – ETH Zürich 16 Remember Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ List of positions L: Array A: A = [0,6,10,14] A = 0 C = 6 G = 10 T = 14 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2)
17
05.04.2008 Gerry Kammerer – ETH Zürich 17 Sequence Search Example Example DB (1-tuples) List of positions L: 8: (2,3) 9: (2,5) 10:(3,2) 11:(1,0) 12:(1,1) 13:(2,1) 14:(1,3) 15:(1,1) 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Hits: (1,2,2) (2,3,3) (2,4,4) (3,0,0) (3,1,1) (3,3,3)
18
05.04.2008 Gerry Kammerer – ETH Zürich 18 Sequence Search Example (ctd.) Example DB (1-tuples) List of positions L: 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Hits: (1,2,2)(1,2,3) (2,3,3)(1,0,1) (2,4,4)(3,1,2) (3,0,0) (3,1,1) (3,3,3) 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2) Hits: (1,2,2) (2,3,3) (2,4,4) (3,0,0) (3,1,1) (3,3,3)
19
05.04.2008 Gerry Kammerer – ETH Zürich 19 Sequence Search Example (ctd.) Example DB (1-tuples) List of positions L: 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0) 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2)
20
05.04.2008 Gerry Kammerer – ETH Zürich 20 Sequence Search Example (ctd.) Query Q = „AT“ Proceed base-by-base Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0) Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ Same i,k in Hits: Run of matching bases Example DB (1-tuples) Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0)
21
05.04.2008 Gerry Kammerer – ETH Zürich 21 Sequence Search Summary Run of matching bases Region of exact matches Gapped matches Only finds in forward direction! Reverse query to find in reward direction 3-tuples, 9-base query Hits: (3,9,9)(5,3,3) (3,9,12)(5,3,9) (3,9,15)
22
05.04.2008 Gerry Kammerer – ETH Zürich 22 Memory Requirements Array A: 4 * 4 k = 4 k+1 bytes 32 bit pointers, 4 k possible k-tuples List L: 8 * W bytes W = Number of k-tuples in database Reduce Memory usage Only consider non-overlapping k-tuples Discard highly frequent k-tuples Loss of accuracy!
23
05.04.2008 Gerry Kammerer – ETH Zürich 23 Search speed Search speed depends on T hash Building Hash-tables T search Processing a specific query T hash does not matter much Computed once for one DB (save to disk, server usage)
24
05.04.2008 Gerry Kammerer – ETH Zürich 24 Optimise Search speed Sorting algorithm In reality: Lies close to linear with quicksort Parameters k and W (tradeoff with accuracy) Increase k (loss of sensitivity) Reduce W by cutoff very often occuring k-tuples Strong effect! (There exists highly repetitive k-tuples)
25
05.04.2008 Gerry Kammerer – ETH Zürich 25 Experimental results (from paper) 2.7 Gb of human genome DNA 292‘016 sequences 177 Query sequences Containing 104‘755 bases Compaq EV6 500MHz Processor, 16 GB RAM
26
05.04.2008 Gerry Kammerer – ETH Zürich 26 Experimental results (ctd.) 90%95%100% kT hash T search T hash T search T hash T search 10824.0s102.5s842.4s128.8s868.5s389.5s 11798.3s26.3s810.5s36.1s808.8s199.1s 12952.2s7.3s969.9s11.0s961.2s119.0s 13850.8s2.2s859.14.5s851.4s78.7s 14914.1s0.9s932.0s2.5s927.1s51.6s 15996.0s0.1s1015.5s1.7s999.2s35.4s
27
05.04.2008 Gerry Kammerer – ETH Zürich 27 Outline Introduction DNA and DNA sequences The problem and some approaches The SSAHA-approach Conclusions
28
05.04.2008 Gerry Kammerer – ETH Zürich 28 Reasons for fastness Hashing the database Nearly independent from database size BLAST e.g. hashes query and scans DB Human genome far from random Discard highly repetitive k-tuples has big effect
29
05.04.2008 Gerry Kammerer – ETH Zürich 29 Conclusions Computers improved quickly Cheaper, more powerful More RAM available Hash the database
30
05.04.2008 Gerry Kammerer – ETH Zürich 30 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.