Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.

Similar presentations


Presentation on theme: "Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007."— Presentation transcript:

1 Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007

2 Outline Introduction DNA sequences Local similarity search Related works BLAST Prefix-suffix hashing scheme Experimental result Conclusion Future work

3 DNA sequences DNA exists in chromosomes of organisms Genome is all DNA in an organism Composed of 4 nucleotides A, C, G, T Human has 23 pairs of chromosomes that amount to 3 billion bp Public DNA databases contains genomes of organisms and their information

4 DNA Similarity DNA sequences contain special regions, eg. Genes, motifs Some regions conserve across species Similar regions may imply similar functions and structures Given a sequence being studied (query), search for regions in the database sequences

5 Similarity measurement Σ = {A, C, G, T} Sequence alignment Align sequences S and T Insert spaces in S and T to form S’ and T’ Scoring matrix σ Match/mismatch scoring Let x and y be two aligned characters or space from two sequences, x, y  Σ  {space} Rif x = y and x ≠ space σ(x, y) =Pif x ≠ y -∞if x = y = space where R (reward) is positive and P (penalty) is negative

6 Gap penalty Gap = a maximal subsequence of spaces in an alignment Affine gap penalty W g + qW s where W g and W s are constants, W g  0, W s  0 and q  1 is the gap length Penalty of a length-q Gap < Penalty of q deletions/insertions

7 DNA sequence alignments Global alignment Needleman-Wunsch algorithm (1970) A C – G T T C A A C C G – – G A Local alignment Smith-Waterman algorithm (1981) A C C G T A G C A C G T – C C A T A – – A C G – Dynamic programming Optimal solution Time and space complexity O(mn), m and n are the lengths of the two sequences

8 Global alignment Input: two sequences S and T Output: alignment of S and T with the highest score V(i, j): the optimal score to align S[1..i] and T[1..j] Basis: V(0, 0) = 0, V(i, 0) = i, V(0, j) = j Recurrence: V(i, j) = max of{ V(i-1, j-1) + σ(S[i], T[j]), V(i-1, j) + σ(S[i], – ), V(i, j-1) + σ( –, T[j]) }

9 Local alignment Input: two sequences S and T Output: Substring A from S Substring B from T Score of the optimal (global) alignment of A and B V(i, j): the optimal score to align subsequences of S ending at i and T ending at j Basis: V(i, 0) = 0, V(0, j) = 0 Recurrence: V(i, j) = max of{ 0 V(i-1, j-1) + σ(S[i], T[j]), V(i-1, j) + σ(S[i], – ), V(i, j-1) + σ( –, T[j]) }

10 Local similarity search Input Two DNA sequences Output The alignments of the regions from the two sequences that score higher than a score threshold

11 Database search Input A query sequence and a sequence database Output The local similarity search results between pairs of database sequence and the query sequence Objective: Perform local similarity search fast Maintain search sensitivity

12 BLAST Basic Local Alignment Search Tool By NCBI (National Center for Biotechnology Information) of the US Government Finds regions of local similarity between sequences (DNA, RNA or proteins) Applies heuristics – fast Applies statistical theory – relatively accurate

13 Sample BLAST result Score = 44.4 bits (27), Expect = 0.013 Identities = 37/47 (78%) Strand = Plus / Minus Query: 6 caggggtccaggcccccagcccctctcctgggcccctcaccccgcgg 52 ||||||||| ||||||||||| ||||| ||| || | |||||| Sbjct: 199635477 caggggtccccgcccccagcccagctcctcggcaccccgggccgcgg 199635431 Score = 44.4 bits (27), Expect = 0.013 Identities = 35/43 (81%) Strand = Plus / Minus Query: 333 ccccgtttctcggatggaaaaactgaggctccgaaagcagaag 375 |||| |||| | ||||||||||||||||| | || || |||| Sbjct: 505025625 ccccatttcacagatggaaaaactgaggcccagagagaggaag 505025583

14 Sample BLAST result Matrix: blastn matrix:1 -1 Gap Penalties: Existence: 5, Extension: 2 Number of Sequences: 1 Number of Hits to DB: 2,526,608 Number of extensions: 138741 Number of successful extensions: 27 Number of sequences better than 1.0: 1 Number of HSP's gapped: 27 Number of HSP's successfully gapped: 27 Length of query: 375 Length of database: 880,975,758 Length adjustment: 44 Effective length of query: 331 Effective length of database: 880,975,714 Effective search space: 291602961334 Effective search space used: 291602961334

15 How BLAST works Split a search into phases Hit generation Ungapped extension Gapped extension Traceback Configurable parameters Word length W Match reward R Mismatch penalty P Cutoff score S Dropoff score X E-value threshold E

16 Hit generation Word hits (length W, default = 11) Database sequences are compressed: A = 00, C = 01, G = 10, T = 11 Compression factor = 4 Build a lookup table on sliding windows of the query sequence 4-sliding window of length 8 Scan the compressed database sequence for exact matches present in the lookup table Extend the exact matches of length 8 to W

17 Ungapped extension Extend the word hits to both directions until the score drops X or more The extended hit is qualified if it scores higher than cutoff score S Example: X = 2, S = 3 Query: A T A C G T A C G T A C G T DB seq: G C A C G T A C G C G T 1 1 1 1 1 1score=6 1 1 1 1 1 1 1score=7 (drop -1) -2 1 1 1 1 1 1 1score=5 (drop 2) -2 1 1 1 1 1 1 1 -2score=3 (drop 2) Extended hit = CACGTACGC

18 Gapped extension + traceback Extend the hits on both directions Allow gaps Perform restricted dynamic programming on the gapped extended hits

19 E-value Low-complexity regions About half the human genome is easily recognized as repetitive. A hit is statistically significant if its score is higher than one obtained from two random sequences. The alignment score of two random sequences follow the Extreme value distribution The expected number of hits with score at least S is given by E = Kmn e -λS The smaller the E-value is, the more statistically significant the hit is The significance of a hit is evaluated by E-value

20 Extreme Value Distribution Positive skewed tail Higher probability to have high score than normal distribution 0 5-2 s ln K λ

21 Prefix-Suffix Hashing Scheme Goals Speed up hit generation and ungapped extension Reduce the number of hits so as to reduce the processing costs of the later phases Design Build hashing indexes on database sequences The index stores the offsets of the words (length W) of the database sequence During a search, for each sliding window of the query sequence, lookup the index for the offsets of the hits in the database sequence

22 Index structure Word pattern – length W Partition into prefix and suffix Its prefix and suffix are represented by its hash value H(T) = ∑(4 i * V(T[i])), i  [0, |T|-1] V(A) = 0, V(C) = 1, V(G) = 2, V(T) = 3 For each possible prefix Lookup file For each possible suffix Pointers to the actual offsets of the word pattern Total number N of offsets Entry file For each possible suffix The N offsets

23 Index structure Prefix: AAAAA PointersNumber of offsets Suffix: AAAAAA Suffix: AAAAAC List of offsets Prefix: AAAAC PointersNumber of offsets Suffix: AAAAAA Suffix: AAAAAC List of offsets … … … Merge Lookup files Entry files

24 Build the index For each sliding window of the database sequence, Divide it into prefix of length P and suffix of length S Store its offset with the prefix and suffix Flush the offsets to the disk if memory is full Reorganise the offsets on the disk to the corresponding lookup files and entry files Merge the lookup files as one

25 During a search Divide the query sequence into sliding windows of length W For each sliding window, Compute the hash values of prefix, H P, and suffix, H S Sort the sliding windows by their H P, then their H S Access the lookup file for H S at H P block Access the entry file for the offsets for the hits of the word

26 Experiments Database sequence: human chromosomes 1 – 4, 840M bp Query sequences: randomly selected from human chromosomes W = 11, P = 5, S = 6 Task: Compare the order of prefix and suffix Compare hit generation time of the algorithms BLAST PS-Hash – Prefix-Suffix Hashing Scheme HashQuery – build a lookup table on query sequence and scan the database sequence Sequential Scan Study the ungapped extension in BLAST

27 Experimental results Two sets of index files built Prefix as lookup Suffix as lookup prefix->suffixsuffix->prefix Query length Eff. len.# of hitstotal (s)lookupentrytotal (s)lookupentry 490 4849255.393640.4543734.93885.668940.5049185.152831 51270207521.0760.2934330.782471.208060.3034940.904463 512 3363676.047080.4778775.5687286.089290.4972895.591531 513 5804415.652640.4750845.169856.03630.5148395.520972 49045212881495.369930.4635724.9059355.515660.4978185.006489 Eff. len. Is the effective search length of the query sequence after filtering.

28 Experimental results BLASTPS-HashHashQuery Sequential Scan Query length Eff. len. HitsTime (s)HitsTime (s) 490 3631417.07704849255.3936440.03464506.84 51270149497.6932207521.07632.0046558.989 512 1946866.78703363676.0470841.66824721.28 513 2482336.79125804415.6526443.16424652.43 49045235007813.395112881495.3699340.74743868.43

29 Analysis Index files Number of word patterns = 4 11 = 4M Number of prefix patterns = 4 5 = 1K Number of suffix patterns = 4 6 = 4K Total size of lookup file = 4 11 * (4 + 4) = 32MB Total size of entry files = 840M * 4 = 3GB

30 Analysis Number of bytes reads BLAST: compressed sequence file = 210MB PS-Hash: (# of query sliding windows) * (4 + 4) + (# of hits) * 4 = 1.85MB HashQuery: sequence file = 840MB Sequential Scan: sequence file = 840MB w.r.t. the first query PS-Hash only accesses 1/113 that of bytes BLAST accesses, but the running time is not much faster, in some cases, even slower Disk Locality

31 Experimental results BLAST Ungapped extension Database sequence: 840M bp Query: 512 bp E-value: 10 -15 Total number of word hits: 194,686

32 Conclusion Introduced local similarity search Described BLAST Proposed Prefix-Suffix Hashing Scheme Showed experimental results and comparisons

33 Future work Optimise implementation of Prefix-Suffix Hashing Scheme Utilise the information of the number of word hits produced by each sliding window of the query sequence Extend the index to store neighbour information about the word patterns Derive useful threshold to restrict the generation of hits for later phase processing Test on multiple sequences in database

34 References BLAST website: http://www.ncbi.nlm.nih.gov/blast/http://www.ncbi.nlm.nih.gov/blast/ The Statistics of Sequence Similarity Scores: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403- 410. Samuel Karlin and Stephen F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Science USA, 87(6):2264-2268, March 1990. BLAST. Ian Korf, Mark Yandell and Joseph Bedell. Sebastopol, CA : O'Reilly & Associates, 2003. WU-BLAST website: http://blast.wustl.edu/http://blast.wustl.edu/ FSA-BLAST website: http://www.fsa-blast.org/http://www.fsa-blast.org/


Download ppt "Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007."

Similar presentations


Ads by Google