Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class.

Similar presentations


Presentation on theme: "Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class."— Presentation transcript:

1 Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class

2 Roadmap Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

3 Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

4 Motivation We have a newly discovered gene: Does it occur in other species? How fast does it evolve? We want to “find” this gene in other species But there will be mutations

5 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

6 Global Alignment Running Time: O(MN) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Needleman-Wunsch (Dynamic Programming) N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

7 Local Alignment Running Time: O(MN) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Smith-Waterman N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Modifications: Store 0 instead of –ve values Search entire table for maximum

8 Alignment Applications We have our newly discovered gene: Does it occur in other species? How fast does it evolve?

9 Complete genomes today About 300 complete genomes have been sequenced

10 GenBank Growth Exponential growth in total sequence data Recently exceeded 100 Gbp (10 11 base pairs)

11 More DNA is coming …

12 Alignment Applications We have our newly discovered gene: Does it occur in other species? How fast does it evolve? Assume we try Smith-Waterman: 10 15 cells The entire genomic database 10 11 Our new gene 10 4

13 Indexed Alignment (BLAST- Basic Local Alignment Search Tool) Main idea: 1. Construct a dictionary of all words in the query 2. Initiate a local alignment for each word match between query and DB Running Time: O(MN) in worst case However, in practice orders of magnitude faster than Smith-Waterman query DB

14 Step 1 (Basic): Construct dictionary of query words Query indexed by all words of size k Query indexed by all words of size k = 3 (in our examples) Query indexed by all words of size k ≈ 11 BLAST Query: AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCG… AAAAACAAGAATACA...AGG...CTA...GCT...GGC... INDEX AGG GGC GCT CTA TAT ATC TCA CAC ACC CCT CTG TGA GAC ACC CCT CTC TCC CCA CAG AGG GGC GCC CCG CGA GAT ATG TGC GCC CCC CCT CTA TAG AGC GCT CTA TAT ATC TCA CAC ACG CGA GAC ACC CCG CGC GCG

15 Step 1 (Advanced): Relative Generation For each query word, generate all relatives A relative is a word with alignment score ≥ T All relatives are updated to point to new location BLAST Query word: GGC Threshold: T = 28 Relatives: GGC30 AGC28 GAC28 AAC26 GGT25 GGA24... Query: AGGCTATCACCTGACCTCCAGGCCG…...AGC...GAC...GGC... INDEX

16 Step 2: Searching Search through database linearly, one word at a time Initiate alignment with all occurrences of that word in query BLAST Genomic database: AGCTAGCTGCTAGTCAGTCGATGCATGCTACTAGCTGCGATCGTCGTC… AAAAACAAGAATACA...AGC...CTA...GCT...TAG... INDEX Query: AGC GCT AGCGCT

17 A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Alignment Extension Example: The matching word GGT initiates an alignment Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC GTTAGGTCC BLAST

18 BLAST Algorithm Variations BLAT- BLAST-Like Alignment Tool 1. Builds index (dictionary) for database, scans linearly through query 2. Alignment extensions allow for gaps as well

19 A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Gapped Extensions Extensions with gaps in a band around anchor Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC-AG GTTAGGTCCTAG BLAT

20 Perfect Match Results Perfect Match: no relatives generated

21 Perfect Match Results

22 Interpreting Results Word size k

23 Interpreting Results Conservation rate Conservation rate: 81% Mutation rate: 19%

24 Interpreting Results Sensitivity Probability of a particular homologous area being identified Larger k decreases probability (exact match less likely) Straightforward mathematics Skip math

25 Sensitivity Calculation Database (genome) Query Homologous area: Suppose k = 7: Conservation rate:81% Mutation rate:19% 7 Probability whole word is conserved: 0.81 7 ≈ 23%

26 Sensitivity Calculation Database (genome) Query Homologous area: Suppose k = 7: 23% Words:10Probability a particular word is conserved: 23% Probability at least one word is conserved: 1 – 0.77 10 ≈ 93%

27 Interpreting Results Specificity Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED

28 Interpreting Results SPEED Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED

29 The Classic BLAST Tradeoff As we increase k … Sensitivity gets worse Speed gets better

30 Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

31 Exact matches unlikely for larger values of k Include variants with one “wildcard” placed in each position GTA *TA G*A GT* Relative Generation Any match:1 Any mismatch:0 Threshold: T = k – 1 Wildcards

32 Wildcard Results Better?

33 Wildcard Results Perfect match: Wildcards: For the same sensitivity, wildcard variant is about 440 times faster

34 Wildcard Results Perfect match: Wildcards: For the same sensitivity, wildcard variant is about 40 times faster

35 Wildcard Results Better Sensitivity/speed tradeoff consistently improved

36 Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

37 N perfect matches Same separation in query and database Database: TGCTAGCTACGATCTGCAGTGCGTAATCT… Query: TCATTACATCGTGACTTGCAGTCGTCCAG… All separations less than distance W Multiple Words TAC TGC 12 bp 7 bp 12 bp INITIATE ALIGNMENT NO INITIATION Skip math

38 Intuition Behind Multiple Words Database (genome) Query Homologous area: If we use a single word of size k = 16: Conservation rate:81% Mutation rate:19% 16 Probability whole word is conserved: 0.81 16 ≈ 3%

39 Intuition Behind Multiple Words Database (genome) Query Homologous area: If we use a single word of size k = 16: 3% Words:10Probability a particular word is conserved: 3% Probability at least one word is conserved: 1 – 0.97 10 ≈ 29%

40 Intuition Behind Multiple Words Database (genome) Query If we use a single word of size k = 16: Probability of a match = 29% If we use N = 2 words of size k = 8: Homologous area: Probability a particular word is conserved: 0.81 8 ≈ 19% 19% Words:20Probability a particular word is conserved: 19% Probability at least two words are conserved: 1 – 0.81 20 – 20 × 0.19 × 0.81 19 ≈ 91%

41 Intuition Behind Multiple Words Database (genome) Query If we use a single word of size k = 16: Probability of a match = 29% If we use N = 2 words of size k = 8: Probability of a match = 91% 3% 19%

42 Multiple Words Results

43 Single perfect match: Multiple perfect matches: For the same sensitivity, multiple words variant about 1,200 times faster

44 Multiple Words Results Single perfect match: Multiple perfect matches: For the same sensitivity, multiple words variant about 75,000 times faster

45 Multiple Words Results Much better than single matches Bigger improvement even than wildcards

46 Multiple Words Results Why not combine them: Multiple Wildcard Matches?

47 Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

48 Contiguous word (k = 10) GTCAGTACGTCAGTCGTGCGTCGTCTAG ×××××××××× Seed pattern GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× Seed Patterns GTCAGTACGT GTATTAGGCG

49 Patterns increase the likelihood of at least one match within a long conserved region 3 common 5 common 7 common Consecutive PositionsNon-Consecutive Positions 6 common On a 100-long 70% conserved region: Consecutive Non-consecutive Expected # hits: 1.070.97 Prob[at least one hit]:0.300.47 Intuition Behind Seed Patterns

50 Advantage of Patterns 11 positions 10 positions

51 Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

52 Designing Seeds Is this a good seed pattern? ×∙×∙×∙×∙×∙×∙×∙×∙×∙× ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 0 matches! ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 9 matches Not so much …

53 Designing Seeds Remember three-periodicity in exons? Higher mutation rate in last position of codon A decent pattern is thus π 110 : ××∙××∙××∙××∙××∙× But isn’t regularity bad? HumanCCTGTT(Proline, Valine) MouseCCAGTC(Proline, Valine) RatCCAGTC(Proline, Valine) DogCCGGTA(Proline, Valine) ChickenCCCGTG(Proline, Valine)

54 ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× Optimizing Seeds Hard problem No efficient solution known for fixed word size k and max span s ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× k = 10 12345678910 s = 28

55 How sensitive is a specific seed π? Construct finite-state automaton Aπ that accepts strings containing it Regular expression: (0 + 1)* 1 (0 + 1) 1 (0 + 1)* Dynamic program to convert into DFA Computing Detection Probabilities START YES 10,11 0 0 π 101 ( ×∙× )

56 Computing Detection Probabilities Compute probability a sequence of length L will be accepted by the DFA Markov model M (n th order) which dictates conservation pattern Dynamic program: θ(k L 2 s – k + n ) CDS P(1) = 0.9 P(0) = 0.1 NC P(1) = 0.8 P(0) = 0.2 0.05 0.95 101101110111 L

57 ××∙×∙∙×××∙∙××××∙××∙××∙∙∙××××∙××∙××∙∙×∙× Mandala Algorithm Fast, practical seed design Start with random seed Jump to best neighbor (moving one × to unused ∙ ) Keep jumping until no better neighbors (greedy hill climbing) Finds local optimum Random restarts to try find globally optimal seed ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ×∙∙×∙×∙××∙∙×××∙∙∙××∙××∙∙×× ×∙∙×∙×∙××∙∙×× ×∙∙×∙×∙×××∙∙× NO BETTER NEIGHBORS ××∙×∙∙××××∙∙×

58 ××∙××∙××∙∙×∙× ×∙∙×∙×∙×××∙∙× ×∙∙×∙×∙××∙∙×× ×∙∙∙××∙××∙∙×× Mandala Results Very fast 20 seconds on k = 11, s = 22 Within 1% of true optimum in all trials Mandala pattern for non-coding DNA (k = 11): ××××∙×××∙∙×∙××× Mandala pattern for coding DNA (k = 11): ×××∙∙∙∙∙××∙××∙××∙×× ×∙×∙××∙××∙∙∙× NO BETTER NEIGHBORS

59 Seed Pattern Results Non-coding sequences Coding sequences

60 Motivation for Improvement Good CDS seeds ××∙××∙××∙××∙×× Good NC seeds ××××∙×××∙∙×∙××

61 Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

62 Single seed GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× ×××∙×∙∙×∙∙∙∙×∙∙∙∙∙∙∙×∙∙∙∙××× ∙∙∙××∙××∙××∙××∙××∙∙∙∙∙∙∙∙∙∙∙ How much longer is it taking us to index? GTCAGTACGTCAGTCGTGCGTCGTCTAG Multiple Simultaneous Seeds How much longer is it taking us to search? Multiple Simultaneous Seeds AGACTCGTGT GTCGCGTTAG GTATTAGGCG GTCAGTACGTCAGTCGTGCGTCGTCTAG...AGCCAGTCAG...GACAGTCCAG...GGCATCATCA... THREE INDEXES: XX∙X∙X∙∙∙X∙X∙∙∙X∙X∙∙∙X∙∙∙∙∙X XXX∙X∙∙X∙∙∙∙X∙∙∙∙∙∙∙X∙∙∙∙XXX ∙∙∙XX∙XX∙XX∙XX∙XX∙∙∙∙∙∙∙∙∙∙∙...AGCCAGTCAG...GACAGTCCAG...GGCATCATCA... AGCCAGTCAG...GACAGTCCAG...GGCATCATCA...

63 Mandala still works Just change definition of neighbor Pick one seed Pick one × Move to a ∙ Slower convergence ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ∙××∙××∙××∙××∙ ×××∙×∙∙×∙××∙×

64 Results Non-coding sequences Back to earlier results …

65 Future Extensions Combine all of the above! Multiple Wildcard Matches of Multiple Patterns? * GT∙G∙T∙∙∙T GTA∙C∙∙G∙∙ ∙∙∙GC∙AG∙T * TC∙T∙T∙∙∙G TCG∙C∙∙C∙∙ ∙∙∙TC∙AC∙G ACGATGGCTCCTGACGAGCTAGCTGATGAGCCGTAGCAGACGTGTAACGGTCGCTCCTGCAGCGA ACGATGCATCGTAGCTAGCTAGCTGATCAGCTGTAGTAGTCGTCTACTGGATGCTGCTGCAGCGT Database Query


Download ppt "Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class."

Similar presentations


Ads by Google