Download presentation
Presentation is loading. Please wait.
1
Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class
2
Roadmap Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
3
Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
4
Motivation We have a newly discovered gene: Does it occur in other species? How fast does it evolve? We want to “find” this gene in other species But there will be mutations
5
Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC
6
Global Alignment Running Time: O(MN) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Needleman-Wunsch (Dynamic Programming) N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
7
Local Alignment Running Time: O(MN) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Smith-Waterman N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Modifications: Store 0 instead of –ve values Search entire table for maximum
8
Alignment Applications We have our newly discovered gene: Does it occur in other species? How fast does it evolve?
9
Complete genomes today About 300 complete genomes have been sequenced
10
GenBank Growth Exponential growth in total sequence data Recently exceeded 100 Gbp (10 11 base pairs)
11
More DNA is coming …
12
Alignment Applications We have our newly discovered gene: Does it occur in other species? How fast does it evolve? Assume we try Smith-Waterman: 10 15 cells The entire genomic database 10 11 Our new gene 10 4
13
Indexed Alignment (BLAST- Basic Local Alignment Search Tool) Main idea: 1. Construct a dictionary of all words in the query 2. Initiate a local alignment for each word match between query and DB Running Time: O(MN) in worst case However, in practice orders of magnitude faster than Smith-Waterman query DB
14
Step 1 (Basic): Construct dictionary of query words Query indexed by all words of size k Query indexed by all words of size k = 3 (in our examples) Query indexed by all words of size k ≈ 11 BLAST Query: AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCG… AAAAACAAGAATACA...AGG...CTA...GCT...GGC... INDEX AGG GGC GCT CTA TAT ATC TCA CAC ACC CCT CTG TGA GAC ACC CCT CTC TCC CCA CAG AGG GGC GCC CCG CGA GAT ATG TGC GCC CCC CCT CTA TAG AGC GCT CTA TAT ATC TCA CAC ACG CGA GAC ACC CCG CGC GCG
15
Step 1 (Advanced): Relative Generation For each query word, generate all relatives A relative is a word with alignment score ≥ T All relatives are updated to point to new location BLAST Query word: GGC Threshold: T = 28 Relatives: GGC30 AGC28 GAC28 AAC26 GGT25 GGA24... Query: AGGCTATCACCTGACCTCCAGGCCG…...AGC...GAC...GGC... INDEX
16
Step 2: Searching Search through database linearly, one word at a time Initiate alignment with all occurrences of that word in query BLAST Genomic database: AGCTAGCTGCTAGTCAGTCGATGCATGCTACTAGCTGCGATCGTCGTC… AAAAACAAGAATACA...AGC...CTA...GCT...TAG... INDEX Query: AGC GCT AGCGCT
17
A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Alignment Extension Example: The matching word GGT initiates an alignment Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC GTTAGGTCC BLAST
18
BLAST Algorithm Variations BLAT- BLAST-Like Alignment Tool 1. Builds index (dictionary) for database, scans linearly through query 2. Alignment extensions allow for gaps as well
19
A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Gapped Extensions Extensions with gaps in a band around anchor Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC-AG GTTAGGTCCTAG BLAT
20
Perfect Match Results Perfect Match: no relatives generated
21
Perfect Match Results
22
Interpreting Results Word size k
23
Interpreting Results Conservation rate Conservation rate: 81% Mutation rate: 19%
24
Interpreting Results Sensitivity Probability of a particular homologous area being identified Larger k decreases probability (exact match less likely) Straightforward mathematics Skip math
25
Sensitivity Calculation Database (genome) Query Homologous area: Suppose k = 7: Conservation rate:81% Mutation rate:19% 7 Probability whole word is conserved: 0.81 7 ≈ 23%
26
Sensitivity Calculation Database (genome) Query Homologous area: Suppose k = 7: 23% Words:10Probability a particular word is conserved: 23% Probability at least one word is conserved: 1 – 0.77 10 ≈ 93%
27
Interpreting Results Specificity Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED
28
Interpreting Results SPEED Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED
29
The Classic BLAST Tradeoff As we increase k … Sensitivity gets worse Speed gets better
30
Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
31
Exact matches unlikely for larger values of k Include variants with one “wildcard” placed in each position GTA *TA G*A GT* Relative Generation Any match:1 Any mismatch:0 Threshold: T = k – 1 Wildcards
32
Wildcard Results Better?
33
Wildcard Results Perfect match: Wildcards: For the same sensitivity, wildcard variant is about 440 times faster
34
Wildcard Results Perfect match: Wildcards: For the same sensitivity, wildcard variant is about 40 times faster
35
Wildcard Results Better Sensitivity/speed tradeoff consistently improved
36
Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
37
N perfect matches Same separation in query and database Database: TGCTAGCTACGATCTGCAGTGCGTAATCT… Query: TCATTACATCGTGACTTGCAGTCGTCCAG… All separations less than distance W Multiple Words TAC TGC 12 bp 7 bp 12 bp INITIATE ALIGNMENT NO INITIATION Skip math
38
Intuition Behind Multiple Words Database (genome) Query Homologous area: If we use a single word of size k = 16: Conservation rate:81% Mutation rate:19% 16 Probability whole word is conserved: 0.81 16 ≈ 3%
39
Intuition Behind Multiple Words Database (genome) Query Homologous area: If we use a single word of size k = 16: 3% Words:10Probability a particular word is conserved: 3% Probability at least one word is conserved: 1 – 0.97 10 ≈ 29%
40
Intuition Behind Multiple Words Database (genome) Query If we use a single word of size k = 16: Probability of a match = 29% If we use N = 2 words of size k = 8: Homologous area: Probability a particular word is conserved: 0.81 8 ≈ 19% 19% Words:20Probability a particular word is conserved: 19% Probability at least two words are conserved: 1 – 0.81 20 – 20 × 0.19 × 0.81 19 ≈ 91%
41
Intuition Behind Multiple Words Database (genome) Query If we use a single word of size k = 16: Probability of a match = 29% If we use N = 2 words of size k = 8: Probability of a match = 91% 3% 19%
42
Multiple Words Results
43
Single perfect match: Multiple perfect matches: For the same sensitivity, multiple words variant about 1,200 times faster
44
Multiple Words Results Single perfect match: Multiple perfect matches: For the same sensitivity, multiple words variant about 75,000 times faster
45
Multiple Words Results Much better than single matches Bigger improvement even than wildcards
46
Multiple Words Results Why not combine them: Multiple Wildcard Matches?
47
Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
48
Contiguous word (k = 10) GTCAGTACGTCAGTCGTGCGTCGTCTAG ×××××××××× Seed pattern GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× Seed Patterns GTCAGTACGT GTATTAGGCG
49
Patterns increase the likelihood of at least one match within a long conserved region 3 common 5 common 7 common Consecutive PositionsNon-Consecutive Positions 6 common On a 100-long 70% conserved region: Consecutive Non-consecutive Expected # hits: 1.070.97 Prob[at least one hit]:0.300.47 Intuition Behind Seed Patterns
50
Advantage of Patterns 11 positions 10 positions
51
Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
52
Designing Seeds Is this a good seed pattern? ×∙×∙×∙×∙×∙×∙×∙×∙×∙× ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 0 matches! ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 9 matches Not so much …
53
Designing Seeds Remember three-periodicity in exons? Higher mutation rate in last position of codon A decent pattern is thus π 110 : ××∙××∙××∙××∙××∙× But isn’t regularity bad? HumanCCTGTT(Proline, Valine) MouseCCAGTC(Proline, Valine) RatCCAGTC(Proline, Valine) DogCCGGTA(Proline, Valine) ChickenCCCGTG(Proline, Valine)
54
××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× Optimizing Seeds Hard problem No efficient solution known for fixed word size k and max span s ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× k = 10 12345678910 s = 28
55
How sensitive is a specific seed π? Construct finite-state automaton Aπ that accepts strings containing it Regular expression: (0 + 1)* 1 (0 + 1) 1 (0 + 1)* Dynamic program to convert into DFA Computing Detection Probabilities START YES 10,11 0 0 π 101 ( ×∙× )
56
Computing Detection Probabilities Compute probability a sequence of length L will be accepted by the DFA Markov model M (n th order) which dictates conservation pattern Dynamic program: θ(k L 2 s – k + n ) CDS P(1) = 0.9 P(0) = 0.1 NC P(1) = 0.8 P(0) = 0.2 0.05 0.95 101101110111 L
57
××∙×∙∙×××∙∙××××∙××∙××∙∙∙××××∙××∙××∙∙×∙× Mandala Algorithm Fast, practical seed design Start with random seed Jump to best neighbor (moving one × to unused ∙ ) Keep jumping until no better neighbors (greedy hill climbing) Finds local optimum Random restarts to try find globally optimal seed ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ×∙∙×∙×∙××∙∙×××∙∙∙××∙××∙∙×× ×∙∙×∙×∙××∙∙×× ×∙∙×∙×∙×××∙∙× NO BETTER NEIGHBORS ××∙×∙∙××××∙∙×
58
××∙××∙××∙∙×∙× ×∙∙×∙×∙×××∙∙× ×∙∙×∙×∙××∙∙×× ×∙∙∙××∙××∙∙×× Mandala Results Very fast 20 seconds on k = 11, s = 22 Within 1% of true optimum in all trials Mandala pattern for non-coding DNA (k = 11): ××××∙×××∙∙×∙××× Mandala pattern for coding DNA (k = 11): ×××∙∙∙∙∙××∙××∙××∙×× ×∙×∙××∙××∙∙∙× NO BETTER NEIGHBORS
59
Seed Pattern Results Non-coding sequences Coding sequences
60
Motivation for Improvement Good CDS seeds ××∙××∙××∙××∙×× Good NC seeds ××××∙×××∙∙×∙××
61
Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds
62
Single seed GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× ×××∙×∙∙×∙∙∙∙×∙∙∙∙∙∙∙×∙∙∙∙××× ∙∙∙××∙××∙××∙××∙××∙∙∙∙∙∙∙∙∙∙∙ How much longer is it taking us to index? GTCAGTACGTCAGTCGTGCGTCGTCTAG Multiple Simultaneous Seeds How much longer is it taking us to search? Multiple Simultaneous Seeds AGACTCGTGT GTCGCGTTAG GTATTAGGCG GTCAGTACGTCAGTCGTGCGTCGTCTAG...AGCCAGTCAG...GACAGTCCAG...GGCATCATCA... THREE INDEXES: XX∙X∙X∙∙∙X∙X∙∙∙X∙X∙∙∙X∙∙∙∙∙X XXX∙X∙∙X∙∙∙∙X∙∙∙∙∙∙∙X∙∙∙∙XXX ∙∙∙XX∙XX∙XX∙XX∙XX∙∙∙∙∙∙∙∙∙∙∙...AGCCAGTCAG...GACAGTCCAG...GGCATCATCA... AGCCAGTCAG...GACAGTCCAG...GGCATCATCA...
63
Mandala still works Just change definition of neighbor Pick one seed Pick one × Move to a ∙ Slower convergence ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ∙××∙××∙××∙××∙ ×××∙×∙∙×∙××∙×
64
Results Non-coding sequences Back to earlier results …
65
Future Extensions Combine all of the above! Multiple Wildcard Matches of Multiple Patterns? * GT∙G∙T∙∙∙T GTA∙C∙∙G∙∙ ∙∙∙GC∙AG∙T * TC∙T∙T∙∙∙G TCG∙C∙∙C∙∙ ∙∙∙TC∙AC∙G ACGATGGCTCCTGACGAGCTAGCTGATGAGCCGTAGCAGACGTGTAACGGTCGCTCCTGCAGCGA ACGATGCATCGTAGCTAGCTAGCTGATCAGCTGTAGTAGTCGTCTACTGGATGCTGCTGCAGCGT Database Query
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.