Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class.

Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class

Roadmap Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

Motivation We have a newly discovered gene: Does it occur in other species? How fast does it evolve? We want to “find” this gene in other species But there will be mutations

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Global Alignment Running Time: O(MN) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Needleman-Wunsch (Dynamic Programming) N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Local Alignment Running Time: O(MN) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Smith-Waterman N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Modifications: Store 0 instead of –ve values Search entire table for maximum

Alignment Applications We have our newly discovered gene: Does it occur in other species? How fast does it evolve?

Complete genomes today About 300 complete genomes have been sequenced

GenBank Growth Exponential growth in total sequence data Recently exceeded 100 Gbp (10 11 base pairs)

More DNA is coming …

Alignment Applications We have our newly discovered gene: Does it occur in other species? How fast does it evolve? Assume we try Smith-Waterman: 10 15 cells The entire genomic database 10 11 Our new gene 10 4

Indexed Alignment (BLAST- Basic Local Alignment Search Tool) Main idea: 1. Construct a dictionary of all words in the query 2. Initiate a local alignment for each word match between query and DB Running Time: O(MN) in worst case However, in practice orders of magnitude faster than Smith-Waterman query DB

Step 1 (Basic): Construct dictionary of query words Query indexed by all words of size k Query indexed by all words of size k = 3 (in our examples) Query indexed by all words of size k ≈ 11 BLAST Query: AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCG… AAAAACAAGAATACA...AGG...CTA...GCT...GGC... INDEX AGG GGC GCT CTA TAT ATC TCA CAC ACC CCT CTG TGA GAC ACC CCT CTC TCC CCA CAG AGG GGC GCC CCG CGA GAT ATG TGC GCC CCC CCT CTA TAG AGC GCT CTA TAT ATC TCA CAC ACG CGA GAC ACC CCG CGC GCG

Step 1 (Advanced): Relative Generation For each query word, generate all relatives A relative is a word with alignment score ≥ T All relatives are updated to point to new location BLAST Query word: GGC Threshold: T = 28 Relatives: GGC30 AGC28 GAC28 AAC26 GGT25 GGA24... Query: AGGCTATCACCTGACCTCCAGGCCG…...AGC...GAC...GGC... INDEX

Step 2: Searching Search through database linearly, one word at a time Initiate alignment with all occurrences of that word in query BLAST Genomic database: AGCTAGCTGCTAGTCAGTCGATGCATGCTACTAGCTGCGATCGTCGTC… AAAAACAAGAATACA...AGC...CTA...GCT...TAG... INDEX Query: AGC GCT AGCGCT

A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Alignment Extension Example: The matching word GGT initiates an alignment Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC GTTAGGTCC BLAST

BLAST Algorithm Variations BLAT- BLAST-Like Alignment Tool 1. Builds index (dictionary) for database, scans linearly through query 2. Alignment extensions allow for gaps as well

A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Gapped Extensions Extensions with gaps in a band around anchor Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC-AG GTTAGGTCCTAG BLAT

Perfect Match Results Perfect Match: no relatives generated

Perfect Match Results

Interpreting Results Word size k

Interpreting Results Conservation rate Conservation rate: 81% Mutation rate: 19%

Interpreting Results Sensitivity Probability of a particular homologous area being identified Larger k decreases probability (exact match less likely) Straightforward mathematics Skip math

Sensitivity Calculation Database (genome) Query Homologous area: Suppose k = 7: Conservation rate:81% Mutation rate:19% 7 Probability whole word is conserved: 0.81 7 ≈ 23%

Sensitivity Calculation Database (genome) Query Homologous area: Suppose k = 7: 23% Words:10Probability a particular word is conserved: 23% Probability at least one word is conserved: 1 – 0.77 10 ≈ 93%

Interpreting Results Specificity Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED

Interpreting Results SPEED Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED

The Classic BLAST Tradeoff As we increase k … Sensitivity gets worse Speed gets better

Exact matches unlikely for larger values of k Include variants with one “wildcard” placed in each position GTA *TA G*A GT* Relative Generation Any match:1 Any mismatch:0 Threshold: T = k – 1 Wildcards

Wildcard Results Better?

Wildcard Results Perfect match: Wildcards: For the same sensitivity, wildcard variant is about 440 times faster

Wildcard Results Perfect match: Wildcards: For the same sensitivity, wildcard variant is about 40 times faster

Wildcard Results Better Sensitivity/speed tradeoff consistently improved

N perfect matches Same separation in query and database Database: TGCTAGCTACGATCTGCAGTGCGTAATCT… Query: TCATTACATCGTGACTTGCAGTCGTCCAG… All separations less than distance W Multiple Words TAC TGC 12 bp 7 bp 12 bp INITIATE ALIGNMENT NO INITIATION Skip math

Intuition Behind Multiple Words Database (genome) Query Homologous area: If we use a single word of size k = 16: Conservation rate:81% Mutation rate:19% 16 Probability whole word is conserved: 0.81 16 ≈ 3%

Intuition Behind Multiple Words Database (genome) Query Homologous area: If we use a single word of size k = 16: 3% Words:10Probability a particular word is conserved: 3% Probability at least one word is conserved: 1 – 0.97 10 ≈ 29%

Intuition Behind Multiple Words Database (genome) Query If we use a single word of size k = 16: Probability of a match = 29% If we use N = 2 words of size k = 8: Homologous area: Probability a particular word is conserved: 0.81 8 ≈ 19% 19% Words:20Probability a particular word is conserved: 19% Probability at least two words are conserved: 1 – 0.81 20 – 20 × 0.19 × 0.81 19 ≈ 91%

Intuition Behind Multiple Words Database (genome) Query If we use a single word of size k = 16: Probability of a match = 29% If we use N = 2 words of size k = 8: Probability of a match = 91% 3% 19%

Multiple Words Results

Single perfect match: Multiple perfect matches: For the same sensitivity, multiple words variant about 1,200 times faster

Multiple Words Results Single perfect match: Multiple perfect matches: For the same sensitivity, multiple words variant about 75,000 times faster

Multiple Words Results Much better than single matches Bigger improvement even than wildcards

Multiple Words Results Why not combine them: Multiple Wildcard Matches?

Contiguous word (k = 10) GTCAGTACGTCAGTCGTGCGTCGTCTAG ×××××××××× Seed pattern GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× Seed Patterns GTCAGTACGT GTATTAGGCG

Patterns increase the likelihood of at least one match within a long conserved region 3 common 5 common 7 common Consecutive PositionsNon-Consecutive Positions 6 common On a 100-long 70% conserved region: Consecutive Non-consecutive Expected # hits: 1.070.97 Prob[at least one hit]:0.300.47 Intuition Behind Seed Patterns

Advantage of Patterns 11 positions 10 positions

Designing Seeds Is this a good seed pattern? ×∙×∙×∙×∙×∙×∙×∙×∙×∙× ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 0 matches! ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 9 matches Not so much …

Designing Seeds Remember three-periodicity in exons? Higher mutation rate in last position of codon A decent pattern is thus π 110 : ××∙××∙××∙××∙××∙× But isn’t regularity bad? HumanCCTGTT(Proline, Valine) MouseCCAGTC(Proline, Valine) RatCCAGTC(Proline, Valine) DogCCGGTA(Proline, Valine) ChickenCCCGTG(Proline, Valine)

××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× Optimizing Seeds Hard problem No efficient solution known for fixed word size k and max span s ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× k = 10 12345678910 s = 28

How sensitive is a specific seed π? Construct finite-state automaton Aπ that accepts strings containing it Regular expression: (0 + 1)* 1 (0 + 1) 1 (0 + 1)* Dynamic program to convert into DFA Computing Detection Probabilities START YES 10,11 0 0 π 101 ( ×∙× )

Computing Detection Probabilities Compute probability a sequence of length L will be accepted by the DFA Markov model M (n th order) which dictates conservation pattern Dynamic program: θ(k L 2 s – k + n ) CDS P(1) = 0.9 P(0) = 0.1 NC P(1) = 0.8 P(0) = 0.2 0.05 0.95 101101110111 L

××∙×∙∙×××∙∙××××∙××∙××∙∙∙××××∙××∙××∙∙×∙× Mandala Algorithm Fast, practical seed design Start with random seed Jump to best neighbor (moving one × to unused ∙ ) Keep jumping until no better neighbors (greedy hill climbing) Finds local optimum Random restarts to try find globally optimal seed ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ×∙∙×∙×∙××∙∙×××∙∙∙××∙××∙∙×× ×∙∙×∙×∙××∙∙×× ×∙∙×∙×∙×××∙∙× NO BETTER NEIGHBORS ××∙×∙∙××××∙∙×

××∙××∙××∙∙×∙× ×∙∙×∙×∙×××∙∙× ×∙∙×∙×∙××∙∙×× ×∙∙∙××∙××∙∙×× Mandala Results Very fast 20 seconds on k = 11, s = 22 Within 1% of true optimum in all trials Mandala pattern for non-coding DNA (k = 11): ××××∙×××∙∙×∙××× Mandala pattern for coding DNA (k = 11): ×××∙∙∙∙∙××∙××∙××∙×× ×∙×∙××∙××∙∙∙× NO BETTER NEIGHBORS

Seed Pattern Results Non-coding sequences Coding sequences

Motivation for Improvement Good CDS seeds ××∙××∙××∙××∙×× Good NC seeds ××××∙×××∙∙×∙××

Single seed GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× ×××∙×∙∙×∙∙∙∙×∙∙∙∙∙∙∙×∙∙∙∙××× ∙∙∙××∙××∙××∙××∙××∙∙∙∙∙∙∙∙∙∙∙ How much longer is it taking us to index? GTCAGTACGTCAGTCGTGCGTCGTCTAG Multiple Simultaneous Seeds How much longer is it taking us to search? Multiple Simultaneous Seeds AGACTCGTGT GTCGCGTTAG GTATTAGGCG GTCAGTACGTCAGTCGTGCGTCGTCTAG...AGCCAGTCAG...GACAGTCCAG...GGCATCATCA... THREE INDEXES: XX∙X∙X∙∙∙X∙X∙∙∙X∙X∙∙∙X∙∙∙∙∙X XXX∙X∙∙X∙∙∙∙X∙∙∙∙∙∙∙X∙∙∙∙XXX ∙∙∙XX∙XX∙XX∙XX∙XX∙∙∙∙∙∙∙∙∙∙∙...AGCCAGTCAG...GACAGTCCAG...GGCATCATCA... AGCCAGTCAG...GACAGTCCAG...GGCATCATCA...

Mandala still works Just change definition of neighbor Pick one seed Pick one × Move to a ∙ Slower convergence ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ∙××∙××∙××∙××∙ ×××∙×∙∙×∙××∙×

Results Non-coding sequences Back to earlier results …

Future Extensions Combine all of the above! Multiple Wildcard Matches of Multiple Patterns? * GT∙G∙T∙∙∙T GTA∙C∙∙G∙∙ ∙∙∙GC∙AG∙T * TC∙T∙T∙∙∙G TCG∙C∙∙C∙∙ ∙∙∙TC∙AC∙G ACGATGGCTCCTGACGAGCTAGCTGATGAGCCGTAGCAGACGTGTAACGGTCGCTCCTGCAGCGA ACGATGCATCGTAGCTAGCTAGCTGATCAGCTGTAGTAGTCGTCTACTGGATGCTGCTGCAGCGT Database Query

Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class.

Similar presentations

Presentation on theme: "Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class.

Similar presentations

Presentation on theme: "Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class."— Presentation transcript:

Similar presentations

About project

Feedback