Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Alignment.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Sequence alignment, E-value & Extreme value distribution
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
BLAST What it does and what it means Steven Slater Adapted from pt.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Heuristic Alignment Algorithms Hongchao Li Jan
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
DNA By: Mr. Kauffman.
Gene architecture and sequence annotation
Fast Sequence Alignments
Pairwise sequence Alignment.
6.096 Algorithms for Computational Biology Lecture 2 BLAST & Database Search Manolis Piotr Indyk.
Sequence alignment, E-value & Extreme value distribution
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class

Roadmap Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

Motivation We have a newly discovered gene: Does it occur in other species? How fast does it evolve? We want to “find” this gene in other species But there will be mutations

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Global Alignment Running Time: O(MN) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Needleman-Wunsch (Dynamic Programming) N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Local Alignment Running Time: O(MN) M AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Smith-Waterman N AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC Modifications: Store 0 instead of –ve values Search entire table for maximum

Alignment Applications We have our newly discovered gene: Does it occur in other species? How fast does it evolve?

Complete genomes today About 300 complete genomes have been sequenced

GenBank Growth Exponential growth in total sequence data Recently exceeded 100 Gbp (10 11 base pairs)

More DNA is coming …

Alignment Applications We have our newly discovered gene: Does it occur in other species? How fast does it evolve? Assume we try Smith-Waterman: cells The entire genomic database Our new gene 10 4

Indexed Alignment (BLAST- Basic Local Alignment Search Tool) Main idea: 1. Construct a dictionary of all words in the query 2. Initiate a local alignment for each word match between query and DB Running Time: O(MN) in worst case However, in practice orders of magnitude faster than Smith-Waterman query DB

Step 1 (Basic): Construct dictionary of query words Query indexed by all words of size k Query indexed by all words of size k = 3 (in our examples) Query indexed by all words of size k ≈ 11 BLAST Query: AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCG… AAAAACAAGAATACA...AGG...CTA...GCT...GGC... INDEX AGG GGC GCT CTA TAT ATC TCA CAC ACC CCT CTG TGA GAC ACC CCT CTC TCC CCA CAG AGG GGC GCC CCG CGA GAT ATG TGC GCC CCC CCT CTA TAG AGC GCT CTA TAT ATC TCA CAC ACG CGA GAC ACC CCG CGC GCG

Step 1 (Advanced): Relative Generation For each query word, generate all relatives A relative is a word with alignment score ≥ T All relatives are updated to point to new location BLAST Query word: GGC Threshold: T = 28 Relatives: GGC30 AGC28 GAC28 AAC26 GGT25 GGA24... Query: AGGCTATCACCTGACCTCCAGGCCG…...AGC...GAC...GGC... INDEX

Step 2: Searching Search through database linearly, one word at a time Initiate alignment with all occurrences of that word in query BLAST Genomic database: AGCTAGCTGCTAGTCAGTCGATGCATGCTACTAGCTGCGATCGTCGTC… AAAAACAAGAATACA...AGC...CTA...GCT...TAG... INDEX Query: AGC GCT AGCGCT

A C G A A G T A A G G T C C A G T C C C T T C C T G G A T T G C G A Alignment Extension Example: The matching word GGT initiates an alignment Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC GTTAGGTCC BLAST

BLAST Algorithm Variations BLAT- BLAST-Like Alignment Tool 1. Builds index (dictionary) for database, scans linearly through query 2. Alignment extensions allow for gaps as well

A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A Gapped Extensions Extensions with gaps in a band around anchor Extension to the left and right with no gaps until alignment falls a certain threshold S below best score so far Output: GTAAGGTCC-AG GTTAGGTCCTAG BLAT

Perfect Match Results Perfect Match: no relatives generated

Perfect Match Results

Interpreting Results Word size k

Interpreting Results Conservation rate Conservation rate: 81% Mutation rate: 19%

Interpreting Results Sensitivity Probability of a particular homologous area being identified Larger k decreases probability (exact match less likely) Straightforward mathematics Skip math

Sensitivity Calculation Database (genome) Query Homologous area: Suppose k = 7: Conservation rate:81% Mutation rate:19% 7 Probability whole word is conserved: ≈ 23%

Sensitivity Calculation Database (genome) Query Homologous area: Suppose k = 7: 23% Words:10Probability a particular word is conserved: 23% Probability at least one word is conserved: 1 – ≈ 93%

Interpreting Results Specificity Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED

Interpreting Results SPEED Expected number of alignments initiated by chance Based on 500 bp query and 3 Gbp database This is essentially an indication of SPEED

The Classic BLAST Tradeoff As we increase k … Sensitivity gets worse Speed gets better

Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

Exact matches unlikely for larger values of k Include variants with one “wildcard” placed in each position GTA *TA G*A GT* Relative Generation Any match:1 Any mismatch:0 Threshold: T = k – 1 Wildcards

Wildcard Results Better?

Wildcard Results Perfect match: Wildcards: For the same sensitivity, wildcard variant is about 440 times faster

Wildcard Results Perfect match: Wildcards: For the same sensitivity, wildcard variant is about 40 times faster

Wildcard Results Better Sensitivity/speed tradeoff consistently improved

Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

N perfect matches Same separation in query and database Database: TGCTAGCTACGATCTGCAGTGCGTAATCT… Query: TCATTACATCGTGACTTGCAGTCGTCCAG… All separations less than distance W Multiple Words TAC TGC 12 bp 7 bp 12 bp INITIATE ALIGNMENT NO INITIATION Skip math

Intuition Behind Multiple Words Database (genome) Query Homologous area: If we use a single word of size k = 16: Conservation rate:81% Mutation rate:19% 16 Probability whole word is conserved: ≈ 3%

Intuition Behind Multiple Words Database (genome) Query Homologous area: If we use a single word of size k = 16: 3% Words:10Probability a particular word is conserved: 3% Probability at least one word is conserved: 1 – ≈ 29%

Intuition Behind Multiple Words Database (genome) Query If we use a single word of size k = 16: Probability of a match = 29% If we use N = 2 words of size k = 8: Homologous area: Probability a particular word is conserved: ≈ 19% 19% Words:20Probability a particular word is conserved: 19% Probability at least two words are conserved: 1 – – 20 × 0.19 × ≈ 91%

Intuition Behind Multiple Words Database (genome) Query If we use a single word of size k = 16: Probability of a match = 29% If we use N = 2 words of size k = 8: Probability of a match = 91% 3% 19%

Multiple Words Results

Single perfect match: Multiple perfect matches: For the same sensitivity, multiple words variant about 1,200 times faster

Multiple Words Results Single perfect match: Multiple perfect matches: For the same sensitivity, multiple words variant about 75,000 times faster

Multiple Words Results Much better than single matches Bigger improvement even than wildcards

Multiple Words Results Why not combine them: Multiple Wildcard Matches?

Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

Contiguous word (k = 10) GTCAGTACGTCAGTCGTGCGTCGTCTAG ×××××××××× Seed pattern GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× Seed Patterns GTCAGTACGT GTATTAGGCG

Patterns increase the likelihood of at least one match within a long conserved region 3 common 5 common 7 common Consecutive PositionsNon-Consecutive Positions 6 common On a 100-long 70% conserved region: Consecutive Non-consecutive Expected # hits: Prob[at least one hit]: Intuition Behind Seed Patterns

Advantage of Patterns 11 positions 10 positions

Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

Designing Seeds Is this a good seed pattern? ×∙×∙×∙×∙×∙×∙×∙×∙×∙× ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 0 matches! ×∙×∙×∙×∙×∙×∙×∙×∙×∙× 9 matches Not so much …

Designing Seeds Remember three-periodicity in exons? Higher mutation rate in last position of codon A decent pattern is thus π 110 : ××∙××∙××∙××∙××∙× But isn’t regularity bad? HumanCCTGTT(Proline, Valine) MouseCCAGTC(Proline, Valine) RatCCAGTC(Proline, Valine) DogCCGGTA(Proline, Valine) ChickenCCCGTG(Proline, Valine)

××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× Optimizing Seeds Hard problem No efficient solution known for fixed word size k and max span s ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× k = s = 28

How sensitive is a specific seed π? Construct finite-state automaton Aπ that accepts strings containing it Regular expression: (0 + 1)* 1 (0 + 1) 1 (0 + 1)* Dynamic program to convert into DFA Computing Detection Probabilities START YES 10, π 101 ( ×∙× )

Computing Detection Probabilities Compute probability a sequence of length L will be accepted by the DFA Markov model M (n th order) which dictates conservation pattern Dynamic program: θ(k L 2 s – k + n ) CDS P(1) = 0.9 P(0) = 0.1 NC P(1) = 0.8 P(0) = L

××∙×∙∙×××∙∙××××∙××∙××∙∙∙××××∙××∙××∙∙×∙× Mandala Algorithm Fast, practical seed design Start with random seed Jump to best neighbor (moving one × to unused ∙ ) Keep jumping until no better neighbors (greedy hill climbing) Finds local optimum Random restarts to try find globally optimal seed ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ×∙∙×∙×∙××∙∙×××∙∙∙××∙××∙∙×× ×∙∙×∙×∙××∙∙×× ×∙∙×∙×∙×××∙∙× NO BETTER NEIGHBORS ××∙×∙∙××××∙∙×

××∙××∙××∙∙×∙× ×∙∙×∙×∙×××∙∙× ×∙∙×∙×∙××∙∙×× ×∙∙∙××∙××∙∙×× Mandala Results Very fast 20 seconds on k = 11, s = 22 Within 1% of true optimum in all trials Mandala pattern for non-coding DNA (k = 11): ××××∙×××∙∙×∙××× Mandala pattern for coding DNA (k = 11): ×××∙∙∙∙∙××∙××∙××∙×× ×∙×∙××∙××∙∙∙× NO BETTER NEIGHBORS

Seed Pattern Results Non-coding sequences Coding sequences

Motivation for Improvement Good CDS seeds ××∙××∙××∙××∙×× Good NC seeds ××××∙×××∙∙×∙××

Status Check Background Recap Simple Tricks of the Trade Wildcards Multiple Words State of the Art Seed Patterns Optimizing Seeds Multiple Simultaneous Seeds

Single seed GTCAGTACGTCAGTCGTGCGTCGTCTAG ××∙×∙×∙∙∙×∙×∙∙∙×∙×∙∙∙×∙∙∙∙∙× ×××∙×∙∙×∙∙∙∙×∙∙∙∙∙∙∙×∙∙∙∙××× ∙∙∙××∙××∙××∙××∙××∙∙∙∙∙∙∙∙∙∙∙ How much longer is it taking us to index? GTCAGTACGTCAGTCGTGCGTCGTCTAG Multiple Simultaneous Seeds How much longer is it taking us to search? Multiple Simultaneous Seeds AGACTCGTGT GTCGCGTTAG GTATTAGGCG GTCAGTACGTCAGTCGTGCGTCGTCTAG...AGCCAGTCAG...GACAGTCCAG...GGCATCATCA... THREE INDEXES: XX∙X∙X∙∙∙X∙X∙∙∙X∙X∙∙∙X∙∙∙∙∙X XXX∙X∙∙X∙∙∙∙X∙∙∙∙∙∙∙X∙∙∙∙XXX ∙∙∙XX∙XX∙XX∙XX∙XX∙∙∙∙∙∙∙∙∙∙∙...AGCCAGTCAG...GACAGTCCAG...GGCATCATCA... AGCCAGTCAG...GACAGTCCAG...GGCATCATCA...

Mandala still works Just change definition of neighbor Pick one seed Pick one × Move to a ∙ Slower convergence ×∙×∙××∙××∙∙∙× ×∙∙∙××∙××∙∙×× ∙××∙××∙××∙××∙ ×××∙×∙∙×∙××∙×

Results Non-coding sequences Back to earlier results …

Future Extensions Combine all of the above! Multiple Wildcard Matches of Multiple Patterns? * GT∙G∙T∙∙∙T GTA∙C∙∙G∙∙ ∙∙∙GC∙AG∙T * TC∙T∙T∙∙∙G TCG∙C∙∙C∙∙ ∙∙∙TC∙AC∙G ACGATGGCTCCTGACGAGCTAGCTGATGAGCCGTAGCAGACGTGTAACGGTCGCTCCTGCAGCGA ACGATGCATCGTAGCTAGCTAGCTGATCAGCTGTAGTAGTCGTCTACTGGATGCTGCTGCAGCGT Database Query