The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs
The Basic Local Alignment Search Tool (BLAST) A Y W T Y I V A L T – Q V R Q Y E A T S I L C I V M I Y S R A - Q Y R Y W R Y Most local alignments contain highly conserved sections without gaps
The Basic Local Alignment Search Tool (BLAST) A Y W T Y I V A L T – Q V R Q Y E A T S I L C I V M I Y S R A - Q Y R Y W R Y -> search for high scoring segment pairs (HSP), i.e. gap-free local alignments
The Basic Local Alignment Search Tool (BLAST)
A Y W T Y I V A L T – Q V R Q Y E A T S I L C I V M I Y S R A - Q Y R Y W R Y Advantages: (a) speed (b) statistical theory about HSP exists.
The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs (2) Use word pairs as seeds
Pair-wise sequence alignment T W L M H C A Q Y I C I M X H X C X T H Y (1) Search word pairs of length 3 with score > T, Use them as seeds.
Pair-wise sequence alignment Naïve algorithm would have a complexity of O(l 1 * l 2 ) Solution: Preprocess query sequence: Compile a list of all words that have a Score > T when aligned to a word in the Query.
Pair-wise sequence alignment Naïve algorithm would have a complexity of O(l 1 * l 2 ) Solution: Preprocess query sequence: Compile a list of all words that have a Score > T when aligned to a word in the Query. Complexity: O(l 1 ) Organize words in efficient data structure (tree) for fast look-up
The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs (2) Use word pairs as seeds (3) Extend seed alignments until score drops below threshold value
Pair-wise sequence alignment T W L M H C A Q Y I C I M X H X C X T H Y Extend seeds until score drops by X.
Pair-wise sequence alignment T W L M H C A Q Y I C I X M X H X C X T X H X Y Extend seeds until score drops by X.
Pair-wise sequence alignment Algorithm not guaranteed to find best segment pair (Heuristic) But works well in practice!
The Basic Local Alignment Search Tool (BLAST) New BLAST version (1997) Two-hit strategy
Pair-wise sequence alignment W L M H C A Q Y A R V I M X H X C X T H W A X R X v X Search two word pairs of at the same diagonal, use lower threshold T
The Basic Local Alignment Search Tool (BLAST) New BLAST version (1997) Two-hit strategy Gapped BLAST Position-Specific Iterative BLAST (PSI BLAST)
The Basic Local Alignment Search Tool (BLAST)
Multiple sequence alignment 1aboA 1.NLFVALYDfvasgdntlsitkGEKLRVLgynhn gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1.NFRVYYRDsrd......pvwkGPAKLLWkg eG 1vie 1.drvrkksga awqGQIVGWYctnlt peG 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment First question: how to score multiple alignments? Possible scoring scheme: Sum-of-pairs score
Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......
Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQtkngqGWVPSNYITPVN 1ycsB 39 WWWARlndkeGYVPRNLLGLYP
Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp
Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment Multiple alignment implies pairwise alignments: Use sum of scores of these p.a. 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
Multiple sequence alignment Goal: Find multi-alignment with maximum score !
Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Multidimensional search space instead of two- dimensional matrix!
Multiple sequence alignment
Complexity: For sequences of length l 1 * l 2 * l 3 O( l 1 * l 2 * l 3 ) For n sequences ( average length l ): O( l n ) Exponential complexity!
Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Optimal solution not feasible:
Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Optimal solution not feasible: -> Heuristics necessary
Multiple sequence alignment (A) Carillo and Lipman (MSA) Find sub-space in dynamic-programming Matrix where optimal path can be found
Multiple sequence alignment (B) Stoye, Dress (DCA) Divide search space into small Calculate optimal alignment for sub-spaces Concatenate sub-alignments
Multiple sequence alignment (B) Stoye, Dress (DCA)
Multiple sequence alignment (B) Stoye, Dress (DCA)
Multiple sequence alignment Progressive alignment. Carry out a series of pair-wise alignment
Most popular way of constructing multiple alignments: Progressive alignment. Carry out a series of pair-wise alignment Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Align most similar sequences Multiple sequence alignment
WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASFQPVAALERIN WLNYNEERGDFPGTYVEYIGRKKISP
Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP
Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Align sequence to alignment
Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Align alignment to alignment
Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP
Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP Rule: “once a gap - always a gap”
Multiple sequence alignment Order of pair-wise profile alignments determined by phylogenetic tree based on pair-wise similarity values (guide tree)
Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP
Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP
Multiple sequence alignment Problem: simple guide tree determines multiple alignment; multiple alignment determines phyolgeneitc analysis
Multiple sequence alignment Implementations: Clustal W, PileUp, MultAlin
Local multiple alignment M M
M M M
M M M M´
Local multiple alignment Find motifs contained in all sequences in data set Problem: motifs often present in only sub-families
Neither local nor global methods appliccable
Alignment possible if order conserved
The DIALIGN approach
Combination of local and global methods.
The DIALIGN approach Combination of local and global methods. Find local pair-wise similarities between input sequences (fragments)
The DIALIGN approach Combination of local and global methods. Find local pair-wise similarities between input sequences (fragments) Compose alignments from fragments
The DIALIGN approach Combination of local and global methods. Find local pair-wise similarities between input sequences (fragments) Compose alignments from fragments Ignore non-related parts of the sequences
The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc
The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc
The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc
The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc
The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc atctaatagttaaaccccctcgtgcttag agatccaaac cagtgcgtgtattactaac ggttcaatcgcgcacatccgc--
The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc atctaatagttaaaccccctcgtgcttag agatccaaac cagtgcgtgtattactaac ggttcaatcgcgcacatccgc atcTAATAGTTAaaccccctcgtGCTTag AGATCCaaac cagtgcgtgTATTACTAAc GGTTcaatcgcgcACATCCgc--
The DIALIGN approach Score of an alignment: Define score of fragment f: l(f) = length of f s(f) = sum of matches (similarity values) P(f) = probability to find a fragment with length l(f) and at least s(f) matches in random sequences that have the same length as the input sequences. Score w(f) = -ln P(f)
The DIALIGN approach Score of an alignment: Define score of alignment as sum of scores w(f) of its fragments No gap penalty is used! Optimization problem for pair-wise alignment: Find chain of fragments with maximal total score
The DIALIGN approach atctaatagttaaaccccctcgtgcttag agatccaaac cagtgcgtgtattactaac ggttcaatcgcgcacatccgc-- Fragment-chaining algorithm finds optimal chain of fragments.
The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa
The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaac ggttcaatcgcg caaa--gagtatcacc cctgaattgaataa
The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa
The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa Consistency: it is possible to introduce gaps such that all segment pairs are aligned.
The DIALIGN approach Multiple fragment alignment atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc GG-TTCAATcgcg caaa--GAGTATCAcc CCTGaaTTGAATaa
Program evaluation Use biologically verified alignments (known 3D structure of proteins) Compare alignments produced by computer programs to “biologically correct” alignments.
Program evaluation (1) First evaluation of multiple alignment programs (McClure, Vasi, Fitch,1994) 4 protein families used: Globin, kinase, protease, ribonuclease H, all globally related -> global programs performed best
Program evaluation (2) The BAliBASE (Thompson et al., 1999) ~ 100 protein families with known 3D structure, some with large insertions/deletions.
Program evaluation 1aboA 1.NLFVALYDfvasgdntlsitkGEKLRVLgynhn gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1.NFRVYYRDsrd......pvwkGPAKLLWkg eG 1vie 1.drvrkksga awqGQIVGWYctnlt peG 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN Key alpha helix RED beta strand GREEN core blocks UNDERSCORE
Program evaluation Results: Four programs performed best, but no method was best in all test examples. ClustalW, SAGA and RPPR best for global alignment, DIALIGN best for sequences with large insertions or deletions.
Program evaluation (3) Lassmann and Sonnhammer (2002) Used BAliBASE plus artificial sequences for local alignment Results: T-COFFEE best for closely related sequences, DIALIGN best for distal sequences.
Program evaluation
Alignment of large genomic sequences Important tool for identifying functional sites (e.g. genes or regulatory elements)
Alignment of large genomic sequences Phylogenetic Footprinting: Functional sites more conserved during evolution => Sequence similarity indicates biological function
Alignment of large genomic sequences DIALIGN performs well in identifying local homologies, but is slow
Quadratic program running time
Solution: Anchored alignments
Find anchor points to reduce search space
Solution: Anchored alignments Use fast heuristic method to find anchor points: CHAOS developed together with Mike Brudno Brudno et al. (2003), BMC Bioinformatics 4:66
Solution: Anchored alignments
(3) Anchored alignments
First step to gene prediction: Exon discovery by genomic alignment
Evaluation of different alignment programs: Compare local sequence similarity identified by alignment programs to known exons Morgenstern et al. (2002), Bioinformatics 18:
DIALIGN alignment of human and murine genomic sequences
DIALIGN alignment of tomato and Thaliana genomic sequences
Evaluation of DIALIGN, PipMaker, WABA, BLASTN and TBLASTX on a set of 42 human and murine genomic sequences. Compare similarities to annotated exons Apply cut-off parameter to resulting alignments Measure sensitivity and specificity
Performance of long-range alignment programs for exon discovery (human - mouse comparison)
Performance of long-range alignment programs for exon discovery (thaliana - tomato comparison)
AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons Recursive algorithm finds biologically consistent chain of potential exons
Identification of candidate exons Fragments in DIALIGN alignment
Identification of candidate exons Build cluster of fragments
Identification of candidate exons Identify conserved splice sites
Identification of candidate exons Candidate exons bounded by conserved splice sites
Construct gene models using candidate exons Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score
Find optimal consistent chain of candidate exons
atggtaggtagtgaatgtga
Find optimal consistent chain of candidate exons atggtaggtagtgaatgtga G1G2
Find optimal consistent chain of candidate exons Recursive algorithm calculates optimal chain of candidate exons in N log N time
DIALIGN fragments
Candidate exons
Complete model
Results: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)
AGenDA GenScan 64 % 12 % 17 %
Results: Quality of AGenDA-based gene models comparable to results from GenScan Exons identified that have not been identified by GenScan No statistical models derived from known genes (no training data necessary!) Method generally appliccable
AGenDA: Alignment-based Gene Detection Algorithm WWW server: Rinner, Taher, Goel, Sczyrba, Brudno, Batzoglou, Morgenstern, submitted