6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple Alignment and Motif Searching Burkhard Morgenstern Universität Göttingen Institute of Microbiology and Genetics Department of Bioinformatics Tunis, March 2007
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple Alignment and Motif Searching burkhard/teaching/tunis_07.php
6/3/2015Burkhard Morgenstern, Tunis teaching/tunis_07.php
6/3/2015Burkhard Morgenstern, Tunis 2007 Information flow in the cell
6/3/2015Burkhard Morgenstern, Tunis 2007 Information flow in the cell Idea: Sequence -> Structure -> Function
6/3/2015Burkhard Morgenstern, Tunis 2007 Information flow in the cell
6/3/2015Burkhard Morgenstern, Tunis 2007 Information flow in the cell gap between sequence and structure/function data Lots of data available at the sequence level Fewer data at the structure and function level
6/3/2015Burkhard Morgenstern, Tunis 2007 Exponential growth of data bases
6/3/2015Burkhard Morgenstern, Tunis 2007 Major goal of bioinformatics: close the gap between sequence information and structure/function information Most important tool for sequence analysis: sequence comparison Simple approach: dot plot, more advanced approach: sequence alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Gibbs and McIntyre (1970)
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y Two sequences to be compared
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Y Q E W T Y I V A R E A Q Y E C I V M R E Q Y Comparison matrix
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Y Q E W T Y I V A R E A Q Y E C I X V M R E Q Y Search pairs of identical residues
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X Dot plot: dot ( X ) for all pairs of identical residues
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X Homologies as diagonal lines from top-left to bottom-right corner
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Y Q E W T Y I V A R E A Q Y E C I X V X M R X E X X X Q X X Y X X Inversions as diagonals from bottom left to top right
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X Repeats as parallel diagonals
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Y Q E W T Y Q E V R E Y Q E I C I X V X M R Y X X X Q X X X E X X X X
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot Advantages: 1. Various types of similarity detectable (repeats, inversions) 2. Useful for large-scale analysis Use filtering for long sequeces: dots represent matching segments instead of matching single residues
6/3/2015Burkhard Morgenstern, Tunis 2007 The dot plot
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment Evolutionary or structurally related sequences: alignment possible Sequence homologies represented by inserting gaps
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C I V M R E A Q Y Two input sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C I V M R E A Q Y Comparison matrix for two sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C I X V X M R X E X X A X Q X Y X X Dot plot for two sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C I X V X M R X E X X A X Q X Y X X Similarities in same relative order over entire seqences
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C I X V X M R X E X A Q X Y X Global alignment of sequences possible
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X Alignment corresponds to path through comparison matrix
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X Matches (red), mis-matches (green), gaps (blue)
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C X X I X V X M X R X E X A X Q X Y X X Matches (red), mis-matches (green), gaps (blue)
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment (global) alignment: write sequences on top of each other, gaps represented by dash symbols
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C I V M R E A Q Y Input sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E C - I V M R E A Q Y – alignment of input sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E C - I V M R E A Q Y - alignment consists matches (red), mismatches (green) and gaps (blue)
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E C - I V M R E A Q Y – Basic task: Find ‘best’ alignment of two sequences = alignment that reflects structural and evolutionary relations
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E C - I V M R E A Q Y – Questions: 1. What is a good alignment? 2. How to find the best alignment?
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E C - I V M R E A Q Y – Idea: consider alignment as hypothesis about evolution of sequences. gaps correspond to insertions/deletions mismatches correspond to substitutions
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E C - I V M R E A - Q Y Problem: astronomical number of possible alignments
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E Q Y E C I - V M R E A Q Y Problem: astronomical number of possible alignments
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E - C I V M R E A Q Y – Problem: astronomical number of possible alignments stupid computer has to find out: which alignment is best ??
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E - C I V M R E A Q Y – First (simplified) rules: 1. minimize number of mismatches 2. maximize number of matches
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E - C I V M R E A Q Y – General assumption: sequences not too distantly related. In this case: mismatches (substitutions) and gaps (insertions/deletions) unlikely Consequence: good alignment should reduce gaps and mismatches
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E C I - V M R E A Q Y – First (simplified) rules: 1. minimize number of mismatches 2. maximize number of matches
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E - C I V M R E A Q Y – First (simplified) rules: 1. minimize number of mismatches 2. maximize number of matches
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E - C I V M R E A Q Y – First (simplified) rules: 1. minimize number of mismatches 2. maximize number of matches
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E - Q Y E C I - V M R E A Q Y – Second (simplified) rule: minimize number of gaps
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V - A R E - Q Y E C I - V M - R E A Q Y – Second (simplified) rule: minimize number of gaps Parsimony principle: minimize number of evolutionary events
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment For protein sequences: different degrees of similarity among amino acids. counting matches/mismatches oversimplistic
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V T L V Protein sequences to be aligned
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V T L - V Possible alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V T - L V Alternative alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V T - L V Some amino acid residues are more similar to each other than others Therefore: similarity among amino acid residues has to be taken into account.
6/3/2015Burkhard Morgenstern, Tunis 2007
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V T - L V To assess quality of protein alignments: use similarity scores for amino acids s(a,b) similarity score for amino acids a and b
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment Similarity measured by substitution matrices based on substitution probabilities Important substitution matrices: PAM (M. Dayhoff) BLOSUM (S. Henikoff / J. Henikoff)
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment The PAM matrix: Consider probability p a,b of substitution a → b (or b → a) for amino acids a and b Define for amino acids a and b similarity score S(a,b) based on probability p a,b First task: find out p a,b for every pair of amino acids a, b
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment The PAM matrix: Use closely related protein families – no alignment problem, no double substitutions Construct phylogenetic tree with parsimony method Count substitution frequencies/probabilities Normalize substitution probabilities Extrapolate probabilities for larger evolutionary distances
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment Finally: define similarity score S(a,b) = log (p a,b / q a q b ) q a = (relative) frequency of amino acid a
6/3/2015Burkhard Morgenstern, Tunis 2007
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V T - L V Given a similarity score s(a,b) for pairs of amino acids, define quality score of alignment as: sum of similarity values s(a,b) of aligned residues minus gap penalty g for each residue aligned with a gap
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V T - L V Example: Score = s(T,T) + s(I,L) + s (V,V) - g
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V T - L V Next question: find alignment with best score Dynamic-programming algorithm finds alignment with best score. (Needleman and Wunsch, 1970)
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Alignment corresponds to path through comparison matrix
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E A Q Y E C I X V X M R X E X X Q X Y X X
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E A Q Y E X X C X I X V X M X R X E X X Q X Y X X
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T Y I V A R E A Q Y E - C I V M R E - Q Y – Alignment corresponds to path through comparison matrix
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V - R E A Q I - C I V M R E - H Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment Score of alignment: Sum of similarity values of aligned residues minus gap penatly T W L V - R E A Q I - C I V M R E - H Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment Example: S = - g + s(W,C) + s(L,L) + s(V,V) - g + s(R,R) … T W L V - R E A Q I - C I V M R E - H Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X T W L V - R E A Q I - C I V M R E - H Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R E A Q Y I X X C X Alignment corresponds I X to path through V X comparison matrix M X R X E X X H X Y X X T W L V - R E A Q I - C I V M R E - H Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment i T W L V R E A Q Y I X X Dynamic programming: C X Calculate scores S(i,j) I X of optimal alignment of V X prefixes up to positions M X i and j. j R X E H Y T W L V - R - C I V M R
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment i T W L V R E A Q Y I X X C X S(i,j) can be calculated from I X possible predecessors V X S(i-1,j-1), S(i,j-1), S(i-1,j). M X j R X E H Y T W L V - R - C I V M R
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from top left = V X M X S(i-1,j-1) + s(R,R) j R X E H Y T W L V - R - C I V M R
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from above = V X j-1 M X S(i,j-1) – g j R X E H Y T W L V R - - C I V M R
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment i-1 i T W L V R E A Q Y I X X C X Score of optimal path that I X comes from left = V X M X S(i-1,j) – g j R X X E H Y T W L - - V R - C I V M R -
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment i-1 i T W L V R E A Q Y I X X C X Score of optimal path = I X V X Maximum of these three M X values j R X X E H Y T W L - - V R - C I V M R -
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment Recursion formula for global alignment: For sequences x and y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R C I V M R E H Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x C x x x I x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x C x x x I x x x V x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x C x x x I x x x V x x x M x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x C x x x I x x x V x x x M x x x R x x E x x H x x Y x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x H x x Y x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x Y x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x x C x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x x C x x x x I x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x x C x x x x I x x x x V x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x x C x x x x I x x x x V x x x x M x x x R x x x E x x x H x x x Y x x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Fill matrix from top left to bottom right:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x x x x C x x x x x x I x x x x x x V x x x x x x M x x x x x x R x x x x x x E x x x x x x H x x x x x x Y x x x x x x Find optimal alignment by trace-back procedure
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R x x x x x x C x I x V x M x R x E x H x Y x Initial matrix entries?
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment i T W L V R X X C X Entries S(i,j) scores I X of optimal alignment of j V X prefixes up to positions M i and j. R E H Y T W L V - C I V
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment i T W L V R j X X X X X C Entries S(i,0) scores I of optimal alignment of V prefix up to positions M i and empty prefix. R E Score = - i* g H Y T W L V
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R C I V M R E H Y Initial matrix entries: Example, g = 2
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment T W L V R C -2 I -4 V -6 M -8 R -10 E -12 H -14 Y -16 Initial matrix entries: Example, g = 2
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise global alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise global alignment Computational complexity: how does program run time and memory depend on size of input data? l 1 and l 2 length of sequences: Computing time and memory proportional to l 1 * l 2 Time and memory complexity = O(l 1 * l 2 )
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment More realistic gap penalty: affine-linear instead of linear Penalty for gap of length l: c 0 + (l-1)* c 1 c 0 = ‘gap-opening penalty’ c 0 = ‘gap-extension penalty’
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment So far: global alignment considered: sequences aligned over their entire length. But: sequences often share only local sequence similarity (conserved genes or domains) Most important application: database searching
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X H X Y X X T W L V - R E A Q I - C I V M R E - F Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment Problem: Find pair of segments with maximal alignment score (not necessarily part of optimal global alignment!) Equivalent: find path starting and ending anywhere in the matrix.
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment Recursion formula for global alignment: S(i,j) = max { S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g }
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment Recursion formula for local alignment: S(i,j) = max { 0, S(i-1,j-i)+s(a i,b j ), S(i-1,j) – g, S(i,j-i) – g }
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment T W L V R C 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 Initial matrix entries = 0
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment T W L V R C 0 0 I 0 V 0 M 0 R 0 E 0 H 0 Y 0 s(C,T) = -2
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment Recursion formula for global alignment:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment Recursion formula for local alignment:
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise sequence alignment For trace-back: Store positions i max and j max with S(i max,jmax ) maximal
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment T W L V R E A Q Y I X X C X I X V X M X R X E X X F X Y X X T W L V - R E A Q I - C I V M R E - F Y
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment Algorithm by Smith and Waterman (1983) Implementation: e.g. BestFit in GCG package
6/3/2015Burkhard Morgenstern, Tunis 2007 Pair-wise local alignment Complexity: l 1 and l 2 length of sequences:computing time and memory proportional to l 1 * l 2 Time and space complexity = O(l 1 * l 2 ) Too slow for data base searching! Therefore tools like BLAST necessary for database searching
6/3/2015Burkhard Morgenstern, Tunis 2007 The Basic Local Alignment Search Tool (BLAST) New BLAST version (1997) Two-hit strategy Gapped BLAST Position-Specific Iterative BLAST (PSI BLAST)
6/3/2015Burkhard Morgenstern, Tunis 2007 The Basic Local Alignment Search Tool (BLAST) PSI BLAST: 1. search database with standard BLAST 2. take best hits and create multiple alignment 3. calculate profile from multiple alignment 4. search database again with profile as query
6/3/2015Burkhard Morgenstern, Tunis 2007 The Basic Local Alignment Search Tool (BLAST)
6/3/2015Burkhard Morgenstern, Tunis 2007 The Basic Local Alignment Search Tool (BLAST) profile for sequence family or motif: table of amino acid/nucleotide frequencies at any position in alignment.
6/3/2015Burkhard Morgenstern, Tunis 2007 The Basic Local Alignment Search Tool (BLAST) Profile: frequencies of nucleotides at every position. seq1 A T T G – A T seq2 C T T G T A G seq3 A - - G T A T seq4 A T G G T G T seq5 A C T G T A C A T C G
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment s1 T Y I M R E A Q Y E S A Q s2 T C I V M R E A Y E s3 Y I M Q E V Q Q E R s4 W R Y I A M R E Q Y E
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment s1 - T Y I - M R E A Q Y E S A Q s2 - T C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment s1 - T Y I - M R E A Q Y E S A Q s2 - T C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment s1 - T Y I - M R E A Q Y E S A Q s2 - T C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment s1 - T Y I - M R E A Q Y E S A Q s2 - T C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E General information in multiple alignment: Functionally important regions more conserved than non-functional regions Local sequence conservation indicates functionality!
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment s1 - T Y I - M R E A Q Y E S A Q s2 - T C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E For phylogeny reconstruction: Estimate pairwise distances between sequences (distance-based methods for tree reconstruction) Estimate evloutionary events in evolution (parsimony and maximum likelihood methods)
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment s1 - T Y I - M R E A Q Y E S A Q s2 - T C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Astronomical number of possible alignments!
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment s1 - T Y I - M R E A Q Y E S A Q s2 - T C I V M R E A Y E - s3 Y I M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Astronomical number of possible alignments!
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment s1 - T Y I - M R E A Q Y E S A Q s2 - T C I V M R E A Y E - s3 Y I M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Computer has to decide: which one is best??
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment Questions in development of multiple-alignment programs (as in pairwise alignment): (1) What is a good alignment? → objective function (`score’) (2) How to find a good alignment? → optimization algorithm First question far more important !
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment Traditional Objective functions: Define Score of alignments as Sum of individual similarity scores S(a,b) Gap penalties Needleman-Wunsch scoring system (1970)
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment Traditional Objective functions Can be generalized to multiple alignment (e.g. sum-of-pair score, tree alignment) Needleman-Wunsch algorithm can also be generalized to multiple alignment, but: Very time and memory consuming! -> Heuristics needed
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment First question: how to score multiple alignments? Possible scoring scheme: Sum-of-pairs score
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment Multiple alignment implies pairwise alignments: Use sum of scores of these p.a. 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN......
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment Complexity: For sequences of length l 1 * l 2 * l 3 O( l 1 * l 2 * l 3 ) For n sequences ( average length l ): O( l n ) Exponential complexity!
6/3/2015Burkhard Morgenstern, Tunis 2007 Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Optimal solution not feasible: -> Heuristics necessary
6/3/2015Burkhard Morgenstern, Tunis 2007 `Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP
6/3/2015Burkhard Morgenstern, Tunis 2007 `Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Guide tree
6/3/2015Burkhard Morgenstern, Tunis 2007 ` Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Idea: align closely related sequences first!
6/3/2015Burkhard Morgenstern, Tunis 2007 `Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASFQPVAALERIN WLNYNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”
6/3/2015Burkhard Morgenstern, Tunis 2007 `Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”
6/3/2015Burkhard Morgenstern, Tunis 2007 `Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”
6/3/2015Burkhard Morgenstern, Tunis 2007 `Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”
6/3/2015Burkhard Morgenstern, Tunis 2007 `Progressive´ Alignment “Greedy” algorithm: Consider partial solution of bigger problem search best partial solution, fix solution search second-best partial solution that is consistent with first solution, fix solution Search third-best partial solution … etc. E.g.: Rucksack-Problem
6/3/2015Burkhard Morgenstern, Tunis 2007 `Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”
6/3/2015Burkhard Morgenstern, Tunis 2007 `Progressive´ Alignment Most important software program: CLUSTAL W: J. Thompson, T. Gibson, D. Higgins (1994), CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment … Nuc. Acids. Res. 22, (~ citations in the literature)
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment Problems with traditional approach: Results depend on gap penalty Heuristic guide tree determines alignment; alignment used for phylogeny reconstruction Algorithm produces global alignments.
6/3/2015Burkhard Morgenstern, Tunis 2007 Tools for multiple sequence alignment Problems with traditional approach: But: Many sequence families share only local similarity E.g. sequences share one conserved motif
6/3/2015Burkhard Morgenstern, Tunis 2007 Local sequence alignment Find common motif in sequences; ignore the rest EYENS ERYENS ERYAS
6/3/2015Burkhard Morgenstern, Tunis 2007 Local sequence alignment Find common motif in sequences; ignore the rest E-YENS ERYENS ERYA-S
6/3/2015Burkhard Morgenstern, Tunis 2007 Local sequence alignment Find common motif in sequences; ignore the rest – Local alignment E-YENS ERYENS ERYA-S
6/3/2015Burkhard Morgenstern, Tunis 2007 Local sequence alignment Important methods for local multiple alignment: PIMA MEME/MAST Idea: expectation maximation.
6/3/2015Burkhard Morgenstern, Tunis 2007 Local sequence alignment Traditional alignment approaches: Either global or local methods!
6/3/2015Burkhard Morgenstern, Tunis 2007 New question: sequence families with multiple local similarities Neither local nor global methods appliccable
6/3/2015Burkhard Morgenstern, Tunis 2007 New question: sequence families with multiple local similarities Alignment possible if order conserved
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Morgenstern, Dress, Werner (1996), PNAS 93, Combination of global and local methods Assemble multiple alignment from gap-free local pair-wise alignments (,,fragments“)
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacc cctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa Consistency!
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc GG-TTCAATcgcg caaa--GAGTATCAcc CCTGaaTTGAATaa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Score of an alignment: Define score of fragment f: l(f) = length of f s(f) = sum of matches (similarity values) P(f) = probability to find a fragment with length l(f) and at least s(f) matches in random sequences that have the same length as the input sequences. Score w(f) = -ln P(f)
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Score of an alignment: Define score of fragment f: Define score of alignment as sum of scores of involved fragments No gap penalty!
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Score of an alignment: Goal in fragment-based alignment approach: find Consistent collection of fragments with maximum sum of weight scores
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaaccccctcgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc Pair-wise alignment:
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaaccccctcgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc Pair-wise alignment: recursive algorithm finds optimal chain of fragments.
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaaccccctcgtgcttag agatccaaac cagtgcgtgtattactaac ggttcaatcgcgcacatccgc-- Pair-wise alignment: recursive algorithm finds optimal chain of fragments.
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaaccccctcgtgcttag agatccaaac cagtgcgtgtattactaac ggttcaatcgcgcacatccgc-- Optimal pairwise alignment: chain of fragments with maximum sum of weights found by dynamic programming: Standard fragment-chaining algorithm Space-efficient algorithm
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Multiple alignment: atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Multiple alignment: atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaccctgaattgaagagtatcacataa (1) Calculate all optimal pair-wise alignments
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Multiple alignment: atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa (1) Calculate all optimal pair-wise alignments
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Multiple alignment: atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa (1) Calculate all optimal pair-wise alignments
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Fragments from optimal pair-wise alignments might be inconsistent
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Fragments from optimal pair-wise alignments might be inconsistent 1. Sort fragments according to scores 2. Include them one-by-one into growing multiple alignment – as long as they are consistent (greedy algorithm, comparable to knapsack problem)
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Consistency problem
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Consistency problem
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Upper and lower bounds for alignable positions
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa Upper and lower bounds for alignable positions
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagt taaactcccccgtgcttag Cagtgcgtgtattact aacggttcaatcgcg caaa--gagtatcacccctgaattgaataa Upper and lower bounds for alignable positions
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taata-----gttaaactcccccgtgcttag Cagtgcgtgtatta-----ctaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa Upper and lower bounds for alignable positions
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Upper and lower bounds for alignable positions site x = [i,p] (sequence i, position p)
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Upper and lower bounds for alignable positions Calculate upper bound b l (x,i) and lower bound b u (x,i) for each x and sequence i
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Upper and lower bounds for alignable positions b l (x,i) and b u (x,i) updated for each new fragment in alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Consistency bounds are to be updated for each new fragment that is included in to the growing Alignment Efficient algorithm (Abdeddaim and Morgenstern, 2002)
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach Advantages of segment-based approach: Program can produce global and local alignments! Sequence families alignable that cannot be aligned with standard methods
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach DIALIGN is available Online at BiBiServ (Bielefeld Bioinformatics Server) Downloadable UNIX/LINUX executables at BiBiServ Source code ( to BM)
6/3/2015Burkhard Morgenstern, Tunis 2007 Program input Program usage: > dialign2-2 [options] = multi-sequence file in FASTA-format
6/3/2015Burkhard Morgenstern, Tunis 2007 Program output DIALIGN ************* Program code written by Burkhard Morgenstern and Said Abdeddaim contact: Published research assisted by DIALIGN 2 should cite: Burkhard Morgenstern (1999). DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, For more information, please visit the DIALIGN home page at program call:./dialign2-2 -nt -anc s Aligned sequences: length: ================== ======= 1) dog_il ) bla 200 3) blu 200 Average seq. length: Please note that only upper-case letters are considered to be aligned.
6/3/2015Burkhard Morgenstern, Tunis 2007 Program output Alignment (DIALIGN format): =========================== dog_il4 1 cagg GTTTGA atctgataca ttgc bla 1 ctga GC CAAGTGGGAA blu 1 ttttgatatg agaaGTGTGA aacaagctat cctatattGC TAAGTGGCAG dog_il ATGGCACT GGGGTGAATG AGGCAGGCAG CAGAATGATC bla 17 ggtgtgaata catgggtttc cagtaccttc tgaggtccag agtacc---- blu 51 ccctggcttt ctATGTGCAC AGAATGGGAG GAAAGTGCCT GCTAGTGAGC dog_il4 63 GTACTGCAGC CCTGAGCTTC CACTGGCCCA TGTTGGTATC CTTGTATTTT bla TTTCCCA TGTGCTCCAT GGTGGAATGG blu 101 CAGGGACTCA GAGAGAATGG AGTATAGGGG TCAGGGCat dog_il4 113 TCCGCCCCTT CCCAGCACca gcattatcct ---GGGATTG GAGAAGGGGG bla 90 ACCACTCCTT CTCAGCACaa caaagcccaa gaaGGTGTTG CGTTCTAGAC blu GGGGTGG CCTTAGGCTC
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacc cctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaac ggttcaatcgcg caaa--gagtatcacc cctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc GG-TTCAATcgcg caaa--GAGTATCAcc CCTGaaTTGAATaa--
6/3/2015Burkhard Morgenstern, Tunis 2007 The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 Program output Alignment (DIALIGN format): =========================== dog_il4 1 cagg GTTTGA atctgataca ttgc bla 1 ctga GC CAAGTGGGAA blu 1 ttttgatatg agaaGTGTGA aacaagctat cctatattGC TAAGTGGCAG dog_il ATGGCACT GGGGTGAATG AGGCAGGCAG CAGAATGATC bla 17 ggtgtgaata catgggtttc cagtaccttc tgaggtccag agtacc---- blu 51 ccctggcttt ctATGTGCAC AGAATGGGAG GAAAGTGCCT GCTAGTGAGC dog_il4 63 GTACTGCAGC CCTGAGCTTC CACTGGCCCA TGTTGGTATC CTTGTATTTT bla TTTCCCA TGTGCTCCAT GGTGGAATGG blu 101 CAGGGACTCA GAGAGAATGG AGTATAGGGG TCAGGGCat dog_il4 113 TCCGCCCCTT CCCAGCACca gcattatcct ---GGGATTG GAGAAGGGGG bla 90 ACCACTCCTT CTCAGCACaa caaagcccaa gaaGGTGTTG CGTTCTAGAC blu GGGGTGG CCTTAGGCTC
6/3/2015Burkhard Morgenstern, Tunis 2007 T-COFFEE C. Notredame, D. Higgins, J. Heringa (2000), T-Coffee: A novel algorithm for multiple sequence alignment, J. Mol. Biol.
6/3/2015Burkhard Morgenstern, Tunis 2007 T-COFFEE Problem with “progressive” approaches: Strictly global alignments Use only pair-wise comparison
6/3/2015Burkhard Morgenstern, Tunis 2007 T-COFFEE Idea: Start with local and global pair-wise alignments (“primary library” of alignments) Construct “scondary library” of residues that are indirectly aligned by primary library. Re-score residue pairs Construct final alignment with “progressive” method
6/3/2015Burkhard Morgenstern, Tunis 2007 T-COFFEE Advantage: Combination of local and global approaches Less sensitive against mis-alignments in progressive proceedure
6/3/2015Burkhard Morgenstern, Tunis 2007 T-COFFEE
6/3/2015Burkhard Morgenstern, Tunis 2007
6/3/2015Burkhard Morgenstern, Tunis 2007 T-COFFEE T-COFFEE and DIALIGN: Less sensitive to spurious pairwise similarities Can handle local homologies better than CLUSTAL
6/3/2015Burkhard Morgenstern, Tunis 2007 Most multi-alignment approaches automated, i.e. based on algorithmic rules. Two components: Objective function: assess alignment quality Optimization algorithm: find optimal or near-optimal alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Fully automated alignment programs necessary f no expert knowledge available if large amounts of data to be analyzed But: Often no biologically reasonable results Often additional information about homologies etc. available
6/3/2015Burkhard Morgenstern, Tunis 2007 Idea for improved alignment Use expert knowledge to influence alignment procedure DIALIGN with user-defined anchor points
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences Alignment of large genomic sequences to identify functional elements (phylogenetic footprinting) Göttgens et al., 2000, 2001, 2002, … Pollard et al., 2004 DIALIGN, MGA, PipMaker, LAGAN, AVID, Mummer, WABA, …
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences Gene-regulatory sites identified by mulitple sequence alignment (phylogenetic footprinting)
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 DIALIGN alignment of human and murine genomic sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 DIALIGN alignment of tomato and Thaliana genomic sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster:
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but Entire genes totally mis-aligned
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but Entire genes totally mis-aligned Reason for mis-alignment: duplications !
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences The Hox gene cluster: 4 Hox gene clusters in pufferfish. 14 genes, different genes in different clusters!
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of large genomic sequences The Hox gene cluster: Complete mis-alignment of entire genes!
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Conserved motivs; no similarity outside motifs
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in two sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in two sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in two sequences
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Mis-alignment would have lower score!
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence Possible mis-alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Duplication in one sequence S3S3
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Consistency problem S3S3
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 More plausible alignment – and higher score : S3S3
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Consistency problem S3S3
6/3/2015Burkhard Morgenstern, Tunis 2007 Alignment of sequence duplications S1S1 S2S2 Alternative alignment; probably biologically wrong; lower numerical score! S3S3
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment Biologically meaningful alignment not possible by automated approaches. Idea: use expert knowledge to guide alignment procedure User defines a set anchor points that are to be „respected“ by the alignment procedure
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point Anchor point = anchored fragment (gap-free pair of segments) Remainder of sequences aligned automatically
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Alignment of anchored positions a and b not enforced – a and b may be un-aligned –, but: a is only residue that can be aligned to b Residues left of a aligned with residues left of b
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment NLF VALYDFVASG DNTLSITKGE klrvlgynhn iihredkGVI YALWDYEPQN DDELPMKEGD cmt Anchored alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS Anchor points in multiple alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQND DELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS Anchor points in multiple alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Anchored sequence alignment NLF V-ALYDFVAS GD NTLSITKGEk lrvLGYNhn iihredkGVI Y-ALWDYEPQ ND DELPMKEGDC MT GYQ YrALYDYKKE REedidlhlg DILTVNKGSL VA-LGFS-- Anchored multiple alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions Goal: Find optimal alignment (=consistent set of fragments) under costraints given by user- specified anchor points!
6/3/2015Burkhard Morgenstern, Tunis 2007 Additional input file with anchor points: Algorithmic questions
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS
6/3/2015Burkhard Morgenstern, Tunis 2007 Additional input file with anchor points: Algorithmic questions
6/3/2015Burkhard Morgenstern, Tunis 2007 Additional input file with anchor points: Sequences Algorithmic questions
6/3/2015Burkhard Morgenstern, Tunis 2007 Additional input file with anchor points: Sequences start positions Algorithmic questions
6/3/2015Burkhard Morgenstern, Tunis 2007 Additional input file with anchor points: Sequences start positions length Algorithmic questions
6/3/2015Burkhard Morgenstern, Tunis 2007 Additional input file with anchor points: Sequences start positions length score Algorithmic questions
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Inconsistent anchor points!
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaat---agttaaactcccccgtgcttag Cagtgcgtgtattac-taacggttcaatcgcg caaagagtatcacccctgaattgaataa Inconsistent anchor points!
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points Find alignment under constraints given by anchor points!
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions Use data structures from multiple alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Greedy procedure for multiple alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Greedy procedure for multiple alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Question: which positions are still alignable ?
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x For each position x and each sequence S i exist an upper bound ub(x,i) and a lower bound lb(x,i) for residues y in S i that are alignable with x
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x For each position x and each sequence S i exist an upper bound ub(x,i) and a lower bound lb(x,i) for residues y in S i that are alignable with x
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i)
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors ub(x,i) and lb(x,i) updated during greedy procedure Resulting values of ub(x,i) and lb(x,i) used as initial values for alignment procedure
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i)
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions atctaatagttaaactcccccgtgcttag S i cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i) calculated using anchor points
6/3/2015Burkhard Morgenstern, Tunis 2007 Algorithmic questions Ranking of anchor points to prioritize anchor points, e.g. anchor points from verified homologies -- higher priority automatically created anchor points (using CHAOS, BLAST, … ) -- lower priority
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: Hox gene cluster
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: Hox gene cluster Use gene boundaries as anchor points
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: Hox gene cluster Use gene boundaries as anchor points + CHAOS / BLAST hits
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: Hox gene cluster no anchoring anchoring Ali. Columns 2 seq seq seq Score CPU time 4:22 0:19
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: Hox gene cluster Example: Teleost Hox gene cluster: Score of anchored alignment 15 % higher than score of non-anchored alignment ! Conclusion: Greedy optimization algorithm does a bad job!
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Wrong objective function: Biologically correct alignment gets bad numerical score Bad optimization algorithms: Biologically correct alignment gets best numerical score, but algorithm fails to find this alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Anchored alignments can help to decide
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: RNA alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: RNA alignment aa----CCCC AGC---GUAa gucgcuaucc a cacucuCCCA AGC---GGAG Aac ccg----CCA AaagauGGCG Acuuga non-anchored alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: RNA alignment aa----CCCC AGC---GUAa gucgcuaucc a cacucuCCCA AGC---GGAG Aac ccg----CCA AaagauGGCG Acuuga structural motif mis-aligned
6/3/2015Burkhard Morgenstern, Tunis 2007 Application: RNA alignment aaCCCCAGCG UAAGUCGCUA UCca-- --CACUCUCC CAAGCGGAGA AC CCGCCA AAAGAUGGCG ACuuga 3 conserved nucleotides as anchor points
6/3/2015Burkhard Morgenstern, Tunis 2007 WWW interface at GOBICS (Göttingen Bioinformatics Compute Server)
6/3/2015Burkhard Morgenstern, Tunis 2007 WWW interface at GOBICS (Göttingen Bioinformatics Compute Server)
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene predictions for eukaryotes
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene predictions for eukaryotes Goal: find location and structure of protein-coding genes in eukaryotic genome sequences.
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt agtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagctt cgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacg tacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagc atctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggc tagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatttttct aggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcttagtcgtgtagt cttgatctacgtacgtatttttctagagcttcgtagtctatggctagtcgtagtcgtag tcgttagcatctgtatgctgttagctgtacgtacgtatttttctagagcttcgtagtct atggctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtatt tttctaggggagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtat gctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgta gtcgtagtcgttagcatctgtatgtacgtacgtatttttctagagcttcgtagtctatg gctagtcgtagtcgtagtcgttagcatctgtatgctgttagctgtacgtacgtattttt ctagagcttcgtagtctatggctagtcgtagtcgtagtcgttagcatctgtatgctgtt agctgtacgtacgtatttttctaggggagcttcgtagtctatggctagtcgtagtcgta gtcgttagcatctgtatgctgttagctgtacgtacgtatttttctaggggagcttcgta gtctatggctagtcgtagtcgtagtcgttagcatctgtatggtcgtagtcgttagcatc tgtatgctgttagctgtacgtacgtatttttctaggggagcttcgtagtctatggctag
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene predictions for eukaryotes
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene predictions for eukaryotes Three different approaches to computational gene- finding: Intrinsic: use statistical information about known genes (Hidden Markov Models) Extrinsic: compare genomic sequence with known proteins / genes Cross-species sequence comparison: search for similarities among genomes
6/3/2015Burkhard Morgenstern, Tunis 2007 Hidden-Markov-Models (HMM) for gene prediction Generative probabilistic model for sequence of observations („symbols“). Finite set of states States can emit symbols Transitions between states possible Sequence generated by path between states
6/3/2015Burkhard Morgenstern, Tunis 2007 Hidden-Markov-Models (HMM) for gene prediction Example: The occasionally dishonest casino F F U U U U U F F F F F F Possible states: fair (F); unfair (U); begin (B); end (E)
6/3/2015Burkhard Morgenstern, Tunis 2007 Hidden-Markov-Models (HMM) for gene prediction Assumptions: Emission probabilities known; depend only on current state. Transition probabilities known, depend only on current state
6/3/2015Burkhard Morgenstern, Tunis 2007 Hidden-Markov-Models (HMM) for gene prediction F U EB
6/3/2015Burkhard Morgenstern, Tunis 2007 Hidden-Markov-Models (HMM) for gene prediction s B F F U U U U U F F F F F F E φ For sequence s and parse φ: P(φ) probability of φ P(φ,s) joint probability of φ and s = P(φ) * P(s|φ) P(φ|s) a-posteriori probability of φ
6/3/2015Burkhard Morgenstern, Tunis 2007 Hidden-Markov-Models (HMM) for gene prediction B F F U U U U U F F F F F F E Goal: find path φ with maximum a-posteriori probability P(φ|s) Idea: find path that maximizes joint probability P(φ,s) by dynamic programming (Viterbi algorithm)
6/3/2015Burkhard Morgenstern, Tunis 2007 Hidden-Markov-Models (HMM) for gene prediction Application to gene prediction: A T A A T G C C T A G T C s (DNA) Z Z Z E E E E E E I I I I φ (parse) Introns, exons etc modeled as states in GHMM („generalized HMM“) Given sequence s, find parse that maximizes P(φ|s) (S. Karlin and C. Burge, 1997)
6/3/2015Burkhard Morgenstern, Tunis 2007 Hidden-Markov-Models (HMM) for gene prediction Application to gene prediction: A T A A T G C C T A G T C s (DNA) Z Z Z E E E E E E I I I I φ (parse) Introns, exons etc modeled as states in GHMM („generalized HMM“) Given sequence s, find parse that maximizes P(φ|s) (S. Karlin and C. Burge, 1997)
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS Basic model for GHMM-based intrinsic gene finding comparable to GenScan (M. Stanke)
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS Features of AUGUSTUS: Intron length model Initial pattern for exons Similarity-based weighting for splice sites Interpolated HMM Internal 3’ content model
6/3/2015Burkhard Morgenstern, Tunis 2007 Hidden-Markov-Models (HMM) for gene prediction A T A A T G C C T A G T C s (DNA) Z Z Z E E E E I I I I φ (parse) Explicit intron length model computationally expensive.
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS Intron length model: Explicit length distribution for short introns Geometric tail for long introns Intron (fixed) Exon Intron (expl.) Exon Intron (geo.)
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS Extension of AUGUSTUS using include extrinsic information: Protein sequences EST sequences Syntenic genomic sequences User-defined constraints
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting Comparison of genomic sequences (human and mouse)
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg Standard score: Consider length, # matches, compute probability of random occurrence Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 Translation option: catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 Translation option: L S Y V catcatatc tta tct tac gtt aactcccccgt cagtgcgtg ata gcc cat atc cgg I A H I DNA segments translated to peptide segments; fragment score based on peptide similarity: Calculate probability of finding a fragment of the same length with (at least) the same sum of BLOSUM values Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 P-fragment (in both orientations) L S Y V catcatatc tta tct tac gtt aactcccccgt cagtgcgtg ata gcc cat atc cgg I A H I N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg For each fragment f three probability values calculated; Score of f based on smallest P value. Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 P-fragment (in both orientations) L S Y V catcatatc tta tct tac gtt aactcccccgt cagtgcgtg ata gcc cat atc cgg I A H I N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg P-fragments associated with strand and reading frame! Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting AGenDA: Alignment-based Gene Detection Algorithm
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting Fragments in DIALIGN alignment
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting Build cluster of fragments
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting Identify conserved splice sites
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting Candidate exons bounded by conserved splice sites Find optimal chain of candidate exons
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting
6/3/2015Burkhard Morgenstern, Tunis 2007 Gene prediction by phylogenetic footprinting AGenDA GenScan 64 % 12 % 17 %
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Extended GHMM using extrinsic information Additional input data: collection h of `hints’ about possible gene structure φ for sequence s Consider s, φ and h result of random process. Define probability P(s,h,φ) Find parse φ that maximizes P(φ|s,h) for given s and h.
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Hints created using Alignments to EST sequences Alignments to protein sequences Combined EST and protein alignment (EST alignments supported by protein alignments) Alignments of genomic sequences User-defined hints
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Alignment to EST: hint to (partial) exon EST G1
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ EST alignment supported by protein: hint to exon (part), start codon EST G1 Protein
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Alignment to ESTs, Proteins: hints to introns, exons ESTs, Protein G1
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Alignment of genomic sequences: hint to (partial) exon G2 G1
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Consider different types of hints: type of hints: start, stop, dss, ass, exonpart, exon, introns Hint associated with position i in s (exons etc. associated with right end position) max. one hint of each type allowed per position in s Each hint associated with a grade g that indicates its source or reliability.
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ h i,t = information about hint of type t at position i h i,t = $ if no hint of type t available at i h i,t = [grade, strand, (length, reading frame)] if hint available (hints created by protein alignments or DIALIGN contain information about reading frame)
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Standard program version, without hints A T A A T G C C T A G T C s (sequence) Z Z Z E E E E E E I I I I φ (parse) Find parse that maximizes P(φ|s)
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ AUGUSTUS+ using hints A T A A T G C C T A G T C s (sequence) $ $ $ $ $ $ $ X $ $ $ $ $ h (type 1) $ $ $ $ $ $ $ $ $ $ $ $ $ h (type 2) $ $ $ $ X $ $ $ $ $ $ $ $ h (type 3).. Z Z Z E E E E E E I I I I φ (parse) Find parse that maximizes P(φ|s,h)
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ As in standard HMM theory: maximize joint probability P(φ,s,h) How to define P(φ,s,h) ?
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ General assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ General assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ General assumption: Hints of different types t and at different positions i independent of each other (for redundant hints: ignore „weaker“ types).
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Assumption: P(h i,t |φ,s) depends on type t, grade g and whether h i,t is compatible with φ or s. Example: h i,t hint to exon E h i,t compatible with parse φ if E part of φ. h i,t compatible with sequence s if start and stop codons exist according to E and if no internal stop codon in E exists
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ For given g and t: 3 possible values for P(h i,t |φ,s) P(h i,t |φ,s) = q + (t,g) if h i,t compatible with φ P(h i,t |φ,s) = q - (t,g) if h i,t compatible with s but not compatible with φ P(h i,t |φ,s) = 0 if h i,t not compatible with s Values learned from training data
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Results: Gene (sub-)structures supported by hints receive bonus compared to non-supported structures Gene (sub-)structures not supported by hints receive malus (M. Stanke et al. 2006, BMC Bioinformatics)
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ h, h’ collections of hints; h’ i,t = h i,t for (i,t) ≠ (I,T) h’ I,T ≠ h I,T = $; g grade of h’ I,T φ +, φ - gene structures on s h’ IT compatible with φ +, but not with φ -
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Result: i.e. structure φ +, which is compatible with additional hint h’ IT receives relative bonus
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Results (gene level) on data set sag178 % SN % SP Augustus GenScan GeneID HMMGene 20 7 Aug. + EST Aug. + prot Aug. combined Aug. all GenomeScan TwinScan 20 25
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Using hints from DIALIGN alignments: 1. Obtain large human/mouse sequence pairs (up to 50kb) from UCSC 2. Run CHAOS to find anchor points 3. Run DIALIGN using CHAOS anchor points 4. Create hints h from DIALIGN fragments 5. Run AUGUSTUS with hints
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Hints from DIALIGN fragments: Segment covered by peptide fragment minus 33 bp at both ends defines exon part hint on all 6 reading frames.
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ Hints from DIALIGN fragments: Consider fragments with score ≥ 20 Distinguish high scores (≥ 45) from low scores Consider reading frame given by DIALIGN Consider strand given by DIALIGN => 2*2*2 = 8 grades
6/3/2015Burkhard Morgenstern, Tunis 2007 AUGUSTUS+ AUGUSTUS best ab-initio method at EGASP
6/3/2015Burkhard Morgenstern, Tunis 2007 EGASP test results
6/3/2015Burkhard Morgenstern, Tunis 2007 EGASP test results
6/3/2015Burkhard Morgenstern, Tunis 2007 EGASP test results
6/3/2015Burkhard Morgenstern, Tunis 2007 EGASP test results
6/3/2015Burkhard Morgenstern, Tunis 2007 EGASP test results
6/3/2015Burkhard Morgenstern, Tunis 2007 Ongoing projects Brugia malayi (TIGR) Aedes aegypti (TIGR) Schistosoma mansoni (TIGR) Tetrahymena thermophilia (TIGR) Galdieria Sulphuraria (Michigan State Univ.) Coprinus cinereus (Univ. Göttingen) Tribolium castaneum (Univ. Göttingen)