Methods course Multiple sequence alignment and Reconstruction of phylogenetic trees Burkhard Morgenstern, Fabian Schreiber Göttingen, October/November 2007
Tools for multiple sequence alignment Multiple alignment basis of (almost) all methods for sequence analysis in bioinformatics
Tools for multiple sequence alignment T Y I M R E A Q Y E T C I V M R E A Y E
Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M R E A - Y E
Tools for multiple sequence alignment T Y I M R E A Q Y E T C I V M R E A Y E Y I M Q E V Q Q E Y I A M R E Q Y E
Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M R E A - Y E Y - I - M Q E V Q Q E Y – I A M R E - Q Y E
Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M R E A - Y E - Y I - M Q E V Q Q E Y – I A M R E - Q Y E Astronomical Number of possible alignments!
Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V - M R E A Y E - Y I - M Q E V Q Q E Y – I A M R E - Q Y E Astronomical Number of possible alignments!
Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M R E A - Y E - Y I - M Q E V Q Q E Y – I A M R E - Q Y E Which one is the best ???
Tools for multiple sequence alignment Questions in development of alignment programs: (1) What is a good alignment? objective function (`score) (2) How to find a good alignment? optimization algorithm
Tools for multiple sequence alignment What is a biologically good alignment ??
Tools for multiple sequence alignment Criteria for alignment quality: 1. 3D-Structure: align residues at corresponding positions in 3D structure of protein! 2. Evolution: align residues with common ancestors!
Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M - R E A Y E - Y I - M Q E V Q Q E - Y I A M R E - Q Y E Alignment hypothesis about sequence evolution Search for most plausible hypothesis!
Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V - M R E A Y E - Y I - M Q E V Q Q E - Y I A M R E - Q Y E Alignment hypothesis about sequence evolution Search for most plausible hypothesis!
Tools for multiple sequence alignment Compute for amino acids a and b Probability p a,b of substitution a b (or b a), Frequency q a of a Define similarity score s(a,b) based on p a,b, q a Result: similarity matrix (substitution matrix), e.g. PAM (Dayhoff matrix), BLOSUM, …
Tools for multiple sequence alignment
Traditional objective functions: Define Score of alignments as Sum of individual similarity scores s(a,b) of aligned amino acid residues Gap penalty g for each gap in alignment Optimal alignment can be calculated for two sequences but in practice not for > 8 sequences
T Y W I V T - - L V Example: Score = s(T,T) + s(I,L) + s (V,V) – 2 g
Tools for multiple sequence alignment Most commonly used heuristic for multiple alignment: Progressive alignment (mid 1980s): Idea: calculate multiple alignment as series of pairwise alignments of sequences and profiles Use guide tree to determine order of pairwise alignments
`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP
`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Guide tree
`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASFQPVAALERIN WLNYNEERGDFPGTYVEYIGRKKISP Profile alignment, once a gap - always a gap
`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, once a gap - always a gap
`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, once a gap - always a gap
`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP Profile alignment, once a gap - always a gap
CLUSTAL W Most important software program: CLUSTAL W: J. Thompson, T. Gibson, D. Higgins (1994, Nuc. Acids Res.) (22,327 citations in the literaterature!, Oct 2007)
Tools for multiple sequence alignment Problems with traditional approach: Results depend on gap penalty Heuristic guide tree determines alignment; alignment used for phylogeny reconstruction Algorithm produces global alignments.
Tools for multiple sequence alignment Problems with traditional approach: But: Many sequence families share only local similarity E.g. sequences share one conserved motif
Local sequence alignment Find common motif in sequences; ignore the rest EYENS ERYENS ERYAS
Local sequence alignment Find common motif in sequences; ignore the rest E-YENS ERYENS ERYA-S
Local sequence alignment Find common motif in sequences; ignore the rest – Local alignment E-YENS ERYENS ERYA-S
Gibbs Motive Sampler Local multiple alignment without gaps: E.g. Gibbs sampling C.E. Lawrence et al. (1993, Science)
Traditional alignment approaches: Either global or local methods!
New question: sequence families with multiple local similarities Neither local nor global methods appliccable
New question: sequence families with multiple local similarities Alignment possible if order conserved
The DIALIGN approach Morgenstern, Dress, Werner (1996, Proc Natl. Acad. Sci.) Combination of global and local methods Assemble multiple alignment from gap-free local pairwise alignments (,,fragments)
The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa
The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa
The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacc cctgaattgaataa
The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa
The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa Consistency!
The DIALIGN approach atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc GG-TTCAATcgcg caaa--GAGTATCAcc CCTGaaTTGAATaa
The DIALIGN approach Advantages of segment-based approach: Program can produce global and local alignments! Sequence families alignable that cannot be aligned with standard methods
T-COFFEE C. Notredame, D. Higgins, J. Heringa (2000, J. Mol. Biol.) Combination of global and local methods
T-COFFEE SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT
T-COFFEE SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT
T-COFFEE
Mixing Heterogenous Data With T-Coffee Local AlignmentGlobal Alignment Multiple Sequence Alignment Multiple Alignment StructuralSpecialist
T-COFFEE T-COFFEE Idea: 1. Build library of pairwise alignments 2. Alignment from seq i, j and seq j, k supports alignment from seq i, k.
T-COFFEE T-COFFEE Less sensitive to spurious pairwise similarities Can handle local homologies better than CLUSTAL
Evaluation of multi-alignment methods Alignment evaluation by comparison to trusted benchmark alignments. `True alignment known by information about structure or evolution.
1aboA 1.NLFVALYDfvasgdntlsitkGEKLRVLgynhn gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1.NFRVYYRDsrd......pvwkGPAKLLWkg eG 1vie 1.drvrkksga awqGQIVGWYctnlt peG 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN Key alpha helix RED beta strand GREEN core blocks UNDERSCORE BAliBASE Reference alignments Evaluation of multi-alignment methods
Result: DIALIGN best method for distantly related sequences, T-Coffee best for globally related proteins
Evaluation of multi-alignment methods Conclusion: no single best multi alignment program! Advice: try different methods!
Tools for phylogeny reconstruction Two approaches covered in this course: Distance methods, e.g. Neighbour-Joining Maximum Likelihood Other important methods (not covered in this course): Maximum parsimony Bayesian approaches
Tools for phylogeny reconstruction Phylogenetic trees: rooted trees unrooted trees Many methods produce unrooted trees: find root using outgroup!
Biological Question: Are Sponges mono-/paraphyletic? Phylogenetic Reconstuction: An Example Organims of interest: Sponge
Build Dataset Dataset Query Sequence DNA/Protein Sequence from Sponge Gene Search for Homologs using e.g BLAST Hits from Search: putative homologs
Sequence alignment Dataset Sequence Alignment Hits from Search: putative homologs Alignment tools: -Clustalw -T-Coffee -Dialign...many more Use to bring sequences in relation
Alignment Phylogenetic Tree Phylogeny Methods: Distance-based: ---Nj ---UPGMA Parsimony: ---Max.Parsimony(Phylip/Paup) Statistical: ---Max.Likelihood (Phyml) ---Bayesian Inf. (MrBayes) Estimate Phylogeny
Interpretate results Hypothesis: Sponges are monophyletic
Tools for phylogeny reconstruction Distance methods: For N sequences S 1, … S N : Calculate distance d(i,j) for any two sequences S i and S j Goal find tree that represents all distances d(i,j) as closely as possible To calculate distances d(i,j) : construct multiple alignment of input sequences, consider substitutions implied by alignment
Matrix of pairwise distances d(i,j)
Find tree that corresponds to distances d(i,j)
Tools for phylogeny reconstruction Maximum likelihood: Consider evolution of sequences as random process. Stochastical model assigns probabilities to substitutions. Consider tree T as hypothesis about observed sequence data D Search tree with highest likelihood P(D|T)
Tools for phylogeny reconstruction Assumptions: Positions in sequences (colums in alignment) independent of each other Events on different branches of tree independent of each other Result: probabilities can be multiplied
Probability P(D|T) for given residues at internal nodes
Consider all possible residues for internal nodes
Testing the reliability of a tree (or parts of it): the bootstrap approach Bootstrap in general: repeat statistical test after random re-sampling, i.e. by drawing additional sample data. In phylogeny: 1. Select randomly columns from Alignment and repeat tree reconstruction with the same method (e.g times) 2. Calculate for every branch: how often is it observed in newly constructed trees?