Methods course Multiple sequence alignment and Reconstruction of phylogenetic trees Burkhard Morgenstern, Fabian Schreiber Göttingen, October/November.

Slides:



Advertisements
Similar presentations
Vorlesung Grundlagen der Bioinformatik
Advertisements

Bioinformatics Methods Course Multiple Sequence Alignment Burkhard Morgenstern University of Göttingen Institute of Microbiology and Genetics Department.
Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
COFFEE: an objective function for multiple sequence alignments
Molecular Evolution Revised 29/12/06
Structural bioinformatics
Heuristic alignment algorithms and cost matrices
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Bioinformatics and Phylogenetic Analysis
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
07/05/2004 Evolution/Phylogeny Introduction to Bioinformatics MNW2.
Probabilistic methods for phylogenetic trees (Part 2)
Multiple Sequence Alignments
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
An Introduction to Multiple Sequence Alignments Cédric Notredame.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
How to Raise the Dead: The Nuts & Bolts of Ancestral Sequence Reconstruction Jeffrey Boucher Theobald Laboratory.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple sequence alignment
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Phylogenetics.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
T-COFFEE, a novel method for combining biological information Cédric Notredame.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Multiple sequence alignment (msa)
Phylogenetic Inference
Dr Tan Tin Wee Director Bioinformatics Centre
Sequence Based Analysis Tutorial
Presentation transcript:

Methods course Multiple sequence alignment and Reconstruction of phylogenetic trees Burkhard Morgenstern, Fabian Schreiber Göttingen, October/November 2007

Tools for multiple sequence alignment Multiple alignment basis of (almost) all methods for sequence analysis in bioinformatics

Tools for multiple sequence alignment T Y I M R E A Q Y E T C I V M R E A Y E

Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M R E A - Y E

Tools for multiple sequence alignment T Y I M R E A Q Y E T C I V M R E A Y E Y I M Q E V Q Q E Y I A M R E Q Y E

Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M R E A - Y E Y - I - M Q E V Q Q E Y – I A M R E - Q Y E

Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M R E A - Y E - Y I - M Q E V Q Q E Y – I A M R E - Q Y E Astronomical Number of possible alignments!

Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V - M R E A Y E - Y I - M Q E V Q Q E Y – I A M R E - Q Y E Astronomical Number of possible alignments!

Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M R E A - Y E - Y I - M Q E V Q Q E Y – I A M R E - Q Y E Which one is the best ???

Tools for multiple sequence alignment Questions in development of alignment programs: (1) What is a good alignment? objective function (`score) (2) How to find a good alignment? optimization algorithm

Tools for multiple sequence alignment What is a biologically good alignment ??

Tools for multiple sequence alignment Criteria for alignment quality: 1. 3D-Structure: align residues at corresponding positions in 3D structure of protein! 2. Evolution: align residues with common ancestors!

Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V M - R E A Y E - Y I - M Q E V Q Q E - Y I A M R E - Q Y E Alignment hypothesis about sequence evolution Search for most plausible hypothesis!

Tools for multiple sequence alignment T Y I - M R E A Q Y E T C I V - M R E A Y E - Y I - M Q E V Q Q E - Y I A M R E - Q Y E Alignment hypothesis about sequence evolution Search for most plausible hypothesis!

Tools for multiple sequence alignment Compute for amino acids a and b Probability p a,b of substitution a b (or b a), Frequency q a of a Define similarity score s(a,b) based on p a,b, q a Result: similarity matrix (substitution matrix), e.g. PAM (Dayhoff matrix), BLOSUM, …

Tools for multiple sequence alignment

Traditional objective functions: Define Score of alignments as Sum of individual similarity scores s(a,b) of aligned amino acid residues Gap penalty g for each gap in alignment Optimal alignment can be calculated for two sequences but in practice not for > 8 sequences

T Y W I V T - - L V Example: Score = s(T,T) + s(I,L) + s (V,V) – 2 g

Tools for multiple sequence alignment Most commonly used heuristic for multiple alignment: Progressive alignment (mid 1980s): Idea: calculate multiple alignment as series of pairwise alignments of sequences and profiles Use guide tree to determine order of pairwise alignments

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Guide tree

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASFQPVAALERIN WLNYNEERGDFPGTYVEYIGRKKISP Profile alignment, once a gap - always a gap

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, once a gap - always a gap

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, once a gap - always a gap

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP Profile alignment, once a gap - always a gap

CLUSTAL W Most important software program: CLUSTAL W: J. Thompson, T. Gibson, D. Higgins (1994, Nuc. Acids Res.) (22,327 citations in the literaterature!, Oct 2007)

Tools for multiple sequence alignment Problems with traditional approach: Results depend on gap penalty Heuristic guide tree determines alignment; alignment used for phylogeny reconstruction Algorithm produces global alignments.

Tools for multiple sequence alignment Problems with traditional approach: But: Many sequence families share only local similarity E.g. sequences share one conserved motif

Local sequence alignment Find common motif in sequences; ignore the rest EYENS ERYENS ERYAS

Local sequence alignment Find common motif in sequences; ignore the rest E-YENS ERYENS ERYA-S

Local sequence alignment Find common motif in sequences; ignore the rest – Local alignment E-YENS ERYENS ERYA-S

Gibbs Motive Sampler Local multiple alignment without gaps: E.g. Gibbs sampling C.E. Lawrence et al. (1993, Science)

Traditional alignment approaches: Either global or local methods!

New question: sequence families with multiple local similarities Neither local nor global methods appliccable

New question: sequence families with multiple local similarities Alignment possible if order conserved

The DIALIGN approach Morgenstern, Dress, Werner (1996, Proc Natl. Acad. Sci.) Combination of global and local methods Assemble multiple alignment from gap-free local pairwise alignments (,,fragments)

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacc cctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa Consistency!

The DIALIGN approach atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc GG-TTCAATcgcg caaa--GAGTATCAcc CCTGaaTTGAATaa

The DIALIGN approach Advantages of segment-based approach: Program can produce global and local alignments! Sequence families alignable that cannot be aligned with standard methods

T-COFFEE C. Notredame, D. Higgins, J. Heringa (2000, J. Mol. Biol.) Combination of global and local methods

T-COFFEE SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT

T-COFFEE SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT

T-COFFEE

Mixing Heterogenous Data With T-Coffee Local AlignmentGlobal Alignment Multiple Sequence Alignment Multiple Alignment StructuralSpecialist

T-COFFEE T-COFFEE Idea: 1. Build library of pairwise alignments 2. Alignment from seq i, j and seq j, k supports alignment from seq i, k.

T-COFFEE T-COFFEE Less sensitive to spurious pairwise similarities Can handle local homologies better than CLUSTAL

Evaluation of multi-alignment methods Alignment evaluation by comparison to trusted benchmark alignments. `True alignment known by information about structure or evolution.

1aboA 1.NLFVALYDfvasgdntlsitkGEKLRVLgynhn gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1.NFRVYYRDsrd......pvwkGPAKLLWkg eG 1vie 1.drvrkksga awqGQIVGWYctnlt peG 1aboA 36 WCEAQt..kngqGWVPSNYITPVN ycsB 39 WWWARl..ndkeGYVPRNLLGLYP pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd vie 28 YAVESeahpgsvQIYPVAALERIN Key alpha helix RED beta strand GREEN core blocks UNDERSCORE BAliBASE Reference alignments Evaluation of multi-alignment methods

Result: DIALIGN best method for distantly related sequences, T-Coffee best for globally related proteins

Evaluation of multi-alignment methods Conclusion: no single best multi alignment program! Advice: try different methods!

Tools for phylogeny reconstruction Two approaches covered in this course: Distance methods, e.g. Neighbour-Joining Maximum Likelihood Other important methods (not covered in this course): Maximum parsimony Bayesian approaches

Tools for phylogeny reconstruction Phylogenetic trees: rooted trees unrooted trees Many methods produce unrooted trees: find root using outgroup!

Biological Question: Are Sponges mono-/paraphyletic? Phylogenetic Reconstuction: An Example Organims of interest: Sponge

Build Dataset Dataset Query Sequence DNA/Protein Sequence from Sponge Gene Search for Homologs using e.g BLAST Hits from Search: putative homologs

Sequence alignment Dataset Sequence Alignment Hits from Search: putative homologs Alignment tools: -Clustalw -T-Coffee -Dialign...many more Use to bring sequences in relation

Alignment Phylogenetic Tree Phylogeny Methods: Distance-based: ---Nj ---UPGMA Parsimony: ---Max.Parsimony(Phylip/Paup) Statistical: ---Max.Likelihood (Phyml) ---Bayesian Inf. (MrBayes) Estimate Phylogeny

Interpretate results Hypothesis: Sponges are monophyletic

Tools for phylogeny reconstruction Distance methods: For N sequences S 1, … S N : Calculate distance d(i,j) for any two sequences S i and S j Goal find tree that represents all distances d(i,j) as closely as possible To calculate distances d(i,j) : construct multiple alignment of input sequences, consider substitutions implied by alignment

Matrix of pairwise distances d(i,j)

Find tree that corresponds to distances d(i,j)

Tools for phylogeny reconstruction Maximum likelihood: Consider evolution of sequences as random process. Stochastical model assigns probabilities to substitutions. Consider tree T as hypothesis about observed sequence data D Search tree with highest likelihood P(D|T)

Tools for phylogeny reconstruction Assumptions: Positions in sequences (colums in alignment) independent of each other Events on different branches of tree independent of each other Result: probabilities can be multiplied

Probability P(D|T) for given residues at internal nodes

Consider all possible residues for internal nodes

Testing the reliability of a tree (or parts of it): the bootstrap approach Bootstrap in general: repeat statistical test after random re-sampling, i.e. by drawing additional sample data. In phylogeny: 1. Select randomly columns from Alignment and repeat tree reconstruction with the same method (e.g times) 2. Calculate for every branch: how often is it observed in newly constructed trees?