Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.

Slides:



Advertisements
Similar presentations
Bioinformatics Methods Course Multiple Sequence Alignment Burkhard Morgenstern University of Göttingen Institute of Microbiology and Genetics Department.
Advertisements

Methods course Multiple sequence alignment and Reconstruction of phylogenetic trees Burkhard Morgenstern, Fabian Schreiber Göttingen, October/November.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Molecular Evolution Revised 29/12/06
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Heuristic alignment algorithms and cost matrices
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Bioinformatics and Phylogenetic Analysis
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
Introduction to bioinformatics
Sequence similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Sequence Alignment III CIS 667 February 10, 2004.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Xuhua Xia Sequence Alignment Xuhua Xia
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Sequence Alignment Xuhua Xia
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Analysis-III
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Sequence Based Analysis Tutorial
Alignment IV BLOSUM Matrices
Presentation transcript:

Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007

Goal: Phylogeny reconstruction based on molecular sequence data (DNA, RNA, protein sequences)

Multiple sequence alignment Molecular phylogeny reconstruction relies on comparative nucleic acid and protein sequence analysis Alignment most important tool for sequence comparison Multiple alignment contains more information than pair-wise alignment

Tools for multiple sequence alignment Y I M Q E V Q Q E R Sequence duplicates in history (e.g. speciation event)

Tools for multiple sequence alignment Y I M Q E V Q Q E R

Tools for multiple sequence alignment Y I M Q E V Q Q E R

Tools for multiple sequence alignment Y I M Q E A Q Q E R Y L M Q E V Q Q E R Substitutions occur

Tools for multiple sequence alignment Y I M Q E A Q Q E R Y L M Q E V Q Q E R

Tools for multiple sequence alignment YAI M Q E A Q Q E R Y L M - - V Q Q E R V Insertions/deletions (indels) occur

Tools for multiple sequence alignment YAI M Q E A Q Q E R Y L M - - V Q Q E R V

Tools for multiple sequence alignment Y A I M Q E A Q Q E R Y L M V Q Q E R V because of insertions/deletions: sequence similarity no longer immediately visible!

Tools for multiple sequence alignment Y A I M Q E A Q Q E R - Y - L M V - - Q Q E R V Alignment brings together related parts of the sequences by inserting gaps into sequences

Tools for multiple sequence alignment Y A I M Q E A Q Q E R - Y - L M V - - Q Q E R V

Tools for multiple sequence alignment Y A I M Q E A Q Q E R - Y - L M V - - Q Q E R V Mismatches correspond to substitutions Gaps correspond to indels

Tools for multiple sequence alignment Pairwise alignment: alignment of two sequences Multiple alignment: alignment of N > 2 sequences

Tools for multiple sequence alignment s1 R Y I M R E A Q Y E S A Q s2 R C I V M R E A Y E s3 Y I M Q E V Q Q E R s4 W R Y I A M R E Q Y E Assumtion: sequence family related by common ancestry; similarity due to common history Sequence similarity not obvious (insertions and deletions may have happened)

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Multiple alignment = arrangement of sequences by introducing gaps Alignment reveals sequence similarities

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E - - -

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E General information in multiple alignment: Functionally important regions more conserved than non-functional regions Local sequence conservation indicates functionality!

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Phylogeny reconstruction based on multiple alignment: Estimate pairwise distances between sequences (distance-based methods for tree reconstruction) Estimate evloutionary events in evolution (parsimony and maximum likelihood methods)

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Task in bioinformatics: Find best multiple alignment for given sequence set

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Astronomical number of possible alignments!

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A Y E - s3 Y I M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Astronomical number of possible alignments!

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A Y E - s3 Y I M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Computer has to decide: which one is best??

Tools for multiple sequence alignment Questions in development of alignment programs: (1) What is a good alignment? → objective function (`score’) (2) How to find a good alignment? → optimization algorithm First question far more important !

Tools for multiple sequence alignment Before defining an objective function (scoring scheme) What is a biologically good alignment ??

Tools for multiple sequence alignment Criteria for alignment quality: 1. 3D-Structure: align residues at corresponding positions in 3D structure of protein!

Tools for multiple sequence alignment Criteria for alignment quality:

Tools for multiple sequence alignment Criteria for alignment quality: 1. 3D-Structure: align residues at corresponding positions in 3D structure of protein!

Tools for multiple sequence alignment Species related by common history

Tools for multiple sequence alignment Genes / proteins related by common history

Tools for multiple sequence alignment Criteria for alignment quality: 1. 3D-Structure: align residues at corresponding positions in 3D structure of protein! 2. Evolution: align residues with common ancestors!

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Alignment hypothesis about sequence evolution Mismatches correspond to substitutions Gaps correspond to insertions/deletions

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - - Y I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Alignment hypothesis about sequence evolution Search for most plausible scenario! Estimate probabilities for individual evolutionary events: insertions/deletions, substitutions

Tools for multiple sequence alignment s1 - R Y I - M R E A Q Y E S A Q s2 - R C I V M R E A - Y E s3 - Y - I - M Q E V Q Q E R - - s4 W R Y I A M R E - Q Y E Alignment hypothesis about sequence evolution Search for most plausible scenario! Estimate probabilities for individual evolutionary events: insertions/deletions, substitutions

Tools for multiple sequence alignment Compute score s(a,b) for degree of similarity between amino acids a and b based on probability p a,b of substitution a → b (or b → a) (Extremely simplified!)

Tools for multiple sequence alignment

Reason for different substitutin probabilities p a,b : Different physical and chemical properties of amino acids Amino acids with similar properties more likely to be substituted against each other

Tools for multiple sequence alignment Use penalty for gaps introduced into alignment Simplest approach: linear gap costs: penalty proportional to gap length Non-linear gap penalties more realistic: long gap caused by single insertion/deletion Most frequently used: affine linear gap penalties: more realistic, but efficient to calculate!

Traditional Objective functions: Define Score of alignments as Sum of individual similarity scores s(a,b) Minus gap penalties Needleman-Wunsch scoring system for pairwise alignment (1970)

Pair-wise sequence alignment T Y W I V T - - L V Example: Score = s(T,T) + s(I,L) + s (V,V) – 2 g Assumption: linear gap penalty!

Pair-wise sequence alignment T Y W I V T - - L V Dynamic-programming algorithm finds alignment with best score. (Needleman and Wunsch, 1970)

Pair-wise sequence alignment T Y W I V T - - L V Running time proportional to product of sequence length Time-complexity O(l 1 * l 2 )

Pair-wise sequence alignment Algorithm for pairwise alignment can be generalized to multiple alignment of N sequences Time-complexity O(l 1 * l 2 * … * l N ) Not feasable in reality (too long running time!) Heuristic necessary, i.e. fast algorithm that does not necessarily produce mathematically best alignment

`Progressive´ Alignment Most popular approach to (global) multiple sequence alignment: Progressive Alignment Since mid-Eighties: Feng/Doolittle, Higgins/Sharp, Taylor, …

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Guide tree

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASFQPVAALERIN WLNYNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVP--KAKIIRD YAVESEA---SVQ--PVAALERIN WLN-YNE---ERGDFPGTYVEYIGRKKISP Most important implementation: CLUSTAL W

`Progressive´ Alignment CLUSTAL W; Thompson et al., 1994 (~ citations) Pairwise distances as 1 - percentage of identity Calculate un-rooted tree with Neighbor Joining Define root as central position in tree Define sequence weights based on tree Gap penalties calculated based on various parameters

Tools for multiple sequence alignment Problems with traditional approach: Results depend on gap penalty Heuristic guide tree determines alignment; alignment used for phylogeny reconstruction Algorithm produces global alignments.

Tools for multiple sequence alignment Problems with traditional approach: But: Many sequence families share only local similarity E.g. sequences share one conserved motif

The DIALIGN approach Morgenstern, Dress, Werner (1996), PNAS 93, Combination of global and local methods Assemble multiple alignment from gap-free local pair-wise alignments (,,fragments“)

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacc cctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa

The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac gg-ttcaatcgcg caaa--gagtatcacc cctgaattgaataa Consistency!

The DIALIGN approach atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc GG-TTCAATcgcg caaa--GAGTATCAcc CCTGaaTTGAATaa

More methods for multiple alignment: T-Coffee PIMA Muscle Prrp Mafft ProbCons

Substitution matrices Similarity score s(a,b) for amino acids a and b based on probability p a,b of substitution a -> b Idea: it is more reasonable to align amino acids that are often replaced by each other!

Substitution matrices Assumptions: p a,b does not depend on sequence position Sequence positions independent of each other p a,b = p b,a (symmetry!)

Substitution matrices Compute score s(a,b) for degree of similarity between amino acids a and b: Probability p a,b of substitution a → b (or b → a), Frequency q a of a Define s(a,b) = log (p a,b / q a q b )

Substitution matrices

To calculate p a,b : Consider alignments of related proteins and count substitutions a → b (or b → a)

Substitution matrices To calculate p a,b : Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY

Substitution matrices To calculate p a,b : Consider alignments of related proteins and count substitutions a → b (or b → a) ESWTS-RQWERYTIALMSDQRREVLYWIALY ERWTSERQWERYTLALMS-QRREALYWIALY

Substitution matrices Problems involved: 1. Probability p a,b depends on time t since sequences separated in evolution: p a,b = p a,b (t) 2. Protein families contain multiple sequences: phylogenetic tree must be known! 3. Alignment of protein families must be known! 4. Multiple mutations at one sequence position

Substitution matrices M. Dayhoff et al., Atlas of Protein sequence and Structure, 1978 PAM matrices

Substitution matrices Calculation of p a,b (t) : Consider multiple alignments of closely related protein families Count occurrence of a and b at corresponding positions in alignments using phylogenetic tree Estimate p a,b (t) for small times t Calculate conditional probabilities p(a|b,t) for small t Normalize to distance 1 PAM (= percentage of accepted mutations) Calculate p(a|b,t) for larger evolutionary distances by matrix multiplication Calculate p a,b (t) for larger evolutionary distances

Substitution matrices

Alternative: BLOSUM matrices S. Henikoff and J.G. Henikoff, PNAS, 1992 Basis: BLOCKS database, gap-free regions of multiple alignments. Cluster of sequences if percentage of similarity > L Estimate p a,b (t) directly. Default values: L = 62, L = 50