Multiple Sequence Alignment. Sequence Families Most sequences are members of large families, some with the same function and others with different functions.

Slides:



Advertisements
Similar presentations
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Measuring the degree of similarity: PAM and blosum Matrix
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
COFFEE: an objective function for multiple sequence alignments
Molecular Evolution Revised 29/12/06
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics and Phylogenetic Analysis
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Multiple alignment: heuristics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Calculating branch lengths from distances. ABC A B C----- a b c.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Chapter 3 Computational Molecular Biology Michael Smith
Copyright OpenHelix. No use or reproduction without express written consent1.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Phylogeny Ch. 7 & 8.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Multiple sequence alignment (msa)
Multiple Alignment and Phylogenetic Trees
Inferring phylogenetic trees: Distance and maximum likelihood methods
Presentation transcript:

Multiple Sequence Alignment

Sequence Families Most sequences are members of large families, some with the same function and others with different functions. –Members of different families are thought to not have a common ancestor. The basic idea of multiple alignment is to line up a group of related proteins so that the amino acids in each column: –Have the same position in the 3-dimensional structure –Are derived from the same amino acid in their common ancestor –Have the same function in the protein These three goals are not completely compatible: structures change over evolutionary time, mutations compensate for each other, etc. –An interesting hypothesis: there are only a few thousand different types of protein fold in existence, and all proteins are made up of one or more of these folds. X-ray crystallography seems to be confirming this idea, but it is far from being widely accepted. Another point of view: a multiple alignment is an attempt to reproduce the sequence of evolutionary events that occurred in moving from the common ancestor to all of the present-day sequences being aligned.

Multiple Alignments Most multiple alignments are global, not local. –This means that it is necessary for the proteins being aligned to all have the same basic domain structure. Aligning 2 sequences is easy and fairly precise (given a substitution matrix and set of gap penalties). –Most multiple alignment programs start with pairwise alignments. But, multiple alignments are still an active area of research, and the best alignments are still refined by hand. –There are many multiple alignment programs available, and there are always new ones coming out. Not to mention algorithms that aren’t fully developed into new user-friendly web-based programs. –Lots of heuristics and general ad hoc decisions that improve results without any underlying theory. Closely related sequences can be aligned more unambiguously that distant sequences. –Thus, most multiple alignment schemes start by joining the most similar sequences –This is a case of the general rule “do the easy ones first”

Scoring a Multiple Alignment How do you determine which alignment is best? “Looks pretty good to me” (i.e. hand refinement) may work when experts in that specific protein are involved, but in general it is an invitation to introducing prejudices –There are several sets of test case data, with alignments refined by hand using 3-D structures. For example, BaliBase “star tree” scoring: score all other sequences relative to a single common (canonical) one. –This is effectively what happens when one sequence is used as a BLAST or PSI-BLAST query and all other sequences are aligned to it. –Not good idea: the original query is not the center of evolutionary events We are trying to re-create evolutionary events, so the ideal method would be a phylogenetic scoring method, that would take evolutionary descent into account. –Detect and count each evolutionary event that distinguishes between sequences. –Phylogeny is often inferred during the course of a multiple alignment, but it is not independent from the alignment itself, so it can’t legitimately be used for scoring. –The best phylogenetic methods are quite slow.

More on Scoring Most common scoring = “sum of pairs”. Once the multiple alignment is made, use a substitution matrix to score all possible pairs of sequences, then add up the total. –Sum of pairs ignores the principal that closely related sequences are easier to align properly than distant ones. –Sum of pairs also overcounts more distant evolutionary events. For example, the first divergence from the common ancestor is counted as if it occurred multiple independent times, not just once. –Some alignment methods use weighting schemes to overcome this objection. A more recent method (slower, unfortunately), is “consistency-based scoring”. –The basic concept is that the final multiple alignment should match as many pairwise alignments as possible.

Progressive Alignment The first thought: why not just extend Smith-Waterman to multiple sequences? i.e. simultaneous alignment. –Huge multi-dimensional matrices and exponential scaling ( O(2 N ), where N = number of sequences ) make it impractical except for small numbers of sequences. Progressive alignment: start by aligning the most similar pair, then add other sequences one by one. –Alignments are most likely to be correct if the sequences are closely related. The big problem is: the order in which the sequences are added affects the final alignment significantly. –Once 2 sequences are aligned, that alignment can’t be changed, so early mistakes in alignment are propagated. This property is called “greediness” –Modern alignment schemes try to re-test and re-align later additions. This is called “iterative alignment”, meaning that the multiple alignment goes through several cycles (iterations) of retesting and re-aligning.

ClustalW ClustalW is the most popular multiple alignment program used today. –Despite the fact that no improvements have been made since 1994 –Despite the fact that it suffers horribly from the order-of-addition problem: it is a progressive method, not iterative. On the web: Basic algorithm: –Do all possible pairwise alignments and score them. –Using cluster analysis, create a relationship tree (a guide tree) based on the similarities between pairs. –Combine alignments by introducing gaps Some heuristic attempts to concentrate gaps into the same alignment columns

Cluster Analysis How to create the guide tree –Based on a matrix of distances between aligned pairs of sequences –Several ways of converting this to a tree. We will look at a simple method called UPGMA –There are better, more accurate ways of creating a phylogenetic tree, but UPGMA is fast and so fits the needs of a multiple alignment program

Distance Matrix To create a UPGMA tree, we start with a distance matrix. Start by scoring all pairs of aligned sequences with BLOSUM62 (for example). Convert the scores to similarities, on a 0-1 scale. Distances are just 1 – similarity. (also a 0-1 scale) The main diagonal is the distance between a sequence and itself, which is always 0. The matrix is symmetrical: the distance between sequence A and sequence B is the same as between B and A. ABCDE A B C D E

UPGMA = Unweighted Pair Group Method with Arithmetic Mean. –UPGMA is a simple and intuitive clustering method –It produces a rooted tree (dendrogram) Algorithm: –Start by finding the closest pair of sequences: A and B are 0.10 apart. –Join them. The branch lengths are the distances. –Combine their columns by averaging distances to all other sequences –Repeat until all sequences have been joined into a single tree. To start: A and B are the closest: 0.10 apart. A+BCDE C D E ABCDE A B C D E

More UPGMA Next, D and E are the closest on the revised distance matrix: The branch length is proportional to the distance: 0.1 for A-B and 0.2 for D-E A+BCDE C D E A+BCD+E A+B C D+E

More UPGMA Next C is closest to A+B (0.22) Note that to get the distances from A+B+C to D+E, we are using twice as much contribution from A+B as we are from C, because A+B represents 2 sequences. A+BCD+E A+B C D+E A+B+CD+E A+B+C00.36 D+E0.360

End of UPGMA The last join in A+B+C with D+E. When Clustal-W uses this guide tree for alignments, it joins A and B, then separately joins D and E, then adds C to A+B, and finally joins the two groups A+B+C and D+E.

MUSCLE “MUSCLE” is probably an acronym for something, but I am not a fan of cutesy acronyms. It gets high marks in recent surveys of multiple alignment methods for being fast, accurate, and capable of dealing with a wide range of sequences. MUSCLE as a web-based service at EBI (European Bioinformatics Institute): MUSCLE is an iterative method: –it starts out doing a progressive alignment of sequences using a guide tree, starting with the most similar. –Then, it attempts to test and re-align various sequences, trying to avoid freezing in early bad decisions. This is an instance of a more general statistical technique called “jacknifing”. Jacknifing involves repeating an analysis many times on a subset of the data: leaving out random halves of the data, or leaving out single random observations, etc. It is used (along with bootstrapping) to estimate the robustness of one’s conclusions. It also illustrates the general principle of “annealing”: slowly converging on a solution by allowing changes but gradually raising the score needed to accept a change.

MUSCLE Algorithm After some preliminary work, MUSCLE estimates a tree using UPGMA and the Kimura method of estimating the evolutionary distance between sequences (which we will deal with later in the semester). The iterative procedure: –Cut the tree into 2 subtrees at random locations –re-align each subtree –Align the subtrees together – Score the new alignment with sum-of-pairs If the new score is better than the old one, discard the old tree and repeat the cutting and realignment procedure with the new tree. If the new score is worse than the old one, discard it and repeat the procedure with the old tree. –Repeat until no further changes appear (convergence) or a user-set limit is reached.

T-COFFEE T-Coffee is also a cute acronym for something For our purposes, it illustrates a scoring method more refined than sum-of- pairs. –It is probably more accurate than other methods, but considerably slower. Web: Consistency-based scoring. Algorithm: 1.Create a library of all pairwise alignments You can generate them externally if you like. For instance, using alignments of 3- dimensional structures. You can use several alignments for the same pair of sequences, even if they are not consistent. The information is stored residue-by-residue: it is a list of entries like “amino acid 23 in sequence X aligns with amino acid 27 in sequence Y” 2.Weight alignments by the percentage of identical residues. If there is more than one alignment for a given pair of sequences, create a weighted average of them.

Library Extension 3.The next step, library extension, involves examining all triplets of sequences. –Residues that pair up in all three have their weights increased by the lesser of the two pair scores. –Residues that match differently in the two pairs being joined to form a triplet do not have their weights increased. –Thus, residues that consistently match up end up with very high weights. This is what is meant by consistency-based scoring. 4.After this, the weighted set of aligned residues is subjected to a progressive alignment equivalent to ClustalW. –Once a gap is created, it remains in all further alignments –Separate gap penalties aren’t needed: already incorporated into the library extension process.

Library Extension Example The scores are pairwise: the percentage of identical amino acids for all positions with an amino acid in both sequences

Example Pairwise scores. Every aligned pair of amino acids gets an initial score equal to the percentage of identity between the aligned sequences. –Sequences x and y aligned as a pair gives 88% identity. –Sequences x and v aligned as pairs gives 77% identity (because LAST and VERY are aligned but not identical). –Sequences v and y are 100% identical, because VERY in sequence v is a gap in sequence y. –Sequences x and z, and y and z, are 100% identical, although mostly gaps. Sequences x and y are then aligned through all other sequences: v and z in this case When the x-v-y triplet is assembled, every aligned pair in x and y gets its weight increased by the minimum of the two pair scores (77 here). –This includes GARFIELD THE FAT CAT, but not VERY or the S in “FAST” because they aren’t aligned. Similarly, when x, z, and y are aligned together, all matching pairs in x and y are increased by 100 (the minimum of the x-z and y-z scores). –This only increases the score for THE FAT CAT Note that LAST FAT in sequence x matches FAST CAT in y, but it matches VERY FAST in v, and ---- FAT in z. This results in two possible matches for these residues, and lower scores for both –These matches are thus less reliable than other parts of the x-y alignment.

Some Numbers xTHELASTFA-TCAT yTHEFASTCA-T--- x-y score xTHELASTFA-TCAT VTHEVERYFASTCAT YTHE----FASTCAT New score xTHELASTFA-TCAT zTHE----FA-TCAT yTHE----FASTCAT Final score

Further... There are many more multiple alignment algorithms.