Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Pairwise Sequence Alignment
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Kun Huang Department of Biomedical Informatics Ohio State University
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
es/by-sa/2.0/. Multiple Alignments & Molecular Evolution Prof:Rui Alves Dept Ciencies Mediques.
An Introduction to Bioinformatics
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
es/by-sa/2.0/. Multiple Alignments & Molecular Evolution Prof:Rui Alves Dept Ciencies Mediques.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Sequence Analysis-III
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Multiple sequence alignment (msa)
Presentation transcript:

Multiple Sequence Alignment H.C.Huang @ 2005/9/29 2005 Autumn / YM / Bioinformatics

Outline Pairwise alignment review Scoring matrix Substition matrices Gap penalties Multiple sequence alignment

Reference D.W. Mount / Bioinformatics Ch.3 pp.94-112 Ch.5 pp.163-189 Slides from Prof. C.H.Chang UW / Genomic Informatics / W.S. Noble WFU / Bioinformatics / J. Burg

PA Review Scoring a pairwise alignment requires a substition matrix and gap penalties. Dynamic programming is an efficient algorithm for finding the optimal alignment. Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. DP iteratively fills in the matrix using a simple mathematical rule.

PA Review Local alignment finds the best match between subsequences. Smith-Waterman local alignment algorithm: No score is negative. Trace back from the largest score in the matrix. Global alignment algorithm: Needleman-Wunsch. Local alignment algorithm: Smith-Waterman.

Dynamic Programming A method for solving recursive problem Break a problem into smaller subproblems Solve subproblems optimally, recursively Use these optimal solutions to construct an optimal solution for the original problem

Global alignment DP Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

Local alignment DP Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

Local alignment Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G 2 4 6 1 C

Substitution matrices Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G 2 4 6 1 C Where did this substitution matrix come from?

Substitution Matrix (scoring matrix)

Why Sequence Alignment? To find sequence similarity

Origin of Sequence Similarity Evolution Similar sequences come from same ancestor sequence with mutations

Substitution Matrices for Scoring Functions Also called “symbol comparison tables” Used for scoring matches of amino acid or nucleic acids Residues label the rows and columns of the matrix; scores for aligning them are given in the matrix Can be used in the dynamic programming method of pair-wise sequence alignment

Sub. Matrix: Basic idea Probability of substitution (mutation)

Nucleic acid PAM matrices PAM = point accepted mutation 1 PAM = 1% probability of mutation at each sequence position. A uniform PAM1 matrix: A G T C 0.99 0.00333

Transitions and transversions Transitions (A  G or C  T) are more likely than transversions (A  T or G  C) Assume that transitions are three times as likely: A G T C 0.99 0.006 0.002

Distant relatives If the probability of a substitution is 2%, simply multiply the probabilities from 1% by themselves. A PAM N matrix is computed by raising PAM 1 to the Nth power. A G T C 0.98014 0.011888 0.003984

Most Commonly-Used Amino Acid Subtitution Matrices PAM (Percent Accepted Mutation, also called Dayhoff Amino Acid Substitution Matrix) BLOSUM (Blocks Amino Acid Substitution Matrix)

PAM Matrices A family of matrices (PAM-N) Based upon an evolutionary model The score for a pairing of amino acids is based on how much we expect that pairing to be observed after a certain length of evolutionary time The scores are derived by a Markov model – i.e., the probability that one amino acid will change to another is not affected by changes that occurred at an earlier stage of evolutionary history

PAM-N Matrices N is a measure of evolutionary distance PAM-1 is modeled on an estimate of how long in evolutionary time it would take one amino acid out of 100 to change. That length of time is called 1 PAM unit, roughly 10 million years (abbreviated my). Values in a PAM-1 matrix show the probability that an amino acid will change over 10 my. To get the PAM-N matrix for any N, multiply PAM-(N-1) by PAM-1.

How did they get the values for PAM-1? Look at 71 groups of protein sequences where the proteins in each group are at least 85% similar (Why these groups?) Compute relative mutability of each amino acid – probability of change From relative mutability, compute mutability probability for each amino acid pair X,Y– probability that X will change to Y over a certain evolutionary time Normalize the mutability probability for each pair to a value between 0 and 1 We want to observe changes in closely-related proteins, where the changes are “accepted mutations”

Computing Relative Mutability – A Measure of the Likelihood that an Amino Acid Will Mutate For each amino acid changes = number of times the amino acid changed into something else exposure to mutation = (percentage occurrence of the amino acid in the group of sequences being analyzed) * (frequency of amino acids changes in the group – based on the phylogenetic tree) relative mutability = (changes/exposure to mutation) / 100

Computing Mutability Probability Between Amino Acid Pairs For each pair of amino acids X and Y: r = relative mutability of X c = num times X becomes Y or vice versa p = num changes involving X mutability probability of X to Y = (r * c) / p

Computing Relative Mutability of A: changes = # times A changes into something else = 4 % occurrence of A in group = 10 / 63 = 0.159 frequency of all amino acid changes in group = 6 * 2 = 12 (Note: Count changes backwards and forwards.) exposure to mutation = (% occurrence of A in group) * (frequency of all amino acid changes in group) = 12 * 0.159 relative mutability = (changes / exposure to mutation) / 100 = (4 / (12 * 0.159)) = 2.09 / 100 = 0.0209 Divide this value by 100 to give us PAM – 1, where we’re modeling 1 substitution per 100 residues. Example from Fundamental Concepts of Bioinformatics by Krane and Raymer.

How can we understand relative mutability intuitively? relative mutability = changes / exposure to mutation = the number of times A changed in proportion to the the probability that it COULD have changed exposure to mutation – that were 6 times when something changed in the tree. Each time, that change could have been A changing to something else, or something else changing to A – 12 chances for a change involving A. But A appears in a sequence only .159 of the time.

Computing Mutability Probability that A will change to G: r = relative mutability of A = .0209 c = num times A becomes G or vice versa = 3 p = num changes involving A = 4 mutability probability of A to G = (r * c) / p = (0.0209 * 3) / 4 = 0.0156

Normalizing Mutability Probability, X to Y For each Y among all amino acids, compute mutability probability of X to Y as described above Get a total of these 20 probabilities. Divide them by a normalizing factor such that the probability that X will NOT change is 99% and the sum of probabilities that it will change to any other amino acid is 1% These are the numbers that go in the PAM-1 matrix! See Table 3.2, p. 96 in Bioinformatics by Mount.

Converting Mutability Probabilities to Log Odds Score for X to Y Compute the relative frequency of change for X to Y as follows: Get the X to Y mutability probability Divide by the % frequency of X in the sequence data Convert to log base 10, multiply by 10 In our example, we get log10(0.0156/0.1587) = log10(.098) To compute log10(.098) solve for x: 10x = 0.098 x = -1.01 10-1.01 = 1/101.01 = 0.098 Compute log odds score for Y to X Take the average of these two values

Usefulness of Log Odds Scores A score of 0 indicates that the change from one amino acid to another is what is expected by chance A negative score means that the change is probably due to chance A positive score means that the change is more than expected by chance Because the scores are in log form, they can be added (i.e., the chance that X will change to Y and then Y to Z) See Figure 3.14, page 98 of Bioinformatics by Mount.

Disadvantages of PAM Matrices A phylogenetic tree must be constructed first, implying some circularity in the analysis Disadvantage: The original PAM-1 matrix was based on a limited number of families, not necessarily representative of all protein families The Markov model does not take into account that multi-step mutations should be treated differently from single-step ones

BLOSUM Scoring Matrices Based on a larger set of protein families than PAM (about 500 families). The proteins in the families are known to be biochemically related. Focuses on blocks of conserved amino acid patterns in these families Designed to find conserved domains in protein families BLOSUM matrices with lower numbers are more useful for scoring matches in pairs that are expected to be less closely related through evolution – e.g., BLOSUM50 is used for more distantly-related proteins than BLOSUM62. (This is the opposite of the PAM matrices.)

BLOSUM BLOSUM (blocks amino acid substitution matrices) Blocks: ungapped amino acid patterns

Block alignment: example

BLOSUM50

Gap Penalty (Gap Scoring)

Better gap scoring Real gaps are often more than one letter long. >gi|729942|sp|P40601|LIP1_PHOLU Lipase 1 precursor (Triacylglycerol lipase) Length = 645 Score = 33.5 bits (75), Expect = 5.9 Identities = 32/180 (17%), Positives = 70/180 (38%), Gaps = 9/180 (5%) Query: 2038 IYSLYGLYNVPYENLFVEAIASYSDNKIRSKSRRVIATTLETVGYQTANGKYKSESYTGQ 2097 +++ YGL+ Y+ ++ Y D K +R ++ + N + G+ Sbjct: 441 VFTAYGLWRY-YDKGWISGDLHYLDMKYEDITRGIVLNDW----LRKENASTSGHQWGGR 495 Query: 2098 LMAGYTYMMPENINLTPLAGLRYSTIKDKGYKETGTTYQNLTVKGKNYNTFDGLLGAKVS 2157 + AG+ + + +P+ + KGY+E+G + + Y++ G LG ++ Sbjct: 496 ITAGWDIPLTSAVTTSPIIQYAWDKSYVKGYRESGNNSTAMHFGEQRYDSQVGTLGWRLD 555 Query: 2158 SNINVNEIVLTPELYAMVDYAFKNKVSAIDARLQGMTAPLPTNSFKQSKTSFDVGVGVTA 2217 +N P ++ F +K I + + + S KQ + +G+ A Sbjct: 556 TNFG----YFNPYAEVRFNHQFGDKRYQIRSAINSTQTSFVSESQKQDTHWREYTIGMNA 611 Real gaps are often more than one letter long.

Affine gap penalty LETVGY W----L -5 -1 -1 -1 -5 -1 -1 -1 Separate penalties for gap opening and gap extension. This requires modifying the DP algorithm to store three values in each box.

Summary Substitution matrices represent the probability of mutations. PAM / BLOSUM(62) Affine gap penalties include a large gap opening penalty and small gap extension penalty.

Multiple Sequence Alignment

MSA Introduction Goal of protein sequence alignment: To discover “biological” (structural / functional) similarities If sequence similarity is weak, pairwise alignment can fail to identify … Simultaneous comparison of many sequences often find similarities that are invisible in PA.

Why do we care about sequence alignment? It can tell us something about the evolution of organisms. We can see which regions of a gene (or its derived protein) are susceptible to mutation and which can have one residue replaced by another without changing function. Homologous genes (genes with share evolutionary origin) have similar sequences. Orthologs are genes that are evolutionarily related, have a similar function, but now appear in different species. Paralogs are evolutionarily related (share an origin) but no longer have the same function. You can uncover either orthologs or paralogs through sequence alignment.

Multiple Sequence Alignment Often applied to proteins Proteins that are similar in sequence are often similar in structure and function Sequence changes more rapidly in evolution than does structure and function.

Work with proteins! If at all possible — Twenty match symbols versus four, plus similarity! Way better signal to noise. Also guarantees no indels are placed within codons. So translate, then align. Nucleotide sequences will only reliably align if they are very similar to each other. And they will require extensive hand editing and careful consideration.

Overview of Methods Dynamic programming – too computationally expensive to do a complete search; uses heuristics Progressive – starts with pair-wise alignment of most similar sequences; adds to that Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms) Locally conserved patterns Statistical and probabilistic methods DP can align about 7 relatively short (200-300) protein sequences in a reasonable amount of time. Progressive – also a heuristic algorithm (a greedy algorithm) In practice, it produces biologically meaningful results

Dynamic Programming Computational complexity – even worse than for pair-wise alignment because we’re finding all the paths through an n-dimensional hyperspace (We can picture this in 2 or 3 dimensions.) Can align about 7 relatively short (200-300) protein sequences in a reasonable amount of time; not much beyond that

A Heuristic for Reducing the Search Space in Dynamic Programming Let’s picture this in 3 dimensions (pp. 174-180 in book). It generalizes to n. Consider the pair-wise alignments of each pair of sequences. Create a phylogenetic tree from these scores. Consider a multiple sequence alignment built from the phylogenetic tree. These alignments circumscribe a space in which to search for a good (but not necessarily optimal) alignment of all n sequences.

Phylogenetic Tree Dynamic programming uses a phylogenetic tree to build a “first-cut” msa The tree shows how protein could have evolved from shared origins over evolutionary time. See page 180 in Bioinformatics by Mount. Chapter 7 goes into detail on this.

Dynamic Programming -- MSA Create a phylogenetic tree based on pair-wise alignments (Pairs of sequences that have the best scores are paired first in the tree.) Do a “first-cut” msa by incrementally doing pair-wise alignments in the order of “alikeness” of sequences as indicated by the tree. Most alike sequences aligned first. Use the pair-wise alignments and the “first-cut” msa to circumscribe a space within which to do a full msa that searches through this solution space. The score for a given alignment of all the sequences is the sum of the scores for each pair, where each of the pair-wise scores is multiplied by a weight є indicating how far the pair-wise score differs from the first-cut msa alignment score.

Heuristic Dynamic Programming Method for MSA Does not guarantee an optimal alignment of all the sequences in the group. Does get an optimal alignment within the space chosen.

Progressive Methods Similar to dynamic programming method in that it uses the first step (i.e., it creates a phylogenetic tree, aligns the most-alike pair, and incrementally adds sequences to the alignment in order of “alikeness” as indicated by the tree.) Differs from dynamic programming method for MSA in that it doesn’t refine the “first-cut” MSA by doing a full search through the reduced search space. (This is the computationally expensive part of DP MSA in that, even though we’ve cut down the search space, it’s still big when we have many sequences to align.)

Progressive Method Generally proceeds as follows: Choose a starting pair of sequences and align them Align each next sequence to those already aligned, one at a time Heuristic method – doesn’t guarantee an optimal alignment Details vary in implementation: How to choose the first sequence to align? Align all subsequence sequences cumulatively or in subfamilies? How to score?

ClustalW Based on phylogenetic analysis A phylogenetic tree is created using a pairwise distance matrix and nearest-neighbor algorithm The most closely-related pairs of sequences are aligned using dynamic programming Each of the alignments is analyzed and a profile of it is created Alignment profiles are aligned progressively for a total alignment W in ClustalW refers to a weighting of scores depending on how far a sequence is from the root on the phylogenetic tree (See p. 180-182 of Bioinformatics by Mount.)

ClustalW Procedure

“Once a gap, always a gap”

Basic Steps in Progressive Alignment “Once a gap, always a gap”

Problems with Progressive Method Highly sensitive to the choice of initial pair to align. If they aren’t very similar, it throws everything off. It’s not trivial to come up with a suitable scoring matrix or gap penaties.

Summary Global multiple sequence alignment Progressive Method ClustalW Use pairwise alignment to iteratively add one sequence to a growing MSA ClustalW Local MSA  Sequence pattern search