Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

Multiple Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 6, 2005 ChengXiang Zhai Department of Computer Science University.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
Lecture 8 Alignment of pairs of sequence Local and global alignment
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple alignment: heuristics
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple Sequence alignment Chitta Baral Arizona State University.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Multiple sequence alignment Usman Roshan.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Multiple Sequence Alignment Colin Dewey BMI/CS 576 Fall 2015.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Hierarchical clustering approaches for high-throughput data
Sequence Alignment 11/24/2018.
Pairwise Sequence Alignment (cont.)
Alignment IV BLOSUM Matrices
Presentation transcript:

Multiple Sequence Alignment Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ sroy@biostat.wisc.edu Sep 18th, 2014

Key concepts The Multiple Sequence Alignment Problem Scoring Multiple Sequence Alignments Scoring an alignment of a “profile” and a sequence Heuristic Algorithms for Multiple Sequence Alignment General strategies Progressive alignment Star alignment Tree-based alignment Iterative alignment Understanding of key steps in common algorithms Progressive: CLUSTAL Iterative: MUSCLE

Readings Durbin Chapter 6 6.1 6.2 6.3: ClustalW 6.4 CLUSTALW MUSCLE

What is multiple sequence alignment? The task of locating equivalent regions of more than two sequences so as to maximize their overall similarity.

Why multiple sequence alignment? Build phylogenetic trees (next module) Determine evolutionary relationships between sequences A multiple sequence alignment can represent a family of proteins with similar function Compare new sequence to a “family” of known proteins For example the BLOCKS database used for BLOSUM contains several ungapped alignments for known protein families Discover common signatures or protein domains among a group of proteins Identify genetic variation among individuals of a population

An example multiple sequence alignment

The tasks in Multiple Sequence Alignment Scoring an alignment Algorithms for creating an alignment

Some notation Let m denote a Multiple Sequence Alignment mi is the ith column of the alignment m mij is the ith column and jth row cia count of residue a in column i

Example using notation G A R F I E L D T H E F A T C A T G A R F I E L D T H E _ _ _ C A T G A R F I E L D T H A T _ _ C A T G A R R Y _ L I K E D A _ _ C A T

Scoring a Multiple Sequence Alignment (MSA) Key issue: how do we score a multiple sequence alignment? Usually, we assume that columns of an alignment are independent We will simplify score by assuming linear gap penalty for now as gap function score of ith column

Gap penalty (G) We will use a simple linear gap penalty function Penalty for a gap: g Let s(a,b) denote the cost of substituting a by b. Linear gap penalty can be incorporated into the substitution matrix s(a,_)=-g=s(_,a) s(_,_)=0

Two ways to score are used Entropy based scores Sum of pairs

Entropy of a distribution A measure of average uncertainty of an outcome For a discrete distribution P(X), where X takes k values x1, .. xk it is defined as Entropy is greatest when we are most uncertain, that is, for a uniform distribution Entropy is least when we are most certain, e.g. deterministic event

Score of a column: Entropy based Score of the ith column of alignment m is This has an entropy-based interpretation First, the probability of a column mi Second, taking log on both sides and putting a “-” sign gives us the entropy-based score

Scoring an alignment: Entropy based score Taking log on both sides and putting a “-” sign gives us the entropy-based score This looks similar to entropy of a random variable specifying the character a in this column. High entropy: More uniform distribution/more variability of characters Low entropy: Less uniform distribution/less variability of characters

Scoring of a column: Sum of Pairs Compute the sum of the pairwise scores Iterate over all pairs of rows in the column Substitution score from a substitution/match matrix such as BLOSUM or PAM

Algorithms for performing a Multiple Sequence Alignment Dynamic programming Not feasible Progressive alignment algorithms Star alignment Guide tree approach Iterative alignment algorithms

Dynamic Programming (DP) for multiple sequence alignment Assume columns are independent Score of alignment is sum of scores per column Generalization of methods for pairwise alignment consider k-dimensional matrix for k sequences (instead of 2-dimensional matrix) each matrix element represents alignment score for k subsequences (instead of 2 subsequences)

Notation for DP Assume we have k sequences i1 denotes where we are in the sequence 1 i2 denotes where we are in sequence 2 … ik denotes where we are in sequence k denotes the character at ik position of sequence xk F: k-dimensional matrix where denotes the score of the best partial alignment upto i1, i2.. ik parts of the sequences

Recall the DP for the pairwise alignment

DP for Multiple sequence alignment max score of alignment for the k subsequences How many items do we need to maximize over? 2k -1

DP algorithm is too expensive For k sequences each of length n Space complexity: O(nk) Time complexity: O(nk2k)

Heuristic algorithms to Multiple sequence alignment Progressive alignment Build the alignment of larger number of sequences from partial alignments of subsets of sequences Iterative alignment Possibly remove some of the aligned sequences and re-align to see if score improves

Progressive alignment Key heuristic: Align the “most similar” sequences first Rely on pre-computed pairwise similarity/distance Pairwise sequence alignments Algorithms differ in the extent to which the pairwise similarity influences the final alignments Two strategies Star alignment Tree alignments Simple (quick and dirty) tree At each time combine two, possibly singleton, sets of sequences

Star Alignment Approach Given: k sequences to be aligned pick one sequence as the “center” for each determine an optimal alignment between and Aggregate pairwise alignments Shift entire columns when incorporating gaps return: multiple alignment resulting from aggregate

Picking the center in star alignments Two possible approaches: try each sequence as the center, return the best multiple alignment compute all pairwise alignments and select the string that maximizes:

Aligning to an existing partial alignment Need to treat each “partial alignment” as a single entity Partial alignment should not be changed other than gap insertions Shift entire columns when incorporating gaps TGTTAAC -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC -TGT AAC -TGT -AC ATGT --C ATGT GGC

Star Alignment Example Given: ATGGCCATT ATTGCCATT ATTGCCATT ATGGCCATT ATCCAATTTT ATC-CAATTTT ATTGCCATT-- ATCTTCTT ATTGCCGATT ATTGCCATT ATTGCCGATT ATTGCC-ATT ATCTTC-TT ATTGCCATT

Star Alignment Example Aggegate pairwise alignments present pair Current multiple alignment ATGGCCATT ATTGCCATT ATTGCCATT ATGGCCATT 1. ATC-CAATTTT ATTGCCATT-- ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT 2.

Star Alignment Example present pair Current multiple alignment ATCTTC-TT ATTGCCATT ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT ATCTTC-TT-- 3. ATTGCCGATT ATTGCC-ATT ATTGCC- A TT-- ATGGCC- A TT-- ATC-CA- A TTTT ATCTTC- - TT-- ATTGCCG A TT-- 4. shift entire columns when incorporating a gap

Comments about Star alignment Conceptually simple Dependent only upon pairwise alignments Does not consider any position-specific information of the partial multiple sequence alignment while aligning a new sequence to it

Tree-based progressive alignments Align sequences according to a guide tree leaves represent sequences internal nodes represent alignments Determine alignments from bottom of tree upward return multiple alignment represented at the root of the tree One common variant: the CLUSTALW algorithm [Thompson et al. 1994]

Calculating the pairwise distance Distance from Feng & Doolittle’s algorithm Percent identity For two sequences with the following alignment AATAATA ATAA_TA Similarity S No. of identical bases/size of alignment 4/7 for the above example Distance=1-S

Tree-based progressive alignment Depending on the internal node in the tree, we may have to align a a sequence with a sequence a sequence with a partial alignment a partial alignment with a partial alignment In all cases we have the option of inserting gaps or substitutions For aligning alignments, we will use sum of pairs scoring To choose between options we will use an idea similar to the pairwise sequence alignment case

Tree alignment example Starting sequences Create a guide tree Using pairwise distances (we will cover this in subsequent lectures) Approach similar to but simpler than phylogenetic trees TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

Tree Alignment Example TGTAAC TGT-AC TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

Tree Alignment Example TGTAAC TGT-AC ATGT--C ATGTGGC TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

Tree Alignment Example Aligning two alignments -TGTAAC -TGT-AC ATGT--C ATGTGGC TGTAAC TGT-AC ATGT--C ATGTGGC TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

Tree Alignment Example -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC Aligning sequence to alignment -TGTAAC -TGT-AC ATGT--C ATGTGGC TGTAAC TGT-AC ATGT--C ATGTGGC TGTTAAC TGTAAC TGTAC ATGTC ATGTGGC

Scoring an alignment of partial alignments Recall the sum of pairs score for a column i Let 1 to n represent sequences from the first alignment Let n+1 to N represent sequences from the second alignment, N denotes total number of sequences Alignment at column i can be written as Within first alignment Within second alignment Between two alignments

Computing the sum of scores for two alignments Assume we have two alignments corresponding to intermediate nodes of the guide tree At each step we maximize over score from aligning column i in A1 to a column j in A2 aligning column i in A1 to gaps in A2 aligning column j in A2 to gaps in A1 ClustalW uses an average of all pairwise comparisons between two alignments Alignment A1 Alignment A2 AAAC _GAC AGC ACC

ClustalW scores for aligning columns from two alignments Assume a score of 1 for mismatch, 2 for match and 0 for gap Score of aligning column 3 from Alignment 1 and column 2 from alignment 2 AAAC _GAC Alignment 1 Alignment 2 AGC ACC

An example for aligning two alignments A A A C _ G A C 1 1.5 2 2.25 2.5 3.5 3.25 4.5 A G C A C C Max of three options A _ A _ _ Alignment 1 A A _ _ A A Alignment 2

Comments about tree-based progressive alignment Exploits partial alignment information But, greedy The tree might not be correct, that is, reflect an incorrect ordering of how sequences should be stacked up in the alignment Final results prone to errors in alignment Some positions might be misaligned (that is have a lower score than if a different ordering is used).

Additional notes about the ClustalW algorithm Tailored to handle very divergent sequences: 25-30% similarity Dynamically varies the gap penalties in a position and residue specific manner Weight different sequences differently Closely related sequences need to be down-weighted Divergent sequences are up-weighted Dynamically switch between substitution matrices depending upon the average similarity between sequences being aligned

Applying ClustalW to SH3 domain proteins Alignment blocks correspond to beta strand secondary structures Proteins share <12% sequence identity Thomson et al, 1994

Iterative refinement methods The order of selection of sequences can influence the alignment ClustalW overcomes some of these issues but has many heuristics and parameters How to avoid committing to a non-optimal pairwise decision? Revisit alignments This is the focus of iterative alignments

Summary Multiple sequence alignment is the problem of finding equivalent regions among more than two sequences Scoring function: Entropy based Sum of pairs Algorithms Progressive Star Dependent upon a center Keep adding all pairs of aligned sequences with the current alignment Tree Create an approximate guide tree Use tree to align the sequences Iterative Don’t commit to the fixed ordering, revisit the alignment until score does not change

Slide Errata/Update Slide 40 had a typo in the second equation Slide 43 now has back arrows

Ordering matters D G D _ G G D G D G G _ Consider aligning GG, DGG and DGD 1 2 D G D _ G G D G D G G _ Are as good. But when we include DGG 1 2 D G D _ G G D G G D G D G G _ D G G 1 is better than 2, assuming a match score of 2, mismatch score =1, gap penalty=-2