Multiple Sequence Alignments

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Sequence analysis lecture 6 Sequence analysis course Lecture 6 Multiple sequence alignment 2 of 3 Multiple alignment methods.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
What you should know by now Concepts: Pairwise alignment Global, semi-global and local alignment Dynamic programming Sequence similarity (Sum-of-Pairs)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures CLUSTAL W Algorithm Lecturer:
Sequence Analysis Tools
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 07/01/08 Multiple sequence alignment 2 Sequence analysis 2007 Optimizing.
Multiple alignment: heuristics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Needleman-Wunsch with affine gaps
Bioinformatics Sequence Analysis III
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Biology 4900 Biocomputing.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Construction of Substitution Matrices
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
1 Multiple Sequence Alignment(MSA). 2 Multiple Alignment Number of sequences >2 Global alignment Seek an alignment that maximizes score.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Sequence similarity, BLAST alignments & multiple sequence alignments
Multiple sequence alignment (msa)
Multiple Sequence Alignment
Alignment IV BLOSUM Matrices
Presentation transcript:

Multiple Sequence Alignments Profiles and Progressive Alignment

Profiles for families of sequences can be built from MSAs 1 2 3 1 2 3 A C T G — 50% 25% 0% 75% 0% 25% 25% 0% 50% C A — G T Note: While profiles can be used for any kind of sequence data, we’ll focus on protein sequences

Profiles Profile: A table that lists the frequencies of each amino acid in each position of protein sequence. Frequencies are calculated from a MSA containing a domain of interest Allows us to identify consensus sequence Derived scoring scheme allows us to align a new sequence to the profile Profile can be used in database searches Find new sequences that match the profile Profiles also used to compute multiple alignments heuristically Progressive alignment

Profiles: Position-Specific Scoring Matrix (PSSM) To compare a sequence to a profile, need to assign a score for each amino acid The score the profile for amino acid a at position p is where f(p,b) = frequency of amino acid b in position p s(a,b) is the score of (a,b) (from, e.g., BLOSUM or PAM)

Profiles: PSSM Insertion/deletion penalty Gribskov et al. PNAS. 84 (13): 4355 (1987)

Profiles: Consensus Sequence A consensus residue C(p) is generated at each position of the profile to aid the display of alignments of target sequences with the profile. The consensus residue c is the amino acid at p that has the highest score M(p,c). c is the amino acid most mutationally similar to all the aligned residues of the probe sequences at p, rather than the most common one

Aligning a sequence to a profile 1 2 3 4 5 K L M – K K L K L K K M M L – M L – L M K L M - .75 .25 .50 K K L L M New sequence: K K L - L M K - L M – K K - L K L K K - M M L – M - L – L M K K L - L M 1 - 2 3 4 5 Align with profile:

Scoring a sequence-to-profile alignment Score each column separately according to PSSM Each character contributes to score, weighed by its frequency 1 2 3 4 5 K K L - L M 1 - 2 3 4 5 K L M - .75 .25 .50 Column 1 score: 0.75 s(K,K) + 0.25 s(K,M)

Profile-to-sequence alignments Optimum alignment can be found by dynamic programming Extension of Needleman-Wunsch Spaces are only added to msa – never removed Once a gap, always a gap Can align profiles to profiles

Evolutionary Profiles Profiles just seen are called average profiles Generally perform well, but disregard some of the biology How did each position evolve? Amount of conservation varies from position to position Type of conservation varies from position to position Alternative: Evolutionary profiles Gribskov, M. and Veretnik, S., Methods in Enzymology 266, 198-212, 1996

Evolutionary Profiles Idea: Fit a different model at each position For each position i : For each possible ancestor b for position i Try various evolutionary distances x (assume PAM model), and choose the one that minimizes cross entropy where fa = observed frequency of a pa= predicted frequency of a assuming b is the ancestor and x is the distance This generates 20 distributions for position i

Evolutionary Profiles For each position i Compute “mixture coefficient,” Wai, measuring likelihood that the residue a generated observed distribution (see text) Profile is given by where paij = frequency of residue j in the ancestral residue distribution a at position i prandom j = frequency of residue j in the database

Progressive multiple alignment Feng & Doolittle 1987, Higgins and Sharp 1988 Idea: Sequences to be aligned are phylogenetically related these relationships are used to guide the alignment Popular implementations: CLUSTALW, PILEUP, T-Coffee

CLUSTALW Perform pair-wise alignments between all pairs of sequences (n x (n-1)/2 possibilities) Generate distance matrix. Distance between a pair = number of mismatched positions in alignment divided by total number of matched positions Generate a Neighbor-Joining ‘guide tree’ from distance table Use guide tree to progressively align sequences in pairs from tips to root of tree. Actually, align profiles “Once a gap, always a gap”

CLUSTALW

CLUSTALW Tree Tree calculated from an alignment of more than 1100 ring finger domains, using ClustalW 1.83.

CLUSTALW heuristics Individual weights are assigned to each sequence in a partial alignment in order to downweight similar sequences and up-weight highly divergent ones. Varying substitution matrices at different alignment stages according to sequence divergence. Gaps Positions in early alignments where gaps have been opened receive locally reduced gap penalties Residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure.

Progressive Alignment: Discussion Strengths: Speed Progression biologically sensible (aligns using a tree) Weaknesses: No objective function. No way of quantifying whether or not the alignment is good

Problems with CLUSTALW Local minimum problem: Alignment depends on sequence addition order. With each alignment some proportion of residues are misaligned Worse for divergent sequences Errors get “locked in” and propagate as sequences are added Can result in arbitrary and incorrect alignments Clustal uses global alignment … may not be accurate for all parts of the sequence T-Coffee considers local similarity as well as global

Iterative alignment To avoid local minima, realign subgroups of sequences and then incorporate them into a growing multiple sequence alignment Improves overall alignment score. May involve rebuilding the guide tree May be randomized Programs: MultAlin PRRP DIALIGN

Phylogenetic Alignment Given a tree for a set of species S, find ancestral species such that total distance is minimized. CTGG GTGG CTGG CCGG CTAA GTAA CTTC