Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment
Advertisements

COFFEE: an objective function for multiple sequence alignments
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Heuristic alignment algorithms and cost matrices
Introduction to Bioinformatics Algorithms Multiple Alignment.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple Sequences Alignment Ka-Lok Ng Dept. of Bioinformatics Asia University.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Multiple Alignment. Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments.
Multiple Sequence Alignment. Sequence Families Most sequences are members of large families, some with the same function and others with different functions.
Multiple alignment: heuristics
Introduction to Bioinformatics Algorithms Sequence Alignment.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics Algorithms Multiple Alignment.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 6 – 16/11/06 Multiple sequence alignment 1 Sequence analysis 2006 Multiple.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignment
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Biology 4900 Biocomputing.
Multiple Sequence Alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Using the T-Coffee Multiple Sequence Alignment Package I - Overview Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple sequence alignment
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.
Cédric Notredame (07/11/2015) Recent Progress in Multiple Sequence Alignments: A Survey Cédric Notredame.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Introduction to Bioinformatics Algorithms Multiple Alignment Lecture 20.
Aligning Sequences With T-Coffee Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Multiple Sequence Alignment
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Intro to Alignment Algorithms: Global and Local
Multiple Sequence Alignment (I)
Introduction to Bioinformatics
Presentation transcript:

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012

Outline  Multiple sequence alignment (MSA)  Generalize DP to 3 sequence alignment –Impractical  Heuristic approaches to MSA –Progressive alignment – ClustalW (using substitution matrix based scoring function) –Consistency-based approach – T-Coffee (consistency-based scoring function) –MUSCLE (MUSCLE-fast, MUSCLE-prog): reduces time and space complexity

Alignment of 2 sequences is represented as a 2-row matrix In a similar way, we represent alignment of 3 sequences as a 3-row matrix A T _ G C G _ A _ C G T _ A A T C A C _ A Score: more conserved columns, better alignment From pairwise to multiple alignment

What’s multiple sequence alignment (MSA)  A model  Indicates relationship between residues of different sequences  Reveals similarity/disimilarity

Why we need MSA  MSA is central to many bioinformatics applications  Phylogenetic tree  Motifs  Patterns  Structure prediction (RNA, protein)

Aligning three sequences  Same strategy as aligning two sequences  Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align  For global alignments, go from source to sink source sink

2D vs 3D alignment grid V W 2D table 3D graph

DP recursion (3 edges vs 7) In 3-D, 7 edges in each unit cube Pairwise: 3 possible paths (match/mismatch, insertion, and deletion)

Architecture of 3D alignment cell (i-1,j-1,k-1) (i,j-1,k-1) (i,j-1,k) (i-1,j-1,k) (i-1,j,k) (i,j,k) (i-1,j,k-1) (i,j,k-1)

Multiple alignment: dynamic programming s i,j,k = max  (x, y, z) is an entry in the 3D scoring matrix s i-1,j-1,k-1 +  (v i, w j, u k ) s i-1,j-1,k +  (v i, w j, _ ) s i-1,j,k-1 +  (v i, _, u k ) s i,j-1,k-1 +  (_, w j, u k ) s i-1,j,k +  (v i, _, _) s i,j-1,k +  (_, w j, _) s i,j,k-1 +  (_, _, u k ) cube diagonal: no indels face diagonal: one indel edge diagonal: two indels

MSA: running time For 3 sequences of length n, the run time is 7n 3 ; O(n 3 ) For k sequences, build a k-dimensional Manhattan, with run time (2 k -1)(n k ); O(2 k n k ) Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences (simultaneous approach) but it is impractical due to exponential running time. Computing exact MSA is computationally almost impossible, and in practice heuristics are used (progressive alignment)

Profile representation of multiple alignment - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T

Aligning alignments/profiles  Given two alignments, can we align them? x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG w GGACGTACC-- Alignment 2 v GGACCT-----

Aligning alignments/profiles  Given two alignments, can we align them?  Hint: use alignment of corresponding profiles x GGGCACTGCAT y GGTTACGTC-- Combined Alignment z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

Progressive alignment Progressive alignment uses guide tree Sequence weighting & scoring scheme and gap penalties Progressive alignment works well for close sequences, but deteriorates for distant sequences Gaps in consensus string are permanent Use profiles to compare sequences

ClustalW Popular multiple alignment tool today ‘W’ stands for ‘weighted’ (sequences are weighted differently). Three-step process 1.) Construct pairwise alignments 2.) Build guide tree 3.) Progressive alignment guided by the tree

Dynamic Programming Using A Substitution Matrix ClustalW algorithm

Step 1: Pairwise alignment Aligns each sequence again each other giving a similarity matrix Similarity = exact matches / sequence length (percent identity) v 1 v 2 v 3 v 4 v 1 - v v v (.17 means 17 % identical)

Step 2: Guide tree v1v3v4v2v1v3v4v2 Calculate: v 1,3 = alignment (v 1, v 3 ) v 1,3,4 = alignment((v 1,3 ),v 4 ) v 1,2,3,4 = alignment((v 1,3,4 ),v 2 ) v 1 v 2 v 3 v 4 v 1 - v v v ClustalW uses NJ to build guide tree; Guide tree roughly reflects evolutionary relations

Align ( Node N) { if ( N->left_child is a Node) A1=Align ( N->left_child) else if ( N->left_child is a Sequence) A1=N->left_child if (N->right_child is a node) A2=Align (N->right_child) else if ( N->right_child is a Sequence) A2=N->right_child Return dp_alignment (A1, A2) } AD E FG C B Step 3: Tree based recursion

Progressive alignment: Scoring scheme Scoring scheme is arguably the most influential component of the progressive algorithm Matrix-based algorithms ClustalW, MUSCLE, Kalign Use a substitution matrix to assess the cost of matching two symbols or two profiled columns Once a gap, always a gap Consistency-based schemes T-Coffee, Dialign Compile a collection of pairwise global and local alignments (primary library) and to use this collection as a position-specific substitution matrix

Substitution matrix based scoring  Sum of pairs (SP score)  Tree based scoring  Entropy score

Sum of pairs score (SP score) Seq Column-A -B 1 ……N……………N…… 2 ……N……………N…… 3 ……N……………N…… 4 ……N……………C…… 5 ……N……………C…… N NN NN N NC NC Score= 10 * S(N,N) = 10 * 6 = 60 Score= 3 * S(N,N) + 6 * S(N,C) + S(C,C) = 3 * * (-3) + 9 = 9 (BLOSUM62) Problem: over-estimation of the mutation costs (assuming each sequence is the ancestor of itself; requires a weighting scheme

Tree-based scoring N N N NN C C C “Real” tree: Cost = 1 But we do not know the tree! A A A C Star tree Cost=2 But the tree is wrong! C A

Entropy-based scoring

Entropy: Example Given a DNA sequence, what is its maximum entropy?

Alignment entropy  Define frequencies for the occurrence of each letter in each column of multiple alignment p A = 1, p T =p G =p C =0 (1 st column) p A = 0.75, p T = 0.25, p G =p C =0 (2 nd column) p A = 0.50, p T = 0.25, p C =0.25 p G =0 (3 rd column)  Compute entropy of each column AAA ACC ACG ACT An alignment with 3 columns Alignment entropy= 2.811

Consistency-based approaches  T-Coffee –M-Coffee & 3D-Coffee (Expresso)  Principle –Primary library –Library extension

T-Coffee: Primary library SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE FAT CAT SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqD THE ---- FAT CAT Input sequences Primary library: collection of global/local pairwise alignments SeqB GARFIELD THE ---- FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT SeqD THE FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE ---- FA-T CAT

T-Coffee: Library extension SeqA GARFIELD THE LAST FAT CAT |||||||| ||| |||| ||| SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FAT CAT |||||||| ||| |||| || SeqC GARFIELD THE VERY FAST CAT |||||||| ||| |||| ||| SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FAT CAT ||| ||| ||| SeqD THE FAT CAT ||| || SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqD THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT SeqD THE FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE ---- FA-T CAT SeqA GARFIELD THE LAST FAT CAT |||||||| ||| SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE ---- FAST CAT DP on the “consistency matrix” Triplets Extended library: new pairwise alignment (AB), (AC), (AD), (BC), (BD) and (CD) Different “weights”

T-Coffee uses progressive strategy to derive multiple alignment  Guide tree  First align the closest two sequences (DP using the weights derived from the extended library)  Align two “alignments” (using the weights from the extended library -- average over each column)  No additional parameters (gaps etc) –The substitution values (weights) are derived from extended library which already considered gaps –High scoring segments (consistent segments) enhanced by the data set to the point that they are insensitive to the gap penalties

MUSCLE: a tool for fast MSA  Initial progressive alignment followed by horizontal refinement (stochastic search for a maximum objective score –Step 1: draft progressive (using k-mer counting for fast computation of pairwise distance; tree building using UPGMA or NJ) –Step 2: Improved progressive to improve the tree and builds a new progressive alignment according to this tree (can be iterated). –Step 3: Refinement using tree-dependent restricted partitioning (each edge is deleted from the tree to divide the sequences into two disjoint subsets, from each a profile is built; the profile-profile alignment is computed, and if the score improves, retain the new alignment).  Ref: MUSCLE: a multiple sequence alignment method with reduced time and space complexity; BMC Bioinformatics 2004, 5:113

Multiple alignment: History 1975 Sankoff Formulated multiple alignment problem and gave DP solution 1988 Carrillo-Lipman Branch and Bound approach for MSA 1990 Feng-Doolittle Progressive alignment 1994 Thompson-Higgins-Gibson-ClustalW Most popular multiple alignment program 1998 DIALIGN (Segment-based multiple alignment) 2000 T-coffee (consensus-based) 2004 MUSCLE 2005 ProbCons (uses Bayesian consistency) 2006 M-Coffee (consensus meta-approach) 2006 Expresso (3D-Coffee; use structural template) 2007 PROMALS (profile-profile alignment)

Summary & references  “A majority of studies indicate that consistency- based methods are more accurate than their matrix- based counterparts, although they typically require an amount of CPU time N times higher than simpler methods (N being the number of sequences)”  bin/Tcoffee/tcoffee_cgi/index.cgi bin/Tcoffee/tcoffee_cgi/index.cgi  Recent evolutions of multiple sequence alignment algorithms. 2007, 3(8):e123  Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Res Jul 1  Chapter 6