Multiple alignment One of the most essential tools in molecular biology Finding highly conserved subregions or embedded patterns of a set of biological.

Slides:



Advertisements
Similar presentations
Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Advertisements

A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Lecture 8: Multiple Sequence Alignment
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
1 CAP5510 – Bioinformatics Multiple Alignment Tamer Kahveci CISE Department University of Florida.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple Alignment. Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Multiple sequence alignment
Multiple Sequence alignment Chitta Baral Arizona State University.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Multiple Sequence Alignment
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple sequence alignment
Multiple Sequence Alignment
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Multiple sequence alignment (msa)
Bioinformatics Algorithms and Data Structures
Multiple Sequence Alignment
Sequence Alignment 11/24/2018.
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Multiple Sequence Alignment
Intro to Alignment Algorithms: Global and Local
Sequence Based Analysis Tutorial
Multiple Sequence Alignment
Multiple Sequence Alignment (I)
Introduction to Bioinformatics
Computational Genomics Lecture #3a
Presentation transcript:

Multiple alignment One of the most essential tools in molecular biology Finding highly conserved subregions or embedded patterns of a set of biological sequences Conserved regions usually are key functional regions, prime targets for drug developments Estimation of evolutionary distance between sequences Prediction of protein secondary/tertiary structure Practically useful methods only since 1987 (D. Sankoff) Before 1987 they were constructed by hand Dynamic programming is expensive MSA gives stronger signal for sequence similarity than pairwise alignment – conserved pattern may be so dissimilar or dispersed that it cannot be detected when just two strings are aligned Build protein family – collection of proteins with similar structure, function and evolutionary history Consensus sequence – representative sequence constructed from MSA of members of protein family Used in databases Estimate evolutionary distance by assuming consensus sequence is ancestor Other uses: Sequence assembly, as in ARACHNE

. Alignment between globins (human beta globin, horse beta globin, An Alignment between globins produces by an alignment program Clustal. The proteins that appear in the alignment are human beta globin, horse beta globin, human alpha globin, horse alpha globin, cyanohaemoglobin, whale myoglobin and leghaemoglobin in that order. The boxes mark the seven alpha helices composing each globin. Alignment between globins (human beta globin, horse beta globin, human alpha globin, horse alpha globin, cyanohaemoglobin, whale myoglobin, leghaemoglobin) produced by Clustal. Boxes mark the seven alpha helices composing each globin. .

Definition Given strings x1, x2 … xk a multiple (global) alignment maps them to strings x’1, x’2 … x’k that may contain spaces where |x’1| = |x’2| = … = |xk’| The removal of all spaces from x’i leaves xi, for 1 i  k

Definitions Multiple Alignment Motif A rectangular arrangement, where each row consists of one protein sequence padded by gaps, such that the columns highlight similarity/conservation between positions Motif A conserved element of a protein sequence alignment that usually correlates with a particular function Motifs are generated from a local multiple protein sequence alignment corresponding to a region whose function or structure is known

Example of motif NAYCDEECK NAYCDKLC- -GYCN-ECT NDYC-RECR Motifs are conserved and hence predictive of any subsequent occurrence of such a structural/functional region in any other novel protein sequence

Scoring multiple alignments Ideally, a scoring scheme should Penalize variations in conserved positions higher Relate sequences by a phylogenetic tree Tree alignment Usually assume Independence of columns Quality computation Entropy-based scoring Compute the Shannon entropy of each column Minimize the total entropy Steiner string Sum-of-pairs (SP) score

Tree alignment Ideally: Find alignment that maximizes probability that sequences evolved from common ancestor x y ? z w v

Tree alignment Model the k sequences with a tree having k leaves (1 to 1 correspondence) Compute a weight for each edge, which is the similarity score Sum of all the weights is the score of the tree Assign sequences to internal nodes so that score is maximized

Tree alignment example Match +1, gap -1, mismatch 0 If x=CT and y=CG, score of 6 CAT CTG x y GT/CT = +1, CAT/CT=+1, CG/CG=+2, CTG/CG=+1,CT/CG=+1 CG GT

Analysis The tree alignment problem is NP-complete Holds even for the special case of star alignment “lifting alignment” gives a 2-approximate algorithm The generalized tree alignment problem (find the best tree) is also NP-complete Special cases for different kinds of scoring metrics Size of alphabet Triangle inequality

Consensus representations Relative frequencies of symbols in each column Adds up to 1 in each column Steiner string Minimize the consensus error May not belong to the set of input strings Consensus string for a given multiple alignment Choose optimal character in every column Consensus string is the concatenation of these characters Alignment error of a column is the distance-sum to the optimal character of all symbols in the column Alignment error of a consensus string is the sum of all column errors Optimal consensus string: optimize over all multiple alignments Signature representation Regular expression Helicase protein: [&H][&A]D[DE]xn[TSN]x4[QK]Gx7[&A] & is any amino acid in {I,L,V,M,F,Y,W}

Steiner string and consensus error metric Minimize Σ D(s,xi), over all possible strings s String smin is called the Steiner string May not belong to the set of inputs NP-complete Consensus error metric based on similarity to the steiner string center string provides an approximation factor of 2 i

Relating alignment error and consensus error Let s be the steiner string for a string set X = {xi} and c be the optimal consensus string For any multiple alignment M of X, Let xM be the consensus string Alignment error of xM = consensus error using xM ≥ consensus error using s Consider the star multiple alignment N using s Alignment error of N using s = consensus error using s Alignment error of N using s ≤ Alignment error of any multiple alignment N is the optimal multiple alignment and s (after removing gaps) is the consensus string Steiner string provides the optimal consensus string

Aligning to family representations Profile Apply dynamic programming Score depends on the profile Consensus string Signature representations Align to regular expressions / CFG/ …

Scoring Function: Sum of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum of Pairs (cont’d) The sum-of-pairs (SP) score of a multiple alignment A is the sum of the scores of all induced pairwise alignments S(A) = i<j S(Aij) Aij is the induced alignment of xi, xj Drawback: no evolutionary characterization Every sequence derived from all others

Optimal solution for SP scores Multidimensional Dynamic Programming Generalization of pair-wise alignment For simplicity, assume k sequences of length n The dynamic programming array is k-dimensional hyperlattice of length n+1 (including initial gaps) The entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik] Initialize values on the faces of the hyperlattice

k=3 2k –1=7 A S V

Complexity Space complexity: O(nk) for k sequences each n long. Computing at a cell: O(2k). cost of computing δ. Time complexity: O(2knk). cost of computing δ. Finding the optimal solution is exponential in k Proven to be NP-complete for a number of cost functions

Algorithms Faster Dynamic Programming Star alignment Carrillo and Lipman 88 (CL) Pruning of hyperlattice in DP Practical for about 6 sequences of length about 200. Star alignment Progressive methods CLUSTALW PILEUP Iterative algorithms Hidden Markov Model (HMM) based methods

CL algorithm Find pairwise alignment Trial multiple alignment produced by a tree, cost = d This provides a limit to the volume within which optimal alignments are found Specifics Sequences x1,..,xr. Alignment A, score = s(A) Optimal alignment A* Aij = induced alignment on xi,..,xj on account of A D(xi,xj) = score of optimal pairwise alignment of xi,xj ≥ s(Aij )

CL algorithm d ≤ s(A*) = s(A*uv) + Σ Σ s(A*ij) ≤ s(A*uv) + Σ Σ D(xi,xj) s(A*uv) ≥ d - Σ Σ D(xi,xj) = B(u,v) Compute B(u,v) for each (u,v) pair Consider any cell f with projection (s,t) on u,v plane. If A* passes through f then A*uv passes through (s,t) beststuv = best pairwise alignment of xu,xv that passes through (s,t). beststuv = score of the prefixes up to (s,t) + cost(xsi,xsj) + score of suffixes after (s,t) i i < j (i,j) ≠ (u,v) i i < j (i,j) ≠ (u,v)

CL algorithm If beststuv < B(u,v), then A* cannot pass through cell f Discard such cells from computation of DP Can prune for all (u,v) pairs

Star alignment Heuristic method for multiple sequence alignments Select a sequence c as the center of the star For each sequence x1, …, xk such that index i  c, perform a Needleman-Wunsch global alignment Aggregate alignments with the principle “once a gap, always a gap.” Consider the case of distance (not scores) Find multiple alignment with minimum distance

Star alignment example MPE | | MKE MSKE | || M-KE S1: MPE S2: MKE S3: MSKE S4: SKE s1 s3 s2 SKE || MKE M-PE M-KE MSKE S-KE M-PE M-KE MSKE MPE MKE s4

Choosing a center Try them all and pick the one with the least distance Let D(xi,xj) be the optimal distance between sequences xi and xj. Given a multiple alignment A, let c(Aij) be the distance between xi and xj that is induced on account of A. Calculate all O(k2) alignments, and pick the sequence xi that minimizes the following as xc Σ D(xi,xj) The resulting multiple alignment A has the property that c(Aci) = D(xc,xi). j ≠ i

Analysis Assuming all sequences have length n O(k2n2) to calculate center Step i of iterative pairwise alignment takes O((i.n).n) time two strings of length n and i.n O(k2n2) overall cost Produces multiple sequence alignments whose SP values are at most twice that of the optimal solutions, provided triangle inequality holds.

Bound analysis Let M = Σ c(A1i) = Σ D(x1,xi), assume x1 is the center 2 c(A) = Σ Σ c(Aij) ≤ Σ Σ [c(A1i) + c(A1j)] = 2(k-1) Σ c(A1i) = 2(k-1) M 2 c(A*) = Σ Σ c(A*ij) ≥ Σ Σ D(xi,xj) ≥ k Σ c(A1i) = k M c(A)/c(A*) <= 2(k-1)/k <= 2 i = 2 i = 2 i j ≠ i i j ≠ i i = 2 i j ≠ i i j ≠ i i = 2

Consensus error Center string c also provides an approximation factor of 2 under consensus error (score) metric Assume triangle inequality Let E(x) denote the consensus error wrt string x. Let z be the Steiner string E(z) = Σ D(z,xi) i

Consensus error For any string y in the input set, E(y) = Σ D(y,xi) ≤ Σ [D(y,z) + D(z,xi)] = (k-2) D(y,z) + D(y,z) + Σ D(z,xi) = (k-2) D(y,z) + E(z) Pick y* from input set that is closest to z. E(z) = Σ D(z,xi) ≥ k D(y*,z) E(y*)/E(z) ≤ [(k-2) D(y*,z) +E(z)]/E(z) ≤ (k-2) D(y*,z) / [k D(y*,z)] + 1 ≤ 2-2/k <= 2 E(c) ≤ E(y*) i y ≠ xi y ≠ xi i

ClustalW Progressive alignment 3 steps: All pairs of sequences are aligned to produce a distance matrix (or a similarity matrix) A rooted guide tree is calculated from this matrix by the neighbor-joining (NJ) method Neighbor Joining – Saitou, 1987 The sequences are aligned progressively according to the branching order in the guide tree

ClustalW example S1 ALSK S2 TNSD S3 NASK S4 NTSD

ClustalW example S1 ALSK S2 TNSD S3 NASK S4 NTSD S1 S2 S3 S4 9 4 7 8 3 All pairwise alignments S1 S2 S3 S4 9 4 7 8 3 Distance Matrix

ClustalW example S1 ALSK S2 TNSD S3 NASK S4 NTSD S1 S2 S3 S4 9 4 7 8 3 All pairwise alignments S1 S2 S3 S4 9 4 7 8 3 S3 S1 S2 S4 Rooted Tree Neighbor Joining Distance Matrix

Multiple Alignment Steps ClustalW example S1 ALSK S2 TNSD S3 NASK S4 NTSD Multiple Alignment Steps Align S1 with S3 Align S2 with S4 Align (S1, S3) with (S2, S4) All pairwise alignments S1 S2 S3 S4 9 4 7 8 3 S3 S1 S2 S4 Rooted Tree Neighbor Joining Distance Matrix

Multiple Alignment Steps ClustalW example Multiple Alignment Steps -ALSK NA-SK S1 ALSK S2 TNSD S3 NASK S4 NTSD -ALSK -TNSD NA-SK NT-SD Align S1 with S3 Align S2 with S4 Align (S1, S3) with (S2, S4) -TNSD NT-SD All pairwise alignments Multiple Alignment S1 S2 S3 S4 9 4 7 8 3 Neighbor Joining Rooted Tree Distance Matrix

Other progressive approaches PILEUP Similar to CLUSTALW Uses UPGMA to produce tree

Problems with progressive alignments Depend on pairwise alignments If sequences are very distantly related, much higher likelihood of errors Care must be made in choosing scoring matrices and penalties

Iterative refinement in progressive alignment One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x: GAAGTT y: GAC-TT z: GAACTG w: GTACTG Frozen! Now clear that correct y = GA-CTT

Multiple alignment tools Clustal W (Thompson, 1994) Most popular PRRP (Gotoh, 1993) HMMT (Eddy, 1995) DIALIGN (Morgenstern, 1998) T-Coffee (Notredame, 2000) MUSCLE (Edgar, 2004) Align-m (Walle, 2004) PROBCONS (Do, 2004)

Evaluating multiple alignments Balibase benchmark (Thompson, 1999) De-facto standard for assessing the quality of a multiple alignment tool Manually refined multiple sequence alignments Quality measured by how good it matches the core blocks Clustal W performs the best Problems of Clustal W Once a gap, always a gap Order dependent

Computationally challenging problems Scalable multiple alignment Dynamic programming is exponential in number of sequences Practical for about 6 sequences of length about 200.

Quick Primer on NP completeness Polynomial-time Reductions If we could solve X in polynomial time, then we could also solve Y in polynomial time YP X Class NP Set of all problems for which there exists an efficient certifier P = NP? General transformation of checking a solution to finding a solution Can arbitrary instances of Y be solved using a polynomial number of standard computational steps plus a polynomial number of calls to a black box that solves X Efficient checking vs efficient solving

NP-completeness X is NP-complete if XNP For all YNP, YPX If X is NP-complete, X is solvable in polynomial time iff P=NP Satisfiability is NP-complete If Y is NP-complete and X is in NP with the property that YPX, then X is NP complete