Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
Triangle partition problem Jian Li Sep,2005.  Proposed by Redstar in Algorithm board in Fudan BBS.  Motivated by some network design strategy.
Advertisements

Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Lectures on Network Flows
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Problem Set 2 Solutions Tree Reconstruction Algorithms
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Multiple sequence alignment
A 2-Approximation algorithm for finding an optimum 3-Vertex-Connected Spanning Subgraph.
Multiple Sequence alignment Chitta Baral Arizona State University.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.
. Clarifications and Corrections. 2 The ‘star’ algorithm (tutorial #3 slide 13) can be implemented with the following modification: Instead of step (a)
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Steiner trees Algorithms and Networks. Steiner Trees2 Today Steiner trees: what and why? NP-completeness Approximation algorithms Preprocessing.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Multiple Sequence Alignment
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Outline Introduction The hardness result The approximation algorithm.
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignments
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetics II.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Approximation Algorithms for TSP Tsvi Kopelowitz 1.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
Tutorial 5 Phylogenetic Trees.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Multiple Alignment.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Steiner trees: Approximation Algorithms
Phylogenetic Trees - Parsimony Tutorial #12
Mathematical Foundations of AI
Greedy Technique.
dij(T) - the length of a path between leaves i and j
The Greedy Method and Text Compression
Lectures on Network Flows
Character-Based Phylogeny Reconstruction
Bioinformatics Algorithms and Data Structures
Approximation Algorithms for TSP
Sequence Alignment 11/24/2018.
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Intro to Alignment Algorithms: Global and Local
CS 581 Tandy Warnow.
CSE 589 Applied Algorithms Spring 1999
Computational Genomics Lecture #3a
Clustering.
Perfect Phylogeny Tutorial #10
Presentation transcript:

Multiple Sequence Alignment S1=AGGTC Possible alignment A - T G C S2=GTTCG S3=TGAAC Possible alignment A G - T C

Multiple Sequence Alignment (cont) Input: Sequences S1 , S2 ,…, Sk over the same alphabet Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length |S’1|= |S’2|=…= |S’k| Removal of spaces from S’i obtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

Multiple Sequence Alignment Example Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Scoring scheme: match - 0 mismatch/indel - -1 SP score: -3 -5 -4 =-12

Multiple Sequence Alignment Complexity Given k strings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment: Instead of a 2-dimensional table we have a k-dimensional table Each dimension is of length ‘n’+1 Each entry depends on 2k-1 adjacent entries Complexity: O(2knk) This problem is known to be NP-hard (no polynomial-time algorithm)

Multiple Sequence Alignment Approximation Algorithm We use cost instead of score  Find alignment of minimal cost Assumption: the cost function δ is a distance function δ(x,x) = 0 δ(x,y) = δ(y,x) ≥ 0 δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels) D(S,T) - cost of minimum global alignment between S and T

Multiple Sequence Alignment Approximation Algorithm The ‘star’ algorithm: Input: Γ - set of k strings S1, …,Sk. Find the string S’ (center) that minimizes Denote S1=S’ and the rest of the strings as S2, …,Sk Iteratively add S2, …,Sk to the alignment as follows: Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1 Align Si to S’1 to produce S’i and S’’1 aligned Adjust S’2, …,S’i-1 by adding spaces where spaces were added to S’’1 Replace S’1 by S’’1

Multiple Sequence Alignment Approximation Algorithm Time analysis: Choosing S1 – execute DP for all sequence-pairs - O(k2n2) Adding Si to the alignment - execute DP for Si , S’1 - O(i·n2). (In the ith stage the length of S’1 can be up-to i· n) total complexity

Multiple Sequence Alignment Approximation Algorithm Approximation ratio: M* - optimal alignment M - The alignment produced by this algorithm d(i,j) - the distance M induces on the pair Si,Sj For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

Multiple Sequence Alignment Approximation Algorithm Triangle inequality Approximation ratio: Definition of S1:

Multiple Sequence Alignment Reminder S1=AGGTC Possible alignment A - T G C S2=GTTCG S3=TGAAC Possible alignment A G - T C

Multiple Sequence Alignment Reminder Input: Sequences S1 , S2 ,…, Sk over the same alphabet Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length |S’1|= |S’2|=…= |S’k| Removal of spaces from S’i obtains Si Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

Multiple Sequence Alignment Reminder The ‘star’ algorithm: Input: Γ - set of k strings S1, …,Sk. Find the string S1 (center) that minimizes Iteratively add S2, …,Sk to the alignment Finds MA costing at most twice the optimal cost! Problem: Conventional MA does not model correctly evolutionary relationships

Tree Alignment Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X) Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. How do we label internal vertices? Sequences Profiles (multiple alignments)

: 3 Profile Alignment A T G C - A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table. Column i holds the distribution of Σ (and gap) in that position A - T G C A 1 T 2 G 3 C - : 3

(same goes for aligning two profiles) Profile Alignment Aligning a sequence to a profile: Matching letter to position: weighted average of scores Indels: introducing new columns gets special consideration (same goes for aligning two profiles) A 1 T 2 G 3 C - : 3

Clustal Algorithm Iteratively constructs MA for intermediate nodes At each point holds profiles for all leaves Chooses closest pair of neighbors neighbors – have common father in T distance - cost of optimal (pairwise) alignment Aligns the two profiles to get the ‘father-profile’ Replaces the two leaves with their father Analysis: Initialization – O(k2) alignments k-1 iterations Iteration i involves k-i-1 new pairwise alignments ClustalW – more advanced version. Sequences/profiles are weighted

Lifted Tree Alignments each internal node is labeled by one of the labels of its daughters Internal nodes are sequences and not profiles Example: We’ll show: DP algorithm for optimal lifted tree alignment Optimal lifted alignment is 2-approximation of optimal tree alignment S1 S2 S3 S4 S6 S5

Lifted Tree Alignments Algorithm Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X) Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. Basic principle: calculate for every node v in T, and sequence S in X: d(v,S) - the optimal cost of v’s subtree when it is labeled by S The cost of optimal tree is

Lifted Tree Alignments Algorithm d(v,S) - the optimal cost of v’s subtree when it is labeled by S Initialization: for leaf v labeled Sv - Recurrence: for internal node v with daughters u1,…ul - Correctness: check for suboptimal solution property Complexity: O(k2) pairwise alignments - O(n2k2) . k-1 iterations For internal node v - O(kv2) work Total: O(k2(n2+depth(T))) O(k2depth(T))=O(k3)

Lifted Tree Alignments Approximation analysis Claim: Optimal LTA 2-approximates general tree alignments We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes (? can be generalized for profile-labeled nodes ?) Notations: T* - optimal TA labels Sv* - label of node v in T* TL – our constructed LTA SvL - label of node v in TL

Lifted Tree Alignments Approximation analysis Construction: We label the nodes bottom-up. For node v with daughters u1,…ul – we choose the label (from Su1L ,…,SulL) closest to Sv* We need to show: D(TL) ≤ 2D(T*)

Lifted Tree Alignments Approximation analysis Some edges in TL have cost 0 Observe edges (v,u) of cost > 0: Si- label of father(v) Sj- label of daughter (u) P(v,u) – the path in T* from v to the leaf labeled by Sj D(Si,Sj) ≤ D(Si,Sv*) + D(Sj,Sv*) ≤ 2D(Sj,Sv*) ≤ 2D(P(v,u)) triangle inequality choice of i triangle inequality

Lifted Tree Alignments Approximation analysis D(Si,Sj) ≤ 2D(P(v,u)) If (u,v) and (u’,v’) are two different edges with cost > 0 in TL, then P(u,v) and P(u’,v’) are mutually disjoint in edges Q.E.D. Final Remarks: Lifted tree alignment TL is only conceptual (we don’t have T*) Optimal LTA cannot cost more than TL In case of profile-labeled nodes: construction and analysis OK when cost is still distance function