Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Slides:



Advertisements
Similar presentations
Computational Genomics Lecture #3a
Advertisements

Approximation Algorithms for TSP
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
. Phylogenetic Trees - Parsimony Tutorial #12 Next semester: Project in advanced algorithms for phylogenetic reconstruction (236512) Initial details in:
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
5 - 1 Chap 5 The Evolution Trees Evolutionary Tree.
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Defining Scoring Functions, Multiple Sequence Alignment Lecture #4
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Multiple sequence alignment
Multiple Sequence alignment Chitta Baral Arizona State University.
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Tirgul 13. Unweighted Graphs Wishful Thinking – you decide to go to work on your sun-tan in ‘ Hatzuk ’ beach in Tel-Aviv. Therefore, you take your swimming.
. Phylogenetic Trees - Parsimony Tutorial #11 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
. Clarifications and Corrections. 2 The ‘star’ algorithm (tutorial #3 slide 13) can be implemented with the following modification: Instead of step (a)
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Multiple Sequence Alignment
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes.
Phylogenetic trees Sushmita Roy BMI/CS 576
Outline Introduction The hardness result The approximation algorithm.
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Multiple Sequence Alignments
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
1 Combinatorial Algorithms Parametric Pruning. 2 Metric k-center Given a complete undirected graph G = (V, E) with nonnegative edge costs satisfying the.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Lectures on Greedy Algorithms and Dynamic Programming
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Approximation Algorithms for TSP Tsvi Kopelowitz 1.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Multiple Alignment.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Phylogenetic Trees - Parsimony Tutorial #12
Greedy Technique.
dij(T) - the length of a path between leaves i and j
Bioinformatics Algorithms and Data Structures
Sequence Alignment 11/24/2018.
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Intro to Alignment Algorithms: Global and Local
Multiple Sequence Alignment
Computational Genomics Lecture #3a
Presentation transcript:

Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC

Multiple Sequence Alignment (cont) Input: Sequences S 1, S 2,…, S k over the same alphabet Output: Gapped sequences S’ 1, S’ 2,…, S’ k of equal length 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.Removal of spaces from S’ i obtains S i Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it.

Consider the following alignment: AC-CDB- -C-ADBD A-BCDAD Multiple Sequence Alignment Example Scoring scheme: match -0 mismatch/indel --1 SP score: =-12

Given k strings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment: Instead of a 2-dimensional table we have a k -dimensional table Each dimension is of length ‘n’+1 Each entry depends on 2 k -1 adjacent entries Complexity: O(2 k n k ) This problem is known to be NP-hard (no polynomial-time algorithm) Multiple Sequence Alignment Complexity

Multiple Sequence Alignment Approximation Algorithm We use cost instead of score  Find alignment of minimal cost Assumption: the cost function δ is a distance function δ(x,x) = 0 δ(x,y) = δ(y,x) ≥ 0 δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels) D(S,T) - cost of minimum global alignment between S and T

The ‘star’ algorithm: Input: Γ - set of k strings S 1, …,S k. 1.Find the string S’ (center) that minimizes 2.Denote S 1 =S’ and the rest of the strings as S 2, …,S k 3.Iteratively add S 2, …,S k to the alignment as follows: a.Suppose S 1, …,S i-1 are already aligned as S’ 1, …,S’ i-1 b.Align S i to S’ 1 to produce S’ i and S’’ 1 aligned c.Adjust S’ 2, …,S’ i-1 by adding spaces where spaces were added to S’’ 1 d.Replace S’ 1 by S’’ 1 Multiple Sequence Alignment Approximation Algorithm

Time analysis: Choosing S 1 – execute DP for all sequence-pairs - O(k 2 n 2 ) Adding S i to the alignment - execute DP for S i, S’ 1 - O(i·n 2 ). (In the i th stage the length of S’ 1 can be up-to i · n ) Multiple Sequence Alignment Approximation Algorithm total complexity

For all i : d(1,i)=D(S 1,S i ) (we perform optimal alignment between S’ 1 and S i and δ(-,-) = 0 ) Multiple Sequence Alignment Approximation Algorithm Approximation ratio: M* - optimal alignment M - The alignment produced by this algorithm d(i,j) - the distance M induces on the pair S i,S j

Multiple Sequence Alignment Approximation Algorithm Approximation ratio: Definition of S 1 : Triangle inequality

Multiple Sequence Alignment Reminder S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC

Input: Sequences S 1, S 2,…, S k over the same alphabet Output: Gapped sequences S’ 1, S’ 2,…, S’ k of equal length 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.Removal of spaces from S’ i obtains S i Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it. Multiple Sequence Alignment Reminder

The ‘star’ algorithm: Input: Γ - set of k strings S 1, …,S k. 1.Find the string S 1 (center) that minimizes 2.Iteratively add S 2, …,S k to the alignment Finds MA costing at most twice the optimal cost! Multiple Sequence Alignment Reminder Problem: Conventional MA does not model correctly evolutionary relationships

Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. How do we label internal vertices? Sequences Profiles (multiple alignments) Tree Alignment

A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table. Column i holds the distribution of Σ (and gap) in that position Profile Alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- A T G C : 3

Aligning a sequence to a profile: Matching letter to position: weighted average of scores Indels: introducing new columns gets special consideration (same goes for aligning two profiles) Profile Alignment A T G C : 3

Iteratively constructs MA for intermediate nodes At each point holds profiles for all leaves Chooses closest pair of neighbors - neighbors – have common father in T - distance - cost of optimal (pairwise) alignment Aligns the two profiles to get the ‘father-profile’ Replaces the two leaves with their father Analysis: Initialization – O(k 2 ) alignments k-1 iterations Iteration i involves k-i-1 new pairwise alignments Clustal Algorithm ClustalW – more advanced version. Sequences/profiles are weighted

Lifted Tree Alignments Lifted tree alignment – each internal node is labeled by one of the labels of its daughters Internal nodes are sequences and not profiles Example: S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 We’ll show: 1. DP algorithm for optimal lifted tree alignment 2. Optimal lifted alignment is 2-approximation of optimal tree alignment

Lifted Tree Alignments Algorithm Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. Basic principle: calculate for every node v in T, and sequence S in X : d(v,S) - the optimal cost of v ’s subtree when it is labeled by S The cost of optimal tree is S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

Lifted Tree Alignments Algorithm d(v,S) - the optimal cost of v ’s subtree when it is labeled by S Initialization: for leaf v labeled S v - Recurrence: for internal node v with daughters u 1,…u l - Correctness: check for suboptimal solution property Complexity: O(k 2 ) pairwise alignments - O(n 2 k 2 ). k-1 iterations For internal node v - O(k v 2 ) work Total: O(k 2 (n 2 +depth(T))) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 O(k 2 depth(T))=O(k 3 )

Lifted Tree Alignments Approximation analysis Claim: Optimal LTA 2-approximates general tree alignments We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes (? can be generalized for profile-labeled nodes ?) Notations: T* - optimal TA labels S v * - label of node v in T* T L – our constructed LTA S v L - label of node v in T L S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

Lifted Tree Alignments Approximation analysis Construction: We label the nodes bottom-up. For node v with daughters u 1,…u l – we choose the label (from S u1 L,…,S u l L ) closest to S v * We need to show: D(T L ) ≤ 2D(T*) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

Lifted Tree Alignments Approximation analysis Analysis: Some edges in T L have cost 0 Observe edges (v,u) of cost > 0: S i - label of father( v ) S j - label of daughter ( u ) P(v,u) – the path in T* from v to the leaf labeled by S j D(S i,S j ) ≤ D(S i,S v *) + D(S j,S v *) ≤ 2D(S j,S v *) ≤ 2D(P(v,u)) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 triangle inequality choice of i triangle inequality

Lifted Tree Alignments Approximation analysis D(S i,S j ) ≤ 2D(P(v,u)) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 If (u,v) and (u’,v’) are two different edges with cost > 0 in T L, then P(u,v) and P(u’,v’) are mutually disjoint in edges Final Remarks: Lifted tree alignment T L is only conceptual (we don’t have T* ) Optimal LTA cannot cost more than T L In case of profile-labeled nodes: construction and analysis OK when cost is still distance function Q.E.D.