1 Sequences comparison 1 Issues Similarity gives a measure of how similar the sequences are. Alignment is a way to make clear the correspondence between.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Multiple Sequence Alignment
Chapter 7 Dynamic Programming.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment Tutorial #2
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Lecture 6: Multiple sequence alignment BioE 480 Sept 9, 2004.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Sequencing and Sequence Alignment
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Dynamic Programming and Biological Sequence Comparison Part I.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Pairwise alignment Computational Genomics and Proteomics.
Introduction to Bioinformatics Algorithms Multiple Alignment.
Class 2: Basic Sequence Alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lectures on Greedy Algorithms and Dynamic Programming
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence comparison and database search.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Learning to Align: a Statistical Approach
SPIRE Normalized Similarity of RNA Sequences
Sequence Alignment 11/24/2018.
SMA5422: Special Topics in Biotechnology
Intro to Alignment Algorithms: Global and Local
SPIRE Normalized Similarity of RNA Sequences
CSE 589 Applied Algorithms Spring 1999
Multiple Sequence Alignment
Dynamic Programming-- Longest Common Subsequence
Computational Genomics Lecture #3a
Presentation transcript:

1 Sequences comparison 1 Issues Similarity gives a measure of how similar the sequences are. Alignment is a way to make clear the correspondence between similar characters or sub-strings from the sequences. 1. Given two sequences (~10 4 ) almost equal (from the same gene) with a few isolated differences such as insertions, deletions or substitutions, find where are the differences. 2. Given two sequences (~k10 2 ), is a prefix of one similar to a suffix of the other (sequence assembly problem) ? Same problem with several sequences. 3. Are there two sub-strings one from each sequence that are similar (local similarities) ? Same problem but with several (thousand) sequences.

2 2. Basic Algorithm Sizes of the two sequences t and s may be different. Alignments produces two sequences issued from t and s whose sizes are equal by inserting spaces. It does not contain elementary alignment of spaces. Example: GA- CGGG-ATTA CATGCCGATTAG Score of an alignment: sum of the score of the elementary alignments. Example: +1 for a match, -1 for a mismatch, -2 for an indel (insertion-deletion). Similarity sim(s, t) is the maximal score possible between s and t.

3 Basic Algorithm (cont.1) Dynamic programming The problem is to determine sim(s, t) and a best alignment between s and t (such that its score value is sim(s, t)). Dynamic programming uses the so-called "composition principle": "the value (similarity) of the whole (couple of sequences) is a function of its parties". We use it to compute sim(s,t). This principle raises two questions (see Assignment1): 1. Into which parties have we to decompose a couple of sequences ? 2. Which is the function giving the similarity of a couple of sequence from the similarities of the parties ?

4 Basic Algorithm (cont.2) 1. An idea (see Assignment1) about the possible decompositions could be the three following ones (with suffixes of s and t being of sizes 1 or 0): - a couple (prefixs, prefixt) of prefixes of s and t such that the respective suffixes sufs and suft are letters (inducing an elementary alignment which is a match or a mismatch). - a couple (prefixs, prefixt) of prefixes of s and t such that the suffix sufs is the voïd sequence and suft a letter (inducing an elementary alignment which is an indel, insertion or deletion) - a couple (prefixs, prefixt) of prefixes of s and t such that the suffix sufs is a letter and suft the voïd sequence (inducing an elementary alignment which is an indel) These decompositions avoid elementary alignments of spaces.

5 Basic Algorithm (cont.3) 2. For each decomposition, the required function is simply addition. Let be prefs = s[1...i-1], preft= t[1...j-1], then sim(s[1...i-1], t[1...j-1]) + p(i, j) where p(i, j) is the cost of a comparison between s[i] and s[j], gives the best score if using the first way of decomposing. To take into account the three possible decompositions, the function is : sim(s[1...i], t[1...j]) = max(sim(s[1...i-1], t[1...j-1]) + p(i, j), sim(s[1...i], t[1...j-1]) + g, sim(s[1...i-1], t[1...j]) + g)) with sim([], []) = 0 (it follows that sim([], t[1...j]) = j*g).

6 Basic Algorithm (cont.4) The computation uses a matrix a such that: a[i, j] = sim(s[1...i], t[1...j]) with 1 ≤ i ≤ m and 1 ≤ j ≤ n if m is the size of s and n the size of t. All the elements of the matrix a can be easily computed as a[i, j] is computed from a[i-1, j], a[i, j] and a[i, j-1]. The result is sim(s, t) = a[m, n]. Best alignments are paths (see Assignment1) constructed from a[m, n] to a[0, 0] which take optimal transitions between nodes (couples of prefixes). Each preceding nodes of a[i, j] can be computed by using the definition of sim (any choice can lead to a[0, 0]). Complexity: O(m*n) in time and space to compute similarities, O(m+n) to compute a best alignment. Be careful, searching for alignments could be exponential…

7 Local Comparison A local alignment between s and t is an alignment between a sub- string of t and a sub-string of s. We search for highest scoring local alignment. A sub-string of s is modeled as a suffix of a prefix of s. So, we have first to express the highest score highestscorepref(s[1...i], t[1...j]) between suffixes of prefixes of s and t. Note: suffixes of prefixes of s and t can be both void. Surprisingly, the definition of highestscorepref(s[1...i], t[1...j]) is very close to that one of sim(s[1...i], t[1...j]).

8 Local Comparison (cont.1) highestscorepref(s[1...i], t[1...j]) = max( highestscorepref (s[1...i-1], t[1...j-1]) + p(i, j), highestscorepref (s[1...i], t[1...j-1]) + g, highestscorepref (s[1...i-1], t[1...j]) + g), highestscorepref ([],[]) ) with highestscorepref ([],[]) = 0 (it follows that highestscorepref ([], t[1...j]) = 0) An highest scoring local alignment is constructed from node a[i, j] having the maximal value, in a way similar to the global case. The construction stops when a node of score 0 is reached.

9 Semi-global Comparison One does not consider spaces at the beginning(or the end) of an alignment. Examples: Alignment1:CAGCA- CTTGGATTCTCGG(from t) - - -CAGCGTGG (from s) The score of alignment is-19. But if spaces at the beginning of s are not considered the score is –3. Another alignment has a better score (-12): Alignment2:CAGCACTTGGATTCTCGG(from t) CAGC G- T GG(from s) In the case where end spaces are ignored after the last character of s the similarity between s and t becomes: sim(s,t) = max j =1…n (a[m, j])

10 Semi-global Comparison (cont.1) In the case where end spaces are ignored after the last character of s, the similarity between s and t is the highest one between s and a prefix of t. Then : sim(s,t) = max j =1…n (a[m, j]) In the case where beginning spaces of s are ignored, a[i, j] must contain the highest similarity between s[1…i] and a suffix of t[1…j]. Then the usual definition of sim(t, s) holds except that this time a[0, j] = 0. Exercice : How to proceed if one does not want to charge both beginning and end spaces of s and t ?

11 3. Extensions to the basic algorithm 1. Saving space The quadratic complexity is unavoidable. With respect to space, it possible to improve complexityfrom quadratic to linear. First, computing similarity can be easily done in linear space. We use a vector which contains at each step the line a[i-1, 0…j-1] and the line a[i, 0…j-1]. From this vector a[i, j] is easily computed. Computing best alignments is more difficult. Let be optimal(s, t) an optimal alignment. Then the composition principle can be applied by splitting s into three parts: s[1…i-1], s[i], s[I+1…m].

12 Saving space This principle induces that it exists j such that : optimal(s, t) = optimal(s[1…i-1], t[1…j-1]). elementary_alignment(s[i], t[j]). optimal(s[i+1…m], t[j+1…n]) or such that (by inserting a space in t): optimal(s, t) = optimal(s[1…i-1], t[1…j]). elementary_aligment(s[i], -). optimal(s[i+1…m], t[j+1…n]) So optimal(s, t) could be computed provided we know j.

13 Saving space (cont.1) Let simpre ij (resp. simsuf ij ) be the similarity between s[1…i-1] (resp. s[i+1…m]) and the prefixes terminating at position j (resp. suffixes beginning at position j) of t. The index j is such that: simpre ij + score(s[i], t[j]) + simsufix ij+1 or simpre ij + score(s[i], -) + simsufix ij is maximum. These similarities can be computed in linear space. Exercise: Show that processing time roughly doubles. Hints: Let T(m, n) be the number of times a maximum is computed for having the similarities. T(m, n) is proportional to the total processing time. Show that T(m, n) < 2*m*n. Note that T(1, n)  2*n (no maximum computations will occur).

14 General and affine gap penalty functions General gap penalty functions are such that the penalty for a block of size n is -w(n). w(k) could be different of b*k where b is the score of a gap (generally less than b*k). It could not be additive (sum of similarities of the components is not the similarity of the whole). The complexity becomes cubic. Less general functions (sub-additive functions)which satisfies w(k)  kw(1) and w(k1 + …+kn)  w(k1) + …+ w(kn) are affine functions of the form w(k) = h +gk (h>0, g >0) with k  1 and w(0) =0. The first space costs h + g and each following space costs g. Then the complexity remains quadratic.

15 Comparing similar sequences We treat the case where the two sequences have the same length n. Spaces will be inserted in pairs (one in s one in t). The number of space pairs k is greater than or equal to the maximum departure from the main diagonal. A narrow band around the main diagonal suffices to compute the optimal score alignment and alignments. The algorithm runs in time o(kn) which is a big win over the usual o(n 2 ) if k is small compared to n.

16 Comparing similar sequences (cont.1) How to use K-band ? If a[n, n] is greater or equal to the best score that would be come from an alignment with k+1 pairs, an optimal alignment has been found with o(kn) steps. This best possible bs score is : M*(n-k-1) + 2(k+1)*g = bs where M (>0) is the score of a match and g (  0) is added for each space.

17 Comparing similar sequences (cont.2) Exercise. Suppose that each time a k [n,n] < bs then k doubles and the algorithm is run again. Find a bound of the complexity of K-band in function n and sim(s, t). Observe that the complexity becomes better as the similarity grows. Hints: Express the stop condition for k and the non stop condition for k/2. Express the complexity and its bound depending whether a k [n, n] = a k/2 [n, n] or a k [n, n] > a k/2 [n, n].

18 Comparing multiple sequences Multiple alignment : sequences are extended by inserting spaces in such a way as to make them all the same size. No column is made exclusively of spaces. The SP-measure (Sum of Pairs): To score a column we want a function with k arguments, where k is the number of sequences. One (impractical way) would be to implement this function as a k-dimensional array: if k =10 this would need 2 10 –1 > 1000 combinations. Reasonable properties for a manageable function could be: - result independent from the order of the arguments (same sore for (V, V, I, -) and (I, I, V, -)). - penalization of unrelated residues and spaces

19 The SP-measure A well-known function SP-score is such that: SP-score(I, -, I, V) = p(I, -) + p(I, I) + p(I, V) + p(-, I) + p(-, V) + p(I, V) = SP-score(V, V, I, -) Where p(a, b) is the pairwise score of symbols a and b. It is possible to have two spaces in the same column. The common practice is to set p(-,-) = 0. -it is consistent with the induced pairwise alignment: elementary 2-alignments composed of spaces are removed. - it is compatible with SP-score(  ) =  i<j score(  ij ) where  ij is the induced alignment on sequences i and j from alignment .

20 Dynamic programming approach Cost in space: o(n k ) for k sequences of length n. Cost in time: - 2 k –1 possible compositions to be examined at each step.( for 3 sequences). -SP-score method requires o(k 2 ) steps by column (ie k(k-1)/2 pairs) The total complexity is o(k 2 2 k n k) if S-score is used. Exercise: Heuristics based on SP-scores for saving time which limit the positions (of the k-dimensional array) to those which are relevant to optimal alignments. Hints: Suppose  optimal. It could be that the induced 2- alignments are not optimal. But an idea based on the following one can be used: a position is relevant only if it belongs to all deduced optimal 2-alignments.

21 Star Alignment A sequence sc is chosen: the center of the star. 1. The goal is to get a multiple alignment  such that all pairwise alignments  ij where i or j is sc are optimal. The cost of all optimal pairwise alignments is o(kn 2 ). 2. A multiple alignment is obtained by composing a multiple alignment (with sc) with a pairwise alignment (with sc) with the following approach: “one gap in sc, always a gap in sc”. Cost of a composition is o(kl) where l is the maximum length of the alignment. The total cost is o( k 2 l). Choice of sc: For example, such that  i sim(s i, sc) is maximized.

22 Tree Alignments Data: k sequences, a (evolutionary) tree with k leaves, a one to one correspondence between sequences and leaves. Tree alignment problem: We have to assign sequences to the interior nodes of the tree in such a way that the score of the tree is maximum. This score is the sum of the similarities between the sequences connected via an edge. The star alignment is a particular case of tree alignment.

23 Tree Alignments Exercise: Consider a tree T with two leaves GAT and GT connected to and edge x and two leaves CG and CTG connected to an edge y, such that x and y are connected. Suppose x = CT and y = GC. Compute the score of the tree T in term of distances (more or less a dual notion of similarity). Consider the distance p(a, b) such that : p(a, b) = if a = b then 1 else 0 p(a, -) = -1 Which tree alignment could you deduce ?