Longest Common Subsequence

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Lecture 24 MAS 714 Hartmut Klauck
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Linear Algebra Applications in Matlab ME 303. Special Characters and Matlab Functions.
 Distance Problems: › Post Office Problem › Nearest Neighbors and Closest Pair › Largest Empty and Smallest Enclosing Circle  Sub graphs of Delaunay.
Greedy Algorithms Greed is good. (Some of the time)
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Global Flow Optimization (GFO) in Automatic Logic Design “ TCAD91 ” by C. Leonard Berman & Louise H. Trevillyan CAD Group Meeting Prepared by Ray Cheung.
Chapter 7 Dynamic Programming.
Deterministic Selection and Sorting Prepared by John Reif, Ph.D. Analysis of Algorithms.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
§ 8 Dynamic Programming Fibonacci sequence
Dynamic Programming Reading Material: Chapter 7..
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
4 -1 Chapter 4 The Sequence Alignment Problem The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :
Distance Functions for Sequence Data and Time Series
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Data Structures Review Session 1
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Dynamic Programming Reading Material: Chapter 7 Sections and 6.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Applied Discrete Mathematics Week 9: Relations
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Algorithms for Enumerating All Spanning Trees of Undirected and Weighted Graphs Presented by R 李孟哲 R 陳翰霖 R 張仕明 Sanjiv Kapoor and.
Zvi Kohavi and Niraj K. Jha 1 Memory, Definiteness, and Information Losslessness of Finite Automata.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Extending Alignments Υλικό βασισμένο στο κεφάλαιο 13 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Logic Circuits Chapter 2. Overview  Many important functions computed with straight-line programs No loops nor branches Conveniently described with circuits.
 2004 SDU Lecture 7- Minimum Spanning Tree-- Extension 1.Properties of Minimum Spanning Tree 2.Secondary Minimum Spanning Tree 3.Bottleneck.
Genome Rearrangements Unoriented Blocks. Quick Review Looking at evolutionary change through reversals Find the shortest possible series of reversals.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Computably Enumerable Semigroups, Algebras, and Groups Bakhadyr Khoussainov The University of Auckland New Zealand Research is partially supported by Marsden.
Word : Let F be a field then the expression of the form a 1, a 2, …, a n where a i  F  i is called a word of length n over the field F. We denote the.
Algorithms 1.Notion of an algorithm 2.Properties of an algorithm 3.The GCD algorithm 4.Correctness of the GCD algorithm 5.Termination of the GCD algorithm.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
1 String Processing CHP # 3. 2 Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is.
Chapter 8 Maximum Flows: Additional Topics All-Pairs Minimum Value Cut Problem  Given an undirected network G, find minimum value cut for all.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
Core String Edits, Alignments, and Dynamic Programming.
Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Applied Discrete Mathematics Week 2: Functions and Sequences
Modeling with Recurrence Relations
Advanced Algorithms Analysis and Design
Copyright © Cengage Learning. All rights reserved.
@#? Text Search g ~ A R B n f u j u q e ! 4 k ] { u "!"
Distance Functions for Sequence Data and Time Series
Chapter 5. Optimal Matchings
ICS 353: Design and Analysis of Algorithms
Enumerating Distances Using Spanners of Bounded Degree
Data Structures Review Session
3. Brute Force Selection sort Brute-Force string matching
Dynamic Programming-- Longest Common Subsequence
3. Brute Force Selection sort Brute-Force string matching
Discrete Mathematics CS 2610
Chapter5: Synchronous Sequential Logic – Part 3
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Longest Common Subsequence Definition: The longest common subsequence or LCS of two strings S1 and S2 is the longest subsequence common between two strings. S1 : A -- A T -- G G C C -- A T A n=10 S2: A T A T A A T T C T A T -- m=12 The LCS is AATCAT. The length of the LCS is 6. The solution is not unique for all pair of strings. Consider the pair (ATTA, ATAT). The solutions are ATT, ATA. In general, for arbitrary pair of strings, there may exist many solutions.

LCS Theorem The LCS can be found by dynamic programming formulation. One can easily show: Theorem: With a score of 1 for each match and a zero for each mismatch or space , the matched characters in an alignment of maximum value for a LCS. Since it is using the general dynamic programming algorithm its complexity is O(nm) . A longest substring problem, on the other hand has a O(n+m) solution. Subsequences are much more complex than substrings. Can we do better for the LCS problem? We will see …

S1 : A -- A T -- G G C C -- A T A n=10 S2: A T A T A A T T C T A T -- m=12 The optimal alignment is shown above. Note the alignment shows three insert (dark), one delete green) and three substitution or replacement operations (blue), which gives an edit distance of 7. But, the 3 replacement operations can be realized by 3 insert and 3 delete operations because a replacement is equivalent to first delete the character and then insert a character in its place like: G -- G -- C -- -- A -- T -- T

if we give a cost of 2 for replace operation and cost of 1 for both insert and delete operations, the minimum edit distance D can be computed in terms of the length L of LCS as: For the above example, n=10, m=12, L=6. So, D=10 ( 6 insert and 4 delete).

Direct Computation of LCS by Dynamic Programming More efficient although the asymptotic complexity remains the same, O(nm). Let L denote The equations are given below without proof (which is simple). Again, if we leave suitable back pointers in the matrix, trace(s) can be derived for the LCS.

Example 3: Edit Distance = 6 + 8 – 2*5 = 4 ↓ ↑A j i T 1 A 2 3 4 C 5 6 7 8 A 1 T 2 C 3 A 4 A 5 T 6

A Faster Algorithm for LCS An algorithm that is asymptotically better than O(nm) for determining LCS. Implies that for special cases of edit distance, there exist more efficient algorithm. Definition: Let π be a set of n integers, not necessarily distinct. An increasing subsequence(IS) of π is a subsequence of π whose values are strictly increasing from left to right. Example: π=(5,3,4,4, 9,6,2,1,8,7,10). IS=(3,4,6,8,10), (5,9,10)

Definition: Example: DS=(5,4,4,2,1). A longest increasing subsequence(LIS) of π is an IS π of maximum length. A decreasing subsequence (DS) of π is a non-increasing subsequence f π. Example: DS=(5,4,4,2,1).

Definition: A cover is a set of disjoint DS of π that covers or contains all elements of π. The size of the cover c equals the number of DS in the cover. Example: π=(5,3,4,9,6,2,1,8,7) Cover:{ (5,3,2,1),(4),(9,6),(8,7)}. C=#of DS=4. A smallest cover (SC) is a cover with a minimum value of c.

Determine LIS and SC simultaneously in O(nlogn) Lemma: If I is an IS of π with length equal to the size of a cover C of π, then I is a LIS of π and C is the smallest cover of size c.

Proof If I is an increasing sequence, it cannot contain more than one element from a decreasing sequence. This means that no increasing subsequence can have size more than the size of any cover C, that is, if a maximum of one element from each can participate in any increasing sequence. Thus, an IS derived from this decomposition can have a maximum length of |C |=c. Conversely, C must be the smallest. If not, let c’ be the length of a cover C’ such that |C’|=< c i.e., if we derive IS from C, it must contain more than one element from one of the decreasing sequence of C’, which is not possible. Hence C has to be of smallest size.

Construction of a cover Greedy algorithm to derive a cover: Starting from the left of π, examine each successive number in π. Append the current number at the left-most subsequence derived so far if it is possible do that maintaining the decreasing sequence property. If not start a new decreasing subsequence beginning with the current element. Proceed until π is exhausted.

Example π=(5,3,4,9,6,2,1,8,7,10) D1=(5,3,2,1), D2=(4), D3=(9,6), D4=(8,7), D4=(10) The algorithm has O (n2) complexity. We will present an O (n logn) algorithm.

An Efficient Algorithm for Constructing the Cover We use a data structure which is a list containing the last number of each of the decreasing sequence that is being constructed. The list is always sorted in increasing order. An identifier indicating which list the number belongs to also included. Procedure Decreasing Sequence Cover Input: π= , the list of input numbers. Output: the set of decreasing sequences Di constituting the cover.

O(n logn) Algorithm Initialize: i←1; Di=(x1); L=(x1, i) ; j←1; For i=2 to n do Search the x-fields of L to find the first x-value such that xi < x. ….takes O( logn) time. If such a value exists, then insert x at the end in the list Di and set xi←x in L… This step takes constant time. If such a value does not exist in L, then set j←j+1. insert in L a new element (x,j) and start a new decreasing sequence Dj=(x) End

Lemma: At any point in the execution of the algorithm the list L is sorted in increasing order with respect to x-values as well as with respect to identifier value. In fact two separate lists will be better from practical implementation point of view. Theorem: The greedy cover can be constructed taking O(nlogn) time. A longest increasing sequence and a smallest cover thus can be constructed using O(nlogn) time.

Example: π=(5,3,4,9,6,2,1,8,7,10) i=1 x1=5 L={(5,1)} D1=(5) 2 3 {(3,1)} (5,3) 3 4 {(3,1),(4,2)} (5,3) D2=(4) 4 9 {(3,1),(4,2),(9,3)} (5,3) (4) D3=(9) 5 6 {(3,1),(4,2),(6,3)} (5,3) (4) (9,6) 6 2 {(2,1),(4,2),(6,3) (5,3,2) (4) (9,6) 7 1 {(1,1),(4,2),(6,3)} (5,3,2,1) (4) (9,6) 8 8 {(1,1),(4,2),(6,3),(8,4)} (5,3,2,1) (4) (9,6) D4=(8) 9 7 {(1,1),(4,2),(6,3),(7,4)} (5,3,2,1) (4) (9,6) D4=(8,7) 10 10 {(1,1),(4,2),(6,3),(7,4),(10,5)} (5,3,2,1) (4) (9,6) D4=(8,7) D5=(10) The x-component of the list, if separated, will look like the following during execution: (5),(3),(3,4), (3,4,9), (3,4,6), (2,4,6),(1,4,6), (1,4,6,7),(1,4,6,7,10)

Reduction of LIS problem to LCS problem Definition: Given sequences S1 and S2, let ri be the number of occurrence of the ith character of S1 in S2. (position index in sequence S2: ) 1 2 3 4 5 6 Example:S1=a b a c x and S2= b a a b c a Then, r(1)=3, r(2)=2, r(3)=3, r(4)=1, r(5)=0 .

Definition: for each distinct character x in S1, define list(x) to be the positions of x in S2 in decreasing order. Example: list(a)= (6,3,2); list(b)=(4,1), list(c)=(5), list(x)=φ (empty sequence).

Definition: Let Π (S1,S2) be a sequence obtained by concatenating list(si) for i=1,2,…,n where n is the length of S1 and si is the ith symbol of S1. Example: Π (S1,S2)= (6,3,2,4,1,6,3,2,5).

Theorem: Every increasing sequence I of Π (S1,S2) specifies an equal length common subsequence of S1 and S2 and vice versa. Thus a longest common subsequence LCS of S1 and S2 corresponds to a longest increasing sequence of Π (S1,S2). Example: Π (S1,S2)= (6,3,2,4,1,6,3,2,5). The possible longest increasing sequences used as indices to access the characters in S2 yield the LCS as: (1,2,5)= b a c, (2,3,5)=a a c, (3,4,6)= a b a for S1=a b a c x and S2= b a a b c a.