UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
Chapter 7 Dynamic Programming.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.
RAIK 283: Data Structures & Algorithms
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Dynamic Programming Lets begin by looking at the Fibonacci sequence.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
CPSC 311, Fall 2009: Dynamic Programming 1 CPSC 311 Analysis of Algorithms Dynamic Programming Prof. Jennifer Welch Fall 2009.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
Bioinformatics Algorithms and Data Structures
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Dynamic Programming Solving Optimization Problems.
Chapter 8 Dynamic Programming Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Chapter 8 Dynamic Programming Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)
1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Dynamic Programming A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 8 ©2012 Pearson Education, Inc. Upper Saddle River,
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 8 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Dynamic Programming Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by or formulated.
Recursion and Dynamic Programming. Recursive thinking… Recursion is a method where the solution to a problem depends on solutions to smaller instances.
CS 5243: Algorithms Dynamic Programming Dynamic Programming is applicable when sub-problems are dependent! In the case of Divide and Conquer they are.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
CSC401: Analysis of Algorithms CSC401 – Analysis of Algorithms Chapter Dynamic Programming Objectives: Present the Dynamic Programming paradigm.
Minimum Edit Distance Definition of Minimum Edit Distance.
Dynamic Programming Louis Siu What is Dynamic Programming (DP)? Not a single algorithm A technique for speeding up algorithms (making use of.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
1 Dynamic Programming Topic 07 Asst. Prof. Dr. Bunyarit Uyyanonvara IT Program, Image and Vision Computing Lab. School of Information and Computer Technology.
Dynamic Programming Min Edit Distance Longest Increasing Subsequence Climbing Stairs Minimum Path Sum.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Dynamic Programming David Kauchak cs161 Summer 2009.
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
Core String Edits, Alignments, and Dynamic Programming.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Definition of Minimum Edit Distance
Dynamic Programming Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Chapter 8 Dynamic Programming
Definition of Minimum Edit Distance
Distance Functions for Sequence Data and Time Series
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
Advanced Analysis of Algorithms
Presentation transcript:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits Lecturer: Dr. Rose Slides by: Dr. Rose January 30 & February 4, 2003

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Core String Edits This chapter introduces inexact matching –Inexact matching is used to compute similarity. –Sequences similarity is a key concept. –Sequence similarity implies Structural similarity Functional similarity –We will consider a dynamic programming approach to inexact matching.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Edit Distance One measure of similarity between two strings is their edit distance. This is a measure of the number of operations required to transform the first string into the other. Single character operations: –Deletion of a character in the first string –Insertion of a character in the first string –Substitution of a character from the second character into the second string –Match a character in the first string with a character of the second.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Edit Distance Example from textbook: transform vintner to writers vintner replace v with w  wintner wintner insert r after w  wrintner wrintner match i  wrintner wrintner delete n  writner writnermatch t  writner writnerdelete n  writer writermatch e  writer writermatch r  writer writerinsert s  writers

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Edit Distance Let  = {I, D, R, M} be the edit alphabet Defn. An edit transcript of two strings is a string over  describing a transformation of one string into another. Defn. The edit distance between two strings is defined as the minimum number of edit operations needed to transform the first into the second. Matches are not included in the count. Edit distance is also called Levenshtein distance.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Edit Distance Defn. An optimal transcript is an edit transcript with the minimal number of edit operations for transforming one string into another. Note: optimal transcripts may not be unique. Defn. The edit distance problem entails computing the edit distance between two strings along with an optimal transcript.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology String Alignment Defn. A global alignment of strings S 1 and S 2 is obtained by: 1.Inserting dashes/spaces into or at the ends of S 1 and S 2. 2.Aligning the two strings s.t. each character/space in either string is opposite a unique character/space in the other string. Example 1: S 1 = qacdbd S 2 = qawxb q a c - d b d q a w x - b -

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology String Alignment Example 2: S 1 = vintner S 2 = writers v - i n t n e r - w r i - t - e r s Mathematically, string alignment and edit transcripts are equivalent. From a modeling perspective they are not equivalent. Edit transcripts express the idea of mutational changes.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Observation 1: There are many possible ways to transform one string into another. Observation 2: This is like the knapsack problem Recall: dynamic programming is used to solve knapsack- like problems. Defn. Let D(i,j) denote the edit distance of S 1 [1..i] and S 2 [1..j]. –That is, D(i,j) is the minimum number of edit ops needed to transform the first i characters of S 1 into the first j characters of S 2.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Notice that we can solve D(i,j) for all combination of lengths of prefixes of S 1 and S 2. Examples: D(0,0),.., D(0,j), D(1,0),..,D(1,j), … D(i,j) Dynamic programming is a divide and conquer method. The three parts to dynamic programming are: –The recurrence relation –Tabular computation –Traceback

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming The recurrence relation expresses the recursive relation between a problem and smaller instances of the problem. For any recursive relation, the base condition(s) must be specified. Base conditions for D(i,j) are: –D(i,0) = i Q: Why is this true? What does it mean in terms of edit ops? –D(0,j) = j Q: Why is this true? What does it mean in terms of edit ops?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming The general recurrence is given by: D(i,j) = min[ D(i - 1, j) + 1, D(i, j - 1) + 1, D(i - 1, j - 1) + t (i,j) ] Here t (i,j) = 1 if S 1 (i)  S 2 (j), o/w t (i,j) = 0. Proof of correctness on Pages Basic argument: D(i,j) must be one of : 1. D(i - 1, j) D(i, j - 1) D(i - 1, j - 1) + t (i,j) There are NO other ways of creating S 2 [1..j] from S 1 [1..i].

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Q: How do we use the recurrence relation to efficiently compute D(i,j) ? Wrong Answer: simply use recursion. Q: Why is this the wrong answer? A: recursion results in inefficient duplication of computations for subproblems. Q: How much duplication? A: Exponential duplication! Example: Fibonacci numbers

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Example: Fibonacci numbers f(n) = f(n - 1) + f(n - 2) Base conditions: f(0) = 0, f(1) = 1

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Note: In calculating D(n,m), there are only (n + 1)  (m + 1) unique combinations of i and j. Clearly an exponential number of computations is NOT required. Soln: instead of going top-down with recursion, go bottom-up. Compute each combination only once. –Decide on a data structure to hold intermediate results. –Start from base conditions. These are the smallest D(i,j) values and are already defined. –Compute D(i,j) for larger values of i and j.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Example: Fibonacci numbers Decide on a data structure: simple array Start from base conditions: f(0) = 0, f(1) = 1 Compute f(i) for larger values of i. From bottom up. Each f(i) is computed only once!

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Q: What kind of data structure should we use for edit distance? 1.Has to be a random access data structure. 2.Has to support the dimensionality of the problem. D(i,j) is two-dimensional: S 1 and S 2. We will use a two-dimensional array, i.e., a table.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Example: edit distance from vintner to writers. Fill in the base condition values.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Q: How do we fill in the other values? A: use the recurrence: D(i,j) = min[ D(i - 1, j) + 1, D(i, j - 1) + 1, D(i - 1, j - 1) + t (i,j) ] where t (i,j) = 1 if S 1 (i)  S 2 (j), o/w t (i,j) = 0. We can first compute D(1,1) because we have D(0,0), D(0,1), and D(1,0) –D(1,1) = min[ 1+1, 1+1, 0+1] = 1 Then we have all the values needed to compute in turn D(1,2), D(1,3),..,D(1,m)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming First compute D(1,1) because we have D(0,0), D(0,1), and D(1,0) Then compute in turn D(1,2), D(1,3),..,D(1,m)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Fill in subsequent values, row by row, from left to right.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Alternatively, first compute D(1,1) from D(0,0), D(0,1), and D(1,0) Then compute in turn D(2,1), D(3,1),..,D(n,1)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Fill in subsequent values, column by column, from top to bottom.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Filling each cell entails a constant number of operations. –Cell (i,j) depends only on characters S 1 (i) and S 2 (j) and cells (i - 1, j - 1), (i, j - 1), and (i - 1, j). There are O(nm) cells in the table Consequently, we can compute the edit distance D(n, m) in O(nm) time by computing the table in O(nm).

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Having computed the table we know the value of the optimal edit transcript. Q: How do we extract the optimal edit transcript from the table? A: One way would be to establish pointers from each cell, to predecessor cell(s) from which its value was derived, i.e, –If D(i,j) = D(i - 1, j) + 1 add a pointer from (i,j) to (i - 1, j) –If D(i,j) = D(i, j - 1) + 1 add a pointer from (i,j) to (i, j - 1) –If D(i,j) = D(i - 1, j - 1) + t(i,j) add a pointer from (i,j) to (i - 1, j - 1)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming We can recover an optimal edit sequence simply by following any path from (n,m) to (0,0) The interpretation of the path links are: –A horizontal link, (i,j)  (i,j-1), corresponds to an insertion of character S 2 (j) into S 1. –A vertical link, (i,j)  (i-1,j), corresponds to a deletion of S 1 (i) from S 1. –A diagonal link, (i,j)  (i-1,j-1), corresponds to a match S 1 (i) = S 2 (j) and a substitution if S 1 (i)  S 2 (j)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming An optimal edit path. What edit transcript does this path correspond to? S,S,S,M,D,M,M,I

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Another optimal edit path. What edit transcript does this path correspond to? I,S,M,D,M,D,M,M,I

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming The third possible optimal edit path. What edit transcript does this path correspond to? S,I,M,D,M,D,M,M,I

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Alternatively we can interpret any path from (n,m) to (0,0) as an alignment of S 1 and S 2. The interpretation of the path links are: –A horizontal link, (i,j)  (i,j-1), corresponds to an insertion of a space/dash into S 1. –A vertical link, (i,j)  (i-1,j), corresponds to an insertion of a space/dash into S 2. –A diagonal link, (i,j)  (i-1,j-1), corresponds to a match if S 1 (i) = S 2 (j) or a mismatch if S 1 (i)  S 2 (j)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming Possible optimal path. What alignment does this optimal path correspond to? w r i t - e r s v i n t n e r -

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming A second possible optimal path. What alignment does this optimal path correspond to? w r i - t - e r s v - i n t n e r -

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Dynamic Programming A third possible optimal path. What alignment does this optimal path correspond to? w r i - t - e r s - v i n t n e r -

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Summary Any path from (n,m) to (0,0) corresponds to an optimal edit sequence and an optimal alignment We can recover all optimal edit sequences and alignments simply by extracting all paths from (n,m) to (0,0) The correspondence between paths and edit sequences is one-to-one. The correspondence between paths and alignments is one-to-one.