Www.monash.edu.au 1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material.

Slides:



Advertisements
Similar presentations
© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:
Advertisements

Lecture 8: Dynamic Programming Shang-Hua Teng. Longest Common Subsequence Biologists need to measure how similar strands of DNA are to determine how closely.
CPSC 335 Dynamic Programming Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Algorithms Dynamic programming Longest Common Subsequence.
CSCI 256 Data Structures and Algorithm Analysis Lecture 13 Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
1 COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Data Structures Lecture 3 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.
Data Structures Lecture 10 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.
Goodrich, Tamassia String Processing1 Pattern Matching.
DYNAMIC PROGRAMMING. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Pattern Matching 4/17/2017 7:14 AM Pattern Matching Pattern Matching.
© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.
Lecture 7 Topics Dynamic Programming
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Chapter 9: Text Processing Pattern Matching Data Compression.
Text Processing 1 Last Update: July 31, Topics Notations & Terminology Pattern Matching – Brute Force – Boyer-Moore Algorithm – Knuth-Morris-Pratt.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 17.
Lecture21: Dynamic Programming Bohyung Han CSE, POSTECH CSED233: Data Structures (2014F)
Greedy Methods and Backtracking Dr. Marina Gavrilova Computer Science University of Calgary Canada.
6/4/ ITCS 6114 Dynamic programming Longest Common Subsequence.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
1 COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf.
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
1 Chapter 6-1 Dynamic Programming Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
CSC 212 – Data Structures Lecture 36: Pattern Matching.
Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:
Introduction to Algorithms Jiafen Liu Sept
1 COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
CSC 213 Lecture 19: Dynamic Programming and LCS. Subsequences (§ ) A subsequence of a string x 0 x 1 x 2 …x n-1 is a string of the form x i 1 x.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
1 COMP9024: Data Structures and Algorithms Week Ten: Text Processing Hui Wu Session 1, 2016
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
CS502: Algorithms in Computational Biology
Merge Sort 5/28/2018 9:55 AM Dynamic Programming Dynamic Programming.
Pattern Matching 9/14/2018 3:36 AM
13 Text Processing Hongfei Yan June 1, 2016.
Chapter 6 Dynamic Programming
Chapter 6 Dynamic Programming
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
The Longest Common Subsequence Problem
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Longest Common Subsequence
Lecture 8. Paradigm #6 Dynamic Programming
Dynamic Programming-- Longest Common Subsequence
Algorithm Design Techniques Greedy Approach vs Dynamic Programming
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sequences 5/17/ :43 AM Pattern Matching.
Dynamic Programming.
Presentation transcript:

1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice.

FIT2004 Algorithms & Data Structures L5: Strings, Matching, and Dynamic Programming Prepared by: Bernd Meyer from lecture materials © 2004 Goodrich & Tamassia February 2007

3 prepared from lecture material © 2004 Goodrich & Tamassia Sequence Matching Problems Finding a word in a document web search etc. spell-checking (approximate matching) Bioinformatics!! –eg (DNA) Sequence Alignment

4 prepared from lecture material © 2004 Goodrich & Tamassia Strings (§ 11.1 Goodrich & Tamassia) A string is a sequence of characters Examples of strings: –Java program –HTML document –DNA sequence –Digitized image An alphabet  is the set of possible characters for a family of strings Example of alphabets: –ASCII, Unicode –{0, 1} –{A, C, G, T} Let P be a string of size m –A substring P[i.. j] of P is the subsequence of P consisting of the characters with index positions between i and j –A prefix of P is a substring of the type P[0.. i] –A suffix of P is a substring of the type P[i..m  1] Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P

5 prepared from lecture material © 2004 Goodrich & Tamassia Brute-Force Pattern Matching The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either –a match is found, or –all placements of the pattern have been tried Brute-force pattern matching runs in time O(nm) Example of worst case: –T  aaa … ah –P  aaah Algorithm BruteForceMatch(T, P) Input text T of size n and pattern P of size m Output starting index of a substring of T equal to P or  1 if no such substring exists for i  0 to n  m { test shift i of the pattern } j  0 while j  m  T[i  j]  P[j] j  j  1 if j  m return i {match at i} else break while loop {mismatch} return -1 {no match anywhere}

6 prepared from lecture material © 2004 Goodrich & Tamassia Boyer-Moore Heuristics The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T[i]  c –If P contains c, shift P to align the last occurrence of c in P with T[i] –Else, shift P to align P[0] with T[i  1] Example

7 prepared from lecture material © 2004 Goodrich & Tamassia Boyer-Moore Heuristics Why is Boyer-Moore Matching correct?

8 prepared from lecture material © 2004 Goodrich & Tamassia Last-Occurrence Function Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet  to build the last-occurrence function L mapping  to integers, where L(c) is defined as –the largest index i such that P[i]  c or –  1 if no such index exists Example: –  {a, b, c, d} –P  abacab The last-occurrence function can be represented by an array indexed by the numeric codes of the characters The last-occurrence function can be computed in time O(m  s), where m is the size of P and s is the size of  cabcd L(c)L(c)453 11

9 prepared from lecture material © 2004 Goodrich & Tamassia Case 1: j  1  l The Boyer-Moore Algorithm Algorithm BoyerMooreMatch(T, P,  ) L  lastOccurenceFunction(P,  ) i  m  1 j  m  1 repeat if T[i]  P[j] if j  0 return i { match at i } else i  i  1 j  j  1 else { character-jump } l  L[ T[i] ] i  i  m – min(j, 1  l) j  m  1 until i  n  1 return  1 { no match } Case 2: 1  l  j

10 prepared from lecture material © 2004 Goodrich & Tamassia Example

11 prepared from lecture material © 2004 Goodrich & Tamassia Analysis Boyer-Moore’s algorithm runs in time O(nm  s) Example of worst case: –T  aaa … a –P  baaa The worst case may occur in images and DNA sequences but is unlikely in English text Boyer-Moore’s algorithm is significantly faster than the brute-force algorithm on English text

12 prepared from lecture material © 2004 Goodrich & Tamassia The KMP Algorithm Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons? Answer: the largest prefix of P[0..j] that is a suffix of P[1..j] x j.. abaab..... abaaba abaaba No need to repeat these comparisons Resume comparing here

13 prepared from lecture material © 2004 Goodrich & Tamassia KMP Failure Function Knuth-Morris-Pratt’s algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself The failure function F(j) is defined as the size of the largest prefix of P[0..j] that is also a suffix of P[1..j] Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm so that if a mismatch occurs at P[j]  T[i] we set j  F(j  1) so F(j  1) is the index of the pattern at which we restart matching (left to right) j01234  P[j]P[j]abaaba F(j)F(j)00112 

14 prepared from lecture material © 2004 Goodrich & Tamassia The KMP Algorithm The failure function can be represented by an array and can be computed in O(m) time At each iteration of the while- loop, either –i increases by one, or –the shift amount i  j increases by at least one (observe that F(j  1) < j ) Hence, there are no more than 2n iterations of the while-loop Thus, KMP’s algorithm runs in optimal time O(m  n) Algorithm KMPMatch(T, P) F  failureFunction(P) i  0 j  0 while i  n if T[i]  P[j] if j  m  1 return i  j { match } else i  i  1 j  j  1 else if j  0 j  F[j  1] else i  i  1 return  1 { no match }

15 prepared from lecture material © 2004 Goodrich & Tamassia Example j01234  P[j]P[j]abacab F(j)F(j)00101 

16 prepared from lecture material © 2004 Goodrich & Tamassia Computing the Failure Function The failure function can be represented by an array and can be computed in O(m) time The construction is similar to the KMP algorithm itself At each iteration of the while- loop, either –i increases by one, or –the shift amount i  j increases by at least one (observe that F(j  1) < j ) Hence, there are no more than 2m iterations of the while-loop Algorithm failureFunction(P) F[0]  0 i  1 j  0 while i  m if P[i]  P[j] {we have matched j + 1 chars} F[i]  j + 1 i  i  1 j  j  1 else if j  0 then {use failure function to shift P} j  F[j  1] else F[i]  0 { no match } i  i  1

17 prepared from lecture material © 2004 Goodrich & Tamassia Subsequences (LCS) A subsequence of a character string x 0 x 1 x 2 …x n-1 is a string of the form x i 1 x i 2 …x i k, where i j < i j+1. Not the same as substring! Example String: ABCDEFGHIJK –Subsequence: ACEGJIK –Subsequence: DFGHK –Not subsequence: DAGH

18 prepared from lecture material © 2004 Goodrich & Tamassia LCS: Longest Common Subsequence Given two strings X and Y, the longest common subsequence (LCS) problem is to find a longest subsequence common to both X and Y Has applications to DNA similarity testing (alphabet is {A,C,G,T}) Example: ABCDEFG and XZACKDFWGH have ACDFG as a longest common subsequence

19 prepared from lecture material © 2004 Goodrich & Tamassia A Poor Approach to the LCS Problem A Brute-force solution: –Enumerate all subsequences of X –Test which ones are also subsequences of Y –Pick the longest one. Analysis: –If X is of length n, then it has 2 n subsequences –This is an exponential-time algorithm!

20 prepared from lecture material © 2004 Goodrich & Tamassia Dynamic Programming Technique Applies to a problem that at first seems to require a lot of time (possibly exponential), provided we have: –Simple subproblem definition: the subproblems can be defined in terms of a few variables, such as j, k, l, m, and so on. –Subproblem optimality: the global optimum value can be defined in terms of optimal subproblems –Subproblem overlap: the subproblems are not independent, but instead they overlap (hence, should be constructed bottom-up). Unwind recursion into iteration using tables

21 prepared from lecture material © 2004 Goodrich & Tamassia Dynamic-Programming Approach to LCS Define L[i,j] to be the length of the longest common subsequence of X[0..i] and Y[0..j]. Allow for -1 as an index, so L[-1,k] = 0 and L[k,-1]=0, to indicate that the null part of X or Y has no match with the other. Then we can define L[i,j] in the general case as follows: If xi=yj, then L[i,j] = L[i-1,j-1] + 1 (we can add this match) If xi≠yj, then L[i,j] = max{L[i-1,j], L[i,j-1]} (we have no match here) Case 1Case 2

22 prepared from lecture material © 2004 Goodrich & Tamassia An LCS Algorithm Algorithm LCS(X,Y ): Input:Strings X and Y with n and m elements, respectively Output: For i = 0,…,n-1, j = 0,...,m-1, the length L[i, j] of a longest string that is a subsequence of both the string X[0..i] = x 0 x 1 x 2 …x i and the string Y [0.. j] = y 0 y 1 y 2 …y j for i =1 to n-1 do L[i,-1] = 0 for j =0 to m-1 do L[-1,j] = 0 for i =0 to n-1 do for j =0 to m-1 do if x i = y j then L[i, j] = L[i-1, j-1] + 1 else L[i, j] = max{L[i-1, j], L[i, j-1]} return array L

23 prepared from lecture material © 2004 Goodrich & Tamassia Visualizing the LCS Algorithm

24 prepared from lecture material © 2004 Goodrich & Tamassia Analysis of LCS Algorithm We have two nested loops –The outer one iterates n times –The inner one iterates m times –A constant amount of work is done inside each iteration of the inner loop –Thus, the total running time is O(nm) Answer is contained in L[n,m] (and the subsequence can be recovered from the L table).

25 prepared from lecture material © 2004 Goodrich & Tamassia Algorithm Design Techniques Greedy: –Solution is built step-by-step, taking the best locally possible step at that point. Divide-and-conquer: –Problem is broken into several smaller and/or simpler independent subproblems –subproblems are solved independently usually using recursion –solutions of subproblems are combined to yield solution of original problem Dynamic Programming: –Problem is broken up into several smaller overlapping sub- problems. –subproblems are solved “bottom-up” (smallest first) –subproblem solutions are combined into solutions of increasingly complex subproblems.

26 prepared from lecture material © 2004 Goodrich & Tamassia Origin of Dynamic Programming Bellman. Pioneered the systematic study of dynamic programming in the 1950s. The name “dynamic programming” was given to the technique to market it… According to Bellman’s auto-biography the Secretary of Defense (funding!) at that time was hostile to mathematical research and the name was chosen because >"it's impossible to use dynamic in a pejorative sense" >"something not even a Congressman could object to"

27 prepared from lecture material © 2004 Goodrich & Tamassia Dynamic Programming Applications Some famous applications of dynamic programming algorithms: –Unix diff for comparing two files. –Smith-Waterman for sequence alignment. –Bellman-Ford for shortest path routing in networks. –Cocke-Kasami-Younger for parsing context free grammars.