String Algorithms David Kauchak cs302 Spring 2012.

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Space-for-Time Tradeoffs
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.
Yangjun Chen 1 String Matching String matching problem - prefix - suffix - automata - String-matching automata - prefix function - Knuth-Morris-Pratt algorithm.
CS5371 Theory of Computation
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Wednesday, 12/6/06 String Matching Algorithms Chapter 32.
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 8 Tuesday, 11/13/01 String Matching Algorithms Chapter.
Pattern Matching II COMP171 Fall Pattern matching 2 A Finite Automaton Approach * A directed graph that allows self-loop. * Each vertex denotes.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Topics Automata Theory Grammars and Languages Complexities
Great Theoretical Ideas in Computer Science.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
The Rabin-Karp Algorithm String Matching Jonathan M. Elchison 19 November 2004 CS-3410 Algorithms Dr. Shomper.
String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
Exact string matching Rhys Price Jones Anne Haake Week 2: Bioinformatics Computing I continued.
KMP String Matching Prepared By: Carlens Faustin.
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
Great Theoretical Ideas in Computer Science.
MCS 101: Algorithms Instructor Neelima Gupta
Application: String Matching By Rong Ge COSC3100
MCS 101: Algorithms Instructor Neelima Gupta
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Rabin-Karp algorithm Robin Visser. What is Rabin-Karp?
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Fundamentals of Informatics
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Deterministic Finite Automata COMPSCI 102 Lecture 2.
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
Fundamental Data Structures and Algorithms
Machines That Can’t Count CS Lecture 15 b b a b a a a b a b.
String-Matching Problem COSC Advanced Algorithm Analysis and Design
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2005 Lecture 10Sept Carnegie Mellon University b b a b a a a b a b One.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
Great Theoretical Ideas in Computer Science for Some.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Great Theoretical Ideas In Computer Science Steven RudichCS Spring 2005 Lecture 9Feb Carnegie Mellon University b b a b a a a b a b One Minute.
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
CSG523/ Desain dan Analisis Algoritma
Text Search ~ k A R B n f u j ! k e
Hashing & HashMaps CS-2851 Dr. Mark L. Hornick.
Advanced Algorithms Analysis and Design
The Rabin-Karp Algorithm
COMP108 Algorithmic Foundations Polynomial & Exponential Algorithms
Hash table CSC317 We have elements with key and satellite data
Advanced Algorithms Analysis and Design
@#? Text Search g ~ A R B n f u j u q e ! 4 k ] { u "!"
String Processing.
Strings: Tries, Suffix Trees
Rabin & Karp Algorithm.
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
String-Matching Algorithms (UNIT-5)
Chapter 7 Space and Time Tradeoffs
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Strings: Tries, Suffix Trees
String Processing.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Space-for-time tradeoffs
Presentation transcript:

String Algorithms David Kauchak cs302 Spring 2012

Where did “dynamic programming” come from? Richard Bellman On the Birth of Dynamic Programming Stuart Dreyfus r50/ pdf

Strings Let Σ be an alphabet, e.g. Σ = (, a, b, c, …, z) A string is any member of Σ*, i.e. any sequence of 0 or more members of Σ ‘this is a string’  Σ* ‘this is also a string’  Σ* ‘1234’  Σ*

String operations Given strings s 1 of length n and s 2 of length m Equality: is s 1 = s 2 ? (case sensitive or insensitive) Running time O(n) where n is length of shortest string ‘this is a string’ = ‘this is a string’ ‘this is a string’ ≠ ‘this is another string’ ‘this is a string’ =? ‘THIS IS A STRING’

String operations Concatenate (append): create string s 1 s 2 Running time (assuming we generate a new string) Θ(n+m) ‘this is a’. ‘ string’ → ‘this is a string’

String operations Substitute: Exchange all occurrences of a particular character with another character Running time Θ(n) Substitute(‘this is a string’, ‘i’, ‘x’) → ‘thxs xs a strxng’ Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’

String operations Length: return the number of characters/symbols in the string Running time O(1) or Θ(n) depending on implementation Length(‘this is a string’) → 16 Length(‘this is another string’) → 24

String operations Prefix: Get the first j characters in the string Running time Θ(j) Suffix: Get the last j characters in the string Running time Θ(j) Prefix(‘this is a string’, 4) → ‘this’ Suffix(‘this is a string’, 6) → ‘string’

String operations Substring – Get the characters between i and j inclusive Running time Θ(j – i + 1) Prefix: Prefix(S, i) = Substring(S, 1, i) Suffix: Suffix(S, i) = Substring(S, i+1, length(n)) Substring(‘this is a string’, 4, 8) → ‘s is ’

Edit distance (aka Levenshtein distance) Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2 Insertion: ABACEDABACCEDDABACCED Insert ‘C’Insert ‘D’

Edit distance (aka Levenshtein distance) Deletion: ABACED Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2

Edit distance (aka Levenshtein distance) Deletion: ABACEDBACED Delete ‘A’ Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2

Edit distance (aka Levenshtein distance) Deletion: ABACEDBACEDBACE Delete ‘A’Delete ‘D’ Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2

Edit distance (aka Levenshtein distance) Substitution: ABACEDABADEDABADES Sub ‘D’ for ‘C’Sub ‘S’ for ‘D’ Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2

Edit distance examples Edit(Kitten, Mitten) =1 Operations: Sub ‘M’ for ‘K’Mitten

Edit distance examples Edit(Happy, Hilly) =3 Operations: Sub ‘a’ for ‘i’Hippy Sub ‘l’ for ‘p’Hilpy Sub ‘l’ for ‘p’Hilly

Edit distance examples Edit(Banana, Car) =5 Operations: Delete ‘B’anana Delete ‘a’nana Delete ‘n’naa Sub ‘C’ for ‘n’Caa Sub ‘a’ for ‘r’Car

Edit distance examples Edit(Simple, Apple) =3 Operations: Delete ‘S’imple Sub ‘A’ for ‘i’Ample Sub ‘m’ for ‘p’Apple

Edit distance Why might this be useful?

Is edit distance symmetric? that is, is Edit(s 1, s 2 ) = Edit(s 2, s 1 )? Why? sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’ delete ‘i’ → insert ‘i’ insert ‘i’ → delete ‘i’ Edit(Simple, Apple) =? Edit(Apple, Simple)

Calculating edit distance X = A B C B D A B Y = B D C A B A Ideas?

Calculating edit distance X = A B C B D A ? Y = B D C A B ? After all of the operations, X needs to equal Y

Calculating edit distance X = A B C B D A ? Y = B D C A B ? Operations:Insert Delete Substitute

Insert X = A B C B D A ? Y = B D C A B ?

Insert X = A B C B D A ? Y = B D C A B ? Edit

Delete X = A B C B D A ? Y = B D C A B ?

Delete X = A B C B D A ? Y = B D C A B ? Edit

Substition X = A B C B D A ? Y = B D C A B ?

Substition X = A B C B D A ? Y = B D C A B ? Edit

Anything else? X = A B C B D A ? Y = B D C A B ?

Equal X = A B C B D A ? Y = B D C A B ?

Equal X = A B C B D A ? Y = B D C A B ? Edit

Combining results Insert: Delete: Substitute: Equal:

Combining results

Running time Θ(nm)

Variants Only include insertions and deletions What does this do to substitutions? Include swaps, i.e. swapping two adjacent characters counts as one edit Weight insertion, deletion and substitution differently Weight specific character insertion, deletion and substitutions differently Length normalize the edit distance

String matching Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S P = ABA S = DCABABBABABA

String matching P = ABA S = DCABABBABABA Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S

Uses grep/egrep search find java.lang.String.contains()

Naive implementation

Is it correct?

Running time? What is the cost of the equality check? Best case: O(1) Worst case: O(m)

Running time? Best case Θ(n) – when the first character of the pattern does not occur in the string Worst case O((n-m+1)m)

Worst case P = AAAA S = AAAAAAAAAAAAA

Worst case P = AAAA S = AAAAAAAAAAAAA

Worst case P = AAAA S = AAAAAAAAAAAAA

Worst case P = AAAA S = AAAAAAAAAAAAA repeated work!

Worst case P = AAAA S = AAAAAAAAAAAAA Ideally, after the first match, we’d know to just check the next character to see if it is an ‘A’

Patterns Which of these patterns will have that problem? P = ABAB P = ABDC P = BAA P = ABBCDDCAABB

Patterns Which of these patterns will have that problem? P = ABAB P = ABDC P = BAA P = ABBCDDCAABB If the pattern has a suffix that is also a prefix then we will have this problem

Finite State Automata (FSA) An FSA is defined by 5 components Q is the set of states q0q0 q1q1 q2q2 qnqn …

Finite State Automata (FSA) An FSA is defined by 5 components Q is the set of states q 0 is the start state A  Q, is the set of accepting states where |A| > 0 Σ is the alphabet (e.g. {A, B}  is the transition function from Q x Σ to Q q0q0 q1q1 q2q2 qnqn … Q Σ Q q 0 A q 1 q 0 B q 2 q 1 A q 1 … q0q0 q1q1 q2q2 A B A … q7q7

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A An FSA starts at state q 0 and reads the characters of the input string one at a time. If the automaton is in state q and reads character a, then it transitions to state  (q,a). If the FSA reaches an accepting state (q  A), then the FSA has found a match.

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A What pattern does this represent? P = ABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

Suffix function The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y σ(abcdab, ababcd) = ?

Suffix function The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y σ(abcdab, ababcd) = 2

Suffix function The suffix function σ(x,y) is the index of the longest suffix of x that is a prefix of y σ(daabac, abacac) = ?

Suffix function σ(daabac, abacac) = 4 The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y

Suffix function σ(dabb, abacd) = ? The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y

Suffix function σ(dabb, abacd) = 0 The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y

Building a string matching automata Given a pattern P = p 1, p 2, …, p m, we’d like to build an FSA that recognizes P in strings P = ababaca Ideas?

Building a string matching automata Q = q 1, q 2, …, q m corresponding to each symbol, plus a q 0 starting state the set of accepting states, A = {q m } vocab Σ all symbols in P, plus one more representing all symbols not in P The transition function for q  Q and a  Σ is defined as:  (q, a) = σ(p 1…q a, P) P = ababaca

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 ?a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(a, ababaca)

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 1?a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(b, ababaca)

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 10?a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(b, ababaca)

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(b, ababaca)

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 q0q0 q1q1 A B,C

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 ?b q4q4 a q5q5 c q6q6 a q7q7 We’ve seen ‘aba’ so far σ(abaa, ababaca)

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 1b q4q4 a q5q5 c q6q6 a q7q7 We’ve seen ‘aba’ so far σ(abaa, ababaca)

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 1?c q6q6 a q7q7 We’ve seen ‘ababa’ so far

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 1?c q6q6 a q7q7 We’ve seen ‘ababa’ so far σ(ababab, ababaca)

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 14c q6q6 a q7q7 We’ve seen ‘ababa’ so far σ(ababab, ababaca)

Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 146c q6q6 700a q7q7 120

Matching runtime Once we’ve built the FSA, what is the runtime? Θ(n) - Each symbol causes a state transition and we only visit each character once What is the cost to build the FSA? How many entries in the table? Ω(m|Σ|) How long does it take to calculate the suffix function at each entry? Naïve: O(m) Overall naïve: O(m 2 |Σ|) Overall fast implementation O(m|Σ|)

Rabin-Karp algorithm P = ABA S = BABABBABABA - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare

P = ABA S = BABABBABABA Hash P T(P) Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare

P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(BAB) =

P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) match Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(ABA) =

P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(BAB) =

P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) … Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(BAB) =

Rabin-Karp algorithm P = ABA S = BABABBABABA For this to be useful/efficient, what needs to be true about T? T(P) … T(BAB) =

Rabin-Karp algorithm P = ABA S = BABABBABABA For this to be useful/efficient, what needs to be true about T? T(P) … Given T(s i…i+m-1 ) we must be able to efficiently calculate T(s i+1…i+m ) T(BAB) =

Calculating the hash function For simplicity, assume Σ = (0, 1, 2, …, 9). (in general we can use a base larger than 10). A string can then be viewed as a decimal number How do we efficiently calculate the numerical representation of a string? T(‘ ’) = ?

Horner’s rule * 10 = 90 (90 + 8)*10 = 980 ( )*10 = 9840 ( )*10 = … =

Horner’s rule * 10 = 90 (90 + 8)*10 = 980 ( )*10 = 9840 ( )*10 = … = Running time? Θ(m)

Calculating the hash on the string m = 4 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )? T(s i…i+m-1 )

Calculating the hash on the string m = 4 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )? T(s i…i+m-1 ) subtract highest order digit 801

Calculating the hash on the string m = 4 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )? T(s i…i+m-1 ) shift digits up 8010

Calculating the hash on the string m = 4 T(s i…i+m-1 ) add in the lowest digit 8015 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )?

Calculating the hash on the string m = 4 T(s i…i+m-1 ) Running time? - Θ(m) for s 1…m - O(1) for the rest Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )?

Algorithm so far… Is it correct? Each string has a unique numerical value and we compare that with each value in the string Running time Preprocessing: Θ(m) Matching Θ(n-m+1) Is there any problem with this analysis?

Algorithm so far… Is it correct? Each string has a unique numerical value and we compare that with each value in the string Running time Preprocessing: Θ(m) Matching Θ(n-m+1) How long does the check T(P) = T(s i…i+m-1 ) take?

Modular arithmetics The run time assumptions we made were assuming arithmetic operations were constant time, which is not true for large numbers To keep the numbers small, we’ll use modular arithmetics, i.e. all operations are performed mod q a+b = (a+b) mod q a*b = (a*b) mod q …

Modular arithmetics If T(A) = T(B), then T(A) mod q = T(B) mod q In general, we can apply mods as many times as we want and we will not effect the result What is the downside to this modular approach? Spurious hits: if T(A) mod q = T(B) mod q that does not necessarily mean that T(A) = T(B) If we find a hit, we must check that the actual string matches the pattern

Runtime Preprocessing Θ(m) Running time Best case: Θ(n-m+1) – No matches and no spurious hits Worst case Θ((n-m+1)m)

Average case running time Assume v valid matches in the string What is the probability of a spurious hit? As with hashing, assume a uniform mapping onto values of q: What is the probability under this assumption? 1…q Σ*Σ*

Average case running time Assume v valid matches in the string What is the probability of a spurious hit? As with hashing, assume a uniform mapping onto values of q: What is the probability under this assumption? 1/q 1…q Σ*Σ*

Average case running time How many spurious hits? n/q Average case running time: O(n-m+1) + O(m(v+n/q)) iterate over the positions checking matches and spurious hits

Matching running times AlgorithmPreprocessing timeMatching time Naïve0O((n-m+1)m) FSAΘ(m|Σ|)Θ(n) Rabin-KarpΘ(m)O(n)+O(m(v+n/q)) Knuth-Morris-PrattΘ(m)Θ(n)