Chapter 9: Text Processing Pattern Matching Data Compression.

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:
Applied Algorithmics - week7
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Goodrich, Tamassia String Processing1 Pattern Matching.
SWE 423: Multimedia Systems
Priority Queues and Heapsort ( ) Priority queues are used for many purposes, such as job scheduling, shortest path, file compression … Recall the.
A Data Compression Algorithm: Huffman Compression
Chapter 9: Huffman Codes
Pattern Matching 4/17/2017 7:14 AM Pattern Matching Pattern Matching.
1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material.
Topics Automata Theory Grammars and Languages Complexities
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
Priority Queues and Heapsort ( ) Priority queues are used for many purposes, such as job scheduling, shortest path, file compression … Recall the.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Chapter 2 Source Coding (part 2)
Topic #3: Lexical Analysis
Finite-State Machines with No Output
Text Processing 1 Last Update: July 31, Topics Notations & Terminology Pattern Matching – Brute Force – Boyer-Moore Algorithm – Knuth-Morris-Pratt.
Data Compression1 File Compression Huffman Tries ABRACADABRA
Huffman Encoding Veronica Morales.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Finite State Machines Chapter 5. Languages and Machines.
CSC312 Automata Theory Lecture # 2 Languages.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
1 Strings CopyWrite D.Bockus. 2 Strings Def: A string is a sequence (possibly empty) of symbols from some alphabet. What do we use strings for? 1) Text.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
CSC 212 – Data Structures Lecture 36: Pattern Matching.
Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Deterministic Finite Automata Nondeterministic Finite Automata.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
1 COMP9024: Data Structures and Algorithms Week Ten: Text Processing Hui Wu Session 1, 2016
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Deterministic Finite-State Machine (or Deterministic Finite Automaton) A DFA is a 5-tuple, (S, Σ, T, s, A), consisting of: S: a finite set of states Σ:
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Applied Algorithmics - week7
13 Text Processing Hongfei Yan June 1, 2016.
Huffman Coding CSE 373 Data Structures.
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
String Processing.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sequences 5/17/ :43 AM Pattern Matching.
CSC312 Automata Theory Lecture # 2 Languages.
Presentation transcript:

Chapter 9: Text Processing Pattern Matching Data Compression

Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Knuth-Morris-Pratt algorithm (§9.1.4) Regular Expressions and Finite Automata Data Compression Huffman Coding Lempel-Ziv Compression

Motivation: Bioinformatics The application of computer science techniques to genetic data See Gene-Finding notesGene-Finding Many interesting algorithm problems Many interesting ethical issues!

Strings A string is a sequence of characters Examples of strings: Java program HTML document DNA sequence Digitized image An alphabet  is the set of possible characters for a family of strings Example of alphabets: ASCII Unicode {0, 1} {A, C, G, T} Let P be a string of size m A substring P[i.. j] of P is the subsequence of P consisting of the characters with ranks between i and j A prefix of P is a substring of the type P[0.. i] A suffix of P is a substring of the type P[i..m  1] Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P Applications: Text editors Regular expressions Search engines Biological research

Brute-Force Algorithm The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either a match is found, or all placements of the pattern have been tried Brute-force pattern matching runs in time O(nm) Example of worst case: T  aaa … ah P  aaah may occur in images and DNA sequences unlikely in English text Algorithm BruteForceMatch(T, P) Input text T of size n and pattern P of size m Output starting index of a substring of T equal to P or  1 if no such substring exists for i  0 to n  m { test shift i of the pattern } j  0 while j  m  T[i  j]  P[j] j  j  1 if j  m return i {match at i} else break while loop {mismatch} return -1 {no match anywhere}

The KMP Algorithm - Motivation Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left- to-right, but shifts the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons? Answer: the largest prefix of P[0..j] that is a suffix of P[1..j] x j.. abaab..... abaaba abaaba No need to repeat these comparisons Resume comparing here

KMP Failure Function Knuth-Morris-Pratt’s algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself The failure function F(j) is defined as the size of the largest prefix of P[0..j] that is also a suffix of P[1..j] Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm so that if a mismatch occurs at P[j]  T[i] we set j  F(j  1) j01234  P[j]P[j]abaaba F(j)F(j)00112 

The KMP Algorithm The failure function can be represented by an array and can be computed in O(m) time At each iteration of the while- loop, either i increases by one, or the shift amount i  j increases by at least one (observe that F(j  1) < j ) Hence, there are no more than 2n iterations of the while-loop Thus, KMP’s algorithm runs in optimal time O(m  n) Algorithm KMPMatch(T, P) F  failureFunction(P) i  0 j  0 while i  n if T[i]  P[j] if j  m  1 return i  j { match } else i  i  1 j  j  1 else if j  0 j  F[j  1] else i  i  1 return  1 { no match }

Computing the Failure Function The failure function can be represented by an array and can be computed in O(m) time The construction is similar to the KMP algorithm itself At each iteration of the while- loop, either i increases by one, or the shift amount i  j increases by at least one (observe that F(j  1) < j ) Hence, there are no more than 2m iterations of the while-loop Algorithm failureFunction(P) F[0]  0 i  1 j  0 while i  m if P[i]  P[j] {we have matched j + 1 chars} F[i]  j + 1 i  i  1 j  j  1 else if j  0 then {use failure function to shift P} j  F[j  1] else F[i]  0 { no match } i  i  1

Example j01234  P[j]P[j]abacab F(j)F(j)00101 

More Complex Patterns Suppose you want to find repeated ATs followed by a G in GAGATATATATCATATG. How do you express that pattern to find? How can you find it efficiently? How if the strings were billions of characters long?

Finite Automata and Regular Expressions How do I match perl-like regular expressions to text? Important topic: regular expressions and finite automata. theoretician: regular expressions are grammars that define regular languages programmer: compact patterns for matching and replacing

Regular Expressions Regular expressions are one of a literal character a (regular expression) – in parentheses a concatenation of two REs the alternation (“or”) of two REs, denoted + in formal notation the closure of an RE, denoted * (ie 0 or more occurrences) Possibly additional syntactic sugar Examples abracadabra abra(cadabra)* = {abra, abracadabra, abracadabracadabra, … } (a*b + ac)d (a(a+b)b*)* t(w+o)?o[? means 0 or 1 occurrence in Perl] aa+rdvark[+ means 1 or more occurrences in Perl]

Finite Automata Regular language: any language defined by a RE Finite automata: machines that recognize regular languages. Deterministic Finite Automaton (DFA): a set of states including a start state and one or more accepting states a transition function: given current state and input letter, what’s the new state? Non-deterministic Finite Automaton (NDFA): like a DFA, but there may be more than one transition out of a state on the same letter (Pick the right one non-deterministically, i.e. via lucky guess!) epsilon-transitions, i.e. optional transitions on no input letter

DFA for (AT)+C Note that DFA can be represented as a 2D array, DFA[state][inputLetter]  newstate DFA: state letter newstate 0A 1 0 TCG 0 1 T 2 1 ACG 0 2 C 4 [accept] 2 GT 0 2 A 3 3 T 2 3 AGC 0 4 AGCT 0

RE  NDFA Given a Regular Expression, how can I build a DFA? Work bottom up. Letter: Concatenation: Or: Closure:

RE  NDFA Example Construct an NDFA for the RE (A*B + AC)D A A* A*B A*B + AC (A*B + AC)D

NDFA -> DFA Keep track of the set of states you are in. On each new input letter, compute the new set of states you could be in. The set of states for the DFA is the power set of the NDFA states. I.e. up to 2 n states, where there were n in the DFA.

Recognizing Regular Languages Suppose your language is given by a DFA. How to recognize? Build a table. One row for every (state, input letter) pair. Give resulting state. For each letter of input string, compute new state When done, check whether the last state is an accepting state. Runtime? O(n), where n is the number of input letters Another approach: use a C program to simulate NDFA with backtracking. Less space, more time. (egrep vs. fgrep?)

Examples Unix grep REs Perl and other languages $input =~ s/t[wo]?o/2/; $input =~ s| ]*>\s*||gs; $input =~ $input =~ s|\s*mso-[^>"]*"|"|gis; $input =~ s/([^ ]+) +([^ ]+)/$2 $1/; $input =~ m/^[0-9]+\.?[0-9]*|\.[0-9]+$/; ($word1,$word2,$rest) = ($foo =~ m/^ *([^ ]+) +([^ ]+) +(.*)$/); $input=~s| ]*>\s* ]*>\s * | |gis;

Data Compression: Intro Suppose you have a text, abracadabra. Want to compress it. How many bits required? at 3 bits per letter, 33 bits. Can we do better? How about variable length codes? In order to be able to decode the file again, we would need a prefix code: no code is the prefix of another. How do we make a prefix code that compresses the text?

Huffman Coding Note: Put the letters at the leaves of a binary tree. Left=0, Right=1. Voila! A prefix code. Huffman coding: an optimal prefix code Algorithm: use a priority queue. insert all letters according to frequency if there is only one tree left, done. else, a=deleteMin(); b=deleteMin(); make tree t out of a and b with weight a.weight() + b.weight(); insert(t)

Huffman coding example abracadabra frequencies: a: 5, b: 2, c: 1, d: 1, r: 2 Huffman code: a: 0, b: 100, c: 1010, d: 1011, r: 11 bits: 5 * * * * * 2 = 23 Follow the tree to decode –  (n) Time to encode? Compute frequencies – O(n) Build heap – O(1) assuming alphabet has constant size Encode – O(n)

Huffman coding summary Huffman coding is very frequently used (You use it every time you watch HTDV or listen to mp3, for example) Text files often compress to 60% of original size (depending on entropy) In real life, Huffman coding is usually used in conjunction with a modeling algorithm…

Data compression overview Two stages: modeling and entropy coding Modeling: break up input into tokens or chunks (the bigger, the better) Entropy Coding: use shorter bit strings to represent more frequent tokens If P is the probability of a code element, the optimal number of bits is –lg(P)

Lempel-Ziv Modeling Consider compressing text Certain byte strings are more frequent than others: the, and, tion, es, etc. Model these with single tokens Build a dictionary of the byte strings you see; the second time you see a byte string, use the dictionary entry

Lempel-Ziv Compression Start with a dictionary of 256 entries for the first 256 characters At each step, Output the code of the longest dictionary match and delete those characters from input Add last two tokens as new dictionary entry with code 256, 257, 258, … Note that code lengths grow by one bit as dictionary reaches size 512, 1024, 2048, etc.

Lempel-Ziv Example Output Add to Dict #(C) - #(D) ND #(O) CO #(_) D_ #(CO) OC #(B) _B #(A) COA #(AN) BA #(_) A_ #(AN) ANA #(A) _A #(A) ANA #(N) AN #(S) AS

Lempel-Ziv Variations All compression algorithms like zip, gzip use variations on Lempel-Ziv Possible variations: Fixed-length vs. variable length codes or adaptive Huffman or arithmetic coding Don’t add duplicate entries to the dictionary Limit the number of codes or switch to larger ones as needed Delete less frequent dictionary entries or give frequent entries shorter codes

How about this approach: Repeat for each letter pair occurring in the text, try: replace the pair with a single new token measure the total entropy (Huffman-compressed size) of the file if that letter pair resulted in the greatest reduction in entropy so far, remember it permanently substitute new token for the pair that caused the greatest reduction in entropy until no more reductions in entropy are possible Results: compression to about 25% for big books: better than gzip, zip. [But not as good as bzip!]

Compression other data Modeling for audio? Modeling for images?

Modeling for Images? Wikipedia

JPEG, etc. Modeling: convert to the frequency domain with DCT Throw away some high-frequency components Throw away imperceptible components Quantize coefficients Encode the remaining coefficients with Huffman coding Results: up to 20-1 compression with good results, with recognizable results How the DCT changed the world…

Data compression results Best algorithms compress text to 25% of original size, but humans can compress to 10% Humans have far better modeling algorithms because they have better pattern recognition and higher-level patterns to recognize Intelligence ≈ pattern recognition ≈ data compression? Going further: Data-Compression.comData-Compression.com