13 Text Processing Hongfei Yan June 1, 2016.

Slides:



Advertisements
Similar presentations
© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
Compression & Huffman Codes
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Data Structures Lecture 3 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.
Goodrich, Tamassia String Processing1 Pattern Matching.
© 2004 Goodrich, Tamassia Greedy Method and Compression1 The Greedy Method and Text Compression.
© 2004 Goodrich, Tamassia Greedy Method and Compression1 The Greedy Method and Text Compression.
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
A Data Compression Algorithm: Huffman Compression
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Pattern Matching 4/17/2017 7:14 AM Pattern Matching Pattern Matching.
1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Chapter 9: Text Processing Pattern Matching Data Compression.
Text Processing 1 Last Update: July 31, Topics Notations & Terminology Pattern Matching – Brute Force – Boyer-Moore Algorithm – Knuth-Morris-Pratt.
Data Compression1 File Compression Huffman Tries ABRACADABRA
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
© Copyright 2012 by Pearson Education, Inc. All Rights Reserved. 1 Chapter 19 Binary Search Trees.
CSC 212 – Data Structures Lecture 36: Pattern Matching.
Greedy Algorithms Analysis of Algorithms.
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
1 COMP9024: Data Structures and Algorithms Week Ten: Text Processing Hui Wu Session 1, 2016
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
CSG523/ Desain dan Analisis Algoritma
Advanced Sorting 7 2  9 4   2   4   7
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
HUFFMAN CODES.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Greedy Technique.
Sequences 6/17/2018 2:44 PM Pattern Matching.
Greedy Method 6/22/2018 6:57 PM Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.
The Greedy Method and Text Compression
The Greedy Method and Text Compression
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Pattern Matching 9/14/2018 3:36 AM
Optimal Merging Of Runs
String Processing.
Chapter 3 String Matching.
Analysis & Design of Algorithms (CSCE 321)
Optimal Merging Of Runs
Merge Sort 11/28/2018 2:21 AM The Greedy Method The Greedy Method.
Huffman Coding CSE 373 Data Structures.
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Greedy Algorithms TOPICS Greedy Strategy Activity Selection
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Data Structure and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Chap 3 String Matching 3 -.
String Processing.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sequences 5/17/ :43 AM Pattern Matching.
Analysis of Algorithms CS 477/677
Presentation transcript:

13 Text Processing Hongfei Yan June 1, 2016

Contents 01 Python Primer (P2-51) 02 Object-Oriented Programming (P57-103) 03 Algorithm Analysis (P111-141) 04 Recursion (P150-180) 05 Array-Based Sequences (P184-224) 06 Stacks, Queues, and Deques (P229-250) 07 Linked Lists (P256-294) 08 Trees (P300-352) 09 Priority Queues (P363-395) 10 Maps, Hash Tables, and Skip Lists (P402-452) 11 Search Trees (P460-528) 12 Sorting and Selection (P537-574) 13 Text Processing (P582-613) 14 Graph Algorithms (P620-686) 15 Memory Management and B-Trees (P698-717) ISLR_Print6

Contents 13.1 Abundance of Digitized Text 13.2 Pattern-Matching 13.3 Dynamic Programming 13.4 Text Compression and the Greedy Method 13.4 Tries

Pattern Matching 9/14/2018 3:37 AM 13.2 Pattern Matching

Strings A string is a sequence of characters Examples of strings: Python program HTML document DNA sequence Digitized image An alphabet S is the set of possible characters for a family of strings Example of alphabets: ASCII Unicode {0, 1} {A, C, G, T} Let P be a string of size m A substring P[i .. j] of P is the subsequence of P consisting of the characters with ranks between i and j A prefix of P is a substring of the type P[0 .. i] A suffix of P is a substring of the type P[i ..m - 1] Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P Applications: Text editors Search engines Biological research

Brute-Force Pattern Matching Algorithm BruteForceMatch(T, P) Input text T of size n and pattern P of size m Output starting index of a substring of T equal to P or -1 if no such substring exists for i  0 to n - m { test shift i of the pattern } j  0 while j < m  T[i + j] = P[j] j  j + 1 if j = m return i {match at i} else break while loop {mismatch} return -1 {no match anywhere} The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either a match is found, or all placements of the pattern have been tried Brute-force pattern matching runs in time O(nm) Example of worst case: T = aaa … ah P = aaah may occur in images and DNA sequences unlikely in English text

Boyer-Moore Heuristics The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T[i] = c If P contains c, shift P to align the last occurrence of c in P with T[i] Else, shift P to align P[0] with T[i + 1] Example Pattern Matching

Last-Occurrence Function Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet S to build the last-occurrence function L mapping S to integers, where L(c) is defined as the largest index i such that P[i] = c or -1 if no such index exists Example: S = {a, b, c, d} P = abacab The last-occurrence function can be represented by an array indexed by the numeric codes of the characters The last-occurrence function can be computed in time O(m + s), where m is the size of P and s is the size of S c a b d L(c) 4 5 3 -1

The Boyer-Moore Algorithm Case 1: j  1 + l Algorithm BoyerMooreMatch(T, P, S) L  lastOccurenceFunction(P, S ) i  m - 1 j  m - 1 repeat if T[i] = P[j] if j = 0 return i { match at i } else i  i - 1 j  j - 1 { character-jump } l  L[T[i]] i  i + m – min(j, 1 + l) until i > n - 1 return -1 { no match } Case 2: 1 + l  j In the case of Figure 13.3(b), we slide the pattern only one unit. It would be more productive to slide it rightward until finding another occurrence of mismatched character T[i] in the pattern, but we do not wish to take time to search for

Example

Analysis Boyer-Moore’s algorithm runs in time O(nm + s) Example of worst case: T = aaa … a P = baaa The worst case may occur in images and DNA sequences but is unlikely in English text Boyer-Moore’s algorithm is significantly faster than the brute-force algorithm on English text

Python Implementation

The KMP Algorithm Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute- force algorithm. When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons? Answer: the largest prefix of P[0..j] that is a suffix of P[1..j] . . a b a a b x . . . . . a b a a b a j a b a a b a No need to repeat these comparisons Resume comparing here

KMP Failure Function Knuth-Morris-Pratt’s algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself The failure function F(j) is defined as the size of the largest prefix of P[0..j] that is also a suffix of P[1..j] Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm so that if a mismatch occurs at P[j]  T[i] we set j  F(j - 1) j 1 2 3 4 5 P[j] a b F(j)

The KMP Algorithm The failure function can be represented by an array and can be computed in O(m) time At each iteration of the while-loop, either i increases by one, or the shift amount i - j increases by at least one (observe that F(j - 1) < j) Hence, there are no more than 2n iterations of the while-loop Thus, KMP’s algorithm runs in optimal time O(m + n) Algorithm KMPMatch(T, P) F  failureFunction(P) i  0 j  0 while i < n if T[i] = P[j] if j = m - 1 return i - j { match } else i  i + 1 j  j + 1 if j > 0 j  F[j - 1] return -1 { no match }

Computing the Failure Function The failure function can be represented by an array and can be computed in O(m) time The construction is similar to the KMP algorithm itself At each iteration of the while-loop, either i increases by one, or the shift amount i - j increases by at least one (observe that F(j - 1) < j) Hence, there are no more than 2m iterations of the while-loop Algorithm failureFunction(P) F[0]  0 i  1 j  0 while i < m if P[i] = P[j] {we have matched j + 1 chars} F[i]  j + 1 i  i + 1 j  j + 1 else if j > 0 then {use failure function to shift P} j  F[j - 1] else F[i]  0 { no match }

Example j 1 2 3 4 5 P[j] a b c F(j)

Python Implementation

The Greedy Method and Text Compression Merge Sort 9/14/2018 3:37 AM The Greedy Method and Text Compression

The Greedy Method Technique The greedy method is a general algorithm design paradigm, built on the following elements: configurations: different choices, collections, or values to find objective function: a score assigned to configurations, which we want to either maximize or minimize It works best when applied to problems with the greedy- choice property: a globally-optimal solution can always be found by a series of local improvements from a starting configuration.

Text Compression Given a string X, efficiently encode X into a smaller string Y Saves memory and/or bandwidth A good approach: Huffman encoding Compute frequency f(c) for each character c. Encode high-frequency characters with short code words No code word is a prefix for another code Use an optimal encoding tree to determine the code words Greedy Method

Encoding Tree Example A code is a mapping of each character of an alphabet to a binary code-word A prefix code is a binary code such that no code-word is the prefix of another code- word An encoding tree represents a prefix code Each external node stores a character The code word of a character is given by the path from the root to the external node storing the character (0 for a left child and 1 for a right child) a b c d e 00 010 011 10 11 a b c d e

Encoding Tree Optimization Given a text string X, we want to find a prefix code for the characters of X that yields a small encoding for X Frequent characters should have long code-words Rare characters should have short code-words Example X = abracadabra T1 encodes X into 29 bits T2 encodes X into 24 bits c a r d b T1 a c d b r T2 Greedy Method

Huffman’s Algorithm Given a string X, Huffman’s algorithm construct a prefix code the minimizes the size of the encoding of X It runs in time O(n + d log d), where n is the size of X and d is the number of distinct characters of X A heap-based priority queue is used as an auxiliary structure Greedy Method

Huffman’s Algorithm