Data Structures Lecture 3 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.

Slides:



Advertisements
Similar presentations
© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:
Advertisements

Analysis of Algorithms
Designing Algorithms Csci 107 Lecture 4. Outline Last time Computing 1+2+…+n Adding 2 n-digit numbers Today: More algorithms Sequential search Variations.
Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Boyer Moore Algorithm String Matching Problem Algorithm 3 cases Searching Timing.
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Data Structures Lecture 0 Fang Yu Department of Management Information Systems National Chengchi University Fall 2011.
Data Structures Lecture 10 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010.
Goodrich, Tamassia String Processing1 Pattern Matching.
CSC 212 – Data Structures Lecture 34: Strings and Pattern Matching.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Pattern Matching 4/17/2017 7:14 AM Pattern Matching Pattern Matching.
1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material.
1 Boyer-Moore Charles Yan Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Chapter 9: Text Processing Pattern Matching Data Compression.
Text Processing 1 Last Update: July 31, Topics Notations & Terminology Pattern Matching – Brute Force – Boyer-Moore Algorithm – Knuth-Morris-Pratt.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
MA/CSSE 473 Day 24 Student questions Quadratic probing proof
  ;  E       
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Application: String Matching By Rong Ge COSC3100
Abstract: Cryptology is a combination of the processes of keeping a message secret (cryptography) and trying to break the secrecy of that message (cryptoanalysis).
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
CSC 212 – Data Structures Lecture 36: Pattern Matching.
Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:
String Sorts Tries Substring Search: KMP, BM, RK
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
1 COMP9024: Data Structures and Algorithms Week Ten: Text Processing Hui Wu Session 1, 2016
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
CSG523/ Desain dan Analisis Algoritma
Pattern Matching 9/14/2018 3:36 AM
13 Text Processing Hongfei Yan June 1, 2016.
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Algorithm Discovery and Design
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Data Structures Lecture 3 Fang Yu Department of Management Information Systems National Chengchi University Fall 2010

Announcement

Lab Classes We will use the MIS PC room ( 逸仙樓 5F) on  11/04  11/18  12/16  12/23 (Demo)

HWs Review  Calculate your BMI  Java Class Library  Generic Geometric Progression  Generics  Inheritance  Exceptions

Project Announcement  Team Project: 30%  1-4 students as a team  Send the team list (name and contact) to your TA before the end of this week  Develop your application use Eclipse with SVN  TAs will help you set up SVN  You will get extra points for constant code update

Smart Ranking  Goal: Mining data in the cyber world!  Stage 0 (HW3): Keyword Counting  Given a URL and a keyword  Return how many times the keyword appears in the contents of the URL  Stage 1 (60%+): Single-Page Ranking  Given a set of keywords and URLs  Rank the URLs based on their score  Define a score formula based on keyword appearances  For each URL (web page), return its rank, score, and the number of appearance of each keyword,

Smart Ranking  Stage 2 (80%+) Hierarchy Ranking  Multiple level keyword search  Given a set of Web sites (URLS) and Keywords  Rank a set of Web sites based on keyword appearances in all its sub URLs  Define a score formula based on keyword appearances in the URL and all its sub URLs  For each URL (web site), return its rank, score, and a tree structure for its sub URLs along with the number of appearance of each keyword in each node

Smart Ranking  Stage 3 (90%+) Increase your rank on google  Given a set of Keywords (No URLs)  Use search engines to find potential URLs  Do the same analysis on these Web sites as Stage 2  Stage 4 (100%+) Semantics Analysis  Do the same analysis on Stage 3  Derive relative keywords from the discovered Web sites

Important Dates Each team needs to 1.Submit a project proposal (4-8 pages) on Nov Give a Demo on Dec Upload the source code before Jan. 6

Text Processing Strings and Pattern matching

Text Processing  Due to internet, documents are everywhere  Text processing becomes one of the dominant functions of computers  HTML and XML  Primary text formats with added tags for multimedia content  Java Applet (embedded Java bytecode in the HTML)

Strings  A string is a sequence of characters  An alphabet  is the set of possible characters for a family of strings  Example of alphabets:  ASCII  Unicode  {0, 1}  {A, C, G, T}

Strings  Let P be a string of size m  A substring P[i.. j] of P is the subsequence of P consisting of the characters with ranks between i and j  A prefix of P is a substring of the type P[0.. i]  A suffix of P is a substring of the type P[i..m  1]

Java String Class String S;  Immutable strings: operations simply return information about strings (no modification) length()Return the length of S charAt(i)Return the ith character startsWith(Q)True if Q is a prefix of S endsWith(Q)True is Q is a suffix of S substring(i,j)Return the substring S[i,j] concat(Q)Return S+Q equals(Q)True is Q is equal to S indexOf(Q)If Q is a substring of S, returns the index of the beginning of the first occurrence of Q in S

Java String Class String a = “Hello World!”; OperationOutput a.length() a.charAt(1) a.startsWith(“Hell”) a.endsWith(“rld”) a.substring(1,2) a.concat(“rld”) a.substring(1,2).equals(“e”) indexOf(“rld”)

Java String Class String a = “Hello World!”; OperationOutput a.length()12 a.charAt(1)e a.startsWith(“Hell”)true a.endsWith(“rld”)false a.substring(1,2)e a.concat(“rld”)Hello World!rld a.substring(1,2).equals(“e”)true indexOf(“rld”)8

Java StringBuffer Class StringBuffer S;  Mutable strings: operations modify the strings append(Q)Replace S with S+Q. Return S. Insert(i,Q)Insert Q in S starting at index i. Return S reverse()Reverse S. Return S. setCharAt(i, ch)Set the character at index i in S to ch charAt(i)Return the character at index i in S toString()Return a String version of S

Java StringBuffer Class StringBuffer a = new StringBuffer(); Operationa a.append(“Hello World!”) a.reverse() a.insert(6,”Fang and the ”) a.setCharAt(4, ‘!’)

Java StringBuffer Class StringBuffer a = new StringBuffer(); Operationa a.append(“Hello World!”)Hello World! a.reverse()!dlroW olleH a.reverse()Hello World! a.insert(6,”Fang and the ”)Hello Fang and the World! a.setCharAt(4, ‘!’)Hell! Fang and the World!

Pattern Matching  Given a text string T of length n and a pattern string P of length m  Find whether P is a substring of T  If so, return the starting index in T of a substring matching P  The implementation of T.indexOf(P)  Applications:  Text editors, Search engines, Biological research

Brute-Force Pattern Matching The idea:  Compare the pattern P with the text T for each possible shift of P relative to T, until  either a match is found, or  all placements of the pattern have been tried

Algorithm BruteForceMatch(T, P) Input text T of size n and pattern P of size m Output starting index of a substring of T equal to P or  1 if no such substring exists for i  0 to n – m // test shift i of the pattern j  0 while j  m  T[i  j]  P[j] j  j  1 if j  m return i //match at i else break while loop //mismatch return -1 //no match anywhere

Brute-Force Pattern Matching  Time Complexity:  O(mn), where m is the length of T and n is the length of P  A worst case example:  T = aaaaaaaaaaaaaab  P = aab  Need 39 comparisons to find a match  may occur in images and DNA sequences  unlikely in English text

Can we do better? Here are two Heuristics. 1. Backward comparison  Compare T and P from the end of P and move backward to the front of P 2. Shift as far as you can  When there is a mismatch of P[j] and T[i]=c, if c does not appear in P, shift P[0] to T[i+1]

The Boyer-Moore Algorithm  The Boyer-Moore’s pattern matching algorithm is based on these two heuristics:  Looking-glass heuristic: Compare P with a subsequence of T moving backwards  Character-jump heuristic: When a mismatch occurs at T[i]  c  If P contains c, shift P to align the last occurrence of c in P with T[i]  Else, shift P to align P[0] with T[i  1]

An Example t appears in P. Shift to t e does not appears in P. align P[0] and T[i+1] T P

Last Occurrence Function  Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet  to build the last- occurrence function L mapping  to integers  L(c) is defined as ( c is a character)  the largest index i such that P[i]  c or   1 if no such index exists  Example:   {a, b, c, d}  P  abacab cabcd L(c)L(c)453 11

Last Occurrence Function  The last-occurrence function can be represented by an array indexed by the numeric codes of the characters  The last-occurrence function can be computed in time O(m  s), where m is the size of P and s is the size of 

Algorithm BoyerMooreMatch(T, P,  ) L  lastOccurenceFunction(P,  ) i  m – 1 //backward j  m – 1 repeat if T[i]  P[j] if j  0 return i // match at i else i  i  1 j  j  1 else // character-jump l  L[T[i]] i  i  m – min(j, 1  l) j  m  1 until i  n  1 return  1 { no match } The Boyer-Moore Algorithm What does this do?

The Boyer-Moore Algorithm  i  i  m – min(j, 1  l)  Don’t shift back! Case 1: j  1  l Case 2: 1  l  j

Another Example Case 2 Case 1

Do we do better?  Boyer-Moore’s algorithm runs in time O(nm  s)  Example of worst case:  T  aaa … a  P  baaa  The worst case may occur in images and DNA sequences but is unlikely in English text  Boyer-Moore’s algorithm is significantly faster than the brute-force algorithm on English text

The Worst-case Example

HW3 (Due on 10/7) Count A Keyword in a Web Page!  Get a URL and a keyword from user inputs  Return how many times the keyword appears in the contents of the URL  For example:  Enter URL:  Enter Keyword: Fang  Output: Fang appears 2 times