北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern.

Slides:



Advertisements
Similar presentations
Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network,
Advertisements

College of Information Technology & Design
XML: Extensible Markup Language
北海道大学 Hokkaido University 1 Lecture on Information knowledge network2010/12/23 Lecture on Information Knowledge Network "Information retrieval and pattern.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Lecture 4 (week 2) Source Coding and Compression
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Data Structures: A Pseudocode Approach with C 1 Chapter 6 Objectives Upon completion you will be able to: Understand and use basic tree terminology and.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
3. 1 Static Huffman A known tree is used for compressing a file. A different tree can be used for each type of file. For example a different tree for an.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
BTrees & Bitmap Indexes
Lecture 6: Huffman Code Thinh Nguyen Oregon State University.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
Character Matching Character Matching Systolic Design — Character Matching A straightforward approach to search for a pattern within a string of characters.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Source Coding Hafiz Malik Dept. of Electrical & Computer Engineering The University of Michigan-Dearborn
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
1 Project 7: Huffman Code. 2 Extend the most recent version of the Huffman Code program to include decode information in the binary output file and use.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
WORKING WITH XSLT AND XPATH
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
1 i206: Lecture 2: Computer Architecture, Binary Encodings, and Data Representation Marti Hearst Spring 2012.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 5.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 13.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
Huffman’s Algorithm 11/02/ Weighted 2-tree A weighted 2-tree T is an extended binary tree with n external nodes and each of the external nodes is.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
1 Algorithms CSCI 235, Fall 2015 Lecture 30 More Greedy Algorithms.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Greedy Algorithms Analysis of Algorithms.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Information and Network Security Lecture 2 Dr. Hadi AL Saadi.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Applied Discrete Mathematics Week 2: Functions and Sequences
HUFFMAN CODES.
RE-Tree: An Efficient Index Structure for Regular Expressions
Chapter 2 Finite Automata
13 Text Processing Hongfei Yan June 1, 2016.
Chapter 8 – Binary Search Tree
Data Structure and Algorithms
Sequences 5/17/ :43 AM Pattern Matching.
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Lecture 1 Introduction and preliminaries (Chapter 0)
Presentation transcript:

北海道大学 Hokkaido University 1 Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA

The 8th Misc. topics of pattern matching Method for multi-bytecode texts Toward an intelligent pattern matching: 1. 1.Pattern matching for XML data 2. 2.Pattern matching on texts with arc annotation 3. 3.Pattern matching with taxonomy data Appendix: Randomized algorithm

北海道大学 Hokkaido University 3 Lecture on Information knowledge network 2011/11/29 Method for multi-bytecode texts (Japanese texts) Synchronization problem of codewords: –False detection will occur when we do pattern matching on a Japanese text by the unit of ASCII (unit of byte). –It is necessary to determine the boundaries of characters as well as Huffman codes. TFT 液晶の時代 Text T = B1D5BEBDA4CEBB FEC2E5 A sequence of bytes → 修了 Pattern P = Japanese EUC encoded text AC machine for a pattern P=“BD A4 CE BB” (修了) BDA4CEBB 10 修了 342 ∑ - {BD}

北海道大学 Hokkaido University 4 Lecture on Information knowledge network 2011/11/29 Review: Solution by automaton with synchronization AB C D E Huffman tree Pattern P = DEC Huffman encoded Pattern E(P) = Text T = ABECA ・・・ Huffman encoded text E(T) = ・・・ ∑ Ordinal KMP automaton KMP automaton with sync M. Miyaaki, S. Fukamachi, M. Takeda: Speeding up the pattern matching machine for compressed texts (in Japanese), Trans. IPSJ, Vol. 39, No. 9, pp , 1998.

北海道大学 Hokkaido University 5 Lecture on Information knowledge network 2011/11/29 PM on multi-bytecode texts by an automaton with synchronization M. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp , TFT 液晶の時代 Text T = B1D5BEBDA4CEBB FEC2E5 A sequence of bytes → 修了 Pattern P = Japanese EUC encoded text An AC machine with synchronization, which correctly detects ( EUC encoded ) pattern P=“ 修了 ” BDA4CEBB 1 修了 342 z g [8E, A0-FF] ∖ [BD] [A0-FF] [8F] [A0-FF] [00-8D, 90-9F] 0 part for synchronization Code automaton accepting any EUC code 0 z g [00-8D, 90-9F] [8E, A0-FF] [8F] [A0-FF] {full-width char.} {half-width char.}

北海道大学 Hokkaido University 6 Lecture on Information knowledge network 2011/11/29 Idea of bit-parallel technique abababba ababb & R i = (R i-1 <<1 | 1) & M(T[i]) Mask table M ab ababbababb Text T: Pattern P: This can be calculated in O(1) time ※ Keeping only the right transferred bits by taking AND op. with the maskbits M.

北海道大学 Hokkaido University 7 Lecture on Information knowledge network 2011/11/29 Bit-parallel method for multi-bytecode texts Basic idea: –We construct the pattern matching machine (code automaton) that can determine the boundaries of codewords and recognize each multi-byte character in the input pattern. –The code automaton runs while reading the text by each byte, and it output the mask bit sequence corresponding to each character in the input pattern. –We simulate an arbitrary bit parallel algorithm by using the output M(T[i]) of the code automaton instead of reading T[i]. A code automaton that can determine the boundaries of EUC codes and recognize “ 修 ” and “ 了 ”. BD A4 CE BB M[ 修 ]=01 M[ 了 ]=10 z g [8E, A0-FF] /[BD,CE] [A0-FF] [8F] [A0-FF] [00-8D, 90-9F] arbitrary bit parallel algorithm R i = (R i-1 <<1 | 1) & M(T[i]) Heikki Hyyrö, Jun Takaba, Ayumi Shinohara, and Masayuki Takeda: On Bit-Parallel Processing of Multi-byte Strings, Proc. of Asia Information Retrieval Symposium, pp , 2004.

北海道大学 Hokkaido University 8 Lecture on Information knowledge network 2011/11/29 Toward an intelligent pattern matching Until now … –Text = just a sequence of characters (We’ve ignored the background knowledge about the text and meaning of sentences.) –Fast! Fast! Fast! From now on … –Text = a sequence of sentences that have meanings and/or structures –We need an intelligent pattern matching ( of course, at high speed! ) Pattern matching in consideration of the structure of the text –Pattern matching for XML texts –Pattern matching for texts with arc-annotation –etc… Pattern matching in consideration of the meaning of the text ( cooperating with ontology data ) –Pattern matching in consideration of the taxonomic information –Thesaurus, Inductive rules, etc…

北海道大学 Hokkaido University 9 Lecture on Information knowledge network 2011/11/29 Pattern matching for XML texts: previous ones XML document memory Application program DOM API …… Tanakaperson/ name/last Makikoperson/ name/first “”person/ name “”person RDB person name firstlast MakikoTanaka SQL

北海道大学 Hokkaido University 10 Lecture on Information knowledge network 2011/11/29 Pattern matching for XML texts: our approach XML document memory Application program Pattern matching algorithm Makiko Tanaka Makiko Tanaka M. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp , 2002.

北海道大学 Hokkaido University 11 Lecture on Information knowledge network 2011/11/29 Advantage of pattern matching approach It can batch the processing for a huge XML document and a large amount of documents It can treat many queries at once. Tree structure Fast processing In a little memory space Various applications XML document

北海道大学 Hokkaido University 12 Lecture on Information knowledge network 2011/11/29 Problem in a simple pattern matching algorithm It may match to part of tag names. That TVCM “mother” If we remove m, it becomes “other” That TVCM “mother” If we remove m, it becomes “other” Wrong detection Is it inside or outside of tags? Pattern Π = {other, }

北海道大学 Hokkaido University 13 Lecture on Information knowledge network 2011/11/29 A solution rothe rothem < > other ∑ other An ordinal AC machine An AC machine in consideration of XML tags rothe rothem < > other Other than ‘<‘ > < rothe rothem < > other Other than ‘<‘ > < 15 14

北海道大学 Hokkaido University 14 Lecture on Information knowledge network 2011/11/29 rothe rothem < > other > 以外 の文字 > < < 以外 の文字 rothe rothem < > other Other than ‘< ‘ > < Other than ‘< ‘ > ] Other than ‘< ‘ Handling of attributes The same tag The same tag

北海道大学 Hokkaido University 15 Lecture on Information knowledge network 2011/11/29 Pattern matching in consideration of XML path rothe rothem < > other Other than ‘<‘ > < Other than  ={,,,,,,…}  ={Tanaka} (,0) (,1) (,0) stack (,2) I want to look for the parsons whose family name is “Tanaka” ( In Xpath expression, the element //person/name/last/ is equal to “Tanaka” )

北海道大学 Hokkaido University 16 Lecture on Information knowledge network 2011/11/29 Processible subset of XPath Limitation of pattern matching approach –We cannot specify the predecessor nodes –The complex filter specifications remarkably decrease the processing speed LocationPath ::= '/' RelativeLocationPath RelativeLocationPath ::= Step | RelativeLocationPath '/' Step Step ::= AxisSpecifier NodeTest AxisSpecifier ::= AxisName '::' AxisName ::= 'attribute' | 'child' | 'descendant' | 'descendant-or-self' | 'following' | 'following-sibling' | 'self' | 'namespace' NodeTest ::= QName | NodeType '(' ')' NodeType ::= 'node' | 'text' | 'comment' | 'processing-instruction' * /descendant::cars/child::car/attribute::node()

北海道大学 Hokkaido University 17 Lecture on Information knowledge network 2011/11/29 Speed comparison with Sgrep Comparison with Sgrep (J. Jaakkola and P. Kilpeläinen) Text: 110MB (English text) CPU: Celelon 366MHz Memory: 128MB OS: Kondara/MNU Linux 2.1 RC2 Pattern //text/"summers"//test//"summers" /site/regions/afric a/item/location/ "United_States" Sgrep Takeda et al. [2002] CPU time (sec.)

北海道大学 Hokkaido University 18 Lecture on Information knowledge network 2011/11/29 Pattern matching for texts with arc- annotation Definition : The arc annotation A that accompanies sequence S is the set of union of integers {1, 2, …, |S|} Each element (iL, iR) ∈ A is called an arc. –S[iL] and S[iR] are called a right endpoint and a left endpoint, respectively. –For an arbitrary arc, we assume that it holds that iL < iR. –Moreover, any two arcs doesn't share the same integer. –That is, any two arcs doesn't share the same endpoint. An example of the text with arc-annotation: AGTCACGCCCGT

北海道大学 Hokkaido University 19 Lecture on Information knowledge network 2011/11/29 Example of text with arc annotation An example of the tRNA(tRNAPhe) two-dimensional structure ・・・ ACACCUAGCΨTGUGU ・・・ The string having nested arcs

北海道大学 Hokkaido University 20 Lecture on Information knowledge network 2011/11/29 Arc-preserving subsequence(APS) problem The APS problem is to answer if the following conditions are satisfied, when text S1 = S1[1 : n] and pattern S2 = S2[1 : m] are given with arch annotations A1 and A2, respectively. –S2 is a subsequence of S1 –There are arcs in the pattern if there are arcs in the sequence, and vice versa. AGTCACGCCCGT S1:= AGTCACGCCCGT ATGCT S2:= ATGCT Text: Pattern: Text: Pattern: ○ base match ×arc match

北海道大学 Hokkaido University 21 Lecture on Information knowledge network 2011/11/29 APS(TYPE1, TYPE2) The difficulty of the APS problem changes for its arc annotation structure APS(TYPE1, TYPE2) –TYPE1 : arc structure of the text –TYPE2 : arc structure of the pattern Example : APS(nested, chain) –Arc structure of the text is “nested” –Arc structure of the pattern is “chain” Chain Nested Limitation Difficulty High Low loose strict Crossing Plain

北海道大学 Hokkaido University 22 Lecture on Information knowledge network 2011/11/29 Result of Kida[2005] The previous work of APS problem: –J. Gramm, J. Guo, and R. Niedermeier. “Pattern matching for arc-annotated sequences.” In Proc. 22nd FSTTCS, volume 2556 of LNCS, pages 182–193. Springer, The result of Kida[2005]: proposed an improved algorithm based on the GGN algorithm –However, the worst case complexity is as the same as GGN corrected an error of Gramm-Guo-Niedermeier (GGN) algorithm –The original GGN algorithm include an error have implemented and experimented –The proposed algorithm runs 2 ~ 5 times faster than GGN APS(nested, nested) is solved in O(nm) Kida: Faster Pattern Matching Algorithm for Arc-Annotated Sequences, Proc. of Federation over the Web, LNAI (to appear)

北海道大学 Hokkaido University 23 Lecture on Information knowledge network 2011/11/29 Change to the text length n |A 1 |=20% of n, m=20, |A 2 |=4

北海道大学 Hokkaido University 24 Lecture on Information knowledge network 2011/11/29 Change to the pattern length m |A 2 |=20% of m, n=1000, |A 1 |=100

北海道大学 Hokkaido University 25 Lecture on Information knowledge network 2011/11/29 Take a breath King Penguin flying in water ( in Asahiyama Zoo ) Summary to here –Method for multi-bytecode texts (Japanese texts) Embedding the code automaton into AC machine for synchronization Combining the code automaton that outputs mask bit sequences with bit-parallel methods –Pattern matching in consideration of the structure of the text Pattern matching for XML texts Pattern matching for arc-annotated texts ~ Trivia ~ How to compute min(x,y) without conditional branching when two integers x and y are represented as m- bits sequences S ← ((x | 10 m ) - y) & 10 m, S ← S - (S ≫ m), min(x,y) ← (~S & x) | (S & y) ( However, we need m+1-bits for each )

北海道大学 Hokkaido University 26 Lecture on Information knowledge network 2011/11/29 Example of pattern matching in consideration of taxonomic information (PMTX) cell insoluble fraction membrane fraction vesicular fraction microsome cell surface cell envelope cell wall molecular function Gene Ontology catalytic activity lyase activity hyaluronate Text T: Pattern P: (cell) (receptor) (for) (catalytic activity) Pub:1:Cell.1990 Jun 29;61(7): Title:CD44 is the principal cell surface receptor for hyaluronate. Authours:Aruffo A, Stamenkovic I, Melnick M, Underhill CB, Seed B.

北海道大学 Hokkaido University 27 Lecture on Information knowledge network 2011/11/29 O(m+mh/w) time for preprocessing O(m|∑|/w) space O(mn/w) time for scanning the text O(m+h) time for preprocessing O(|∑|) space O(n) time for scanning the text –m: the length of pattern P ∈ ∑ * –n: the length of text T ∈ ∑ * –h: the size of taxonomic information H –|∑|: the size of set ∑ of concepts –w: the length of word (say, 32 or 64) Result of Kida&Arimura[2004] It works well when m < w T. Kida and H. Arimura: Pattern Matching with Taxonomic Information, Proc. of Asia Information Retrieval Symposium (AIRS2004), pp , Oct

北海道大学 Hokkaido University 28 Lecture on Information knowledge network 2011/11/29 Taxonomic information and sorted alphabet An example of DAG H representing (∑,  ) G E C DF AB We assume that a pattern and a text are given as a sequence of concepts: P ∈ ∑* and T ∈ ∑* Sorted alphabet ( ∑,  ) – ∑ : a finite alphabet ( a set of concepts ) –  : a partial order relation ※ This is also called as Hasse diagram. Pattern P:= A B E F A B C B D F C B Text T:= Concept E corresponds with the character class [A,B,C,D,E].

北海道大学 Hokkaido University 29 Lecture on Information knowledge network 2011/11/29 Examples of sorted alphabet ABCDZ 0129az [0-9] [a-z] ? (1) flat alphabet (2) class of characters (3) letter-sets alphabet (abc) (ab) (a) (ac) (bc) (c) (b) φ

北海道大学 Hokkaido University 30 Lecture on Information knowledge network 2011/11/29 We can utilize the Shift-And method! ababbbba ab[ab]bb & Mask table M ab a b [ab] b Text T: Pattern P: The difference is just here! R i = (R i-1 <<1 | 1) & M(T[i]) This is the same

北海道大学 Hokkaido University 31 Lecture on Information knowledge network 2011/11/29 Toward taxonomic information G E C DF AB Mask table M’ ABCDEFG ABCDEFABCDEF O(mh) ? Taxonomic information H: Pattern P:= A B E F A B C B D F C B Text T:=

北海道大学 Hokkaido University 32 Lecture on Information knowledge network 2011/11/29 Computation of M’(a) Lemma 1 Let ( ∑,  ) be a sorted alphabet. Given pattern P ∈ ∑ *, for any a ∈ ∑, it holds that M’(a) = ∪ x ∈ Upb(a) M(x). Lemma 2 Let ( ∑,  ) be a sorted alphabet. Given pattern P ∈ ∑ *, for any a ∈ ∑, it holds that M’(a) = M(a) ∪ ∪ x ∈ Par(a) M’(x).

北海道大学 Hokkaido University 33 Lecture on Information knowledge network 2011/11/29 Pseudo code for computing M’(a) Preprocess_M’ (P=p 1 …p m ) /* Assume H is a global variable */ 1 initalize M(a) as follows: 2 M(a)={1 ≦ i ≦ m | P[i]=a} ; 3 for each a ∈ ∑ do 4 CalculateM’(a) ; 5 end of for Function CalculateM’(a) 1 if M’(a) has been computed then return M’(a) 2 else do 3 M’(a) = M(a); 4for each x ∈ Par(a) do 5M’(a)=M’(a) ∪ (CalculateM’(x)); 6end of for 7 return M’(a); O(m) O(h) Total O(m+mh/w) O(m/w)

北海道大学 Hokkaido University 34 Lecture on Information knowledge network 2011/11/29 Occurrences Taxonomic information DB Text DB Translator Pattern matching machine Pattern Overview of retrieval system with PMTX algorithm We have to parse the text into a sequence of concepts Replace Automaton ( Arikawa and Shiraishi[1984] ) O(h+n) Translator Or using a morphological parser for natural language texts like ChaSen

北海道大学 Hokkaido University 35 Lecture on Information knowledge network 2011/11/29 The 7th summary Method for multi-bytecode texts (Japanese texts) –Embedding the code automaton into AC machine for synchronization –Combining the code automaton that outputs mask bit sequences with bit-parallel methods Toward an intelligent pattern matching: –Pattern matching in consideration of the structure of the text Pattern matching for XML texts Pattern matching for arc-annotated texts –Pattern matching in consideration of the meanings of the text (in cooperation with ontology data) Pattern matching in consideration of taxonomic information Prof. Arimura will take charge of this class from the next –Efficient data structure for information retrieval –Data mining form the web, etc.

北海道大学 Hokkaido University 36 Lecture on Information knowledge network 2011/11/29 Karp-Rabin algorithm It is a randomized algorithm using hashing technique –Matching a string by regarding it as an integer! The worst case takes O(mn) time, but it becomes O(n+m) time in the average Extra space we need is only O(1) KARP R.M., RABIN M.O., Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2): , Text : Pattern : mod ∑ = { 0,1,2,…,9 } Correct!Wrong! ・・・ The highest figure in the previous step The lowest figure that is newly input ≡ (31415 – 3×10000)× (mod 13) ≡ (7 – 3×3)× (mod13) ≡ 8 (mod 13)

北海道大学 Hokkaido University 37 Lecture on Information knowledge network 2011/11/29 Pseudo code Karp-Rabin (P, T, d, q) 1 m ← length[P]. 2 n ← length[T]. 3 h ← d m–1 mod q. 4 p ← 0. 5 t 0 ← 0. 6 for i ← 1 to m do 7 p ← (d ・ p + P[i]) mod q; 8 t 0 ← (d ・ t 0 + T[i]) mod q. 9 for s ← 0 to n – m do 10 if p = t s then 11 if P[1…m] = T[s+1…s+m] then 12 report an occurrence at s; 13 else if s < n – m then 14 t s+1 ← (d ・ (t s – T[s+1] ・ h)+T[s+m+1]) mod q. Check if the candidate is the occurrence

北海道大学 Hokkaido University 38 Lecture on Information knowledge network 2011/11/29 Randomized approximate pattern matching using FFT Fast Fourier Transform (FFT) can be computed at high speed on hardware They do (approximate) pattern matching by replacing strings into a sequence of numeric and then computing the score vectors at high speed by FFT K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A Note on Randomized Algorithm for String Matching with Mismatches. Nordic Journal of Computing, 10(1):2-12, K. Baba ( Kyushu Univ. ) a b b a c c i = T[i] =a c b a b b a c c b i : P =a b b a c Score vector