Lecture on Information knowledge network2011/11/29 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA
The 8th Misc. topics of pattern matching Method for multi-bytecode texts Toward an intelligent pattern matching: 1. 1.Pattern matching for XML data 2. 2.Pattern matching on texts with arc annotation 3. 3.Pattern matching with taxonomy data Appendix: Randomized algorithm
Method for multi-bytecode texts (Japanese texts) Synchronization problem of codewords: –False detection will occur when we do pattern matching on a Japanese text by the unit of ASCII (unit of byte). –It is necessary to determine the boundaries of characters as well as Huffman codes. TFT 液晶の時代 Text T = B1D5BEBDA4CEBB FEC2E5 A sequence of bytes → 修了 Pattern P = Japanese EUC encoded text AC machine for a pattern P="BD A4 CE BB" (修了) BDA4CEBB 10 修了 342 ∑ - {BD}
Review: Solution by automaton with synchronization AB C D E Huffman tree Pattern P = DEC Huffman encoded Pattern E(P) = Text T = ABECA ・・・ Huffman encoded text E(T) = ・・・ ∑ Ordinal KMP automaton KMP automaton with sync M. Miyaaki, S. Fukamachi, M. Takeda: Speeding up the pattern matching machine for compressed texts (in Japanese), Trans. IPSJ, Vol. 39, No. 9, pp , 1998.
PM on multi-bytecode texts by an automaton with synchronization M. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp , TFT 液晶の時代 Text T = B1D5BEBDA4CEBB FEC2E5 A sequence of bytes → 修了 Pattern P = Japanese EUC encoded text An AC machine with synchronization, which correctly detects ( EUC encoded ) pattern P=" 修了 " BDA4CEBB 1 修了 342 z g [8E, A0-FF] ∖ [BD] [A0-FF] [8F] [A0-FF] [00-8D, 90-9F] 0 part for synchronization Code automaton accepting any EUC code 0 z g [00-8D, 90-9F] [8E, A0-FF] [8F] [A0-FF] {full-width char.} {half-width char.}
Idea of bit-parallel technique abababba ababb & R i = (R i-1 <<1 | 1) & M(T[i]) Mask table M ab ababbababb Text T: Pattern P: This can be calculated in O(1) time ※ Keeping only the right transferred bits by taking AND op. with the maskbits M.
Bit-parallel method for multi-bytecode texts Basic idea: –We construct the pattern matching machine (code automaton) that can determine the boundaries of codewords and recognize each multi-byte character in the input pattern. –The code automaton runs while reading the text by each byte, and it output the mask bit sequence corresponding to each character in the input pattern. –We simulate an arbitrary bit parallel algorithm by using the output M(T[i]) of the code automaton instead of reading T[i]. A code automaton that can determine the boundaries of EUC codes and recognize " 修 " and " 了 ". BD A4 CE BB M[ 修 ]=01 M[ 了 ]=10 z g [8E, A0-FF] /[BD,CE] [A0-FF] [8F] [A0-FF] [00-8D, 90-9F] arbitrary bit parallel algorithm R i = (R i-1 <<1 | 1) & M(T[i]) Heikki Hyyrö, Jun Takaba, Ayumi Shinohara, and Masayuki Takeda: On Bit-Parallel Processing of Multi-byte Strings, Proc. of Asia Information Retrieval Symposium, pp , 2004.
Toward an intelligent pattern matching Until now … –Text = just a sequence of characters (We've ignored the background knowledge about the text and meaning of sentences.) –Fast! Fast! Fast! From now on … –Text = a sequence of sentences that have meanings and/or structures –We need an intelligent pattern matching ( of course, at high speed! ) Pattern matching in consideration of the structure of the text –Pattern matching for XML texts –Pattern matching for texts with arc-annotation –etc… Pattern matching in consideration of the meaning of the text ( cooperating with ontology data ) –Pattern matching in consideration of the taxonomic information –Thesaurus, Inductive rules, etc…
Pattern matching for XML texts: previous ones XML document memory Application program DOM API …… Tanakaperson/ name/last Makikoperson/ name/first ""person/ name ""person RDB person name firstlast MakikoTanaka SQL
Pattern matching for XML texts: our approach XML document memory Application program Pattern matching algorithm Makiko Tanaka Makiko Tanaka M. Takeda, et al.: Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi-Structured Texts, Proc. of SPIRE2002, LNCS2476, pp , 2002.
Advantage of pattern matching approach It can batch the processing for a huge XML document and a large amount of documents It can treat many queries at once. Tree structure Fast processing In a little memory space Various applications XML document
Problem in a simple pattern matching algorithm It may match to part of tag names. That TVCM "mother" If we remove m, it becomes "other" That TVCM "mother" If we remove m, it becomes "other" Wrong detection Is it inside or outside of tags? Pattern Π = {other, }
A solution rothe rothem < > other ∑ other An ordinal AC machine An AC machine in consideration of XML tags rothe rothem < > other Other than '<' > < rothe rothem < > other Other than '<' > < 15 14
rothe rothem < > other > 以外 の文字 > < < 以外 の文字 rothe rothem < > other Other than '< ' > < Other than '< ' > ] Other than '< ' Handling of attributes The same tag The same tag
Pattern matching in consideration of XML path rothe rothem < > other Other than '<' > < Other than ={,,,,,,…} ={Tanaka} (,0) (,1) (,0) stack (,2) I want to look for the parsons whose family name is "Tanaka" ( In Xpath expression, the element //person/name/last/ is equal to "Tanaka" )
Processible subset of XPath Limitation of pattern matching approach –We cannot specify the predecessor nodes –The complex filter specifications remarkably decrease the processing speed LocationPath ::= '/' RelativeLocationPath RelativeLocationPath ::= Step | RelativeLocationPath '/' Step Step ::= AxisSpecifier NodeTest AxisSpecifier ::= AxisName '::' AxisName ::= 'attribute' | 'child' | 'descendant' | 'descendant-or-self' | 'following' | 'following-sibling' | 'self' | 'namespace' NodeTest ::= QName | NodeType '(' ')' NodeType ::= 'node' | 'text' | 'comment' | 'processing-instruction' * /descendant::cars/child::car/attribute::node()
Speed comparison with Sgrep Comparison with Sgrep (J. Jaakkola and P. Kilpeläinen) Text: 110MB (English text) CPU: Celelon 366MHz Memory: 128MB OS: Kondara/MNU Linux 2.1 RC2 Pattern //text/"summers"//test//"summers" /site/regions/afric a/item/location/ "United_States" Sgrep Takeda et al. [2002] CPU time (sec.)
Pattern matching for texts with arc- annotation Definition : The arc annotation A that accompanies sequence S is the set of union of integers {1, 2, …, |S|} Each element (iL, iR) ∈ A is called an arc. –S[iL] and S[iR] are called a right endpoint and a left endpoint, respectively. –For an arbitrary arc, we assume that it holds that iL < iR. –Moreover, any two arcs doesn't share the same integer. –That is, any two arcs doesn't share the same endpoint. An example of the text with arc-annotation: AGTCACGCCCGT
Example of text with arc annotation An example of the tRNA(tRNAPhe) two-dimensional structure ・・・ ACACCUAGCΨTGUGU ・・・ The string having nested arcs
Arc-preserving subsequence(APS) problem The APS problem is to answer if the following conditions are satisfied, when text S1 = S1[1 : n] and pattern S2 = S2[1 : m] are given with arch annotations A1 and A2, respectively. –S2 is a subsequence of S1 –There are arcs in the pattern if there are arcs in the sequence, and vice versa. AGTCACGCCCGT S1:= AGTCACGCCCGT ATGCT S2:= ATGCT Text: Pattern: Text: Pattern: ○ base match ×arc match
APS(TYPE1, TYPE2) The difficulty of the APS problem changes for its arc annotation structure APS(TYPE1, TYPE2) –TYPE1 : arc structure of the text –TYPE2 : arc structure of the pattern Example : APS(nested, chain) –Arc structure of the text is "nested" –Arc structure of the pattern is "chain" Chain Nested Limitation Difficulty High Low loose strict Crossing Plain
Result of Kida[2005] The previous work of APS problem: –J. Gramm, J. Guo, and R. Niedermeier. "Pattern matching for arc-annotated sequences." In Proc. 22nd FSTTCS, volume 2556 of LNCS, pages 182–193. Springer, The result of Kida[2005]: proposed an improved algorithm based on the GGN algorithm –However, the worst case complexity is as the same as GGN corrected an error of Gramm-Guo-Niedermeier (GGN) algorithm –The original GGN algorithm include an error have implemented and experimented –The proposed algorithm runs 2 ~ 5 times faster than GGN APS(nested, nested) is solved in O(nm) Kida: Faster Pattern Matching Algorithm for Arc-Annotated Sequences, Proc. of Federation over the Web, LNAI (to appear)
Change to the text length n |A 1 |=20% of n, m=20, |A 2 |=4
Change to the pattern length m |A 2 |=20% of m, n=1000, |A 1 |=100
Summary to here –Method for multi-bytecode texts (Japanese texts) Embedding the code automaton into AC machine for synchronization Combining the code automaton that outputs mask bit sequences with bit-parallel methods –Pattern matching in consideration of the structure of the text Pattern matching for XML texts Pattern matching for arc-annotated texts ~ Trivia ~ How to compute min(x,y) without conditional branching when two integers x and y are represented as m- bits sequences S ← ((x | 10 m ) - y) & 10 m, S ← S - (S ≫ m), min(x,y) ← (~S & x) | (S & y) ( However, we need m+1-bits for each )
Example of pattern matching in consideration of taxonomic information (PMTX) cell insoluble fraction membrane fraction vesicular fraction microsome cell surface cell envelope cell wall molecular function Gene Ontology catalytic activity lyase activity hyaluronate Text T: Pattern P: (cell) (receptor) (for) (catalytic activity) Pub:1:Cell.1990 Jun 29;61(7): Title:CD44 is the principal cell surface receptor for hyaluronate. Authours:Aruffo A, Stamenkovic I, Melnick M, Underhill CB, Seed B.
O(m+mh/w) time for preprocessing O(m|∑|/w) space O(mn/w) time for scanning the text O(m+h) time for preprocessing O(|∑|) space O(n) time for scanning the text –m: the length of pattern P ∈ ∑ * –n: the length of text T ∈ ∑ * –h: the size of taxonomic information H –|∑|: the size of set ∑ of concepts –w: the length of word (say, 32 or 64) Result of Kida&Arimura[2004] It works well when m < w T. Kida and H. Arimura: Pattern Matching with Taxonomic Information, Proc. of Asia Information Retrieval Symposium (AIRS2004), pp , Oct
Taxonomic information and sorted alphabet An example of DAG H representing (∑, ) G E C DF AB We assume that a pattern and a text are given as a sequence of concepts: P ∈ ∑* and T ∈ ∑* Sorted alphabet ( ∑, ) – ∑ : a finite alphabet ( a set of concepts ) – : a partial order relation ※ This is also called as Hasse diagram. Pattern P:= A B E F A B C B D F C B Text T:= Concept E corresponds with the character class [A,B,C,D,E].
Examples of sorted alphabet ABCDZ 0129az [0-9] [a-z] ? (1) flat alphabet (2) class of characters (3) letter-sets alphabet (abc) (ab) (a) (ac) (bc) (c) (b) φ
We can utilize the Shift-And method! ababbbba ab[ab]bb & Mask table M ab a b [ab] b Text T: Pattern P: The difference is just here! R i = (R i-1 <<1 | 1) & M(T[i]) This is the same
Toward taxonomic information G E C DF AB Mask table M' ABCDEFG ABCDEFABCDEF O(mh) ? Taxonomic information H: Pattern P:= A B E F A B C B D F C B Text T:=
北海道大学 Hokkaido University 32 Lecture on Information knowledge network 2011/11/29 Computation of M’(a) Lemma 1 Let ( ∑, ) be a sorted alphabet. Given pattern P ∈ ∑ *, for any a ∈ ∑, it holds that M’(a) = ∪ x ∈ Upb(a) M(x). Lemma 2 Let ( ∑, ) be a sorted alphabet. Given pattern P ∈ ∑ *, for any a ∈ ∑, it holds that M’(a) = M(a) ∪ ∪ x ∈ Par(a) M’(x).
北海道大学 Hokkaido University 33 Lecture on Information knowledge network 2011/11/29 Pseudo code for computing M’(a) Preprocess_M’ (P=p 1 …p m ) /* Assume H is a global variable */ 1 initalize M(a) as follows: 2 M(a)={1 ≦ i ≦ m | P[i]=a} ; 3 for each a ∈ ∑ do 4 CalculateM’(a) ; 5 end of for Function CalculateM’(a) 1 if M’(a) has been computed then return M’(a) 2 else do 3 M’(a) = M(a); 4for each x ∈ Par(a) do 5M’(a)=M’(a) ∪ (CalculateM’(x)); 6end of for 7 return M’(a); O(m) O(h) Total O(m+mh/w) O(m/w)
北海道大学 Hokkaido University 34 Lecture on Information knowledge network 2011/11/29 Occurrences Taxonomic information DB Text DB Translator Pattern matching machine Pattern Overview of retrieval system with PMTX algorithm We have to parse the text into a sequence of concepts Replace Automaton ( Arikawa and Shiraishi[1984] ) O(h+n) Translator Or using a morphological parser for natural language texts like ChaSen
北海道大学 Hokkaido University 35 Lecture on Information knowledge network 2011/11/29 The 7th summary Method for multi-bytecode texts (Japanese texts) –Embedding the code automaton into AC machine for synchronization –Combining the code automaton that outputs mask bit sequences with bit-parallel methods Toward an intelligent pattern matching: –Pattern matching in consideration of the structure of the text Pattern matching for XML texts Pattern matching for arc-annotated texts –Pattern matching in consideration of the meanings of the text (in cooperation with ontology data) Pattern matching in consideration of taxonomic information Prof. Arimura will take charge of this class from the next –Efficient data structure for information retrieval –Data mining form the web, etc.
北海道大学 Hokkaido University 36 Lecture on Information knowledge network 2011/11/29 Karp-Rabin algorithm It is a randomized algorithm using hashing technique –Matching a string by regarding it as an integer! The worst case takes O(mn) time, but it becomes O(n+m) time in the average Extra space we need is only O(1) KARP R.M., RABIN M.O., Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2): , Text : Pattern : mod ∑ = { 0,1,2,…,9 } Correct!Wrong! ・・・ The highest figure in the previous step The lowest figure that is newly input ≡ (31415 – 3×10000)× (mod 13) ≡ (7 – 3×3)× (mod13) ≡ 8 (mod 13)
北海道大学 Hokkaido University 37 Lecture on Information knowledge network 2011/11/29 Pseudo code Karp-Rabin (P, T, d, q) 1 m ← length[P]. 2 n ← length[T]. 3 h ← d m–1 mod q. 4 p ← 0. 5 t 0 ← 0. 6 for i ← 1 to m do 7 p ← (d ・ p + P[i]) mod q; 8 t 0 ← (d ・ t 0 + T[i]) mod q. 9 for s ← 0 to n – m do 10 if p = t s then 11 if P[1…m] = T[s+1…s+m] then 12 report an occurrence at s; 13 else if s < n – m then 14 t s+1 ← (d ・ (t s – T[s+1] ・ h)+T[s+m+1]) mod q. Check if the candidate is the occurrence
北海道大学 Hokkaido University 38 Lecture on Information knowledge network 2011/11/29 Randomized approximate pattern matching using FFT Fast Fourier Transform (FFT) can be computed at high speed on hardware They do (approximate) pattern matching by replacing strings into a sequence of numeric and then computing the score vectors at high speed by FFT K. Baba, A. Shinohara, M. Takeda, S. Inenaga, and S. Arikawa. A Note on Randomized Algorithm for String Matching with Mismatches. Nordic Journal of Computing, 10(1):2-12, K. Baba ( Kyushu Univ. ) a b b a c c i = T[i] =a c b a b b a c c b i : P =a b b a c Score vector