Download presentation
Presentation is loading. Please wait.
Published byRuby Francis Modified over 9 years ago
1
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine [2] Santa Clara University[1][2] [1] [2]
2
2 n-gram Search New pattern matching idea Matches algebraic signatures Preprocesses both : pattern & string (record) –String preprocessing is a new idea To the best of our knowledge Provides incidental protection of stored data Important for P2P & grid systems Fast processing Especially useful for DBs & longer patterns –ASCII, Unicode, DNA… –Should be then often faster than Boyer-Moore –Possibly the fastest known in this context
3
3 Algebraic Signature Symbols of the alphabet are elements of a Galois Field –GF (256) usually We choose there one primitive element –Usually = 2 The algebraic signature of the string of i symbols p 1… p i is the sum: p’ i = p 1 +…+p i i. Here the addition and the multiplication are the operations in GF.
4
4 Algebraic Signature In our GF (2 f ) where f = 8,16: p + q = p – q = p XOR q One method for multiplying is : p*q = antilog (( log p + log q) mod 255) The division is then : p / q = antilog (( log p - log q) mod 255) The log and antilog are encoded in log and antilog tables with 2 f elements each. –Entry 0 is for element 0 of the GF and is by convention set to 2 f - 1.
5
5 Cumulative Algebraic Signature We encode every symbol p i in a string into the signature of the prefix p 1 …p i The value of a CAS symbol now encodes also the knowledge of values of all the previous ones Matching a single symbol means prefix matching
6
6 Application of CASs Incidental stored data protection –On P2P & Grid Servers especially Numerous CAS encoded string matching algorithms –Prefix match with O (1) complexity –Pattern match by signature only Karp – Rabin like, linear O (L) complexity –Longest common string search –Longest common prefix search –…
7
7 CAS Properties O (K) encoding and decoding speed For encoding, for instance: p’ i = p’ i-1 + p i i = CAS ( p i-1 ) + p i i Fast n – gram signature calculus –For S k, l = p k …p l with k > 1 and l – k = n : AS ( S k, l ) = AS (S l - k+1 ) = (p’ l XOR p’ k - 1 ) / k-1 Logarithmic Algebraic Signature (LAS) LAS ( S k, l ) = log AS ( S k, l ) = = ( log (p’ l XOR p’ k - 1 ) – (k-1)) mod 2 f – 1
8
8 The n-gram Search Key ideas Design a sublinear pattern match search –With speed about L / K Apply to CAS encoded DB –New idea for string search algorithm with preprocessing –Justified for a DB Store once, search many times
9
9 The n-gram Search Key ideas Preprocess the pattern to create a jump table –As in Boyer – Moore Use n –grams with n > 1 to increase the discriminative power of an attempt –Comparison of a sample from the pattern a single symbol for BM an LAS of an n – gram for a CAS-encoded string
10
10 The n-gram Search Key ideas If the alphabet uses m symbols, the probability that a symbol matches is 1/m –Assuming all symbols equally likely For usual ASCII pattern matching m = 20-25 For DNA m = 4 A single symbol may often match without the whole pattern matching e.g., ¼ times for DNA on the average Leading to small jumps, –by m symbols on the average
11
11 The n-gram Search Key ideas The probability of an n - gram matching may be : min ( 1/ 2 f, 1 / m n ) In our examples it can reach 1 / 256 – More discriminative sampling – Longer jumps By almost K or 256 symbols in general Useful for longer strings –DNA, text, images…
12
12 ASCII Exemple Usual Alphabet 2-grams => 5 jumps 1-gram => 6 jumps
13
13 DNA Exemple 4-letter Alphabet 3 jumps 4 jumps 11 jumps
14
14 The n-gram Search Preprocessing Encode every record (string) into its CAS –Done for incidental protection anyhow for SDDS-2006 Encode the terminal n - gram of the searched pattern S K into its LAS in variable V Fill up the jump table T for every other n - gram in S K –calculate every LAS –for each LAS, store in T its rightmost offset with respect to the end of S K
15
15 The n-gram Search Jump Table For GF (256), every n – gram S i, i+n-1 in the pattern and i = LAS (S i, i+n-1 ): –T ( i ) = the offset –T ( i ) = K – n + 1 otherwise Remainder : LAS (0) = 255 T can be also hash table –See the paper –Slower to use but possibly more memory efficient Probably more useful for a larger GF
16
16 ASCII Exemple Dauphine V = ne’’ 7 0 7 1 … … 1 in’’ … … 5 au’’ … … 3 ph’’ … … 7 255 Notation : xy’’ = LAS (xy)
17
17 The n-gram Search Processing Calculate LAS of the current n-gram in the string –Start with the n-gram S K-n+1, K –Continue depending on jump calculus Attempt to match V –If.true then calculate LAS of the entire current possibly matching substring of length K and ending with the current n-gram If.true, then resolve the possible collision –Either attempt to match all the K symbols –Or match enough of terminal n-grams or symbols to decrease the probability of collision to a very small value
18
18 The n-gram Search Processing Otherwise –Go to T using LAS of the n-gram –Jump by the number of symbols found in T Update the “current” position for n-gram to attempt the match –Re-attempt the match as above Unless the n-gram to attempt is beyond the end of the string
19
19 ASCII Exemple Again 2-grams => 5 jumps 1-gram => 6 jumps
20
20 DNA Exemple Again 3 jumps 4 jumps 11 jumps
21
21 Related Work Implemented in SDDS-2006 Applies best to –longer patterns where many jumps occur –alphabets much smaller than the size of GF used Instead of jump of size m in the average, one reaches almost min (K, 2 f ) per jump –up to almost 256 for DNA or ASCII with GF (256) –up to almost 64K for DNA or Unicode with GF (64K) instead of 4 or 25 respectively –For Boyer-Moore especially
22
22 n-grams / BM Jumps with n-grams can be typically longer Calculate an attempt & jump are more expensive as well –About twice as long at first approach –The precise analysis remains to be done Rule of thumb: If jumps are more than 2 times longer, n-grams with n > 1 or should be faster than BM. In both our examples, it should be the case of patterns longer than : –50 symbols for ASCII –8 symbols for DNA
23
23 Related Work In SDDS 2006 & P2P or Grid System in general Wish to hide what is searched for ? Use the signature only based search –Usually slower since linear only
24
24 Conclusion A new pattern matching algorithm Uses algebraic signatures Preprocesses both the pattern and the string Appears particularly efficient –For databases –For longer patterns Possibly faster in this context than any other algorithm known know But all this are only preliminray results
25
25 Future Work Performance Analysis –Theoretical Jump Length –Median, Average… –Experimental Actual text –Non uniform symbol distribution DNA –Actual DNA strings
26
26 Future Work Variants –Jump Table –Partial Signatures of n –grams Symbol p i encodes the n –gram signature up to p i- n+1 …p i –No more XORing & Division to find this signature –Faster unsuccessful attempt to match –Approximate Match Tolerating match errors –E.g., and at most 1 symbol
27
27 Thank You for Your Attention witold.litwin@dauphine.fr
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.