Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University
Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P. Problem definition - Pattern Matching T= P= n m
Problem definition - Online Pattern Matching We get the text character by character P= T=
Motivation… Stock market
Motivation.. Espionage The rest we monitor
Motivation… Viruses and malware Software solutions: Snort: 73.5Mb ClamAV: 1.48Gb Using TCAMs: Snort: 680Kb ClamAV: 25Mb Our solution (software): Snort: 51Kb ClamAV: 216Kb
Motivation… Monitoring internet traffic
Streaming model 2 50 BPS We can't store the whole input In our case we seek for algorithm which require poly(log m) space
Related work Karp-Rabin: Randomized Algorithm for exact pattern matching Clifford, Porat, and Porat: A black box algorithm for online approximate pattern matching o Almost any pattern matching algorithm can be converted to run online.
p 0 p 1 p 2 p 3...p m-1 Karp-Rabin Algorithm t 0 t 1 t 2... t i t i+1... t i+m-1 t i+m... t n p 0 r m-1 +p 1 r m-2 +p 2 r m p m-1 modq S i =t i r m-1 +t i+1 r m t i+m-1 modq S i+1 =t i+1 r m t i+m-1 r+t i+m modq S i+1 =S i r+t i+m -t i r m Require O(m) memory Choosing randomly r
The idea - Simple case P= Z Z T Signature Start signing Signature The pattern start with z, and there is no more z's in the pattern Z Signature Start signing
Case 1 P= U U T Signature Start signing Signature There is a prefix U s.t U appear only once in the pattern U Signature Start signing m =<m/2 Seek in recursion
Case 2: No small U P= W Look on the first m/2 character They appear again somewhere W P= v v v v v v v v Prefix of v Option 1 Option 2 P= v v v v w w isn't a prefix of v and v isn't a prefix of w v=<m/2
Solving case 2 Option 2 P= v v v v w v=<m/2 Search in recursion for v, and count how many time you found it Sign on w T v v Start signing Signature v Start signing
Solving case 2 - continue Option 2 P= v v v v w v=<m/2 Search in recursion for v, and count how many time you found it Sign on w T v v Start signing Signature v Using O(log m) signatures and counters in the worst case v v v >m/2 <m/2 Signature Start signing
p 0 p 1 p 2 p 3...p m-1 Karp-Rabin Algorithm t 0 t 1 t 2... t i t i+1... t i+m-1 t i+m... t n p 0 r m-1 +p 1 r m-2 +p 2 r m p m-1 modq S i =t i r m-1 +t i+1 r m t i+m-1 modq S i+1 =t i+1 r m t i+m-1 r+t i+m modq S i+1 =S i r+t i+m -t i r m Choosing randomly r
p 0 p 1 p 2 p 3...p m-1 Rothschild signature 07 p 0 r m-1 +p 1 r m-2 +p 2 r m p m-1 modq p 0 +p 1 r+p 2 r p m-1 r m-1 modq t 0 t 1 t 2 t 3... t i
Forward signatures P= U U T Signature Calculate X=S i +Sig*r i+1 Signature There is a prefix U s.t U appear only once in the pattern m =<m/2 Seek in recursion Check if equal to X Remember X for this position
0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 Example 0, 1, 1, 0, 1, 1, 1 0, 1, 1 P: T:0 Level 1: Level 2: Level 3:
0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 Example: q=7 r=3 0, 1, 1, 0, 1, 1, 1 0, 1, 1 P: T:0 Level 1: Level 2: Level 3: ri=ri= Level 2: Level 1:
Worst case - time t 0 t 1 t 2 t 3... t i X1X1 X2X2 X logm Check using hash table X 1 =X 2 =…=X logm ??? We can work in lazy approach without blowup in the memory Time: O(1) Amortized O(1), but what about worst case?
Average / Random/ Smooth case P: m log ∑ m log ∑ log ∑ m Total number of iteration is O(log* ∑ m)
Worst case P: m m/2 m/4 Total number of iteration is O(log m) = O(log m logδ) space.
Multi-Pattern search (dictionary matching) Given a set of patterns D={P 1,P 2,P 3,…,P d } –The patterns can be of different length We will want to report whenever one of the patterns appear. Our algorithm will require O(∑ i=1 d log|P i |) memory, and will require O(log d) time per text character.
Multi-Pattern search (dictionary matching) Denote M=max i |P i | Our algorithm will have 2 cases: –Case 1: d>M –Case 2: d<M
Case 1: d>M In this case we can allocate an array of size M+1 t 0 t 1 t 2 t 3... t l-M t l-M+1... t l S l-M S l-M+1... S l It is easy to maintain such a sliding window in O(1) time and O(M) memory
Case 1: d>M - continue For each P i in D: (P i =a 0 a 1 a 2 … a mi-1 ) e=m i while e!=0: find j s.t 2 j = e e=e-2 j if e!=0 HashTable(Sig(a e a e+1 …a mi )) HashTable(Sig(a 0 a 1 …a mi ),match i ) Example P i =a 0 a 1 a 2 … a 38 We will store in the hash table: Sig(a 7 a 6 …a 38 ) Sig(a 3 a 4 …a 38 ) Sig(a 1,a 2 …a 38 ) Sig(a 0 a 1 …a 38 ),match i We will store at most log |P i | points
Case 1: d>M - continue 2i2i 2 i +2 j 2 i +2 j +2 l At most logP i levels
Case 1: d>M In this case we can allocate an array of size M+1 t 0 t 1 t 2 t 3... t l-M t l-M+1... t l S l-M S l-M+1... S l Notice that it take O(1) to calculate Sig(t i t i+1 …t l )
Case 1: d>M - continue We will do binary search over the sliding window S l-M S l-M+1... S l l-2 j Is it in the HashTable? No l-2 j-1 Is it in the HashTable? Yes l-2 j-1 -2 j-2 Is it in the HashTable?
Case 2: d<M In this case we will split our dictionary D into 2 dictionaries: –D 1 – all the patterns shorter then d. On this dictionary we will run case 1. –D 2 – all the patterns longer then d. We need only to deal with this case.
Case 2: d<M - continue For each P i in D 2 : P i = a 0 a 1 a 2... a d-1 a d... a m SP i =Sig(a 0 a 1 …a d-1 ) Store in hash table SP i
Case 2: d<M - continue If P i contain a period prefix of length more then d P i = u u u u u u v.. a m SP i We store as well the number of time we need to see SP i w.h.p won’t be SP i We will start a process which will seek for P i only after seeing enough SP i. Therefore the minimum number of characters we have to see between 2 process of P i is at least d.
Case 2: d<M - continue We run the algorithm from the beginning of the lecture. Amortized it take O(1/d) per pattern per text character. Overall it take O(1) amortized time per text character. By lazy approach we get O(1) time in worst case.
Open problems Multi pattern search case 2 takes O(1) time, however case 1 takes O(logd) –Improve case 1 to be O(1) –With heuristic almost all the dictionary take O(1) time, and O(1) space per pattern. Lower bound –We believe that single pattern search lower bound is Ώ(log m log δ) Find more clients Find a place for sabbatical (~1/1/ /9/2013)
Important things: In coming events: –ICALP2011GT (July 3 rd, one day before ICALP) We will have some support for students –Workshop on Sparsity and Computation, U. Mich. Aug 1--4 We will have some support for students –IMA: Group Testing Designs, Algorithms, and Applications to Biology Feb –Stringology 2012 Find a place for sabbatical (~1/1/ /9/2013)