Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University.

Similar presentations


Presentation on theme: "Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University."— Presentation transcript:

1 Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University

2 Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P. Problem definition - Pattern Matching T= P= n m

3 Problem definition - Online Pattern Matching We get the text character by character P= T=

4 Motivation… Stock market

5

6 Motivation.. Espionage The rest we monitor

7 Motivation… Viruses and malware Software solutions: Snort: 73.5Mb ClamAV: 1.48Gb Using TCAMs: Snort: 680Kb ClamAV: 25Mb Our solution (software): Snort: 51Kb ClamAV: 216Kb

8 Motivation… Monitoring internet traffic

9 Streaming model 2 50 BPS We can't store the whole input In our case we seek for algorithm which require poly(log m) space

10 Related work Karp-Rabin: Randomized Algorithm for exact pattern matching Clifford, Porat, and Porat: A black box algorithm for online approximate pattern matching o Almost any pattern matching algorithm can be converted to run online.

11 p 0 p 1 p 2 p 3...p m-1 Karp-Rabin Algorithm t 0 t 1 t 2... t i t i+1... t i+m-1 t i+m... t n p 0 r m-1 +p 1 r m-2 +p 2 r m-3 +...+p m-1 modq S i =t i r m-1 +t i+1 r m-2 +...t i+m-1 modq S i+1 =t i+1 r m-1 +...t i+m-1 r+t i+m modq S i+1 =S i r+t i+m -t i r m Require O(m) memory Choosing randomly r

12 The idea - Simple case P= Z Z T Signature Start signing Signature The pattern start with z, and there is no more z's in the pattern Z Signature Start signing

13 Case 1 P= U U T Signature Start signing Signature There is a prefix U s.t U appear only once in the pattern U Signature Start signing m =<m/2 Seek in recursion

14 Case 2: No small U P= W Look on the first m/2 character They appear again somewhere W P= v v v v v v v v Prefix of v Option 1 Option 2 P= v v v v w w isn't a prefix of v and v isn't a prefix of w v=<m/2

15 Solving case 2 Option 2 P= v v v v w v=<m/2 Search in recursion for v, and count how many time you found it Sign on w T v v Start signing Signature v Start signing

16 Solving case 2 - continue Option 2 P= v v v v w v=<m/2 Search in recursion for v, and count how many time you found it Sign on w T v v Start signing Signature v Using O(log m) signatures and counters in the worst case v v v >m/2 <m/2 Signature Start signing

17 p 0 p 1 p 2 p 3...p m-1 Karp-Rabin Algorithm t 0 t 1 t 2... t i t i+1... t i+m-1 t i+m... t n p 0 r m-1 +p 1 r m-2 +p 2 r m-3 +...+p m-1 modq S i =t i r m-1 +t i+1 r m-2 +...t i+m-1 modq S i+1 =t i+1 r m-1 +...t i+m-1 r+t i+m modq S i+1 =S i r+t i+m -t i r m Choosing randomly r

18 p 0 p 1 p 2 p 3...p m-1 Rothschild signature 07 p 0 r m-1 +p 1 r m-2 +p 2 r m-3 +...+p m-1 modq p 0 +p 1 r+p 2 r 2 +...+p m-1 r m-1 modq t 0 t 1 t 2 t 3... t i

19 Forward signatures P= U U T Signature Calculate X=S i +Sig*r i+1 Signature There is a prefix U s.t U appear only once in the pattern m =<m/2 Seek in recursion Check if equal to X Remember X for this position

20 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 Example 0, 1, 1, 0, 1, 1, 1 0, 1, 1 P: T:0 Level 1: Level 2: Level 3: 10110011011100000110110100110110110110110110111

21 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 Example: q=7 r=3 0, 1, 1, 0, 1, 1, 1 0, 1, 1 P: T:0 Level 1: Level 2: Level 3: 10110011011100000110110100110110110110110110111 5 1 4 033 ri=ri= 26662441600000014463311 32645132164513264513264513 563461326430 01 1 Level 2: Level 1:

22 Worst case - time t 0 t 1 t 2 t 3... t i X1X1 X2X2 X logm Check using hash table X 1 =X 2 =…=X logm ??? We can work in lazy approach without blowup in the memory Time: O(1) Amortized O(1), but what about worst case?

23 Average / Random/ Smooth case P: m log ∑ m log ∑ log ∑ m Total number of iteration is O(log* ∑ m)

24 Worst case P: m m/2 m/4 Total number of iteration is O(log m) = O(log m logδ) space.

25 Multi-Pattern search (dictionary matching) Given a set of patterns D={P 1,P 2,P 3,…,P d } –The patterns can be of different length We will want to report whenever one of the patterns appear. Our algorithm will require O(∑ i=1 d log|P i |) memory, and will require O(log d) time per text character.

26 Multi-Pattern search (dictionary matching) Denote M=max i |P i | Our algorithm will have 2 cases: –Case 1: d>M –Case 2: d<M

27 Case 1: d>M In this case we can allocate an array of size M+1 t 0 t 1 t 2 t 3... t l-M t l-M+1... t l S l-M S l-M+1... S l It is easy to maintain such a sliding window in O(1) time and O(M) memory

28 Case 1: d>M - continue For each P i in D: (P i =a 0 a 1 a 2 … a mi-1 ) e=m i while e!=0: find j s.t 2 j = e e=e-2 j if e!=0 HashTable(Sig(a e a e+1 …a mi )) HashTable(Sig(a 0 a 1 …a mi ),match i ) Example P i =a 0 a 1 a 2 … a 38 We will store in the hash table: Sig(a 7 a 6 …a 38 ) Sig(a 3 a 4 …a 38 ) Sig(a 1,a 2 …a 38 ) Sig(a 0 a 1 …a 38 ),match i We will store at most log |P i | points

29 Case 1: d>M - continue 2i2i 2 i +2 j 2 i +2 j +2 l At most logP i levels

30 Case 1: d>M In this case we can allocate an array of size M+1 t 0 t 1 t 2 t 3... t l-M t l-M+1... t l S l-M S l-M+1... S l Notice that it take O(1) to calculate Sig(t i t i+1 …t l )

31 Case 1: d>M - continue We will do binary search over the sliding window S l-M S l-M+1... S l l-2 j Is it in the HashTable? No l-2 j-1 Is it in the HashTable? Yes l-2 j-1 -2 j-2 Is it in the HashTable?

32 Case 2: d<M In this case we will split our dictionary D into 2 dictionaries: –D 1 – all the patterns shorter then d. On this dictionary we will run case 1. –D 2 – all the patterns longer then d. We need only to deal with this case.

33 Case 2: d<M - continue For each P i in D 2 : P i = a 0 a 1 a 2... a d-1 a d... a m SP i =Sig(a 0 a 1 …a d-1 ) Store in hash table SP i

34 Case 2: d<M - continue If P i contain a period prefix of length more then d P i = u u u u u u v.. a m SP i We store as well the number of time we need to see SP i w.h.p won’t be SP i We will start a process which will seek for P i only after seeing enough SP i. Therefore the minimum number of characters we have to see between 2 process of P i is at least d.

35 Case 2: d<M - continue We run the algorithm from the beginning of the lecture. Amortized it take O(1/d) per pattern per text character. Overall it take O(1) amortized time per text character. By lazy approach we get O(1) time in worst case.

36 Open problems Multi pattern search case 2 takes O(1) time, however case 1 takes O(logd) –Improve case 1 to be O(1) –With heuristic almost all the dictionary take O(1) time, and O(1) space per pattern. Lower bound –We believe that single pattern search lower bound is Ώ(log m log δ) Find more clients Find a place for sabbatical (~1/1/2012-30/9/2013)

37 Important things: In coming events: –ICALP2011GT (July 3 rd, one day before ICALP) We will have some support for students –Workshop on Sparsity and Computation, U. Mich. Aug 1--4 We will have some support for students –IMA: Group Testing Designs, Algorithms, and Applications to Biology Feb 13--17 –Stringology 2012 Find a place for sabbatical (~1/1/2012-30/9/2013)


Download ppt "Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University."

Similar presentations


Ads by Google