Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University.

Slides:



Advertisements
Similar presentations
Hashing.
Advertisements

Introduction to Algorithms
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
MATH 224 – Discrete Mathematics
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
D ICTIONARY M ATCHING WITH O NE G AP Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom 1 CPM 2014.
CSE 373: Data Structures and Algorithms Lecture 5: Math Review/Asymptotic Analysis III 1.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
296.3: Algorithms in the Real World
Chapter 1 – Basic Concepts
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Data Structures Performance Analysis.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
SHOCK: A Worst-Case Ensured Sub-linear Time Pattern Matching Algorithm for Inline Anti-Virus Scanning Author: Nen-Fu Huang, Wen-Yen Tsai Publisher: IEEE.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Ely Porat Bar-Ilan University Group Testing and New Algorithmic Applications.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Program Performance & Asymptotic Notations CSE, POSTECH.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Author : Ozgun Erdogan and Pei Cao Publisher : IEEE Globecom 2005 (IJSN 2007) Presenter : Zong-Lin Sie Date : 2010/12/08 1.
Searching. RHS – SOC 2 Searching A magic trick: –Let a person secretly choose a random number between 1 and 1000 –Announce that you can guess the number.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
CSC 211 Data Structures Lecture 13
Real time pattern matching Porat Benny Porat Ely Bar-Ilan University.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach
3.3 Complexity of Algorithms
Author : Sarang Dharmapurikar, John Lockwood Publisher : IEEE Journal on Selected Areas in Communications, 2006 Presenter : Jo-Ning Yu Date : 2010/12/29.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
1 Algorithms CSCI 235, Fall 2015 Lecture 19 Order Statistics II.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
21-Feb-16 Analysis of Algorithms II. 2 Basics Before we attempt to analyze an algorithm, we need to define two things: How we measure the size of the.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Introduction to Algorithms Amortized Analysis My T. UF.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
An Improved Multi-Pattern Matching Algorithm for Large-Scale Pattern Sets Author : Zhan Peng, Yu-Ping Wang and Jin-Feng Xue Conference: IEEE 10th International.
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
A way to detect a collision…
CSE373: Data Structures & Algorithms Lecture 5: AVL Trees
Suffix Arrays and Suffix Trees
Heavy Hitters in Streams and Sliding Windows
CSE 326: Data Structures Lecture #12 Hashing II
Algorithms CSCI 235, Spring 2019 Lecture 20 Order Statistics II
Hash Maps Implementation and Applications
Lecture 20 Hashing Amortized Analysis
CSE 326: Data Structures Lecture #10 B-Trees
CSE 326: Data Structures Lecture #14
Maintaining Stream Statistics over Sliding Windows
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Pattern Matching in the streaming model Ely Porat Google inc & Bar-Ilan University

Given a Text T and Pattern P, the problem is to find all the substring of T that equal to P. Problem definition - Pattern Matching T= P= n m

Problem definition - Online Pattern Matching We get the text character by character P= T=

Motivation… Stock market

Motivation.. Espionage The rest we monitor

Motivation… Viruses and malware Software solutions: Snort: 73.5Mb ClamAV: 1.48Gb Using TCAMs: Snort: 680Kb ClamAV: 25Mb Our solution (software): Snort: 51Kb ClamAV: 216Kb

Motivation… Monitoring internet traffic

Streaming model 2 50 BPS We can't store the whole input In our case we seek for algorithm which require poly(log m) space

Related work Karp-Rabin: Randomized Algorithm for exact pattern matching Clifford, Porat, and Porat: A black box algorithm for online approximate pattern matching o Almost any pattern matching algorithm can be converted to run online.

p 0 p 1 p 2 p 3...p m-1 Karp-Rabin Algorithm t 0 t 1 t 2... t i t i+1... t i+m-1 t i+m... t n p 0 r m-1 +p 1 r m-2 +p 2 r m p m-1 modq S i =t i r m-1 +t i+1 r m t i+m-1 modq S i+1 =t i+1 r m t i+m-1 r+t i+m modq S i+1 =S i r+t i+m -t i r m Require O(m) memory Choosing randomly r

The idea - Simple case P= Z Z T Signature Start signing Signature The pattern start with z, and there is no more z's in the pattern Z Signature Start signing

Case 1 P= U U T Signature Start signing Signature There is a prefix U s.t U appear only once in the pattern U Signature Start signing m =<m/2 Seek in recursion

Case 2: No small U P= W Look on the first m/2 character They appear again somewhere W P= v v v v v v v v Prefix of v Option 1 Option 2 P= v v v v w w isn't a prefix of v and v isn't a prefix of w v=<m/2

Solving case 2 Option 2 P= v v v v w v=<m/2 Search in recursion for v, and count how many time you found it Sign on w T v v Start signing Signature v Start signing

Solving case 2 - continue Option 2 P= v v v v w v=<m/2 Search in recursion for v, and count how many time you found it Sign on w T v v Start signing Signature v Using O(log m) signatures and counters in the worst case v v v >m/2 <m/2 Signature Start signing

p 0 p 1 p 2 p 3...p m-1 Karp-Rabin Algorithm t 0 t 1 t 2... t i t i+1... t i+m-1 t i+m... t n p 0 r m-1 +p 1 r m-2 +p 2 r m p m-1 modq S i =t i r m-1 +t i+1 r m t i+m-1 modq S i+1 =t i+1 r m t i+m-1 r+t i+m modq S i+1 =S i r+t i+m -t i r m Choosing randomly r

p 0 p 1 p 2 p 3...p m-1 Rothschild signature 07 p 0 r m-1 +p 1 r m-2 +p 2 r m p m-1 modq p 0 +p 1 r+p 2 r p m-1 r m-1 modq t 0 t 1 t 2 t 3... t i

Forward signatures P= U U T Signature Calculate X=S i +Sig*r i+1 Signature There is a prefix U s.t U appear only once in the pattern m =<m/2 Seek in recursion Check if equal to X Remember X for this position

0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 Example 0, 1, 1, 0, 1, 1, 1 0, 1, 1 P: T:0 Level 1: Level 2: Level 3:

0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,1,0,1 Example: q=7 r=3 0, 1, 1, 0, 1, 1, 1 0, 1, 1 P: T:0 Level 1: Level 2: Level 3: ri=ri= Level 2: Level 1:

Worst case - time t 0 t 1 t 2 t 3... t i X1X1 X2X2 X logm Check using hash table X 1 =X 2 =…=X logm ??? We can work in lazy approach without blowup in the memory Time: O(1) Amortized O(1), but what about worst case?

Average / Random/ Smooth case P: m log ∑ m log ∑ log ∑ m Total number of iteration is O(log* ∑ m)

Worst case P: m m/2 m/4 Total number of iteration is O(log m) = O(log m logδ) space.

Multi-Pattern search (dictionary matching) Given a set of patterns D={P 1,P 2,P 3,…,P d } –The patterns can be of different length We will want to report whenever one of the patterns appear. Our algorithm will require O(∑ i=1 d log|P i |) memory, and will require O(log d) time per text character.

Multi-Pattern search (dictionary matching) Denote M=max i |P i | Our algorithm will have 2 cases: –Case 1: d>M –Case 2: d<M

Case 1: d>M In this case we can allocate an array of size M+1 t 0 t 1 t 2 t 3... t l-M t l-M+1... t l S l-M S l-M+1... S l It is easy to maintain such a sliding window in O(1) time and O(M) memory

Case 1: d>M - continue For each P i in D: (P i =a 0 a 1 a 2 … a mi-1 ) e=m i while e!=0: find j s.t 2 j = e e=e-2 j if e!=0 HashTable(Sig(a e a e+1 …a mi )) HashTable(Sig(a 0 a 1 …a mi ),match i ) Example P i =a 0 a 1 a 2 … a 38 We will store in the hash table: Sig(a 7 a 6 …a 38 ) Sig(a 3 a 4 …a 38 ) Sig(a 1,a 2 …a 38 ) Sig(a 0 a 1 …a 38 ),match i We will store at most log |P i | points

Case 1: d>M - continue 2i2i 2 i +2 j 2 i +2 j +2 l At most logP i levels

Case 1: d>M In this case we can allocate an array of size M+1 t 0 t 1 t 2 t 3... t l-M t l-M+1... t l S l-M S l-M+1... S l Notice that it take O(1) to calculate Sig(t i t i+1 …t l )

Case 1: d>M - continue We will do binary search over the sliding window S l-M S l-M+1... S l l-2 j Is it in the HashTable? No l-2 j-1 Is it in the HashTable? Yes l-2 j-1 -2 j-2 Is it in the HashTable?

Case 2: d<M In this case we will split our dictionary D into 2 dictionaries: –D 1 – all the patterns shorter then d. On this dictionary we will run case 1. –D 2 – all the patterns longer then d. We need only to deal with this case.

Case 2: d<M - continue For each P i in D 2 : P i = a 0 a 1 a 2... a d-1 a d... a m SP i =Sig(a 0 a 1 …a d-1 ) Store in hash table SP i

Case 2: d<M - continue If P i contain a period prefix of length more then d P i = u u u u u u v.. a m SP i We store as well the number of time we need to see SP i w.h.p won’t be SP i We will start a process which will seek for P i only after seeing enough SP i. Therefore the minimum number of characters we have to see between 2 process of P i is at least d.

Case 2: d<M - continue We run the algorithm from the beginning of the lecture. Amortized it take O(1/d) per pattern per text character. Overall it take O(1) amortized time per text character. By lazy approach we get O(1) time in worst case.

Open problems Multi pattern search case 2 takes O(1) time, however case 1 takes O(logd) –Improve case 1 to be O(1) –With heuristic almost all the dictionary take O(1) time, and O(1) space per pattern. Lower bound –We believe that single pattern search lower bound is Ώ(log m log δ) Find more clients Find a place for sabbatical (~1/1/ /9/2013)

Important things: In coming events: –ICALP2011GT (July 3 rd, one day before ICALP) We will have some support for students –Workshop on Sparsity and Computation, U. Mich. Aug 1--4 We will have some support for students –IMA: Group Testing Designs, Algorithms, and Applications to Biology Feb –Stringology 2012 Find a place for sabbatical (~1/1/ /9/2013)