Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Similar presentations


Presentation on theme: "Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo."— Presentation transcript:

1 Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo

2 Outline  Motivations of this paper  The concrete problems  Basic idea and solutions  Questions needed to clarify

3 Motivations 1.Speedup the classification process in order to defense against spam quickly, furthermore, improve the throughout of system. 2.Improve the scalability of the statistical-based classification methods. 3.Keep high classification accuracy.

4 The background and concrete problem  Background Statistical-based Bayesian filters and its variants are used to block spam. The statistical value of each individual token is stored by a dictionary. A decision-making is based on the summarization of values of much tokens.  Problems needed to research How to improve the performance of value retrieval operation for each individual token. (the motivation 1 and 2) The solutions should not have much negative effect on the classification accuracy. (the motivation 3)

5 Basic idea and solutions (1)  A straightforward idea Use the Bloom filters to store the values of tokens, and retrieve the value of any token on demand.  The first obstacle How to extend the standard Bloom filter?

6 0101010110 Data set B 00011 abcd xy Data set A A hash function family A bit vector m-1 0

7 0001000 0 00 test set B 00010 token1 xy token universe A hash function family 010000000000000 010100000000010 Multi- bit vector Bit-wise AND 0 1 0 output value token2token4token3 q-1 0 First dimension Second dimension

8 Basic idea and solutions (2)  Instead the bit vector with a two dimensions vector, with (multiply m by q) size. The first dimension denotes the hash locations for each token in a m bits vector, the same as the standard Bloom filter. The second dimension of each hash locations denotes the value of token. One bit for one identical value.  The second obstacle The size of value universe is usually large even huge. It is impossible to allocate bits in the second dimension for all elements of the value universe.

9 Basic idea and solutions (3)  Encode In this field, the value universe ranges from 0 to 1. This paper does not propose new encoding method, just use a algorithm referred from the paper [20]. Choose and tune the parameter q, which denotes the number of possible elements resulting from encoding algorithm.

10 Why the idea can meet the motivation one and two?  Space (for the set of pairs (token, value)) If use the extended Bloom filter to store them, it need less space than others. K bits for each token. Given the allocated memory, the solution can store more pairs (token, value) than others.  Time Extended Bloom filter are small enough to load in memory. No other I/O operations. The response delay is a constant for the query with any input no matter how many pairs have been stored. In the same time slot, the solution can retrieve the values of more tokens than previous solutions.

11 The negative effects on the classification accuracy (1)  The query based on the extended Bloom filter may output two kinds of mistake. For any query with a token outside of the test data set as input, may get a useful output entry (just one bit is set to 1). For any query with a token inside the test data set as input, may get a conflict output entry (more than one bits are set to 1).  For any token, the decoding result usually does not equal the real statistical value.

12 0101000 0 00 token set B 00010 token1 xy token set A A hash function family 010000000000000 010100000000010 Multi- bit vector Bit-wise AND 1 1 0 output value token2token4token3 q-1 0 First dimension Second dimension

13 The negative effects on the classification accuracy (2)  The misclassification The former error will affect the summarization of values of a message, and maybe influence the decision. For a multi-bits error, choose the smallest value. If it is wrongly chosen, the error only makes the classification result less likely as spam, and maybe result in a false negative. This can be tolerated.  The decoding deviation It can not been avoided. Design better algorithms and/or select the parameters carefully.

14 Questions needed to clarify(1)  For a query output entry, the possibility for a single bit of the output entry being zero as P m,n,h (0)=1-P m,n,h (fpos) =1-(1-(1-1/m) n*h ) h  For a query output entry, the probability of the former case: P m,n,h,q (fpos)=1-(P m,n,h (0)) q (6)  The probability of the latter case: P m,n,h,q (multi)=1-(P m,n,h (0)) q -q * (1-P m,n,h (0)) (q-1) (7)

15 Questions needed to clarify(1)  The formulas 6 and 7 are wrong or not consistent with the error definitions.  The probability of the event (just one bit of the output entry is set to 1) is:  The probability of the event (more than one bits of the output entry are set to 1) is: One minus the probability of all bits being set to 0 and the probability of only one bit getting 1.

16 Questions needed to clarify(2)  In order to store and retrieve values, can this idea be a general way to improve the standard Bloom filter? The size of value universe. The multi-bit output error. Deletion operation of pairs (key,value).

17  Questions and Answers

18 Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines Authors: Flavio Bonomi Michael Mitzenmacher Rina Panigrahy SIGCOMM 2006 Reader: Deke Guo

19 Questions  How to track the simultaneous state of a large number of connections at each network device.  The size of tracking result should be small in order to load in on-chip memory.

20 Solution(1)  Uses standard bloom filters to summarize the simultaneous state of a large number of connections.  lookups the state of each connection according to its summarization.  Introduces a new error named “ don ’ t know ” besides false positive and false negative.

21 Solution(1)  Introduces the timing-based deletion mechanism to deal with ill-behaving or non-terminating.  Operations: Put (id, state) Lookup (id) or Lookup (id, state) Delete (id, state) Update (id, old state, new state)  Ill-behaving or attacking may result in false negative error.

22 0101010110 Data set B 00011 h 1 (x)h2(x)h2(x)h k (x)h3(x)h3(x) abcd x doesn’t belong to set B, yet its bits have been set 1 h1(y)h1(y)h2(y)h2(y)hk(y)hk(y)h3(y)h3(y) y doesn’t belong to set B, and its bits aren’t all 1. a belongs to set B, and its bits are all 1. xy Data set A

23 0001010010 Data set B 00001 abcd xy Data set A a belongs to set B, and its bits are not all 1 after the false deletion of x. A false positive error may result in at most k false negative.

24 Solution(2)  Introduce the Stateful Bloom Filter Approach. Instead the bit vector used by standard bloom filters with cell vector. Its rate of false positive is less than that of standard bloom filters. Note that the storage space used by two filters are not same. Thus, it is need to compare more carefully.

25 010101212 001212 0 Data set B 00031 h 1 (x)h2(x)h2(x)h k (x)h3(x)h3(x) abcd X don’t belong to set B. The lookup based on the filter also make right judge. xy Data set A

26 Solution(3)  An Approach Using d-left Hashing The authors did not explain why it is the best solution among the three solutions through formal compare and analysis. The simulation tries to prove it, but it is not strong enough, especially don ’ t compare under the same space used.

27 Data set B 00031 abcd xy Data set A 0023100031

28 Questions needed to analyze  Analyze the relationship between false positive and false negative, and try to give formula.  If the old value of a cell was “ don ’ t know ”, then the cell keeps the value before its register becomes 0. Analyze the fraction of cell which value is “ don ’ t know ”, and compute the rate of this error. If the register becomes 1 from a larger value, value “ don ’ t know ” should become a identify value, but SBF can ’ t support this transformation.  If we use the idea of SBF to redesign the standard Bloom Filters, whether we can achieve some benefits, such as lower false positive rate.


Download ppt "Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo."

Similar presentations


Ads by Google