Improving Digest-Based Collaborative Spam Detection

Improving Digest-Based Collaborative Spam Detection
Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le Boudec EPFL, Switzerland MICS MIT_Spam_Conference, Mar 27-28, 2008, MIT, Cambridge.

Talk content Digest-based filtering – global picture overview
Understanding “HOW Digests WORK” - “Open Digest” Paper [1] (Very positive results/conclusions, cited and referred a lot!) Understanding it better - Our re-evaluation of “Open Digest” Paper results (Different conclusions!) Our Alternative Digests - results IMPROVE a lot, understanding “WHY” Understanding the “why” => further improvements possible (Negative selection) Conclusions [1] "An Open Digest-based Technique for Spam Detection”, E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Samarati, in Proc. of the 2004 International Workshop on Security in Parallel and Distributed Systems, San Francisco, CA USA, September 15-17, 2004.

Two main collaborative spam detection approaches
1) White-listing using Social Networks 2) Bulky Content Detection using Digests digests relationships User 1 User 1 User n User 2 Recent digests User 3 User 2 User n Example: PGP graph of certificates Examples: DCC, Vipul’s Razor, Commtouch Implementations (in both cases): centralized or decentralized, open or proprietary This talk (paper): digests approach for bulky content detection

A Real Digest-Based System: DCC (Distributed Checksum Clearinghouse)
… s ~ 250 DCC Servers s s s … ~ n * Mail servers MS MS MS MS … Reply=counter (n=3) … Query= digest MC MC MC MC ~ n * millions of Mail users Strengths/drawbacks: - fast response not precise (FP problems) limited obfuscation resistance Spammer (sends in bulk) Reproducible evaluation of digests-efficiency: “Open Digest” Paper

Producing Digests: Nilsimsa similarity hashing as explained in OD-paper
Cheap N=5 characters sliding window , L characters long 1: Che 2: Cha … 8: hea trigrams Cheapest vac... Hash() Hash() Hash() Hash: 30^3 -> 2^8 b7 ... b0 b7 ... b0 b7 ... b0 … +1 +1 +1 accumulator ... Best Regards, John 15 255 15 255 Accumulator After L-N+1 steps Digest = 1 1 1 15 255 Digest is a binary string of 256 bits Definition: Nilsimsa Compare Value (NCV) between two digests is equal to the number of bits at corresponding positions that are equal, minus 128. Identical s  NCV=128, unrelated s  NCV close to 0. More similar s  more similar digests  higher NCV

“Open Digest” paper experiments and results
Evaluation <= experiment: spam bulk detection <= detection of similarity between two s from the same spam bulk ham miss-detection <= miss-detection of similarity between unrelated s Bulk detection experiment: OD-paper result for “adding random text” obfuscation: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 compare Evaluate similarity > Threshold=54 OD-paper only evaluates (talks about) the average NCV OD-paper conclusion: Average NCV > Threshold => bulk detection resistant to strong obfuscation by spammer NCV value (integer) Matching indicator (0/1)

“Open Digest” paper experiments and results (cont.)
Ham miss-detection experiment: Ham and Spam Corpus OD-paper result: n1~2500, n2~2500 s no matching (miss-detection) case is observed For each pair of unrelated s Compute digests 100110…10 011100…11 OD-paper conclusion: Miss-detection of good s must be very low approximating miss-detection probability by use of Binomial distribution supports the observed result compare Evaluate similarity > Threshold=54 NCV values (integer) Matching indicators (0/1)

Extending OD-paper experiments: spam bulk detection
Bulk detection experiment, identical as in OD-paper: But we test higher obfuscation ratios: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 compare Evaluate similarity > Threshold=54 OD-paper results is well recovered (blue dotted line) NCV value (integer) Matching indicator (0/1) OD-paper conclusion does not hold! Even only slightly higher obfuscation ratio brings the average NCV bellow the threshold

Understanding better what happens
“Compare X to Database” (generic experiment): EITHER Ham Corpus1/2 (ham to filter) OR Spam Corpus (Obfuscation 1) X n2 n1 Spam Corpus (Obfuscation 2) Select at random Ham Corpus 2/2 Compute digest 010110…10 Database DB of spam and ham digests (represents “previous digest queries”) compare to each from DB > Threshold=54 NCV values (integer) Matching indicators (0/1) We look at more metrics Probability of -to- matching Max(NCV) average NCV histogram

SPAM – DB experiment results:
Mean Max(NCV) value not informative Effect of obfuscation changes gracefully Spammer may gain by additional obfuscation.

SPAM – DB, NCV histograms: effect of obfuscation
Small obfuscation: digests are still usefull for bulk detection

SPAM – DB, NCV histograms: effect of obfuscation
Stronger obfuscation: most of the digest are rendered to not be useful !

HAM – DB experiment results:
Mean Max(NCV) value not informative Miss-detection probability still too high for practical use

HAM – DB, NCV histograms: effect of obfuscation
Spam obfuscation does not impact miss-detection of good s. Shifted and wide histograms phenomena => high false positives explained

Alternative digests Sampling strings: fixed length, random positions
011010…11 101110…11 001010…10 -to- matching: max NCV between over pairs of digests (find how similar are the most similar parts – e.g. spammy phrases)

SPAM – DB experiment results (alt. digests)
Spam bulk detection not any more vulnerable to obfuscation...

SPAM – DB (alt. digests): effect of obfuscation
… and we can see why it is like that.

HAM – DB experiment results (alt. digests):
miss-det. Prob still too high

HAM – DB (alt. digests) effect of obfuscation:
What can be done to decrease ham miss-detection?

Alternative digests open new possibilities
New digest(s) database of good digests Negative selection digest that do not match Compare to collaborative database of digests (DB) This part is the same as without negative selection

Effect of negative selection on miss-detection of ham:

Conclusions Use of proper metrics is crucial for proper conclusions from experiments. Alternative digests provide much better results, and by use of NCV histograms we understand why. Use of proper metrics crucial for understanding what happens… … and for understanding how to fix the problems.

Improving Digest-Based Collaborative Spam Detection

Similar presentations

Presentation on theme: "Improving Digest-Based Collaborative Spam Detection"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Digest-Based Collaborative Spam Detection

Similar presentations

Presentation on theme: "Improving Digest-Based Collaborative Spam Detection"— Presentation transcript:

Similar presentations

About project

Feedback