Improving Digest-Based Collaborative Spam Detection Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le Boudec EPFL, Switzerland MICS MIT_Spam_Conference, Mar 27-28, 2008, MIT, Cambridge.
Talk content Digest-based filtering – global picture overview Understanding “HOW Digests WORK” - “Open Digest” Paper [1] (Very positive results/conclusions, cited and referred a lot!) Understanding it better - Our re-evaluation of “Open Digest” Paper results (Different conclusions!) Our Alternative Digests - results IMPROVE a lot, understanding “WHY” Understanding the “why” => further improvements possible (Negative selection) Conclusions [1] "An Open Digest-based Technique for Spam Detection”, E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Samarati, in Proc. of the 2004 International Workshop on Security in Parallel and Distributed Systems, San Francisco, CA USA, September 15-17, 2004.
Two main collaborative spam detection approaches 1) White-listing using Social Networks 2) Bulky Content Detection using Digests digests relationships User 1 User 1 User n User 2 Recent digests User 3 User 2 User n Example: PGP graph of certificates Examples: DCC, Vipul’s Razor, Commtouch Implementations (in both cases): centralized or decentralized, open or proprietary This talk (paper): digests approach for bulky content detection
A Real Digest-Based System: DCC (Distributed Checksum Clearinghouse) … s ~ 250 DCC Servers s s s … ~ n * 10 000 Mail servers MS MS MS MS … Reply=counter (n=3) … Query= digest MC MC MC MC ~ n * millions of Mail users Strengths/drawbacks: - fast response not precise (FP problems) limited obfuscation resistance Spammer (sends in bulk) Reproducible evaluation of digests-efficiency: “Open Digest” Paper
Producing Digests: Nilsimsa similarity hashing as explained in OD-paper Cheap N=5 characters sliding window E-mail, L characters long 1: Che 2: Cha … 8: hea trigrams Cheapest vac... Hash() Hash() Hash() Hash: 30^3 -> 2^8 b7 ... b0 b7 ... b0 b7 ... b0 … 00001111 +1 +1 +1 accumulator ... Best Regards, John 15 255 15 255 Accumulator After L-N+1 steps Digest = 1 1 1 15 255 Digest is a binary string of 256 bits Definition: Nilsimsa Compare Value (NCV) between two digests is equal to the number of bits at corresponding positions that are equal, minus 128. Identical emails NCV=128, unrelated emails NCV close to 0. More similar emails more similar digests higher NCV
“Open Digest” paper experiments and results Evaluation <= experiment: spam bulk detection <= detection of similarity between two emails from the same spam bulk ham miss-detection <= miss-detection of similarity between unrelated emails Bulk detection experiment: OD-paper result for “adding random text” obfuscation: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 compare Evaluate similarity > Threshold=54 OD-paper only evaluates (talks about) the average NCV OD-paper conclusion: Average NCV > Threshold => bulk detection resistant to strong obfuscation by spammer NCV value (integer) Matching indicator (0/1)
“Open Digest” paper experiments and results (cont.) Ham miss-detection experiment: Ham and Spam Corpus OD-paper result: n1~2500, n2~2500 emails no matching (miss-detection) case is observed For each pair of unrelated emails Compute digests 100110…10 011100…11 OD-paper conclusion: Miss-detection of good emails must be very low approximating miss-detection probability by use of Binomial distribution supports the observed result compare Evaluate similarity > Threshold=54 NCV values (integer) Matching indicators (0/1)
Extending OD-paper experiments: spam bulk detection Bulk detection experiment, identical as in OD-paper: But we test higher obfuscation ratios: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 compare Evaluate similarity > Threshold=54 OD-paper results is well recovered (blue dotted line) NCV value (integer) Matching indicator (0/1) OD-paper conclusion does not hold! Even only slightly higher obfuscation ratio brings the average NCV bellow the threshold
Understanding better what happens “Compare X to Database” (generic experiment): EITHER Ham Corpus1/2 (ham to filter) OR Spam Corpus (Obfuscation 1) X n2 n1 Spam Corpus (Obfuscation 2) Select at random Ham Corpus 2/2 Compute digest 010110…10 Database DB of spam and ham digests (represents “previous digest queries”) compare to each from DB > Threshold=54 NCV values (integer) Matching indicators (0/1) We look at more metrics Probability of email-to-email matching Max(NCV) average NCV histogram
SPAM – DB experiment results: Mean Max(NCV) value not informative Effect of obfuscation changes gracefully Spammer may gain by additional obfuscation.
SPAM – DB, NCV histograms: effect of obfuscation Small obfuscation: digests are still usefull for bulk detection
SPAM – DB, NCV histograms: effect of obfuscation Stronger obfuscation: most of the digest are rendered to not be useful !
HAM – DB experiment results: Mean Max(NCV) value not informative Miss-detection probability still too high for practical use
HAM – DB, NCV histograms: effect of obfuscation Spam obfuscation does not impact miss-detection of good emails. Shifted and wide histograms phenomena => high false positives explained
Alternative digests Sampling strings: fixed length, random positions 011010…11 101110…11 001010…10 Email-to-email matching: max NCV between over pairs of digests (find how similar are the most similar parts – e.g. spammy phrases)
SPAM – DB experiment results (alt. digests) Spam bulk detection not any more vulnerable to obfuscation...
SPAM – DB (alt. digests): effect of obfuscation … and we can see why it is like that.
HAM – DB experiment results (alt. digests): miss-det. Prob still too high
HAM – DB (alt. digests) effect of obfuscation: What can be done to decrease ham miss-detection?
Alternative digests open new possibilities New email digest(s) database of good digests Negative selection digest that do not match Compare to collaborative database of digests (DB) This part is the same as without negative selection
Effect of negative selection on miss-detection of ham:
Conclusions Use of proper metrics is crucial for proper conclusions from experiments. Alternative digests provide much better results, and by use of NCV histograms we understand why. Use of proper metrics crucial for understanding what happens… … and for understanding how to fix the problems.