Improving Digest-Based Collaborative Spam Detection

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
INHA UNIVERSITY INCHEON, KOREA ALPACAS: A Large-scale Privacy-aware Collaborative Anti-spam System Z. Zhong, L. Ramaswamy and.
Near Duplicate Detection
Spam May CS239. Taxonomy (UBE)  Advertisement  Phishing Webpage  Content  Links From: Thrifty Health-Insurance Mailed-By: noticeoption.comReply-To:
1/13/2003Approximate Object Location and Spam Filtering on Tapestry1 Feng Zhou Li Zhuang
Finding Similar Items.
1 Authors: Anirudh Ramachandran, Nick Feamster, and Santosh Vempala Publication: ACM Conference on Computer and Communications Security 2007 Presenter:
SocialFilter: Introducing Social Trust to Collaborative Spam Mitigation Michael Sirivianos Telefonica Research Telefonica Research Joint work with Kyungbaek.
Antispam GARR Michele Michelotto Hepix Karlsruhe, 11 May 2005.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
.Net Security and Performance -has security slowed down the application By Krishnan Ganesh Madras.
Dan Johnson. What is a hashing function? Fingerprint for a given piece of data Typically generated by a mathematical algorithm Produces a fixed length.
12/13/2002CS262A - ATA and Spam Filtering on P2P Systems1 Approximate Text Addressing and Spam Filtering on P2P Systems Feng Zhou
Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.
Comp. Genomics Recitation 3 The statistics of database searching.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
C August 24, 2004 Page 1 SMS Spam Control Nobuyuki Uchida QUALCOMM Incorporated Notice ©2004 QUALCOMM Incorporated. All rights reserved.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. SIGCOMM, Presented.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Presentation for CDA6938 Network Security, Spring 2006 Timing Analysis of Keystrokes and Timing Attacks on SSH Authors: Dawn Xiaodong Song, David Wagner,
Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
第五章 电子邮件安全. Security is one of the most widely used and regarded network services currently message contents are not secure –may be inspected.
Security Depart. of Computer Science and Engineering 刘胜利 ( Liu Shengli) Tel:
Exponential Differential Document Count A Feature Selection Factor for Improving Bayesian Filters Fidelis Assis 1 William Yerazunis 2 Christian Siefkes.
Spamfilter Relay Mailserver Mark McSweeney CentraLUG, February 1, 2010.
Ahoy: A Proximity-Based Discovery Protocol Robbert Haarman.
Web Applications Security Cryptography 1
Recommendation in Scholarly Big Data
Security is one of the most widely used and regarded network services
Statistical NLP: Lecture 7
POLYGRAPH: Automatically Generating Signatures for Polymorphic Worms
Introduction to Wireless Sensor Networks
Near Duplicate Detection
The Beta Reputation System
Outline Introduction Characteristics of intrusion detection systems
Appendix D: Network Model
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
Hash-Based Indexes Chapter 11
KDD 2004: Adversarial Classification
Sampling rate conversion by a rational factor
Martin Rajman, Martin Vesely
Introduction to Query Optimization
Relational Algebra Chapter 4, Part A
563.10: Bloom Cookies Web Search Personalization without User Tracking
Rabin & Karp Algorithm.
Flavio Toffalini, Ivan Homoliak, Athul Harilal,
Inferential Statistics
Chapter 11 – Message Authentication and Hash Functions
Network Security – Kerberos
Hash-Based Indexes Chapter 10
Design open relay based DNS blacklist system
Fast Sequence Alignments
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Data Integration for Relational Web
Coherent Coincident Analysis of LIGO Burst Candidates
Key Management Network Systems Security
Inferential Statistics
False discovery rate estimation
Psych 231: Research Methods in Psychology
Operating Systems CMPSC 473
Minwise Hashing and Efficient Search
Chapter 11 Instructor: Xin Zhang
Network Models Michael Goodrich Some slides adapted from:
….for authentication and confidentiality PGP
Presentation transcript:

Improving Digest-Based Collaborative Spam Detection Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le Boudec EPFL, Switzerland MICS MIT_Spam_Conference, Mar 27-28, 2008, MIT, Cambridge.

Talk content Digest-based filtering – global picture overview Understanding “HOW Digests WORK” - “Open Digest” Paper [1] (Very positive results/conclusions, cited and referred a lot!) Understanding it better - Our re-evaluation of “Open Digest” Paper results (Different conclusions!) Our Alternative Digests - results IMPROVE a lot, understanding “WHY” Understanding the “why” => further improvements possible (Negative selection) Conclusions [1] "An Open Digest-based Technique for Spam Detection”, E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Samarati, in Proc. of the 2004 International Workshop on Security in Parallel and Distributed Systems, San Francisco, CA USA, September 15-17, 2004.

Two main collaborative spam detection approaches 1) White-listing using Social Networks 2) Bulky Content Detection using Digests digests relationships User 1 User 1 User n User 2 Recent digests User 3 User 2 User n Example: PGP graph of certificates Examples: DCC, Vipul’s Razor, Commtouch Implementations (in both cases): centralized or decentralized, open or proprietary This talk (paper): digests approach for bulky content detection

A Real Digest-Based System: DCC (Distributed Checksum Clearinghouse) … s ~ 250 DCC Servers s s s … ~ n * 10 000 Mail servers MS MS MS MS … Reply=counter (n=3) … Query= digest MC MC MC MC ~ n * millions of Mail users Strengths/drawbacks: - fast response not precise (FP problems) limited obfuscation resistance Spammer (sends in bulk) Reproducible evaluation of digests-efficiency: “Open Digest” Paper

Producing Digests: Nilsimsa similarity hashing as explained in OD-paper Cheap N=5 characters sliding window E-mail, L characters long 1: Che 2: Cha … 8: hea trigrams Cheapest vac... Hash() Hash() Hash() Hash: 30^3 -> 2^8 b7 ... b0 b7 ... b0 b7 ... b0 … 00001111 +1 +1 +1 accumulator ... Best Regards, John 15 255 15 255 Accumulator After L-N+1 steps Digest = 1 1 1 15 255 Digest is a binary string of 256 bits Definition: Nilsimsa Compare Value (NCV) between two digests is equal to the number of bits at corresponding positions that are equal, minus 128. Identical emails  NCV=128, unrelated emails  NCV close to 0. More similar emails  more similar digests  higher NCV

“Open Digest” paper experiments and results Evaluation <= experiment: spam bulk detection <= detection of similarity between two emails from the same spam bulk ham miss-detection <= miss-detection of similarity between unrelated emails Bulk detection experiment: OD-paper result for “adding random text” obfuscation: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 compare Evaluate similarity > Threshold=54 OD-paper only evaluates (talks about) the average NCV OD-paper conclusion: Average NCV > Threshold => bulk detection resistant to strong obfuscation by spammer NCV value (integer) Matching indicator (0/1)

“Open Digest” paper experiments and results (cont.) Ham miss-detection experiment: Ham and Spam Corpus OD-paper result: n1~2500, n2~2500 emails no matching (miss-detection) case is observed For each pair of unrelated emails Compute digests 100110…10 011100…11 OD-paper conclusion: Miss-detection of good emails must be very low approximating miss-detection probability by use of Binomial distribution supports the observed result compare Evaluate similarity > Threshold=54 NCV values (integer) Matching indicators (0/1)

Extending OD-paper experiments: spam bulk detection Bulk detection experiment, identical as in OD-paper: But we test higher obfuscation ratios: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 compare Evaluate similarity > Threshold=54 OD-paper results is well recovered (blue dotted line) NCV value (integer) Matching indicator (0/1) OD-paper conclusion does not hold! Even only slightly higher obfuscation ratio brings the average NCV bellow the threshold

Understanding better what happens “Compare X to Database” (generic experiment): EITHER Ham Corpus1/2 (ham to filter) OR Spam Corpus (Obfuscation 1) X n2 n1 Spam Corpus (Obfuscation 2) Select at random Ham Corpus 2/2 Compute digest 010110…10 Database DB of spam and ham digests (represents “previous digest queries”) compare to each from DB > Threshold=54 NCV values (integer) Matching indicators (0/1) We look at more metrics Probability of email-to-email matching Max(NCV) average NCV histogram

SPAM – DB experiment results: Mean Max(NCV) value not informative Effect of obfuscation changes gracefully Spammer may gain by additional obfuscation.

SPAM – DB, NCV histograms: effect of obfuscation Small obfuscation: digests are still usefull for bulk detection

SPAM – DB, NCV histograms: effect of obfuscation Stronger obfuscation: most of the digest are rendered to not be useful !

HAM – DB experiment results: Mean Max(NCV) value not informative Miss-detection probability still too high for practical use

HAM – DB, NCV histograms: effect of obfuscation Spam obfuscation does not impact miss-detection of good emails. Shifted and wide histograms phenomena => high false positives explained

Alternative digests Sampling strings: fixed length, random positions 011010…11 101110…11 001010…10 Email-to-email matching: max NCV between over pairs of digests (find how similar are the most similar parts – e.g. spammy phrases)

SPAM – DB experiment results (alt. digests) Spam bulk detection not any more vulnerable to obfuscation...

SPAM – DB (alt. digests): effect of obfuscation … and we can see why it is like that.

HAM – DB experiment results (alt. digests): miss-det. Prob still too high

HAM – DB (alt. digests) effect of obfuscation: What can be done to decrease ham miss-detection?

Alternative digests open new possibilities New email digest(s) database of good digests Negative selection digest that do not match Compare to collaborative database of digests (DB) This part is the same as without negative selection

Effect of negative selection on miss-detection of ham:

Conclusions Use of proper metrics is crucial for proper conclusions from experiments. Alternative digests provide much better results, and by use of NCV histograms we understand why. Use of proper metrics crucial for understanding what happens… … and for understanding how to fix the problems.