Improving Digest-Based Collaborative Spam Detection

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.

INHA UNIVERSITY INCHEON, KOREA ALPACAS: A Large-scale Privacy-aware Collaborative Anti-spam System Z. Zhong, L. Ramaswamy and.

Near Duplicate Detection

Spam May CS239. Taxonomy (UBE)  Advertisement  Phishing Webpage  Content  Links From: Thrifty Health-Insurance Mailed-By: noticeoption.comReply-To:

1/13/2003Approximate Object Location and Spam Filtering on Tapestry1 Feng Zhou Li Zhuang

Finding Similar Items.

1 Authors: Anirudh Ramachandran, Nick Feamster, and Santosh Vempala Publication: ACM Conference on Computer and Communications Security 2007 Presenter:

SocialFilter: Introducing Social Trust to Collaborative Spam Mitigation Michael Sirivianos Telefonica Research Telefonica Research Joint work with Kyungbaek.

Antispam GARR Michele Michelotto Hepix Karlsruhe, 11 May 2005.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

.Net Security and Performance -has security slowed down the application By Krishnan Ganesh Madras.

Dan Johnson. What is a hashing function? Fingerprint for a given piece of data Typically generated by a mathematical algorithm Produces a fixed length.

12/13/2002CS262A - ATA and Spam Filtering on P2P Systems1 Approximate Text Addressing and Spam Filtering on P2P Systems Feng Zhou

Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.

Comp. Genomics Recitation 3 The statistics of database searching.

A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.

C August 24, 2004 Page 1 SMS Spam Control Nobuyuki Uchida QUALCOMM Incorporated Notice ©2004 QUALCOMM Incorporated. All rights reserved.

Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. SIGCOMM, Presented.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Presentation for CDA6938 Network Security, Spring 2006 Timing Analysis of Keystrokes and Timing Attacks on SSH Authors: Dawn Xiaodong Song, David Wagner,

Accelerating Multi-Pattern Matching on Compressed HTTP Traffic Dr. Anat Bremler-Barr (IDC) Joint work with Yaron Koral (IDC), Infocom[2009]

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.

第五章电子邮件安全. Security is one of the most widely used and regarded network services currently message contents are not secure –may be inspected.

Security Depart. of Computer Science and Engineering 刘胜利 ( Liu Shengli) Tel:

Exponential Differential Document Count A Feature Selection Factor for Improving Bayesian Filters Fidelis Assis 1 William Yerazunis 2 Christian Siefkes.

Spamfilter Relay Mailserver Mark McSweeney CentraLUG, February 1, 2010.

Ahoy: A Proximity-Based Discovery Protocol Robbert Haarman.

Web Applications Security Cryptography 1

Recommendation in Scholarly Big Data

Security is one of the most widely used and regarded network services

Statistical NLP: Lecture 7

POLYGRAPH: Automatically Generating Signatures for Polymorphic Worms

Introduction to Wireless Sensor Networks

Near Duplicate Detection

The Beta Reputation System

Outline Introduction Characteristics of intrusion detection systems

Appendix D: Network Model

Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.

Hash-Based Indexes Chapter 11

KDD 2004: Adversarial Classification

Sampling rate conversion by a rational factor

Martin Rajman, Martin Vesely

Introduction to Query Optimization

Relational Algebra Chapter 4, Part A

563.10: Bloom Cookies Web Search Personalization without User Tracking

Rabin & Karp Algorithm.

Flavio Toffalini, Ivan Homoliak, Athul Harilal,

Inferential Statistics

Chapter 11 – Message Authentication and Hash Functions

Network Security – Kerberos

Hash-Based Indexes Chapter 10

Design open relay based DNS blacklist system

Fast Sequence Alignments

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Data Integration for Relational Web

Coherent Coincident Analysis of LIGO Burst Candidates

Key Management Network Systems Security

Inferential Statistics

False discovery rate estimation

Psych 231: Research Methods in Psychology

Operating Systems CMPSC 473

Minwise Hashing and Efficient Search

Chapter 11 Instructor: Xin Zhang

Network Models Michael Goodrich Some slides adapted from:

….for authentication and confidentiality PGP

Presentation transcript:

Improving Digest-Based Collaborative Spam Detection Slavisa Sarafijanovic Sabrina Perez Jean-Yves Le Boudec EPFL, Switzerland MICS MIT_Spam_Conference, Mar 27-28, 2008, MIT, Cambridge.

Talk content Digest-based filtering – global picture overview Understanding “HOW Digests WORK” - “Open Digest” Paper [1] (Very positive results/conclusions, cited and referred a lot!) Understanding it better - Our re-evaluation of “Open Digest” Paper results (Different conclusions!) Our Alternative Digests - results IMPROVE a lot, understanding “WHY” Understanding the “why” => further improvements possible (Negative selection) Conclusions [1] "An Open Digest-based Technique for Spam Detection”, E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Samarati, in Proc. of the 2004 International Workshop on Security in Parallel and Distributed Systems, San Francisco, CA USA, September 15-17, 2004.

Two main collaborative spam detection approaches 1) White-listing using Social Networks 2) Bulky Content Detection using Digests digests relationships User 1 User 1 User n User 2 Recent digests User 3 User 2 User n Example: PGP graph of certificates Examples: DCC, Vipul’s Razor, Commtouch Implementations (in both cases): centralized or decentralized, open or proprietary This talk (paper): digests approach for bulky content detection

A Real Digest-Based System: DCC (Distributed Checksum Clearinghouse) … s ~ 250 DCC Servers s s s … ~ n * 10 000 Mail servers MS MS MS MS … Reply=counter (n=3) … Query= digest MC MC MC MC ~ n * millions of Mail users Strengths/drawbacks: - fast response not precise (FP problems) limited obfuscation resistance Spammer (sends in bulk) Reproducible evaluation of digests-efficiency: “Open Digest” Paper

Producing Digests: Nilsimsa similarity hashing as explained in OD-paper Cheap N=5 characters sliding window E-mail, L characters long 1: Che 2: Cha … 8: hea trigrams Cheapest vac... Hash() Hash() Hash() Hash: 30^3 -> 2^8 b7 ... b0 b7 ... b0 b7 ... b0 … 00001111 +1 +1 +1 accumulator ... Best Regards, John 15 255 15 255 Accumulator After L-N+1 steps Digest = 1 1 1 15 255 Digest is a binary string of 256 bits Definition: Nilsimsa Compare Value (NCV) between two digests is equal to the number of bits at corresponding positions that are equal, minus 128. Identical emails  NCV=128, unrelated emails  NCV close to 0. More similar emails  more similar digests  higher NCV

“Open Digest” paper experiments and results Evaluation <= experiment: spam bulk detection <= detection of similarity between two emails from the same spam bulk ham miss-detection <= miss-detection of similarity between unrelated emails Bulk detection experiment: OD-paper result for “adding random text” obfuscation: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 compare Evaluate similarity > Threshold=54 OD-paper only evaluates (talks about) the average NCV OD-paper conclusion: Average NCV > Threshold => bulk detection resistant to strong obfuscation by spammer NCV value (integer) Matching indicator (0/1)

“Open Digest” paper experiments and results (cont.) Ham miss-detection experiment: Ham and Spam Corpus OD-paper result: n1~2500, n2~2500 emails no matching (miss-detection) case is observed For each pair of unrelated emails Compute digests 100110…10 011100…11 OD-paper conclusion: Miss-detection of good emails must be very low approximating miss-detection probability by use of Binomial distribution supports the observed result compare Evaluate similarity > Threshold=54 NCV values (integer) Matching indicators (0/1)

Extending OD-paper experiments: spam bulk detection Bulk detection experiment, identical as in OD-paper: But we test higher obfuscation ratios: (repeated many times, to get statistic) Spam Corpus Select at random Obfuscate (2 copies) Compute digests 010110…10 011010…11 compare Evaluate similarity > Threshold=54 OD-paper results is well recovered (blue dotted line) NCV value (integer) Matching indicator (0/1) OD-paper conclusion does not hold! Even only slightly higher obfuscation ratio brings the average NCV bellow the threshold

Understanding better what happens “Compare X to Database” (generic experiment): EITHER Ham Corpus1/2 (ham to filter) OR Spam Corpus (Obfuscation 1) X n2 n1 Spam Corpus (Obfuscation 2) Select at random Ham Corpus 2/2 Compute digest 010110…10 Database DB of spam and ham digests (represents “previous digest queries”) compare to each from DB > Threshold=54 NCV values (integer) Matching indicators (0/1) We look at more metrics Probability of email-to-email matching Max(NCV) average NCV histogram

SPAM – DB experiment results: Mean Max(NCV) value not informative Effect of obfuscation changes gracefully Spammer may gain by additional obfuscation.

SPAM – DB, NCV histograms: effect of obfuscation Small obfuscation: digests are still usefull for bulk detection

SPAM – DB, NCV histograms: effect of obfuscation Stronger obfuscation: most of the digest are rendered to not be useful !

HAM – DB experiment results: Mean Max(NCV) value not informative Miss-detection probability still too high for practical use

HAM – DB, NCV histograms: effect of obfuscation Spam obfuscation does not impact miss-detection of good emails. Shifted and wide histograms phenomena => high false positives explained

Alternative digests Sampling strings: fixed length, random positions 011010…11 101110…11 001010…10 Email-to-email matching: max NCV between over pairs of digests (find how similar are the most similar parts – e.g. spammy phrases)

SPAM – DB experiment results (alt. digests) Spam bulk detection not any more vulnerable to obfuscation...

SPAM – DB (alt. digests): effect of obfuscation … and we can see why it is like that.

HAM – DB experiment results (alt. digests): miss-det. Prob still too high

HAM – DB (alt. digests) effect of obfuscation: What can be done to decrease ham miss-detection?

Alternative digests open new possibilities New email digest(s) database of good digests Negative selection digest that do not match Compare to collaborative database of digests (DB) This part is the same as without negative selection

Effect of negative selection on miss-detection of ham:

Conclusions Use of proper metrics is crucial for proper conclusions from experiments. Alternative digests provide much better results, and by use of NCV histograms we understand why. Use of proper metrics crucial for understanding what happens… … and for understanding how to fix the problems.