Internet Level Spam Detection and SpamAssassin 2.50 Matt Sergeant Senior Anti-Spam Technologist MessageLabs.

Slides:



Advertisements
Similar presentations
Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
Advertisements

University of Leeds Academic Services How to use ISS filtering service to remove spam with Outlook Qin Li ISS Service Coordinator
TrustPort Net Gateway traffic protection. Keep It Secure Entry point protection –Clear separation of the risky internet and secured.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
UNIVERSITY OF JYVÄSKYLÄ Building NeuroSearch – Intelligent Evolutionary Search Algorithm For Peer-to-Peer Environment Master’s Thesis by Joni Töyrylä
Aki Hecht Seminar in Databases (236826) January 2009
Induction of Decision Trees
1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.
Analyzing Behavioral Features for Classification.
Three kinds of learning
On the Application of Artificial Intelligence Techniques to the Quality Improvement of Industrial Processes P. Georgilakis N. Hatziargyriou Schneider ElectricNational.
Pro Exchange SPAM Filter An Exchange 2000 based spam filtering solution.
Fighting Spam Enterprise Spam Filtering Using Open Source Tools.
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Spam Reduction Techniques Using greylisting and SpamAssassin.
TrustPort Net Gateway traffic protection. Keep It Secure Entry point protection –Clear separation of the risky internet and secured.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
September 16, 2009 SpamAssassin Way more than the Mac OS X Server GUI shows Presented by: Kevin A. McGrail Project Management Committee Member of the Apache.
ICT Essential Skills. (electronic mail) Snail Mail.
Norman SecureTide Powerful cloud solution to stop spam and threats before it reaches your network.
Antispam GARR Michele Michelotto Hepix Karlsruhe, 11 May 2005.
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
Anomaly detection Problem motivation Machine Learning.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
A Quick Review of Unit 1 – Recognizing Computers Computing Fundamentals © CCI Learning Solutions.
Background on USPS mail forwarding operations Overview of PARS
A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele, University of.
Client X CronLab Spam Filter Technical Training Presentation 19/09/2015.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Machine learning system design Prioritizing what to work on
Biologically Inspired Defenses against Computer Viruses International Joint Conference on Artificial Intelligence 95’ J.O. Kephart et al.
Today’s Topics Read –For exam: Chapter 13 of textbook –Not on exam: Sections & Genetic Algorithms (GAs) –Mutation –Crossover –Fitness-proportional.
G53SEC 1 Coursework Specification. G53SEC Coursework Option 1: Spam Detection and Categorisation 2.
Spam Solutions Group 7 Leo Leung Peter Gorzkowski Seema Yadav Tobby Mathew You’ve Got Mail!
ADVANCED PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Project Presentation B 王 立 B 陳俊甫 B 張又仁 B 李佳穎.
Marketing Amanda Freeman. Design Guidelines Set your width to pixels Avoid too many tables Flash, JavaScript, ActiveX and movies will not.
Survey Propagation. Outline Survey Propagation: an algorithm for satisfiability 1 – Warning Propagation – Belief Propagation – Survey Propagation Survey.
Spam Detection Ethan Grefe December 13, 2013.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Reputation Network Analysis for Filtering Ravi Emani Ramesh Ravindran.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu Team Glyph Debbie Bridygham Pravesvuth Uparanukraw.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Classification using Co-Training
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Estimating Store Potential for Installment Lending
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
CS Fall 2016 (Shavlik©), Lecture 12, Week 6
Instructor: Shengyu Zhang
Introduction to Data Mining, 2nd Edition
Fundamentals of Data Representation
Ensembles.
Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.
CS Fall 2016 (Shavlik©), Lecture 12, Week 6
Management Suite v2.0 DoubleCheck Manager Management Suite v2.0.
Project Presentation B 王 立 B 陳俊甫 B 張又仁
Check Cruising Social : Science by Kim Iles : SITCA
Presentation transcript:

Internet Level Spam Detection and SpamAssassin 2.50 Matt Sergeant Senior Anti-Spam Technologist MessageLabs

MessageLabs Details Scan mainly for companies but also personal scanning >10m s/day Originally anti-virus only. Anti-spam is main focus in US Anti-virus scanning is 100% solution (with guarantee) Anti-spam is about 95% accurate Different problem scale though - 1:200 s vs 1:3 s

How we work MX Records point to us Outgoing points to us too 20+ processing racks worldwide Spam and Viruses stopped before they enter your network

Technology Started off with SpamAssassin –It was the only decent spam scanner at the time Extended it, changed it, submitted patches, added custom code Became a lead SpamAssassin developer in the process Scared the pants off our business people in the process ;-)

SpamAssassin Intro SpamAssassin is a rules based heuristic engine combined with a genetic algorithm blah blah blah... Reality: SpamAssassin is a framework for combining spam detection techniques Probably around 30m users

Spam Detection Stats DNSBLsPhrase Matching Heuristics (SA) Statistics Accuracy0 - 60%80%95%99%+ False Positives 10%2%0.5%0.1%

Why Not Just Use Statistics Then? 99% is only true for personal Statistical techniques learn what your personal looks like Doesn’t work quite as well when you have users with dissimilar inboxes Live data testing accuracy about %

Further Details Some users (bless them) like to receive stock reports, marketing reports, sales leaflets, offers, deals of the century, HTML, and every piece of junk you can imagine, all via . Yes, these people do exist, and they are our customers! Their statistics db entries tilt the database - and they are often right to do so

What Can We Do? Statistics don’t consider all the details –Feature extraction is hard SpamAssassin examines a lot more of the –Finer details of the headers –Regexps in the body text –Eval tests do things like “HTML tag percentage” So lets combine the two

Aside: How Statistics Works Extract features from the Look up how many times we’ve seen that feature before in Spam and Ham Create probability for that feature Combine all the probabilities for all the features into an overall probability

Possible Method Store SpamAssassin results as features –e.g. P(ADVERT_CODE) = 0.95 This works, but not very well compared to current scoring mechanism Reason: SpamAssassin doesn’t have enough non-spam indicators to correctly weight against the spam features

Chosen Method Assign scores to probabilities Use GA to assign those scores score BAYES_ score BAYES_ score BAYES_ score BAYES_ score BAYES_ score BAYES_ score BAYES_ score BAYES_ Add this to the total along with everything else

Results With threshold at 7 to reduce FPs Using customer live s as training and test data (split half and half) Accuracy: 99% FPs: 0.1% (all in the “hard_ham” folder - newsletters and other HTML mail) Overall, better than we could ever expect with pure SpamAssassin (pre 2.50) or pure Bayes

Questions?

Extra Slides

Alternate Possible Schemes Decision Trees –Reduce number of rules run by a factor of 50 –Speeds up SpamAssassin by an enormous amount –Not quite as accurate, though lots of work left to do in this area Boosting/ADABOOST Neural Nets –Each rule becomes a node in the network –Slow to learn (even compared to the GA) –Single layer perceptron may be comparable to the GA

Future SpamAssassin Developments Auto-learn - trains bayes db continually on the that gets scanned. Spam Signatures –Some effort by Razor, but massively inaccurate –Other work by Brightmail is proprietary –Must work like anti-virus signatures - human element? –Is it possible to make it open source?

SpamAssassin Retraining Most people install SpamAssassin and forget about it This is why Bayes kicks butt for personal installations But… If you treat SpamAssassin like a Bayes system - i.e. train on your own , it gets much more accurate The training uses a genetic algorithm Achieving >99% accuracy via the GA isn’t unheard of

SpamAssassin GA Start Good Enough? Evolve Scores Final Scores No Yes

GA In Action Read test results for 3948 messages (7011 total). Read scores for 1015 tests. Iter # Field Value 1 Best e+03 Average e Pop size, replacement: Mutations (rate, good, bad, var, num): Adapt (t, fneg, fneg_add, fpos, fpos_add): Adapt (over, cross, repeat): # SUMMARY for threshold 5.0: # Correctly non-spam: % (99.60% of non-spam corpus) # Correctly spam: % (90.64% of spam corpus) # False positives: % (0.40% of nonspam, 756 weighted) # False negatives: % (9.36% of spam, 1052 weighted) # Average score for spam: 14.4 nonspam: -3.6 # Average for false-pos: 5.1 false-neg: 3.0 # TOTAL: % 300 Best e+02 Average e+02

Full DNS Results OVERALL% SPAM% HAM% S/O RANK SCORE NAME (all messages) (all messages as %) RCVD_IN_SBL DCC_CHECK X_OSIRU_SPAMWARE_SITE RCVD_IN_ORBS RCVD_IN_DSBL RAZOR2_CHECK RCVD_IN_OPM RAZOR2_CF_RANGE_01_ RAZOR2_CF_RANGE_11_ RAZOR2_CF_RANGE_21_ RAZOR2_CF_RANGE_31_ RAZOR2_CF_RANGE_41_ ROUND_THE_WORLD RAZOR2_CF_RANGE_71_ RAZOR2_CF_RANGE_81_ RAZOR2_CF_RANGE_51_ NO_MX_FOR_FROM RCVD_IN_RELAYS_ORDB_ORG RAZOR2_CF_RANGE_91_100 ? RCVD_IN_RFCI ? RCVD_IN_NJABL ? X_NJABL_OPEN_PROXY X RCVD_IN_UNCONFIRMED_DSBL X RCVD_IN_OSIRUSOFT_COM X X_OSIRU_DUL_FH X X_OSIRU_DUL X X_NJABL_DIALUP ok RAZOR2_CF_RANGE_61_70 ? RCVD_IN_BONDEDSENDER X RCVD_IN_MULTIHOP_DSBL ? BAD_HELO_WARNING