Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos

Slides:



Advertisements
Similar presentations
Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
Advertisements

Basic Communication on the Internet:
Paul Vanbosterhaut Managing Director, Vircom Europe January 2007 ModusGate™ 4.4 Smart Assurance Gateway Not Just Warmed-over Open Source Technology…
What is Spam  Any unwanted messages that are sent to many users at once.  Spam can be sent via , text message, online chat, blogs or various other.
RB-Seeker: Auto-detection of Redirection Botnet Presenter: Yi-Ren Yeh Authors: Xin Hu, Matthew Knysz, Kang G. Shin NDSS 2009 The slides is modified from.
COMPUTER BASICS METC 106. The Internet Global group of interconnected networks Originated in 1969 – Department of Defense ARPANet Only text, no graphics.
Early Detection of Outgoing Spammers in Large-Scale Service Provider Networks Yehonatan Cohen Daniel Gordon Danny Hendler Ben-Gurion University Yehonatan.
Design and Evaluation of a Real-Time URL Spam Filtering Service
----Presented by Di Xu  Introduction  Overview of Spam  Solutions to Spam  Conclusion.
Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
IMF Mihály Andó IT-IS 6 November Mihály Andó 2 / 11 6 November 2006 What is IMF? ­ Intelligent Message Filter ­ provides server-side message filtering,
Deep Belief Networks for Spam Filtering
Security Awareness: Applying Practical Security in Your World, Second Edition Chapter 3 Internet Security.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Spam Reduction Techniques Using greylisting and SpamAssassin.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
An Effective Defense Against Spam Laundering Paper by: Mengjun Xie, Heng Yin, Haining Wang Presented at:CCS'06 Presentation by: Devendra Salvi.
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray,
What’s New in WatchGuard XCS v9.1 Update 2. WatchGuard XCS v9.1 Update 2  Introduce New Features WatchGuard XCS Outlook Add-in Secur Encryption.
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.
11 SECURING INTERNET MESSAGING Chapter 9. Chapter 9: SECURING INTERNET MESSAGING2 CHAPTER OBJECTIVES  Explain basic concepts of Internet messaging. 
An Exercise in Machine Learning
Countering Spam Using Classification Techniques Steve Webb Data Mining Guest Lecture February 21, 2008.
Security Awareness Chapter 3 Internet Security. Security Awareness, 3 rd Edition2 Objectives After completing this chapter, you should be able to do the.
Client X CronLab Spam Filter Technical Training Presentation 19/09/2015.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
The Internet 8th Edition Tutorial 2 Basic Communication on the Internet: .
Module 6 Planning and Deploying Messaging Security.
What is and How Does it Work?  Electronic mail ( ) is the most popular use of the Internet. It is a fast and inexpensive way of sending messages.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
A Technical Approach to Minimizing Spam Mallory J. Paine.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Technology Considerations for Spam Control 3 rd AP Net Abuse Workshop Busan Dave Crocker Brandenburg InternetWorking
What’s New in WatchGuard XCS v9.1 Update 1. WatchGuard XCS v9.1 Update 1  Enhancements that improve ease of use New Dashboard items  Mail Summary >
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
Project Presentation B 王 立 B 陳俊甫 B 張又仁 B 李佳穎.
SpamAssassin An Introduction PacNOG I Workshop June 20, 2005 Nadi, Fiji Hervey Allen.
SPAM DETECTION AND FILTERING By Prasanna Kunchavaram.
Spam Detection Ethan Grefe December 13, 2013.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Detecting Phishing in s Srikanth Palla Ram Dantu University of North Texas, Denton.
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
97% of all is spam 100 billion spam per day Easy to setup spam networks Low cost of operation Millions of dollars worth of time and equipment.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Security fundamentals Topic 9 Securing internet messaging.
A COMPARISON OF ANN, NAÏVE BAYES, AND DECISION TREE FOR THE PURPOSE OF SPAM FILTERING KAASHYAPEE JHA ECE/CS
Classification using Co-Training
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
1 Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Speaker: Jun-Yi Zheng 2010/01/18.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Information explosion 1.4X 44X Protect communications.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
An Effective Defense Against Spam Laundering Author: Mengjun Xie, Heng Yin, Haining Wang Presented At: CCS’ 06 Prepared By: Amit Shrivastava.
Anti-Spam Managing Spam with Kerio Connect
Learning to Detect and Classify Malicious Executables in the Wild by J
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
Machine Learning with Weka
iSRD Spam Review Detection with Imbalanced Data Distributions
Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.
Lecture 10 – Introduction to Weka
Kanchana Ihalagedara Rajitha Kithuldeniya Supun weerasekara
Presentation transcript:

Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos George Paliouras George Sakkis Panagiotis Stamatopoulos Mountain View, CA, July 30 th and 31 st 2004 First Conference on and Anti-Spam (CEAS)

Outline  Spam Filtering: past, present and future  Anti-spam filtering with Filtron  In Vitro Evaluation  In Vivo Evaluation  Conclusions

Spam Filtering: past, present and future Past: Past:  Black-lists and white-lists of addresses  Handcrafted rules looking for suspicious keywords and patterns in headers Present: Present:  Machine learning-based filters –Mostly using Naïve Bayes classifier –Examples: Mozilla’s spam filter, POPFILE, K9  Signature based filtering (Vipul’s Razor) Future: Future:  Combination of several techniques (SpamAssassin)

Filtron: An overview A multi-platform learning-based anti-spam filter. A multi-platform learning-based anti-spam filter. Features for simple the user: Features for simple the user:  Personalized: based on her legitimate messages  Automatically updating black/white lists  Efficient: server-side filtering and interception rules Features for the advanced user and the researcher: Features for the advanced user and the researcher:  Customizable learning component –Through Weka open source machine learning platform  Support for creating publicly available message collections –Privacy-preserving encoding of messages and user profiles Portable: Implemented in Java and Tcl/Tk Portable: Implemented in Java and Tcl/Tk Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way) Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way)

Legitimatefolders Spamfolders Preprocessor Vectorizer Learner AttributeSelector Filtron Filtron’s Architecture attribute set training vectors User model induced classifier black list, white list

Preprocessing 1. 1.Break down mailbox(es) into distinct messages 2. 2.Remove from every message:   mail headers   html tags   attached files 3. 3.Remove messages with no textual content 4. 4.Store 5 messages per sender   Avoids bias towards regular correspondents Remove duplicates 6. 6.Encode messages (optional)

Message Classification

In Vitro Evaluation We investigated the effect of: We investigated the effect of:  Single-token versus multi-token attributes (n-grams for n=1,2,3)  Number of attributes ( )  Learning algorithm (Naïve Bayes, Flexible Bayes, SVMs, LogitBoost)  Training corpus size (~ 10%-100% of full training corpus) Cost-Sensitive Learning Formulation Cost-Sensitive Learning Formulation  Misclassifying a legitimate message as spam (L  S) is λ times more serious an error than misclassifying a spam to legitimate (S  L)  Two usage scenarios (λ = 1, 9)

In Vitro Evaluation (cont.) Evaluation: Evaluation:  Four message collections (PU1, PU2, PU3, PUA)  Stratified 10-fold cross validation Results: Results:  No clear winner among learning algorithms wrt accuracy  Efficiency (or other criteria) more important for real usage.  Nevertheless, SVMs consistently among two best  No substantial improvement with n-grams (for n>1) Refer to the TR for more details: Refer to the TR for more details:  Learning to filter unsolicited commercial , TRN 2004/2, NCSR “Demokritos” (

Summary of in Vitro Evaluation λ = 1 λ = 9 PrReWAccPrReWAcc 1-grams Naive Bayes Flexible Bayes LogitBoostSVM /2/3-grams Flexible Bayes SVM

In Vivo Evaluation Seven month live-evaluation by the third author Seven month live-evaluation by the third author Training collection: PU3 Training collection: PU3  2313 legitimate / 1826 spam Learning algorithm: SVM Learning algorithm: SVM Cost scenario: λ = 1 Cost scenario: λ = 1 Retained attributes: grams Retained attributes: grams  Numeric values (term frequency) No black-list was used No black-list was used

Summary of in Vivo Evaluation Days used Messages received Spam messages received Legitimate messages received Legitimate-to-Spam Ratio (avg per day) 1623 (avg per day) 5109 (avg per day) 3.15 Correctly classified legitimate messages (L  L) Incorrectly classified legitimate messages (L  S) Correctly classified spam messages (S  S) Incorrectly classified spam messages (S  L) (avg per week) (avg per week) PrecisionRecallWAcc 96.54% (PU3: 96.43%) 89.34% (PU3: 95.05%) 96.66% (PU3: 96.22%)

Post-Mortem Analysis False Positives 52 false positives (out of 6732) 52 false positives (out of 6732) 52%: Automatically generated messages 52%: Automatically generated messages  subscription verifications, virus warnings, etc. 22%: Very short messages 22%: Very short messages  3-5 words in message body  Along with attachments and hyperlinks 26%: Short messages 26%: Short messages  1-2 lines  Written in casual style, often exploited by spammers  With no attachments or hyperlinks

Post-Mortem Analysis False Negatives 173 false negatives (out of 6732) 173 false negatives (out of 6732) 30%: “Hard Spam” 30%: “Hard Spam”  Little textual information, avoiding common suspicious word patterns  Many images and hyperlinks  Tricks to confuse tokenizers 8%: Advertisements of pornographic sites with very casual and well chosen vocabulary 8%: Advertisements of pornographic sites with very casual and well chosen vocabulary 23%: Non-English messages 23%: Non-English messages  Under-represented in the training corpus 30%: Encoded messages 30%: Encoded messages  BASE64 format; Filtron could not process it at that time 6%: Hoax letters 6%: Hoax letters  Long formal letters (“tremendous business opportunity !”)  Many occurrences of the receiver’s full name 3%: Short messages with unusual content 3%: Short messages with unusual content

Conclusions Signs of arms race between spammers and content-based filters Signs of arms race between spammers and content-based filters Filtron’s performance deemed satisfactory, though it can be improved with: Filtron’s performance deemed satisfactory, though it can be improved with:  More elaborate preprocessing to tackle usual countermeasures of spammers (misspellings, uncommon words, text on images)  Regular retraining Currently most promising approach: combination of different filtering approaches along with Machine Learning Currently most promising approach: combination of different filtering approaches along with Machine Learning  Collaborative filtering  Filtering in the transport layer level …………