Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos

Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis (ernani@iit.demokritos.gr), ernani@iit.demokritos.gr Ion Androutsopoulos (ion@aueb.gr), ion@aueb.gr George Paliouras (paliourg@iit.demokritos.gr), paliourg@iit.demokritos.gr George Sakkis (gsakkis@rutgers.edu), gsakkis@rutgers.edu Panagiotis Stamatopoulos (takis@di.uoa.gr) takis@di.uoa.gr Mountain View, CA, July 30 th and 31 st 2004 First Conference on Email and Anti-Spam (CEAS)

Outline  Spam Filtering: past, present and future  Anti-spam filtering with Filtron  In Vitro Evaluation  In Vivo Evaluation  Conclusions

Spam Filtering: past, present and future Past: Past:  Black-lists and white-lists of e-mail addresses  Handcrafted rules looking for suspicious keywords and patterns in headers Present: Present:  Machine learning-based filters –Mostly using Naïve Bayes classifier –Examples: Mozilla’s spam filter, POPFILE, K9  Signature based filtering (Vipul’s Razor) Future: Future:  Combination of several techniques (SpamAssassin)

Filtron: An overview A multi-platform learning-based anti-spam filter. A multi-platform learning-based anti-spam filter. Features for simple the user: Features for simple the user:  Personalized: based on her legitimate messages  Automatically updating black/white lists  Efficient: server-side filtering and interception rules Features for the advanced user and the researcher: Features for the advanced user and the researcher:  Customizable learning component –Through Weka open source machine learning platform  Support for creating publicly available message collections –Privacy-preserving encoding of messages and user profiles Portable: Implemented in Java and Tcl/Tk Portable: Implemented in Java and Tcl/Tk Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way) Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way)

Legitimatefolders Spamfolders Preprocessor Vectorizer Learner AttributeSelector Filtron Filtron’s Architecture attribute set training vectors User model induced classifier black list, white list

Preprocessing 1. 1.Break down mailbox(es) into distinct messages 2. 2.Remove from every message:   mail headers   html tags   attached files 3. 3.Remove messages with no textual content 4. 4.Store 5 messages per sender   Avoids bias towards regular correspondents. 5. 5.Remove duplicates 6. 6.Encode messages (optional)

Message Classification

In Vitro Evaluation We investigated the effect of: We investigated the effect of:  Single-token versus multi-token attributes (n-grams for n=1,2,3)  Number of attributes (40-3000)  Learning algorithm (Naïve Bayes, Flexible Bayes, SVMs, LogitBoost)  Training corpus size (~ 10%-100% of full training corpus) Cost-Sensitive Learning Formulation Cost-Sensitive Learning Formulation  Misclassifying a legitimate message as spam (L  S) is λ times more serious an error than misclassifying a spam to legitimate (S  L)  Two usage scenarios (λ = 1, 9)

In Vitro Evaluation (cont.) Evaluation: Evaluation:  Four message collections (PU1, PU2, PU3, PUA)  Stratified 10-fold cross validation Results: Results:  No clear winner among learning algorithms wrt accuracy  Efficiency (or other criteria) more important for real usage.  Nevertheless, SVMs consistently among two best  No substantial improvement with n-grams (for n>1) Refer to the TR for more details: Refer to the TR for more details:  Learning to filter unsolicited commercial e-mail, TRN 2004/2, NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-config/) http://www.iit.demokritos.gr/skel/i-config/

Summary of in Vitro Evaluation λ = 1 λ = 9 PrReWAccPrReWAcc 1-grams Naive Bayes Flexible Bayes LogitBoostSVM 90.56 95.55 92.43 94.95 94.73 89.89 90.08 91.43 94.65 95.15 93.64 95.42 91.57 98.88 97.71 98.12 92.17 74.63 74.89 78.33 94.87 97.76 97.24 97.60 1/2/3-grams Flexible Bayes SVM 92.98 94.73 91.89 91.70 93.89 95.05 97.43 98.70 81.36 76.40 96.91 97.67

In Vivo Evaluation Seven month live-evaluation by the third author Seven month live-evaluation by the third author Training collection: PU3 Training collection: PU3  2313 legitimate / 1826 spam Learning algorithm: SVM Learning algorithm: SVM Cost scenario: λ = 1 Cost scenario: λ = 1 Retained attributes: 520 1-grams Retained attributes: 520 1-grams  Numeric values (term frequency) No black-list was used No black-list was used

Summary of in Vivo Evaluation Days used Messages received Spam messages received Legitimate messages received Legitimate-to-Spam Ratio 212 6732 (avg. 31.75 per day) 1623 (avg. 7.66 per day) 5109 (avg. 24.10 per day) 3.15 Correctly classified legitimate messages (L  L) Incorrectly classified legitimate messages (L  S) Correctly classified spam messages (S  S) Incorrectly classified spam messages (S  L) 5057 52 (avg. 1.72 per week) 1450 173 (avg. 5.71 per week) PrecisionRecallWAcc 96.54% (PU3: 96.43%) 89.34% (PU3: 95.05%) 96.66% (PU3: 96.22%)

Post-Mortem Analysis False Positives 52 false positives (out of 6732) 52 false positives (out of 6732) 52%: Automatically generated messages 52%: Automatically generated messages  subscription verifications, virus warnings, etc. 22%: Very short messages 22%: Very short messages  3-5 words in message body  Along with attachments and hyperlinks 26%: Short messages 26%: Short messages  1-2 lines  Written in casual style, often exploited by spammers  With no attachments or hyperlinks

Post-Mortem Analysis False Negatives 173 false negatives (out of 6732) 173 false negatives (out of 6732) 30%: “Hard Spam” 30%: “Hard Spam”  Little textual information, avoiding common suspicious word patterns  Many images and hyperlinks  Tricks to confuse tokenizers 8%: Advertisements of pornographic sites with very casual and well chosen vocabulary 8%: Advertisements of pornographic sites with very casual and well chosen vocabulary 23%: Non-English messages 23%: Non-English messages  Under-represented in the training corpus 30%: Encoded messages 30%: Encoded messages  BASE64 format; Filtron could not process it at that time 6%: Hoax letters 6%: Hoax letters  Long formal letters (“tremendous business opportunity !”)  Many occurrences of the receiver’s full name 3%: Short messages with unusual content 3%: Short messages with unusual content

Conclusions Signs of arms race between spammers and content-based filters Signs of arms race between spammers and content-based filters Filtron’s performance deemed satisfactory, though it can be improved with: Filtron’s performance deemed satisfactory, though it can be improved with:  More elaborate preprocessing to tackle usual countermeasures of spammers (misspellings, uncommon words, text on images)  Regular retraining Currently most promising approach: combination of different filtering approaches along with Machine Learning Currently most promising approach: combination of different filtering approaches along with Machine Learning  Collaborative filtering  Filtering in the transport layer level …………

Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos

Similar presentations

Presentation on theme: "Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos

Similar presentations

Presentation on theme: "Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos"— Presentation transcript:

Similar presentations

About project

Feedback