Download presentation
Presentation is loading. Please wait.
1
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 1 Email Spam Filtering Computer Security Seminar N.Muthiyalu Jothir – 271120 Media Informatics
2
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 2 Agenda What is Spam ? What is Spam ? Statistics Statistics Who Benefits from it? Who Benefits from it? Spam Filtering Techniques Spam Filtering Techniques Combining Filters Combining Filters Conclusion Conclusion
3
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 3 What is Spam? Spam Unsolicited email Spam Unsolicited email Emails that involves sending identical or nearly identical messages to thousands (or millions) of recipients. Emails that involves sending identical or nearly identical messages to thousands (or millions) of recipients. Caution ! Caution ! “SPAM - Spiced Ham ” is a popular American canned meat brand… “SPAM - Spiced Ham ” is a popular American canned meat brand…
4
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 4 Problem With a tiny investment, a spammer can send over 100,000 bulk emails per hour. With a tiny investment, a spammer can send over 100,000 bulk emails per hour. Junk mails waste storage and transmission bandwidth. Junk mails waste storage and transmission bandwidth. ISP’s investment Cost we absorb as ISP’s customer ISP’s investment Cost we absorb as ISP’s customer Spam is a problem because the cost is forced onto us, the recipient. Spam is a problem because the cost is forced onto us, the recipient.
5
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 5Statistics Email considered Spam40% of all email Daily Spam emails sent 12.4 billion Daily Spam received per person 6 Annual Spam received per person2,200 Spam cost to all non-corp. Internet users$255 million Spam cost to all U.S. Corporations in 2002 $8.9 billion Estimated Spam increase by 2007 63% Users who reply to Spam email28% Users who purchased from Spam email8% Wasted corporate time per Spam email 4-5 seconds
6
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 6 Who benefits from Spam? Financial Firms e.g. Mortgage Lead Generators (Gain 2% of Loan value per customer data) Spammers (Share the profit with Lead Generators) Recipient Information about interested customers Recipient replies here
7
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 7 Spam Control Techniques Fight Back techniquesFiltering Techniques Reporting Spam to ISP Fight back filters Slow Senders Law ??? etc. Challenge-Response Filtering Blacklists and White lists Content based filters Rule based Bayesian filters
8
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 8 Reporting Spam To ISPs Original spam solution Legitimate ISPs respond to such complaints Spammers kicked off Disadvantage Disguised Spammers. Naïve users cannot interpret the email headers
9
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 9 Filters that Fight Back (FFB) Majority of spam contain links to web pages. Spam filters could auto retrieve the URLs and crawl back to those pages, which would increase the load on the server. If all the spam receivers do this at the same time, the server might be crashed and so the cost of spamming increases. Caution ! FFB usually works with blacklists (of malicious servers) in order to avoid the attack on innocent servers.
10
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 10 Filtering Techniques
11
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 11 Spam Vs Ham Care to be taken in any Spam filtering technique Care to be taken in any Spam filtering technique “All the Spam could be allowed to pass thro; but, not even a single legitimate mail should be filtered.” “All the Spam could be allowed to pass thro; but, not even a single legitimate mail should be filtered.” False Positive – Legitimate mail classified as spam. False Positive – Legitimate mail classified as spam. Least false positive rate desired… Least false positive rate desired… Caution : Check your junk folder before deleting Caution : Check your junk folder before deleting Don’t believe your Spam filter Don’t believe your Spam filter
12
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 12 Challenge-Response Filtering Emails from unknown senders will receive an auto-reply message asking them to verify themselves Senders “Challenged" to type in a word that is hidden within a graphic or a sound file Mail is forwarded to receiver’s inbox, only after successful “response” This technique almost filters all spam. No spammer would be interested to take the extra effort to prove him / her self. Commercial product “spamarrest” Disadvantage This technique is rude Sometimes senders don’t or forget to reply to the challenge
13
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 13 Blacklists and White lists Blacklists of misbehaving servers or known spammers that are collected by several sites. Sender id in the email is compared with the blacklist White lists are complementary to black lists, and contain addresses of trusted contacts Use blacklists and white lists for the first level filtering (before applying content checks) and not used as the only tool for making decision. Disadvantage Prone to wrong configurations with legitimate servers unable to exit from a list where they had been incorrectly inserted.
14
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 14 Content based filters Not a good idea to filter mails just based on blacklists Not a good idea to filter mails just based on blacklists Wiser decision Consider the actual content of the email Wiser decision Consider the actual content of the email Almost all the successful spam filters use this technique Almost all the successful spam filters use this technique Major types : Rule-based and Bayesian Major types : Rule-based and Bayesian
15
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 15 Rule Based Filters Rule based filters work based on some static rules to decide whether a mail is a spam or not. Rules could be words and phrases lots of uppercase characters exclamation points special characters Web links HTML messages background colors crazy Subject lines etc.
16
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 16 Rule based filters Rules are given scores, based on importance Incoming mails are parsed and checked for known malicious patterns Total score calculated for the triggered rules If Final Score > Threshold, classify as spam. Otherwise, classify as legitimate mail. Threshold decided by the user.
17
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 17 Rule Based Filters “Spamassasin”, a popular spam filtering product uses rule based filtering. Perl Regex (Regular expressions) used for pattern checking Example rules header __LOCAL_FROM_NEWS From /news@example\.com/i body __LOCAL_SALES_FIGURES /\bMonthly Sales Figures\b/ score LOCAL_NEWS_SALES_FIGURES 0.8
18
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 18 Rule Based Filters Advantage Advantage Easy to implement Easy to implement No training required No training required Disadvantage Disadvantage Static rules too general Static rules too general Spammers find new ways to deceive the rules Spammers find new ways to deceive the rules
19
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 19 Bayesian Filters Bayesian filters are the latest in spam filtering technology and the most successful. Bayes classifiers were used extensively in the field of pattern recognition. Given an unlabeled example, the classifier will calculate the most likely classification with some degree of probability.
20
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 20 Bayesian Filters Steps in Bayes Filtering Training Validation Implementation Training starts with two collections of mails : one of spam and one of legitimate mail. For every word in these emails, it calculates a spam probability based on the proportion of spam occurrences. Bayesian filters are quite accurate, and adapt automatically as spam evolves. False positives are minimized by Bayesian filtering because they consider evidence of innocence as well as evidence of spam.
21
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 21 Bayesian Filtering Bayes Probability, Bayes Probability, Pr (spam | words) = Pr (spam) * Pr (spam | words) = Pr (spam) * Pr (words | Spam) Pr (words) Probability closer to 1 would be classified as spam and closer to 0 is classified as ham. 0.5 is set as the threshold.
22
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 22 Neural Network for Training Neural Network Structure Neural Network Structure i
23
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 23 Neural Networks for Training Neural networks are used to train the spam filter (Rule-based or Bayesian) and itself is not a filter Input words or rules etc. Trained over multiple samples of the user’s mails (both spam and ham) Weights of the links are altered till the desired output is obtained.
24
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 24 Supervised Learning Supervised learning Training with a “teacher” signal Train the system till we get optimized unaltered weights for the edges. Caution! Take care not to over train the network.
25
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 25 Combining Spam Filters Goal Goal Combined filter aims to improve individual filters performance. Combined Filter = Original Filter (OF) + Received Filter (RF) Combined Filter = Original Filter (OF) + Received Filter (RF) Max gain Received filter contains some feature sets not found in the original filter. E.g. Original Filter = {“Share Market”, “Higher Studies”} Received filter = {“Share Market”, “Job Alerts”}
26
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 26 Challenges Decisions (Spam / Ham) made by both filters individually Decisions (Spam / Ham) made by both filters individually Decisions agree No Problem Decisions agree No Problem Disagreement Due to difference of feature sets Disagreement Due to difference of feature sets Challenges Challenges “How do we select the correct decision or filter?” “Who selects it?”
27
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 27 Filter Selector (FS) Training Phase FS predicts the unique features (e.g. words) of RF Training Phase FS predicts the unique features (e.g. words) of RF Parse the emails of training set and extract the features Parse the emails of training set and extract the features ‘Bag’ of (predicted) features for RF ‘Bag’ of (predicted) features for RF Text similarity comparison between the current e-mail's features and the feature sets of the filters.
28
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 28 Algorithm Flowchart 1. Training Phase 2. Final Verdict
29
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 29 TF – IDF Similarity Measure Commonly used in Information Retrieval applications. More frequent words would be key to accurate classification of emails FS predicted feature set is unique “Query – Document” retrieval procedure. 2 documents – Feature sets Query – Current email
30
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 30 Experiments & Results Experiments & Results
31
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 31 Conclusion We discussed the techniques to “kill” spam We discussed the techniques to “kill” spam Comparison between various techniques Comparison between various techniques So far, Bayesian seems to be reliable So far, Bayesian seems to be reliable Discussed a new approach to combine filters Discussed a new approach to combine filters Future work : Future work : Learning techniques for Filter Selector Learning techniques for Filter Selector Better Similarity measures Better Similarity measures
32
6/1/2015 Email Spam Filtering - Muthiyalu Jothir 32 Thank You Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.