Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients.

Similar presentations


Presentation on theme: "Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients."— Presentation transcript:

1 Spam Detection Kingsley Okeke Nimrat Virk

2 Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients by email. They impede our ability to recognise normal emails. They can also be a threat to computer security

3 But how do we filter out spams from normal emails?? ?? ??

4 Text Mining!! What is Text Mining?? Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.. wikipediadatabase

5 Applications Marketing applications It is used to improve predictive analytic models for customers E.g Open ended questions in surveys Online Media applications Used by Large media companies to provide users with better search experience Academic applications Publishers with large databases use text mining for easy information retrieval

6 Using text mining we can analyse patterns common in spam emails in order to distinguish them from Ham emails.

7 Steps 1) Get some training data A large collection of spam and normal emails SpamAssassin public corpus (http://www.spamassassin.org/publiccorpus/)http://www.spamassassin.org/publiccorpus/

8 Steps 2) Data Pre-processing a) Stop words: e.g for, when, to, a, be Domain specific stop words e.g email, send

9 Steps b) Stemming: removal of stems/roots from words E.g discussed – discussing - discuss Porter stemming algorithm One of the most widely used stemming algorithm Developed by Martin Porter http://www.tartarus.org/~martin/PorterStemmer/

10 Steps c) Feature Selection What are Good and Bad Features? Good features: Must occur alongside with a particular category Do not co-occur with other categories Bad features: Uniform across all categories Very infrequent occurrence

11 Steps Information Gain A common feature selection technique used in machine learning applications. information gain of term t is defined as:

12 Steps Feature Representation word1word2….class doc102c1 doc224c2 doc321c3

13 Steps TF: Term Frequency Definition: TF = t (i,j) frequency of term i in document j Purpose: makes the frequent words for the document more important TF-IDF (Term Frequency - Inverted Document Frequency) value of a term i in document j Definition: TF×IDF = t(i,j) × log(N/n i ) n i : number of documents containing term i N : total number of documents

14 Steps d) Text Classification WEKA Training data is used to build a classification model This model is built from the pre-processed data

15 END


Download ppt "Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients."

Similar presentations


Ads by Google