Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients.

Spam Detection Kingsley Okeke Nimrat Virk

Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients by email. They impede our ability to recognise normal emails. They can also be a threat to computer security

But how do we filter out spams from normal emails?? ?? ??

Text Mining!! What is Text Mining?? Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.. wikipediadatabase

Applications Marketing applications It is used to improve predictive analytic models for customers E.g Open ended questions in surveys Online Media applications Used by Large media companies to provide users with better search experience Academic applications Publishers with large databases use text mining for easy information retrieval

Using text mining we can analyse patterns common in spam emails in order to distinguish them from Ham emails.

Steps 1) Get some training data A large collection of spam and normal emails SpamAssassin public corpus (http://www.spamassassin.org/publiccorpus/)http://www.spamassassin.org/publiccorpus/

Steps 2) Data Pre-processing a) Stop words: e.g for, when, to, a, be Domain specific stop words e.g email, send

Steps b) Stemming: removal of stems/roots from words E.g discussed – discussing - discuss Porter stemming algorithm One of the most widely used stemming algorithm Developed by Martin Porter http://www.tartarus.org/~martin/PorterStemmer/

Steps c) Feature Selection What are Good and Bad Features? Good features: Must occur alongside with a particular category Do not co-occur with other categories Bad features: Uniform across all categories Very infrequent occurrence

Steps Information Gain A common feature selection technique used in machine learning applications. information gain of term t is defined as:

Steps Feature Representation word1word2….class doc102c1 doc224c2 doc321c3

Steps TF: Term Frequency Definition: TF = t (i,j) frequency of term i in document j Purpose: makes the frequent words for the document more important TF-IDF (Term Frequency - Inverted Document Frequency) value of a term i in document j Definition: TF×IDF = t(i,j) × log(N/n i ) n i : number of documents containing term i N : total number of documents

Steps d) Text Classification WEKA Training data is used to build a classification model This model is built from the pre-processed data

Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients.

Similar presentations

Presentation on theme: "Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients.

Similar presentations

Presentation on theme: "Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam emails, also known as junk emails, are unwanted emails sent to numerous recipients."— Presentation transcript:

Similar presentations

About project

Feedback