Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.

Similar presentations


Presentation on theme: "Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi."— Presentation transcript:

1 Email Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi

2 Outlines Abstract Objectives Introduction Support Vector Machine System Flow Diagram Conclusions References

3 Abstract Nowadays, e-mail provides many ways to send millions of advertisement at no cost to sender. As a result, many unsolicited bulk e-mail, also known as spam e-mail spread widely and become serious threat to not only the Internet but also to society. For example, when user received large amount of e-mail spam, the chance of the user forgot to read a non-spam message increase. As a result, many e-mail readers have to spend their time removing unwanted messages. E-mail spam also may cost money to users with dial-up connections, waste bandwidth, and may expose minors to unsuitable content.

4 Cont’d Several machine learning algorithms have been employed in anti-spam e-mail spam filtering, including algorithms that are considered top-performers in Text Classification, like Boosting algorithm, Support Vector Machines (SVM) algorithm and Nave Bayes algorithm. Among then, we intend to use support vector machine (SVM) in filtering the email as spam or not. In SVM algorithm, there are binary or multiclass classifer. In this thesis, we intend to use binary SVM classifier because there are only two to classify “SPAM” or “NOT-SPM”. SVM is to find a separating line between data of two classes SVM is to train a model that assigns new unseen objects into a particular category

5 Objectives to classify between spam and legitimate mail message. To alarm the increasing volumes of unwanted mails automatically To help the email users not to waste time in a large amount of spam emails To classify the technologies of spam filtering

6 Introduction Internet has become an insensible method to communicate, with each other because of tis popularization, low cost and fast devliery of message. Along with the growth of Internet and email there has been a dramatic growth in a spam in recent years. Spam can originate from any location across the globe where internet access is available. Spamming is the abuse of electronic messaging systems to send unsolicited bulk messages or to promote products or services, which are almost universally undesired.

7 Cont’d The problem of spam is currently of serious and escalating concern, and it is challenging to develop spam filters that can effectively eliminate the increasing volumes of unwanted mails automatically before they enter a users’ mailbox. One popular solution of is an automated email filtering using Machine Learning (ML). Support vector machine (SVM) is one of the ML techqnies, which is very popular to classify the email types: spam or not to overcome the email spam problems.

8 Overall system process To filter the email spam, the system generally does the following tasks. Text preprocessing Feature selection and extraction Text classification with SVM Results with spam or not

9 Pre-processing steps Pre-processing the input text means data cleaning. It is essential in order to reduce the probability of getting wrong results because some words have no influence on the classification. They can neither be associated with spam class nor with ham (non-spam) class. Also, there are some words that can be normalized in order to group same-meaning words and reduce redundancy. So, pre-processing steps is to Remove stop words, numbers, special symbol (, , etc), URLs, do stemming and lemmatization to help in improving the results of SVM algorithm.

10 Feature selection and extraction Feature extraction is used to extract important and relevant features from the email body. The feature transforms the email into 2D vector space having features numbers. These features are mapped from the vocabulary list. It calculates the term frequency contained in a document named as TF, IDF. They are calculated as follows.

11 TF-IDF(Term Frequency – Inverse Document Frequency) TF_IDF is used which provide a statistic of how a particular word is crucial for the given document 6

12 Support Vector Machine Support vector machines is an algorithm that determines the best decision boundary between vectors that belong to a given group (or category) and vectors that do not belong to it. It can be applied to any kind of vectors which encode any kind of data. This means that in order to leverage the power of svm text classification, texts have to be transformed into vectors. Vectors are (sometimes huge) lists of numbers which represent a set of coordinates in some space. So, when SVM determines the decision boundary we mentioned above, SVM decides where to draw the best “line” (or the best hyperplane) that divides the space into two subspaces: one for the vectors which belong to the given category and one for the vectors which do not belong to it. The fundamental point of SVM is to create a model which predicts class labels of information occurrences in the testing set which are given only the features.

13 SVM(Support Vector Machine) SVM can classify either Binary or multiclass-Classifier. It is Classified data based on features. For example, 7 (b) For Nonlinear Classification, there is no Separation between data. So, using kernel Function (a)For linear SVM, used hyper-plane to classify data

14 System Flow Diagram - User Input (new email message or dataset ) Text PreprocessingFeature Extraction Text Classification Using SVM Output ( Spam or Non Spam ) SPAMBASE Dataset

15 Datasets UCI machine learning repository Spambase dataset https://archive.ics.uci.edu/ml/datasets/spambase

16 Results evaluation

17 Conclusions E-mail spam filtering is an important issue in the network security and machine learning techniques. In this system, we plan to mainly use support vector machine in classifying the email whether they are spam or not. We firstly perform the clean the input email data and then extract relevant features to continue spam classification process. We intend to use famous email spam dataset, download from UCI machine learning repository for implementing this email spam classification process.

18 References Nurul Fitriah Rusland, Norfaradilla Wahid, Shahreen Kasim, Hanayanti Hat, Analysis of Naïve Bayes Algorithm for Email Spam Filtering across Multiple Datasets, International Research and Innovation Summit (IRIS2017)


Download ppt "Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi."

Similar presentations


Ads by Google