Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi
Outlines Abstract Objectives Introduction Support Vector Machine System Flow Diagram Conclusions References
Abstract Nowadays, provides many ways to send millions of advertisement at no cost to sender. As a result, many unsolicited bulk , also known as spam spread widely and become serious threat to not only the Internet but also to society. For example, when user received large amount of spam, the chance of the user forgot to read a non-spam message increase. As a result, many readers have to spend their time removing unwanted messages. spam also may cost money to users with dial-up connections, waste bandwidth, and may expose minors to unsuitable content.
Cont’d Several machine learning algorithms have been employed in anti-spam spam filtering, including algorithms that are considered top-performers in Text Classification, like Boosting algorithm, Support Vector Machines (SVM) algorithm and Nave Bayes algorithm. Among then, we intend to use support vector machine (SVM) in filtering the as spam or not. In SVM algorithm, there are binary or multiclass classifer. In this thesis, we intend to use binary SVM classifier because there are only two to classify “SPAM” or “NOT-SPM”. SVM is to find a separating line between data of two classes SVM is to train a model that assigns new unseen objects into a particular category
Objectives to classify between spam and legitimate mail message. To alarm the increasing volumes of unwanted mails automatically To help the users not to waste time in a large amount of spam s To classify the technologies of spam filtering
Introduction Internet has become an insensible method to communicate, with each other because of tis popularization, low cost and fast devliery of message. Along with the growth of Internet and there has been a dramatic growth in a spam in recent years. Spam can originate from any location across the globe where internet access is available. Spamming is the abuse of electronic messaging systems to send unsolicited bulk messages or to promote products or services, which are almost universally undesired.
Cont’d The problem of spam is currently of serious and escalating concern, and it is challenging to develop spam filters that can effectively eliminate the increasing volumes of unwanted mails automatically before they enter a users’ mailbox. One popular solution of is an automated filtering using Machine Learning (ML). Support vector machine (SVM) is one of the ML techqnies, which is very popular to classify the types: spam or not to overcome the spam problems.
Overall system process To filter the spam, the system generally does the following tasks. Text preprocessing Feature selection and extraction Text classification with SVM Results with spam or not
Pre-processing steps Pre-processing the input text means data cleaning. It is essential in order to reduce the probability of getting wrong results because some words have no influence on the classification. They can neither be associated with spam class nor with ham (non-spam) class. Also, there are some words that can be normalized in order to group same-meaning words and reduce redundancy. So, pre-processing steps is to Remove stop words, numbers, special symbol (, , etc), URLs, do stemming and lemmatization to help in improving the results of SVM algorithm.
Feature selection and extraction Feature extraction is used to extract important and relevant features from the body. The feature transforms the into 2D vector space having features numbers. These features are mapped from the vocabulary list. It calculates the term frequency contained in a document named as TF, IDF. They are calculated as follows.
TF-IDF(Term Frequency – Inverse Document Frequency) TF_IDF is used which provide a statistic of how a particular word is crucial for the given document 6
Support Vector Machine Support vector machines is an algorithm that determines the best decision boundary between vectors that belong to a given group (or category) and vectors that do not belong to it. It can be applied to any kind of vectors which encode any kind of data. This means that in order to leverage the power of svm text classification, texts have to be transformed into vectors. Vectors are (sometimes huge) lists of numbers which represent a set of coordinates in some space. So, when SVM determines the decision boundary we mentioned above, SVM decides where to draw the best “line” (or the best hyperplane) that divides the space into two subspaces: one for the vectors which belong to the given category and one for the vectors which do not belong to it. The fundamental point of SVM is to create a model which predicts class labels of information occurrences in the testing set which are given only the features.
SVM(Support Vector Machine) SVM can classify either Binary or multiclass-Classifier. It is Classified data based on features. For example, 7 (b) For Nonlinear Classification, there is no Separation between data. So, using kernel Function (a)For linear SVM, used hyper-plane to classify data
System Flow Diagram - User Input (new message or dataset ) Text PreprocessingFeature Extraction Text Classification Using SVM Output ( Spam or Non Spam ) SPAMBASE Dataset
Datasets UCI machine learning repository Spambase dataset
Results evaluation
Conclusions spam filtering is an important issue in the network security and machine learning techniques. In this system, we plan to mainly use support vector machine in classifying the whether they are spam or not. We firstly perform the clean the input data and then extract relevant features to continue spam classification process. We intend to use famous spam dataset, download from UCI machine learning repository for implementing this spam classification process.
References Nurul Fitriah Rusland, Norfaradilla Wahid, Shahreen Kasim, Hanayanti Hat, Analysis of Naïve Bayes Algorithm for Spam Filtering across Multiple Datasets, International Research and Innovation Summit (IRIS2017)