Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.

Text Classification: An Implementation Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo.

Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Introduction to machine learning

Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Advanced Multimedia Text Classification Tamara Berg.

Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

Text Classification using SVM- light DSSI 2008 Jing Jiang.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Universit at Dortmund, LS VIII

Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.

Project 1: Machine Learning Using Neural Networks Ver 1.1.

SPAM DETECTION AND FILTERING By Prasanna Kunchavaram.

Spam Detection Ethan Grefe December 13, 2013.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Class Imbalance in Text Classification

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.

Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Support-Vector Networks C Cortes and V Vapnik (Tue) Computational Models of Intelligence Joon Shik Kim.

Classification using Co-Training

Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.

Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.

An Effective Defense Against Spam Laundering Author: Mengjun Xie, Heng Yin, Haining Wang Presented At: CCS’ 06 Prepared By: Amit Shrivastava.

Information Retrieval in Practice

PREDICT 422: Practical Machine Learning

Queensland University of Technology

Machine Learning for Computer Security

Sentiment analysis algorithms and applications: A survey

Perceptrons Lirong Xia.

Source: Procedia Computer Science（2015）70:

Basic machine learning background with Python scikit-learn

Applications of IScore (using R)

Machine Learning Week 1.

Design open relay based DNS blacklist system

Project 1: Text Classification by Neural Networks

Text Categorization Assigning documents to a fixed set of categories

PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

COSC 4368 Machine Learning Organization

A Novel Smoke Detection Method Using Support Vector Machine

Automatic Handwriting Generation

Modeling IDS using hybrid intelligent systems

Perceptrons Lirong Xia.

Presentation transcript:

Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi

Outlines Abstract Objectives Introduction Support Vector Machine System Flow Diagram Conclusions References

Abstract Nowadays, provides many ways to send millions of advertisement at no cost to sender. As a result, many unsolicited bulk , also known as spam spread widely and become serious threat to not only the Internet but also to society. For example, when user received large amount of spam, the chance of the user forgot to read a non-spam message increase. As a result, many readers have to spend their time removing unwanted messages. spam also may cost money to users with dial-up connections, waste bandwidth, and may expose minors to unsuitable content.

Cont’d Several machine learning algorithms have been employed in anti-spam spam filtering, including algorithms that are considered top-performers in Text Classification, like Boosting algorithm, Support Vector Machines (SVM) algorithm and Nave Bayes algorithm. Among then, we intend to use support vector machine (SVM) in filtering the as spam or not. In SVM algorithm, there are binary or multiclass classifer. In this thesis, we intend to use binary SVM classifier because there are only two to classify “SPAM” or “NOT-SPM”. SVM is to find a separating line between data of two classes SVM is to train a model that assigns new unseen objects into a particular category

Objectives to classify between spam and legitimate mail message. To alarm the increasing volumes of unwanted mails automatically To help the users not to waste time in a large amount of spam s To classify the technologies of spam filtering

Introduction Internet has become an insensible method to communicate, with each other because of tis popularization, low cost and fast devliery of message. Along with the growth of Internet and there has been a dramatic growth in a spam in recent years. Spam can originate from any location across the globe where internet access is available. Spamming is the abuse of electronic messaging systems to send unsolicited bulk messages or to promote products or services, which are almost universally undesired.

Cont’d The problem of spam is currently of serious and escalating concern, and it is challenging to develop spam filters that can effectively eliminate the increasing volumes of unwanted mails automatically before they enter a users’ mailbox. One popular solution of is an automated filtering using Machine Learning (ML). Support vector machine (SVM) is one of the ML techqnies, which is very popular to classify the types: spam or not to overcome the spam problems.

Overall system process To filter the spam, the system generally does the following tasks. Text preprocessing Feature selection and extraction Text classification with SVM Results with spam or not

Pre-processing steps Pre-processing the input text means data cleaning. It is essential in order to reduce the probability of getting wrong results because some words have no influence on the classification. They can neither be associated with spam class nor with ham (non-spam) class. Also, there are some words that can be normalized in order to group same-meaning words and reduce redundancy. So, pre-processing steps is to Remove stop words, numbers, special symbol (, , etc), URLs, do stemming and lemmatization to help in improving the results of SVM algorithm.

Feature selection and extraction Feature extraction is used to extract important and relevant features from the body. The feature transforms the into 2D vector space having features numbers. These features are mapped from the vocabulary list. It calculates the term frequency contained in a document named as TF, IDF. They are calculated as follows.

TF-IDF(Term Frequency – Inverse Document Frequency) TF_IDF is used which provide a statistic of how a particular word is crucial for the given document 6

Support Vector Machine Support vector machines is an algorithm that determines the best decision boundary between vectors that belong to a given group (or category) and vectors that do not belong to it. It can be applied to any kind of vectors which encode any kind of data. This means that in order to leverage the power of svm text classification, texts have to be transformed into vectors. Vectors are (sometimes huge) lists of numbers which represent a set of coordinates in some space. So, when SVM determines the decision boundary we mentioned above, SVM decides where to draw the best “line” (or the best hyperplane) that divides the space into two subspaces: one for the vectors which belong to the given category and one for the vectors which do not belong to it. The fundamental point of SVM is to create a model which predicts class labels of information occurrences in the testing set which are given only the features.

SVM(Support Vector Machine) SVM can classify either Binary or multiclass-Classifier. It is Classified data based on features. For example, 7 (b) For Nonlinear Classification, there is no Separation between data. So, using kernel Function (a)For linear SVM, used hyper-plane to classify data

System Flow Diagram - User Input (new message or dataset ) Text PreprocessingFeature Extraction Text Classification Using SVM Output ( Spam or Non Spam ) SPAMBASE Dataset

Datasets UCI machine learning repository Spambase dataset

Results evaluation

Conclusions spam filtering is an important issue in the network security and machine learning techniques. In this system, we plan to mainly use support vector machine in classifying the whether they are spam or not. We firstly perform the clean the input data and then extract relevant features to continue spam classification process. We intend to use famous spam dataset, download from UCI machine learning repository for implementing this spam classification process.

References Nurul Fitriah Rusland, Norfaradilla Wahid, Shahreen Kasim, Hanayanti Hat, Analysis of Naïve Bayes Algorithm for Spam Filtering across Multiple Datasets, International Research and Innovation Summit (IRIS2017)