Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Basic Communication on the Internet:
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
1 A LVQ-based neural network anti-spam approach 楊婉秀 教授 資管碩一 詹元順 /12/07.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
IMF Mihály Andó IT-IS 6 November Mihály Andó 2 / 11 6 November 2006 What is IMF? ­ Intelligent Message Filter ­ provides server-side message filtering,
Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:
Deep Belief Networks for Spam Filtering
Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political.
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
Internal Presentation by : Lei Wang Pervasive and Artificial Intelligenge research group On: An Artificial Immune System for .
Introduction to Machine Learning Approach Lecture 5.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos
Norman SecureTide Powerful cloud solution to stop spam and threats before it reaches your network.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 2 1 Evaluating an Program and a Web-Based Service Basic Communication.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Group 2 R 李庭閣 R 孔垂玖 R 許守傑 R 鄭力維.
The Internet 8th Edition Tutorial 2 Basic Communication on the Internet: .
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
A Technical Approach to Minimizing Spam Mallory J. Paine.
Universit at Dortmund, LS VIII
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
A Language Independent Method for Question Classification COLING 2004.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
Classification Techniques: Bayesian Classification
C August 24, 2004 Page 1 SMS Spam Control Nobuyuki Uchida QUALCOMM Incorporated Notice ©2004 QUALCOMM Incorporated. All rights reserved.
SPAM DETECTION AND FILTERING By Prasanna Kunchavaram.
Spam Detection Ethan Grefe December 13, 2013.
Breast Cancer Diagnosis via Neural Network Classification Jing Jiang May 10, 2000.
Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm Yen-Jen Oyang Department of Computer Science.
Linear Discrimination Reading: Chapter 2 of textbook.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Non-Bayes classifiers. Linear discriminants, neural networks.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Indoor Location Detection By Arezou Pourmir ECE 539 project Instructor: Professor Yu Hen Hu.
By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
1 CHUKWUEMEKA DURUAMAKU.  Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
On Utillizing LVQ3-Type Algorithms to Enhance Prototype Reduction Schemes Sang-Woon Kim and B. John Oommen* Myongji University, Carleton University*
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
A New Generation of Artificial Neural Networks.  Support Vector Machines (SVM) appeared in the early nineties in the COLT92 ACM Conference.  SVM have.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Learning to Detect and Classify Malicious Executables in the Wild by J
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Source: Procedia Computer Science(2015)70:
Schizophrenia Classification Using
PEBL: Web Page Classification without Negative Examples
Classification Techniques: Bayesian Classification
Prepared by: Mahmoud Rafeek Al-Farra
iSRD Spam Review Detection with Imbalanced Data Distributions
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Text Mining Application Programming Chapter 9 Text Categorization
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori

Problem:  provides a perfect way to send millions of advertisements at no cost for the sender, and this unfortunate fact is nowadays extensively exploited by several organizations.  Being incredibly cheap to send, spam causes a lot of trouble to the Internet community: large amounts of spam-traffic between servers cause delays in delivery of legitimate , people with dial-up Internet access have to spend bandwidth downloading junk mail  There is quite an amount of pornographic spam that should not be exposed to children  Sorting out the unwanted messages takes time and introduces a risk of deleting normal mail by mistake. 2

 Many ways of fighting spam have been proposed. There are “social” methods like legal measures (one example is an anti-spam law introduced in the US).  There are “technological” ways like blocking spammer’s IP-address, and, at last, there is filtering. Unfortunately, no universal and perfect way for eliminating spam exists yet, so the amount of junk mail keeps increasing.  There are two general approaches to mail filtering: knowledge engineering (KE) and machine learning (ML).  In the first approach, a set of rules is created according to which messages are categorized as spam or legitimate mail. 3

 The major drawback of this method is that the set of rules must be constantly updated, and maintaining it is not convenient for most users.  Another Drawback is,when these rules are publicly available, the spammer has the ability to adjust the text of his message so that it would pass through the filter.  The machine learning approach does not require specifying any rules explicitly. Instead, a set of pre-classified documents (training samples) is needed. A specific algorithm is then used to “learn” the classification rules from this data. 4

Feature extraction:  Most of the machine learning algorithms can only classify numerical objects (real numbers or vectors) or otherwise require some measure of similarity between the objects.  The objects we are trying to classify are text messages, i.e. strings. Strings are, unfortunately, not very convenient objects to handle. We have to convert all messages to vectors of numbers (feature vectors) and then classify these vectors.  When we extract features we usually lose information and it is clear that the way we define our feature-extractor is crucial for the performance of the filter.  The test data used is :The PU1 corpus of messages collected by Ion Androutsopoulos. 5

Classifiers : For this problem, the classifiers used are:  The Naive Bayesian Classifier  k Nearest Neighbors Classifier  Artificial Neural Networks  Support Vector Machine Classification 6

Application on test Data :  The PU1 corpus of messages collected by Ion Androutsopoulos was used for testing. The corpus consists of 1099 messages, of which 481 are spam.  Every message was converted to a feature vector with attributes (this is approximately the number of different words in all the messages of the corpus). An attribute n was set to 1 if the corresponding word was present in a message, and to 0 otherwise.  This feature extraction scheme was used for all the algorithms.  For every algorithm we count the number, N S→L of spam messages incorrectly classified as legitimate mail (false negatives) N L→S of legitimate messages, incorrectly classified as spam (false positives). N = 1099 denote the total number of messages, N S = 481 — the number of spam messages, and N L = 618 — the number of legitimate messages 7

The Quantities of interest are:  error rate : E = (N S→L + N L→S )/N  Precision : P = 1 − E  legitimate mail fallout : F L = N L→S /N L  spam fallout : F S = N S→L /N S 8

 The following table presents the results obtained in the way described above. 9

Eliminating False Positives:  The naive bayesian classifier has the λ parameter, that we can increase. The k-NN classifier may be replaced with the l/k classifier the number l may be then adjusted together with k. The perceptron can not be tuned, so he leaves the competition at this stage. The hard-margin SVM classifier also can’t be improved, but its modification, the soft-margin classifier can.  Results after modification: 10

Combining Classifers:  Since,we have nil false positives for every classifier,we may now combine any of the classifiers to get better precision. 11

Conclusion :  Machine learning always makes it easier. 12

Thank You 13