Spam Email Detection Ethan Grefe December 13, 2013.

Slides:



Advertisements
Similar presentations
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.
Advertisements

Machine Learning Basics with Applications to Spam Detection UGR P ROJECT - H AOYU LI, BRITTANY EDWARDS, WEI ZHANG UNDER XIAOXIAO XU AND ARYE NEHORAI.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
An Overview of Machine Learning
Foundations of Adversarial Learning Daniel Lowd, University of Washington Christopher Meek, Microsoft Research Pedro Domingos, University of Washington.
Face Recognition & Biometric Systems Support Vector Machines (part 2)
Standard electrode arrays for recording EEG are placed on the surface of the brain. Detection of High Frequency Oscillations Using Support Vector Machines:
MLP Lyrical Analysis ● % of Unique Words ● # of Unique Words ● Average Word Length ● # of Lyrics ● # of Characters Input Feature Vectors:
Speeding up multi-task learning Phong T Pham. Multi-task learning  Combine data from various data sources  Potentially exploit the inter-relation between.
Deep Belief Networks for Spam Filtering
1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.
Analyzing Behavioral Features for Classification.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
How does computer know what is spam and what is ham?
Optimization Theory Primal Optimization Problem subject to: Primal Optimal Value:
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Naïve Bayes Chapter 4, DDS. Introduction Classification Training set  design a model Test set  validate the model Classify data set using the model.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
Final Presentation Tong Wang. 1.Automatic Article Screening in Systematic Review 2.Compression Algorithm on Document Classification.
Bayesian Networks. Male brain wiring Female brain wiring.
Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.
Group 2 R 李庭閣 R 孔垂玖 R 許守傑 R 鄭力維.
Texture analysis Team 5 Alexandra Bulgaru Justyna Jastrzebska Ulrich Leischner Vjekoslav Levacic Güray Tonguç.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Spam Filtering. From: "" Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Adapting Statistical Filtering David Kohlbrenner IT.com TJHSST.
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
SPAM DETECTION AND FILTERING By Prasanna Kunchavaram.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
An Introduction to Support Vector Machine (SVM)
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
1 CHUKWUEMEKA DURUAMAKU.  Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
A COMPARISON OF ANN, NAÏVE BAYES, AND DECISION TREE FOR THE PURPOSE OF SPAM FILTERING KAASHYAPEE JHA ECE/CS
Classification using Co-Training
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Naïve Bayes Classifier Christina Wallin, Period 3 Computer Systems Research Lab
ORGANIZING . 1.Sort messages quickly. 2.Group similar messages in folders or labels. 3.Route mail efficiently to specific folders or labels. 4.Reduce.
Machine Learning Models
Mammogram Analysis – Tumor classification
Using Transductive SVMs for Object Classification in Images
Machine Learning Week 1.
Students: Meiling He Advisor: Prof. Brain Armstrong
Text Categorization Assigning documents to a fixed set of categories
Naïve Bayes Classifiers
Concave Minimization for Support Vector Machine Classifiers
Basics of ML Rohan Suri.
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

Spam Email Detection Ethan Grefe December 13, 2013

Motivation Spam email is constantly cluttering inboxes Commonly removed using rule based filters Spam often has very similar characteristics This allows them to be detected using machine learning Naïve Bayes Classifiers Support Vector Machines

SVM Solution Used training data from CSDMC2010 SPAM corpus 4327 labeled emails 2949 non-spam messages (HAM) 1378 spam messages (SPAM). Extracted features from the subject and body of emails Used resulting feature vectors to train an SVM classifier in Matlab

Email Features Features were determined by research and observation Best results were obtained with the following features Percentage of letters that are capitalized Types of punctuation used Average length of a word Amount of html in the email

Classifier Results Trained on a random 35% of emails Tested SVM classifier on remaining 65% Trained SVM using three different kernel functions Kernel Function Spam Classification Rate Ham Classification Rate Total Classification Rate RBF 80.06% 92.33% 86.20% Linear 78.69% 80.66% 79.67% Quadratic 82.75% 84.85% 83.80%

Possible Improvements Use Naïve Bayes to classify emails using word frequency Obtain a wider variety of input features Test other types of learning algorithms