Spam Detection Jingrui He 10/08/2007. Spam Types  Email Spam Unsolicited commercial email  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

ECG Signal processing (2)
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
An Introduction of Support Vector Machine
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.
Supervised Learning Recap
Early Detection of Outgoing Spammers in Large-Scale Service Provider Networks Yehonatan Cohen Daniel Gordon Danny Hendler Ben-Gurion University Yehonatan.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Speeding up multi-task learning Phong T Pham. Multi-task learning  Combine data from various data sources  Potentially exploit the inter-relation between.
Sparse vs. Ensemble Approaches to Supervised Learning
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Deep Belief Networks for Spam Filtering
Binary Classification Problem Learn a Classifier from the Training Set
Announcements  Project teams should be decided today! Otherwise, you will work alone.  If you have any question or uncertainty about the project, talk.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learing.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Towards Modeling Legitimate and Unsolicited Traffic Using Social Network Properties 1 Towards Modeling Legitimate and Unsolicited Traffic Using.
SVM by Sequential Minimal Optimization (SMO)
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
DETECTING SPAMMERS AND CONTENT PROMOTERS IN ONLINE VIDEO SOCIAL NETWORKS Fabrício Benevenuto ∗, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, and.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Multimodal Information Analysis for Emotion Recognition
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
Today Ensemble Methods. Recap of the course. Classifier Fusion
Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.
Improving Spam Detection Based on Structural Similarity By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida, Luis M. A. Bettencourt, Virgílio A.
Spam Detection Ethan Grefe December 13, 2013.
Page 1 Inferring Relevant Social Networks from Interpersonal Communication Munmun De Choudhury, Winter Mason, Jake Hofman and Duncan Watts WWW ’10 Summarized.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Linear Models for Classification
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Network Lasso: Clustering and Optimization in Large Graphs
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Contextual models for object detection using boosted random fields by Antonio Torralba, Kevin P. Murphy and William T. Freeman.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Basic machine learning background with Python scikit-learn
Dieudo Mulamba November 2017
Asymmetric Gradient Boosting with Application to Spam Filtering
Text Categorization Rong Jin.
Support Vector Machines
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Presentation transcript:

Spam Detection Jingrui He 10/08/2007

Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs to boost PageRank

From Learning Point of View  Spam Detection Classification problem (ham vs. spam)  Feature Extraction A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung  Fast Classifier Relaxed Online SVMs for Spam Filtering. D. Sculley, G.M. Wachman

A Learning Approach to Spam Detection based on Social Networks H.Y. Lam and D.Y. Yeung CEAS 2007

Problem Statement  n Accounts  Sender Set: ; Receiver Set  Labeled Sender Set: s.t.  Goal Assign the remaining account with in

System Flow Chart

Social Network from Logs  Directed Graph  Directed Edge sent from to  Edge Weight = is the number of s sent from to

System Flow Chart

Features from Social Networks  In-count / Out-count The sum of in-coming / out-going edge weights  In-degree / Out-degree The number of accounts that a node receives s from / sends s to

Features from Social Networks  Communication Reciprocity (CR) The percentage of interactive neighbors that a node has The set of accounts that received s from The set of accounts that sent s to

 Communication Interaction Average (CIA) The level of interaction between a sender and each of the corresponding recipients Features from Social Networks

 Clustering Coefficient (CC) Friends-of-friends relationship between accounts Features from Social Networks Number of neighbors of Number of connections between neighbors of

System Flow Chart

Preprocessing  Sender Feature Vector  Weighted Features Problematic?

System Flow Chart

Assigning Spam Score  Similarity Weighted k-NN method Gaussian similarity Similarity weighted mean k-NN scores Score scaling The set of k nearest neighbors

Experiments  Enron Dataset: 9150 Senders  To Get Legitimate Enron senders: transactions within the Enron domain 5000 generated spam accounts 120 senders from each class  Results Averaged over 100 Times

Number of Nearest Neighbors

Feature Weights (CC)

Feature Weights (CIA)

Feature Weights (CR)

Feature Weights  In/Out-Count & In/Out-Degree The smaller the better  Final Weights In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15

Conclusion  Legitimacy Score No content needed  Can Be Combined with Content-Based Filters  More Sophisticated Classifiers SVM, boosting, etc  Classifiers Using Combined Feature

Relaxed Online SVMs for Spam Filtering D. Sculley and G.M. Washman SIGIR 2007

Anti-Spam Controversy  Support Vector Machines (SVMs)  Academic Researchers Statistically robust State-of-the-art performance  Practitioners Quadratic in the number of training examples Impractical!  Solution: Relaxed Online SVMs

Background: SVMs  Data Set =  Class Label : 1 for spam; -1 for ham  Classifier:  To Find and Minimize: Constraints: Slack variable Maximizing the marginMinimizing the loss function Tradeoff parameter

Online SVMs

Tuning the Tradeoff Parameter C  Spamassassin data set: 6034 examples Large C preferred

Spam and SVMs  TREC05P-1: Messages  TREC06P: messages

Blog Comment Spam and SVMs  Leave One Out Cross Validation  50 Blog Posts; 1024 Comments

Splogs and SVMs  Leave One Out Cross Validation  1380 Examples

Computational Cost  Online SVMs: Quadratic Training Time

Relaxed Online SVMs (ROSVM)  Objective Function of SVMs:  Large C Preferred Minimizing training error more important than maximizing the margin  ROSVM Full margin maximization not necessary Relax this requirement

The last value found for when Three Ways to Relax SVMs (1)  Only Optimize Over the Recent p Examples Dual form of SVMs Constraints

Three Ways to Relax SVMs (2)  Only Update on Actual Errors Original online SVMs  Update when ROSVM  Update when  m=0: mistake driven online SVMs  NO significant degrade in performance  Significantly reduce cost

Three Ways to Relax SVMs (3)  Reduce the Number of Iterations in Interative SVMs SMO: repeated pass over the training set to minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance

Testing Reduced Size

Testing Reduced Iterations

Testing Reduced Updates

Online SVMs and ROSVM  ROSVM: Spam Blog Comment Spam Splog Data Set