Countering Spam Using Classification Techniques Steve Webb Data Mining Guest Lecture February 21, 2008.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Lydia Song, Lauren Steimle, Xiaoxiao Xu, and Dr. Arye Nehorai Department of Electrical and Systems Engineering, Washington University in St. Louis .
What is Spam  Any unwanted messages that are sent to many users at once.  Spam can be sent via , text message, online chat, blogs or various other.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
Design and Evaluation of a Real-Time URL Spam Filtering Service
----Presented by Di Xu  Introduction  Overview of Spam  Solutions to Spam  Conclusion.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,
Hongyu Gao, Tuo Huang, Jun Hu, Jingnan Wang.  Boyd et al. Social Network Sites: Definition, History, and Scholarship. Journal of Computer-Mediated Communication,
Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Deep Belief Networks for Spam Filtering
Analyzing Behavioral Features for Classification.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Chapter 6: Information Retrieval and Web Search
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
Spam Detection Ethan Grefe December 13, 2013.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Detecting Phishing in s Srikanth Palla Ram Dantu University of North Texas, Denton.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
11 Shades of Grey: On the effectiveness of reputation- based “blacklists” Reporter: 林佳宜 /8/16.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Learning to Detect and Classify Malicious Executables in the Wild by J
Uncovering Social Spammers: Social Honeypots + Machine Learning
WEB SPAM.
Source: Procedia Computer Science(2015)70:
Design open relay based DNS blacklist system
iSRD Spam Review Detection with Imbalanced Data Distributions
Presentation transcript:

Countering Spam Using Classification Techniques Steve Webb Data Mining Guest Lecture February 21, 2008

Overview Introduction Countering Spam  Problem Description  Classification History  Ongoing Research Countering Web Spam  Problem Description  Classification History  Ongoing Research Conclusions

Introduction The Internet has spawned numerous information-rich environments  Systems  World Wide Web  Social Networking Communities Openness facilities information sharing, but it also makes them vulnerable…

Denial of Information (DoI) Attacks Deliberate insertion of low quality information (or noise) into information-rich environments Information analog to Denial of Service (DoS) attacks Two goals  Promotion of ideals by means of deception  Denial of access to high quality information Spam is the currently the most prominent example of a DoI attack

Overview

Countering Spam Close to 200 billion (yes, billion) s are sent each day Spam accounts for around 90% of that traffic ~2 million spam messages every second

Old Spam Examples

Problem Description spam detection can be modeled as a binary text classification problem  Two classes: spam and legitimate (non-spam) Example of supervised learning  Build a model (classifier) based on training data to approximate the target function Construct a function  : M  {spam, legitimate} such that it overlaps  : M  {spam, legitimate} as much as possible

Problem Description (cont.) How do we represent a message? How do we generate features? How do we process features? How do we evaluate performance?

How do we represent a message? Classification algorithms require a consistent format Salton’s vector space model (“bag of words”) is the most popular representation Each message m is represented as a feature vector f of n features:

How do we generate features? Sources of information  SMTP connections Network properties  headers Social networks  body Textual parts URLs Attachments

How do we process features? Feature Tokenization  Alphanumeric tokens  N-grams  Phrases Feature Scrubbing  Stemming  Stop word removal Feature Selection  Simple feature removal  Information-theoretic algorithms

How do we evaluate performance? Traditional IR metrics  Precision vs. Recall False positives vs. False negatives  Imbalanced error costs ROC curves

Classification History Sahami et al. (1998)  Used a Naïve Bayes classifier  Were the first to apply text classification research to the spam problem Pantel and Lin (1998)  Also used a Naïve Bayes classifier  Found that Naïve Bayes outperforms RIPPER

Classification History (cont.) Drucker et al. (1999)  Evaluated Support Vector Machines as a solution to spam  Found that SVM is more effective than RIPPER and Rocchio Hidalgo and Lopez (2000)  Found that decision trees (C4.5) outperform Naïve Bayes and k-NN

Classification History (cont.) Up to this point, private corpora were used exclusively in spam research Androutsopoulos et al. (2000a)  Created the first publicly available spam corpus (Ling-spam)  Performed various feature set size, training set size, stemming, and stop-list experiments with a Naïve Bayes classifier

Classification History (cont.) Androutsopoulos et al. (2000b)  Created another publicly available spam corpus (PU1)  Confirmed previous research than Naïve Bayes outperforms a keyword-based filter Carreras and Marquez (2001)  Used PU1 to show that AdaBoost is more effective than decision trees and Naïve Bayes

Classification History (cont.) Androutsopoulos et al. (2004)  Created 3 more publicly available corpora (PU2, PU3, and PUA)  Compared Naïve Bayes, Flexible Bayes, Support Vector Machines, and LogitBoost: FB, SVM, and LB outperform NB Zhang et al. (2004)  Used Ling-spam, PU1, and the SpamAssassin corpora  Compared Naïve Bayes, Support Vector Machines, and AdaBoost: SVM and AB outperform NB

Classification History (cont.) CEAS (2004 – present)  Focuses solely on and anti-spam research  Generates a significant amount of academic and industry anti-spam research Klimt and Yang (2004)  Published the Enron Corpus – the first large-scale corpus of legitimate messages TREC Spam Track (2005 – present)  Produces new corpora every year  Provides a standardized platform to evaluate classification algorithms

Ongoing Research Concept Drift New Classification Approaches Adversarial Classification Image Spam

Concept Drift Spam content is extremely dynamic  Topic drift (e.g., specific scams)  Technique drift (e.g., obfuscations) How do we keep up with the Joneses? Batch vs. Online Learning

New Classification Approaches Filter Fusion Compression-based Filtering Network behavioral clustering

Adversarial Classification Classifiers assume a clear distinction between spam and legitimate features Camouflaged messages  Mask spam content with legitimate content  Disrupt decision boundaries for classifiers

Camouflage Attacks Baseline performance  Accuracies consistently higher than 98% Classifiers under attack  Accuracies degrade to between 50% and 70% Retrained classifiers  Accuracies climb back to between 91% and 99%

Camouflage Attacks (cont.) Retraining postpones the problem, but it doesn’t solve it We can identify features that are less susceptible to attack, but that’s simply another stalling technique

Image Spam What happens when an does not contain textual features? OCR is easily defeated Classification using image properties

Overview

Countering Web Spam What is web spam?  Traditional definition  Our definition Between 13.8% and 22.1% of all web pages

Ad Farms Only contain advertising links (usually ad listings) Elaborate entry pages used to deceive visitors

Ad Farms (cont.) Clicking on an entry page link leads to an ad listing Ad syndicators provide the content Web spammers create the HTML structures

Parked Domains Domain parking services  Provide place holders for newly registered domains  Allow ad listings to be used as place holders to monetize a domain Inevitably, web spammers abused these services

Parked Domains (cont.) Functionally equivalent to Ad Farms  Both rely on ad syndicators for content  Both provide little to no value to their visitors Unique Characteristics  Reliance on domain parking services (e.g., apps5.oingo.com, searchportal.information.com, etc.)  Typically for sale by owner (“Offer To Buy This Domain”)

Parked Domains (cont.)

Advertisements Pages advertising specific products or services Examples of the kinds of pages being advertised in Ad Farms and Parked Domains

Problem Description Web spam detection can also be modeled as a binary text classification problem Salton’s vector space model is quite common Feature processing and performance evaluation are also quite similar But what about feature generation…

How do we generate features? Sources of information  HTTP connections Hosting IP addresses Session headers  HTML content Textual properties Structural properties  URL linkage structure PageRank scores Neighbor properties

Classification History Davison (2000)  Was the first to investigate link-based web spam  Built decision trees to successfully identify “nepotistic links” Becchetti et al. (2005)  Revisited the use of decision trees to identify link- based web spam  Used link-based features such as PageRank and TrustRank scores

Classification History Drost and Scheffer (2005)  Used Support Vector Machines to classify web spam pages  Relied on content-based features as well as link- based features Ntoulas et al. (2006)  Built decision trees to classify web spam  Used content-based features (e.g., fraction of visible content, compressibility, etc.)

Classification History Up to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets Webb et al. (2006)  Presented the Webb Spam Corpus – a first-of-its-kind large-scale, publicly available web spam corpus (almost 350K web spam pages)  Castillo et al. (2006)  Presented the WEBSPAM-UK2006 corpus – a publicly available web spam corpus (only contains 1,924 web spam pages)

Classification History Castillo et al. (2007)  Created a cost-sensitive decision tree to identify web spam in the WEBSPAM-UK2006 data set  Used link-based features from [Becchetti et al. (2005)] and content-based features from [Ntoulas et al. (2006)] Webb et al. (2008)  Compared various classifiers (e.g., SVM, decision trees, etc.) using HTTP session information exclusively  Used the Webb Spam Corpus, WebBase data, and the WEBSPAM-UK2006 data set  Found that these classifiers are comparable to (and in many cases, better than) existing approaches

Ongoing Research Redirection Phishing Social Spam

Redirection 144,801 unique redirect chains (1.54 average HTTP redirects) 43.9% of web spam pages use some form of HTML or JavaScript redirection

Phishing Interesting form of deception that affects and web users Another form of adversarial classification

Social Spam Comment spam Bulletin spam Message spam

Conclusions and web spam are currently two of the largest information security problems Classification techniques offer an effective way to filter this low quality information Spammers are extremely dynamic, generating various areas of important future research…

Questions