Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.

Slides:



Advertisements
Similar presentations
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Advertisements

Design and Evaluation of a Real-Time URL Spam Filtering Service
Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.
Information Retrieval in Practice
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Predicting the Semantic Orientation of Adjectives
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Anaphora Resolution Sanghoon Kwak Takahiro Aoyama.
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Memeta: A Framework for Analytics on the Blogosphere Pranam Kolari, Tim Finin Partially supported by NSF award ITR-IIS and ITR-IDM and.
Countering Spam Using Classification Techniques Steve Webb Data Mining Guest Lecture February 21, 2008.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Text Classification, Active/Interactive learning.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones,
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Spam Detection Ethan Grefe December 13, 2013.
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)
CSC 594 Topics in AI – Text Mining and Analytics
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.
Information Retrieval in Practice
Uncovering Social Spammers: Social Honeypots + Machine Learning
Automatic Web Security Unit Testing: XSS Vulnerability Detection Mahmoud Mohammadi, Bill Chu, Heather Richter, Emerson Murphy-Hill Presenter:
Genre-based decomposition of class noise
WEB SPAM.
A Machine Learning Approach
Document Filtering Social Web 3/17/2010 Jae-wook Ahn.
Automatic Digitizing.
Feeds That Matter A study of Bloglines subscriptions
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Text Based Similarity Metrics and Delta for Semantic Web Graphs
Text Categorization Assigning documents to a fixed set of categories
Information Retrieval
Basics of ML Rohan Suri.
PolyAnalyst Web Report Training
Evaluating Classifiers
Introduction to Sentiment Analysis
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full retraining accuracy, ensemble vs. true labels and no retraining Accuracy, ensemble labels for retraining Accuracy, true labels for retraining The Problem with Spam Web and spam is a constant problem Effective spam classifiers can be built with sufficient, labeled training data As spammers change their tactics, however, the classifiers must be retrained Knowing when to retrain is a problem and obtaining new labeled data is expensive The Problem with Spam Web and spam is a constant problem Effective spam classifiers can be built with sufficient, labeled training data As spammers change their tactics, however, the classifiers must be retrained Knowing when to retrain is a problem and obtaining new labeled data is expensive An example of a spam blog on wordpress.com and software used to create spam blogs Ensemble of Classifiers Approach Use an ensemble with one classifier per feature set extracted from data Changes in mutual agreement between labels assigned by pairs of individual classifiers in the ensemble indicate concept drift Retrain drifting classifiers using ensemble labels, no need for hand-labeled data Ensemble of Classifiers Approach Use an ensemble with one classifier per feature set extracted from data Changes in mutual agreement between labels assigned by pairs of individual classifiers in the ensemble indicate concept drift Retrain drifting classifiers using ensemble labels, no need for hand-labeled data Evaluation Evaluated approach using spam blog data collected in 2005 and 2006 and hand labeled Used features from Kolari 2006 Compare automatic retraining and using true labels Measure accuracy and cost to retrain under several policies Evaluation Evaluated approach using spam blog data collected in 2005 and 2006 and hand labeled Used features from Kolari 2006 Compare automatic retraining and using true labels Measure accuracy and cost to retrain under several policies Example Feature Sets Bag of words: word frequency in blog Word n-grams: frequency of short word phrases Character n-grams: frequency of short character sequences Anchor text: text in HTML links Tokenized URLs & outlinks: tokens extracted from links HTML tags: frequency of common tags, e.g., H1 and BOLD URL: tokens extracted from link to blog Example Feature Sets Bag of words: word frequency in blog Word n-grams: frequency of short word phrases Character n-grams: frequency of short character sequences Anchor text: text in HTML links Tokenized URLs & outlinks: tokens extracted from links HTML tags: frequency of common tags, e.g., H1 and BOLD URL: tokens extracted from link to blog Results Mutual agreement between classifiers is an accurate indicator of the need to retrain Concept drift typically affects a subset of the features, so ensemble is robust Ensemble labels are almost as good as true labels for retraining individual classifier Results Mutual agreement between classifiers is an accurate indicator of the need to retrain Concept drift typically affects a subset of the features, so ensemble is robust Ensemble labels are almost as good as true labels for retraining individual classifier Mutual Agreement Percent of time a pair of classifiers agrees on label Evaluated for all pairs If one classifier is affected by concept drift but others are not, mutual agreement with that classifier will change No need for true labels to detect drift Mutual Agreement Percent of time a pair of classifiers agrees on label Evaluated for all pairs If one classifier is affected by concept drift but others are not, mutual agreement with that classifier will change No need for true labels to detect drift