1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 18, 2004.

Slides:



Advertisements
Similar presentations
Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
Advertisements

Information Retrieval in Practice
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Basic Communication on the Internet:
Albert Gatt Corpora and Statistical Methods Lecture 13.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian.
Early Detection of Outgoing Spammers in Large-Scale Service Provider Networks Yehonatan Cohen Daniel Gordon Danny Hendler Ben-Gurion University Yehonatan.
Internet Level Spam Detection and SpamAssassin 2.50 Matt Sergeant Senior Anti-Spam Technologist MessageLabs.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Presented by: Alex Misstear Spam Filtering An Artificial Intelligence Showcase.
6/1/2015 Spam Filtering - Muthiyalu Jothir 1 Spam Filtering Computer Security Seminar N.Muthiyalu Jothir – Media Informatics.
Sparse Binary Polynomial Hashing and the CRM114 Discriminator William S. Yerazunis Mitsubishi Electric Research Laboratories Cambridge, MA
1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.
Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political.
Fighting Spam Randy Appleton Northern Michigan University
How does computer know what is spam and what is ham?
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting.
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Spam Reduction Techniques Using greylisting and SpamAssassin.
Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
Spam Filtering Techniques Arnold Perez Joseph Tilley.
Good Word Attacks on Statistical Spam Filters Daniel Lowd University of Washington (Joint work with Christopher Meek, Microsoft Research)
Masquerade Detection Mark Stamp 1Masquerade Detection.
Unit 10 Communication Services.  Identify types of electronic communication  Describe users of electronic communication  Identify major components.
Comment Spam Identification Eric Cheng & Eric Steinlauf.
Client X CronLab Spam Filter Technical Training Presentation 19/09/2015.
Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Group 2 R 李庭閣 R 孔垂玖 R 許守傑 R 鄭力維.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
The Internet 8th Edition Tutorial 2 Basic Communication on the Internet: .
Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Information and Computer Science University of California, Irvine.
An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Spam Filtering. From: "" Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There.
A Technical Approach to Minimizing Spam Mallory J. Paine.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
What’s New in WatchGuard XCS v9.1 Update 1. WatchGuard XCS v9.1 Update 1  Enhancements that improve ease of use New Dashboard items  Mail Summary >
Bayesian Spam Filter By Joshua Spaulding. Statement of Problem “Spam now accounts for more than half of all messages sent and imposes huge productivity.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Adapting Statistical Filtering David Kohlbrenner IT.com TJHSST.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Computing Science, University of Aberdeen1 Reflections on Bayesian Spam Filtering l Tutorial nr.10 of CS2013 is based on Rosen, 6 th Ed., Chapter 6 & exercises.
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
Project Presentation B 王 立 B 陳俊甫 B 張又仁 B 李佳穎.
Spam Detection Ethan Grefe December 13, 2013.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
Classification using Co-Training
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
1 Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Speaker: Jun-Yi Zheng 2010/01/18.
Spam By Dan Sterrett. Overview ► What is spam? ► Why it’s a problem ► The source of spam ► How spammers get your address ► Preventing Spam ► Possible.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
A Simple Approach for Author Profiling in MapReduce
Learning to Detect and Classify Malicious Executables in the Wild by J
Lecture 15: Text Classification & Naive Bayes
Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.
Project Presentation B 王 立 B 陳俊甫 B 張又仁
Speech recognition, machine learning
Speech recognition, machine learning
Presentation transcript:

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 18, 2004

2 How might we analyze ? Identify different parts Reply blocks, signature blocks Integrate with workflow tasks Build a social network Who do you know, and what is their contact info? Reputation analysis –Useful for anti-spam too

3 Today analysis Spam filtering

4 Recognizing Structure Three tasks: Does this message contain a signature block? If so, which lines are in it? Which lines are reply lines? Three-way classification for each line Representation A sequence of lines Each line has features associated with it Windows of lines important for line classification Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from , in CEAS 2004.

5 Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from , in CEAS 2004.

6 Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from , in CEAS 2004.

7 Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from , in CEAS 2004.

8 Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from , in CEAS 2004.

9 Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from , in CEAS 2004.

10 Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from , in CEAS 2004.

11 The Cost of Spam Most of the cost of spam is paid for by the recipients: Typical spam batch is 1,000,000 spams Spammer averages ~$250 commission per batch Cost to recipients to delete the load of 2 seconds/spam, $5.15/hour: $2,861

12 The Cost of Spam Theft efficiency ratio of spammer: profit to thief = ~10 % cost to victims 10% theft efficiency ratio is typical in many other lines of criminal activity such as fencing stolen goods (jewelery, hubcaps, car stereos).

13 How to Recognize Spam? What features and algorithms should we use?

14 Adapted froms slide by Rohan Malkhare Anti-spam Approaches Legislation Technology White listing of addresses Black Listing of addresses/domains Challenge Response mechanisms Content Filtering –Learning Techniques –“Bayesian filtering” for spam has got a lot of press, e.g.  “How to spot and stop spam”, BBC News, 26/5/  “Sorting the ham from the spam”, Sydney Morning Herald, 24/6/ –The “Bayesian filtering” they are talking about is actually Naïve Bayes Classification

15 Adapted froms slide by Rohan Malkhare Research in Spam Classification Spam filtering is really a classification problem Each needs to be classified as either spam or not spam (“ham”) W. Cohen (1996): RIPPER, Rule Learning System Rules in a human-comprehensible format Pantel & Lin (1998): Naïve-Bayes with words as features Sahami, Dumais, Heckerman, Horvitz (1998): Naïve-Bayes with a mutual information measure to select features with strongest resolving power Words and domain-specific attributes of spam used as features

16 Adapted froms slide by Rohan Malkhare Research in Spam Classification Paul Graham (2002): A Plan for spam Very popular algorithm credited with starting the craze for Bayesian Filters Uses naïve bayes with words as features Bill Yerazunis (2002): CRM114 sparse binary polynomial hashing algorithm Very accurate (over 99.7% accuracy) Distinctive because of it’s powerful feature extraction technique Uses Bayesian chain rule for combining weights Available via sourceforge Others have used SVMs, etc. New work: First and anti-spam conference just held

17 Adapted froms slide by William Yerazunis Yerazunis’ CRM114 Algorithm Other naïve-bayes approaches focused on single-word features CRM114 creates a huge number of n-grams and represents them efficiently The goal is to create a LOT of features, many of which will be invariant over a large body of spam (or nonspam). (The name is a reference to a program in Dr. StrangeLove) Sparse Binary Polynomial Hashing and the CRM114 Discriminator, William S. Yerazunis,

18 Adapted froms slide by William Yerazunis CRM Slide a window N words long over the incoming text 2. For each window position, generate a set of order- preserving sub-phrases containing combinations of the windowed words 3. Calculate 32-bit hashes of these order-preserved sub-phrases (for efficiency reasons)

19 Adapted froms slide by William Yerazunis Step 1: slide a window N words long over the incoming text. ex: You can Click here to buy viagra online NOW!!! Yields: You can Click here to buy viagra online NOW!!!... and so on... (on to step 2) CRM114 Feature Extraction Example

20 Adapted froms slide by William Yerazunis SBPH Example Click Click here Click to Click here to Click buy Click here buy Click to buy Click here to buy Click viagra Click here viagra Click to viagra Click here to viagra Click buy viagra Click here buy viagra Click to buy viagra Click here to buy viagra...yields all these feature sub-phrases Note the binary counting pattern; this is the ‘binary’ in ‘sparse binary polynomial hashing’ Sliding Window Text : ‘Click here to buy viagra’ Step 2: generate order-preserving sub-phrases from the words in each of the sliding windows

21 Adapted froms slide by William Yerazunis SBPH Example Click Click here Click to Click here to Click buy Click here buy Click to buy Click here to buy Click viagra Click here viagra Click to viagra Click here to viagra Click buy viagra Click here buy viagra Click to buy viagra Click here to buy viagra Step 3: make 32-bit hash value “features” from the sub-phrases 32-bit hash E06BF8AA 12FAD10F 7B37C4F CF 1821F0E8 46B99AAD B7EE69BF 19A78B4D AE1B0B DE DBB..... and so on

22 Adapted froms slide by William Yerazunis How to use the terms For each phrase you can build Keep track of how many times you see that phrase in both the spam and nonspam categories. When you need to classify some text, Build up the phrases –Each extra word adds 15 features Count up how many times all of the phrases appear in each of the two different categories. The category with the most phrase matches wins. –But really it uses the Bayesian chain rule

23 Adapted froms slide by William Yerazunis Learning and Classifying Learning: each feature is bucketed into one of two bucket files ( spam or nonspam) Classifying: the comparable bucket counts of the two files generate rough estimates of each feature's ‘spamminess’ P(F|C) =0.5 + ( |Fc| - |F~c| ) / ( 2 * MaxF )

24 Adapted froms slide by William Yerazunis The Bayesian Chain Rule (BCR) P ( F|C ) P ( C ) P (C|F ) = P( F|C ) P( C ) + P ( F|~C) P(~C) Start with P(C ) = P(~C) =.5 For a new msg, compute this for both P(spam) and P(not-spam) Which ever has the higher score wins. The denominator renormalizes to take into account if most of the is mainly one class or the other

25 Adapted froms slide by William Yerazunis The feature set created by the SBPH feature hash gives better performance than single-word Bayesian systems. Phrases in colloquial English are much more standardized than words alone - this makes filter evasion much harder A bigger corpus of example text is better With 400Kbytes selected spams, 300Kbytes selected nonspams trained in, no blacklists, whitelists, or other shenanigans Evaluation

26 Adapted froms slide by William Yerazunis > % The actual performance of CRM114 Mailfilter from Nov 1 to Dec 1, messages, (1935 spam, 3914 nonspam) 4 false accepts, ZERO false rejects, (and 2 messages I couldn't make head nor tail of). All messages were incoming mail 'fresh from the wild'. No canned spam. For comparison, a human* is only about 99.84% accurate in classifying spam v. nonspam in a “rapid classification” environment. Results

27 Adapted froms slide by William Yerazunis Filtering speed: classification: about 20Kbytes per second, learning time: about 10Kbytes per second (on a Transmeta 666 MHz laptop) Memory required: about 5 megabytes 404K spam features, 322K nonspam features Results Stats

28 Adapted froms slide by William Yerazunis The bad news: SPAM MUTATES Even a perfectly trained Bayesian filter will slowly deteriorate. New spams appear, with new topics, as well as old topics with creative twists to evade antispam filters. Downsides?

29 Revenge of the Spammers How do the spammers game these algorithms? Break the tokenizer –Split up words, use html tags, etc Throw in randomly ordered words –Throw off the n-gram based statistics Use few words –Harder for the classifier to work On Attacking Statistical Spam Filters. Gregory L. Wittel and S. Felix Wu, CEAS ’04.

30 Next Time In-class work: creating categories for the Enron corpus