Sparse Binary Polynomial Hashing and the CRM114 Discriminator William S. Yerazunis Mitsubishi Electric Research Laboratories Cambridge, MA

Slides:



Advertisements
Similar presentations
Intrusion Detection Systems (I) CS 6262 Fall 02. Definitions Intrusion Intrusion A set of actions aimed to compromise the security goals, namely A set.
Advertisements

Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Basic Communication on the Internet:
Using Different Forms of Basic Knowledge of the 3 Different Platform: Outlook, AOL and HTML Prepared by Mitch.
System Design System Design - Mr. Ahmad Al-Ghoul System Analysis and Design.
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
CRM114 TeamKNN and Hyperspace Spam Sorting1 Sorting Spam with K-Nearest Neighbor and Hyperspace Classifiers William Yerazunis 1 Fidelis Assis 2 Christian.
Indian Statistical Institute Kolkata
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
6/1/2015 Spam Filtering - Muthiyalu Jothir 1 Spam Filtering Computer Security Seminar N.Muthiyalu Jothir – Media Informatics.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Evaluating Search Engine
Search Engines and Information Retrieval
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 18, 2004.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Hearth Bulk System Divisional Secretaries’ Briefing 2012.
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Spam Reduction Techniques Using greylisting and SpamAssassin.
An Effective Defense Against Spam Laundering Paper by: Mengjun Xie, Heng Yin, Haining Wang Presented at:CCS'06 Presentation by: Devendra Salvi.
Fundamentals of Python: From First Programs Through Data Structures
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
Spam Filtering Techniques Arnold Perez Joseph Tilley.
Achieving Better Reliability With Software Reliability Engineering Russel D’Souza Russel D’Souza.
Fundamentals of Python: First Programs
Chapter 8: Systems analysis and design
Search Engines and Information Retrieval Chapter 1.
1 Chapter 11 Implementation. 2 System implementation issues Acquisition techniques Site implementation tools Content management and updating System changeover.
Fast Portscan Detection Using Sequential Hypothesis Testing Authors: Jaeyeon Jung, Vern Paxson, Arthur W. Berger, and Hari Balakrishnan Publication: IEEE.
Chapter 6 : Software Metrics
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
The Internet 8th Edition Tutorial 2 Basic Communication on the Internet: .
Data and its manifestations. Storage and Retrieval techniques.
Small Business Resource Power Point Series How to Avoid Your Marketing Messages Being Labelled as Spam.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
A Technical Approach to Minimizing Spam Mallory J. Paine.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
~Computer Virus~ The things you MUST know Brought to You By Sumanta Majumdar Dept. Of Electrical Engg. 2010,GNIT
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
What’s New in WatchGuard XCS v9.1 Update 1. WatchGuard XCS v9.1 Update 1  Enhancements that improve ease of use New Dashboard items  Mail Summary >
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Silicon & Software Systems (S3)‏ Copyright © Silicon & Software Systems Limited Antispam protection IT Department 20/03/2008 Ondrej Valousek.
Face Detection Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Installing Java on a Home machine For Windows Users: Download/Install: Go to downloads html.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
13-1 ANSYS, Inc. Proprietary © 2009 ANSYS, Inc. All rights reserved. April 28, 2009 Inventory # Chapter 13 Solver.out File and CCL Introduction to.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
An Effective Defense Against Spam Laundering Author: Mengjun Xie, Heng Yin, Haining Wang Presented At: CCS’ 06 Prepared By: Amit Shrivastava.
Exponential Differential Document Count A Feature Selection Factor for Improving Bayesian Filters Fidelis Assis 1 William Yerazunis 2 Christian Siefkes.
Learning to Detect and Classify Malicious Executables in the Wild by J
Installing Java on a Home machine
Record Storage, File Organization, and Indexes
Chapter 13 Solver .out File and CCL
Improving Digest-Based Collaborative Spam Detection
Installing Java on a Home machine
Objective of This Course
Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

Sparse Binary Polynomial Hashing and the CRM114 Discriminator William S. Yerazunis Mitsubishi Electric Research Laboratories Cambridge, MA (include CRM114 somewhere in your )

Rough Guide to this Talk ● Introduction and Motivation ● Sparse Binary Polynomial Hashing (SBPH) ● Bayesian Chain Rule (BCR) ● The CRM114 Discriminator Language ● Accuracy of SBPH/BCR ● Limits to Accuracy ● 'Follow the Money' – Destroying the Spammer Business Model

Why filter spam? ● Spam is a low-level DoS attack ● Spam is a human time-sink, and costs bandwidth and disk space. ● Critical s may be lost in a spam-spasm (spam-spasm: where you delete large blocks of in rapid fire, because they're 'all spam') ● Spam may carry workplace-inappropriate statements or images

The Cost of Spam ● Most of the cost of spam is paid for BY THE RECIPIENTS: ● Typical spam batch is 1,000,000 spams ● Spammer averages ~$250 commission per batch

The Cost of Spam ● Most of the cost of spam is paid for BY THE RECIPIENTS: ● Typical spam batch is 1,000,000 spams ● Spammer averages ~$250 commission per batch ● Cost to recipients to delete the load of 2 seconds/spam, $5.15/hour: $2,861

The Cost of Spam ● Theft efficiency ratio of spammer: profit to thief = ~10 % cost to victims ● 10% theft efficiency ratio is typical in many other lines of criminal activity such as fencing stolen goods (jewelery, hubcaps, car stereos).

What is Sparse Binary Polynomial Hashing ? Sparse Binary Polynomial Hashing (SBPH) is a way to create a lot of distinctive features from an incoming text. The goal is to create a LOT of features, many of which will be invariant over a large body of spam (or nonspam).

How to do Sparse Binary Polynomial Hashing 1. Slide a window N words long over the incoming text 2. For each window position, generate a set of order- preserving sub-phrases containing combinations of the windowed words 3. Calculate 32-bit hashes of these order-preserved sub-phrases

What is an SBPH 'feature' ? ● Single words hash to features (e.g. ' S1618 ') ● Phrases hash to features (e.g. 'to unsubscribe') ● Even phrases with wildcards make features – (e.g. 'BUY ONLINE NOW' ) ● Each word in a text affects 2 N-1 features, we use N=5 (a five-word sliding window) in CRM114 ● SBPH creates far more features than there were words in the original text !

Step 1: slide a window N words long over the incoming text. ex: You can Click here to buy viagra online NOW!!! Yields: You can Click here to buy viagra online NOW!!!... and so on... (on to step 2) SBPH Example

Click Click here Click to Click here to Click buy Click here buy Click to buy Click here to buy Click viagra Click here viagra Click to viagra Click here to viagra Click buy viagra Click here buy viagra Click to buy viagra Click here to buy viagra...yields all these feature sub-phrases Note the binary counting pattern; this is the ‘binary’ in ‘sparse binary polynomial hashing’ Sliding Window Text : ‘Click here to buy viagra’ Step 2: generate order-preserving sub-phrases from the words in each of the sliding windows

SBPH Example Click Click here Click to Click here to Click buy Click here buy Click to buy Click here to buy Click viagra Click here viagra Click to viagra Click here to viagra Click buy viagra Click here buy viagra Click to buy viagra Click here to buy viagra Step 3: make 32-bit hash value “features” from the sub-phrases 32-bit hash E06BF8AA 12FAD10F 7B37C4F CF 1821F0E8 46B99AAD B7EE69BF 19A78B4D AE1B0B DE DBB..... and so on

Evaluate these features with the Naive Bayesian Chain Rule ● Learning: each feature is bucketed into one of one million buckets in one of two bucket files (one spam, one nonspam) ● Classifying: the comparable bucket counts of the two files generate rough estimates of each feature's ‘spamminess’ P(F|C) =0.5 + ( |Fc| - |F~c| ) / ( 2 * MaxF )

Quick review: the Bayesian Chain Rule (BCR) P ( F|C ) P ( C ) P (C|F ) = P( F|C ) P( C ) + P ( F|~C) P(~C)

Quick review: the Bayesian Chain Rule (BCR) P ( F|C ) P ( C ) P (C|F ) = P( F|C ) P( C ) + P ( F|~C) P(~C) Conditional probability of a class C given Feature F Renormalize for both possibilities of class C and not-class-C

● a-priori probability: P(C) = P(~C) = 0.5 ● Decision threshold: P = 0.5 ● SBPH/BCR has a VERY steep ROC curve: >99 % of messages evaluate to the top or bottom ’th of the probability range 0.0 – 1.0 (the ROC curve is basically a step function) ● SBPH/BCR is essentially insensitive to parameters Naive Bayesian Parameters

● Yes, CRM114 uses a Naive Bayesian classifier. ● The feature set created by the SBPH feature hash gives better performance than single-word Bayesian systems. ● Hashing and bucketing keeps all evidence from the training phase for use during classification ● Phrases in colloquial English are much more standardized than words alone - this makes filter evasion much harder So it's Naive Bayesian underneath?

● One option: make this engine into a PERL module. ● But frankly, PERL creeps me out... ● The obvious remaining logical option: make up a new language to express these kinds of programs naturally. That gives us probabilities...how do we use those for sorting mail?

● Interpreted generic mutilating filter language ● Based on regex matching as the primitive bind & test operator, and overlaid strings as the basic data structure. ● Data windowing for operation on truly infinite data streams (e.g. syslog streams ) ● ANSI C implementation language The CRM114 Language

● Input/output with native data windowing ● regex test, bind, block-structured iterators (approximate match “in testing”) ● hash, learn, classify, union, and intersection ● surgical and nonsurgical string-alter operations ● subprocess management (both synchronous and asynchronous) CRM114 primitives

“ Shrouding” a message (remove the words, but keep the word positions and statistics, so test messages can be shared between multiple spamfilter developers without overtly violating privacy or copyright) CRM114 Example

“Shrouding” a message (remove the words, but keep the word positions and statistics, so test messages can be shared between multiple spamfilter developers without overtly violating privacy or copyright) (yes, this algorithm is easily crackable- it's just a toy example. Work with me on this, OK?) CRM114 Example

Shrouding a message: #!/usr/bin/crm114 { match (:word:) /[[:graph:]]+/ hash (:word:) /:*:word:/ liaf # Loop Iterate Awaiting Failure of match } accept # so we accept it to standard output } CRM114 Example

Shrouding a message: #!/usr/bin/crm114 { match (:word:) /[[:graph:]]+/ hash (:word:) /:*:word:/ liaf # Loop Iterate Awaiting Failure of match } accept # so we accept it to standard output } CRM114 Example

The program, applied to itself 99E05DBF BF1C3F05 8BE19BD9 6CAF79C0 01E801E0 3DD0C452 11F911F D CBF25052 D14DD945 38EBC052 9B9E D CBF25052 D14DD945 6A05F F26 A11AA91A A4A53CB6 9F22D2A8 0971BCE BEB 6CAF79C0 9BF2ABEA A3704CCE A11AA91A EF ED42033 A3704CCE 192B4BF4 EE580C1C 8ED91AF9 AA8D73B8 30C3C80A 0D BF2ABEA

CRM114 and Mailfilter ● Mailfilter is a filter program, written in CRM114, to separate spam from nonspam. ● CRM114 provides the primitives; Mailfilter uses those primitives to create a moderately usable antispam filter

CRM114 and Mailfilter ● All access to Mailfilter training, whitelists, blacklists, etc. is done by mailing yourself a message with a command, a password, and arguments ● You can reprogram your Mailfilter remotely ● No special user mail agent is required- any mail reading program will work.

CRM114 and Mailfilter ● Mailfilter is inserted into the mail delivery chain either via Procmail or via the '.forward' hook of Sendmail ● This step of installation requires 'root' privs and some system knowledge.

Training CRM114 Mailfilter ● Just read your . If you get a misclassified message, mail it back to yourself, prefixed with the proper command and password ● Train Only Errors ( the TOE strategy) works well for SBPH/BCR; corpuses of under 100Kbytes can achieve >98% accuracy level

● A bigger corpus of example text is better ● With 400Kbytes selected spams, 300Kbytes selected nonspams trained in, no blacklists, whitelists, or other shenanigans: How good can it get ?

● A bigger corpus of example text is better ● With 400Kbytes selected spams, 300Kbytes selected nonspams trained in, no blacklists, whitelists, or other shenanigans: > % How good can it get ?

> % (N+1 accuracy) ● This is the actual performance of CRM114 Mailfilter from Nov 1 to Dec 1, ● 5849 messages, (1935 spam, 3914 nonspam) ● 4 false accepts, ZERO false rejects, (and 2 messages I couldn't make head nor tail of). ● All messages were incoming mail 'fresh from the wild'. No canned spam... How good can it get ?

> % (N+1 accuracy) ● Filtering speed: classification: about 20Kbytes per second, learning time: about 10Kbytes per second (on a Transmeta 666 MHz laptop) ● Memory required: about 5 megabytes ● 404K spam features, 322K nonspam features How good can it get ?

> % (N+1 accuracy) For comparison, a human* is only about 99.84% accurate in classifying spam v. nonspam in a “rapid classification” environment. * In this case, the author, because even a grad student won’t classify 3900 spams twice. How good can it get ?

The bad news: SPAM MUTATES Even a perfectly trained Bayesian filter will slowly deteriorate. New spams appear, with new topics, as well as old topics with creative twists to evade antispam filters. Downsides?

A VERY ROUGH* empirical estimate of the rate of spam evolution: Newspams = TotalSpams x days In realistic terms, this is really about 1 to 3 truly new spams or spam methods per month. How fast does spam mutate? * pronounced “by casual observation, wholly unsubstantiated”

Spamfiltering is shooting at a target that's not just moving, but one that's actively making evasive maneuvers! Sometimes these new spam technologies require a structural change to deal with (such as the Spammus Interruptus spams that started appearing around Pearl Harbor day, 2002) Cases like these require significant human intervention and/or evaluating the text in ‘eye-space’ rather than in ‘ascii-space’.

How good does it need to be? FOLLOW THE MONEY ● Spamming is a business ● Our target is not the spammer ● Our target is not the spammer's ISP

How good does it need to be? FOLLOW THE MONEY ● Spamming is a business ● Our target is not the spammer ● Our target is not the spammer's ISP ● Our target is the SPAMMER'S BUSINESS MODEL

How good does it need to be? FOLLOW THE MONEY ● The typical spammer commission is between $250 and $500 per million spams, and gets a.0001 response rate. ● Paper junk mail costs about 25 cents per unit, and gets a.05 response rate. ● We need to drive the amortized cost of a spam above that of junk mail.

FOLLOW THE MONEY..... ● Amortized cost of one spam response: roughly 2.5 cents ● Amortized cost of one junk mail response: roughly $5.00 ● We need widely implemented spam filters at 99.5% or better to drive the spammers out of business.

Conclusions Bayesian spamfilters (especially SBPH/BCR ) filters are more than adequate to destroy the spammer's business model.

Conclusions Bayesian spamfilters (especially SBPH/BCR ) filters are more than adequate to destroy the spammer's business model. Other high accuracy filter techniques exist- we should deploy as many different methods as possible to avoid monoculture-induced wide-area failures whenever spam evolves in an unforeseen way.

Concerns Full SBPH/BCR may be too computationally expensive for widespread (at-ISP) implementation.

Concerns Full SBPH/BCR may be too computationally expensive for widespread (at-ISP) implementation. Bayesian and other high-accuracy filters may become a tool for censorship or governmental monitoring

Concerns Full SBPH/BCR may be too computationally expensive for widespread (at-ISP) implementation. Bayesian and other high-accuracy filters may become a tool for censorship or governmental monitoring “It's a bad idea to implement good ideas that make it easy to do bad things”

Future Work Integrate an 8-bit-safe approximate matching regex engine (probably TRElib, by Ville Laurikari) Better classifier (steal Spambayes chi-square evaluator) Advanced mail sorting (e.g. 'parties', 'rants’, 'humor', 'flames', 'blather', 'spam') Speed improvements Ease-of-installation issues

Thank You All !! Questions? CRM114 is licensed under the GPL and available for free download at the URL: it's still alpha-quality, but it works.