Design and Evaluation of a Real-Time URL Spam Filtering Service

Design and Evaluation of a Real-Time URL Spam Filtering Service
Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011

OUTLINE Introduction - Monarch Related Work System Design
Implementation Evaluation Discussion and Conclusion

Spam URL Advertisement Harmful content
Phishing, malware, and scams Use of compromised and fraudulent accounts , web services

Monarch Spam URL Filtering as a Service Tens of millions of features

Related Work “Detecting spammers on Twitter” (2010)
Post frequency, URLs, friends… “Behind phishing: an examination of phisher modi operandi” (2008) Lexical characteristics of phishing URLs “Cantina: a content-based approach to detecting phishing web sites” (2007) Parse HTML content

System Design Monarch’s cloud infrastructure Url Aggregation
providers and Twitter’s streaming API Feature Collection Visits a URL with web browsers to collect page content

System Design(cont.) Monarch’s cloud infrastructure Feature Extraction
Transform the raw data into a sparse feature vector Classification Training and testing by distributed logistic regression

Collect Raw Features – Web Browser
“A taxonomy of JavaScript redirection spam”(2007) Lightweight browser not enough Poor HTML parsing, lack of JavaScript and plugins Instrumented version of Firefox JavaScript enabled Flash and Java installed Visited a URL and monitor a number of details

Raw Features Web Browser
Initial URL and Landing URL, Redirects, Sources and Frames HTML Content, Page Links JavaScript Events, Pop-up Windows, Plugins HTTP Headers DNS Resolver Initial, final, and redirect URLs IP Address Analysis City, country, ASN Proxy and Whitelist (200 domains)

Features Vector Raw Features => sparse feature vector
Canonicalize URLs Remove obfuscation Tokenize the text corpus Splitting on non-alphanumeric characters => domain feature [adl,tw] path feature [dada,dada2,php] query parameters feature [a,1,b,3] => (…,adl:true,adm:false,…,dada:true,…,tw:true,……..) total 49,960,691 feature(dimension)… => (1,3,a,adl,b,dada,dada2,php,tw)

Distributed Classifier Design
Linear classification : feature vector Determine a weight vector A parallel online learner With regularization to yield a sparse weight vector Labeled data , Testing => -1 => non-spam site 1 => spam site

Training the weight vector
Logistic Regression With subgradient L1-Regularization yi(xi．wi) larger => f(w) smaller (Classification margin, hyperplane)

Distributed Classifier Algorithm

Data Set and assumption
1.25 million spam URLs 567,784 spam Twitter URLs 9 million non-spam Twitter URLs Checking all Twitter URLs against: Google Safebrowsing, SURBL, URIBL, APWG, Phishtank Any of its source URLs become blacklisted

Data Set and assumption(cont.)
On Twitter: 36% scams, 60% phishing, 4% malware

After regularization

Implementation Amazon Web Services(AWS) infrastructure URL Aggregation
A queue, keeps 300,000 URLs Feature Collection 20x6 Firefox(4.0b4) on Ubuntu 10.04 With a custom extension Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views Classifier Hadoop Distributed File System On the 50-node cluster

Evaluation – Overall Accuracy
5-fold cross-validation 500,000 spam and non-spam each Training set size to 400,000 example 1:1, 4:1, 10:1 Testing set size to 200,000 example 1:1

Evaluation – Single Feature

Evaluation – Accuracy Over Time
Training once only <-> Retraining every four days

Evaluation – Comparing Email and Tweet Spam
Log odds ratio:

Evaluation – The Cost For Twitter, $22,751 per month

Discussion and Conclusion
Evasion Feature Evasion Time-based Evasion Crawler Evasion Monarch Real-time system Spam URL Filtering as a Service $22,751 a month

Design and Evaluation of a Real-Time URL Spam Filtering Service

Similar presentations

Presentation on theme: "Design and Evaluation of a Real-Time URL Spam Filtering Service"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design and Evaluation of a Real-Time URL Spam Filtering Service

Similar presentations

Presentation on theme: "Design and Evaluation of a Real-Time URL Spam Filtering Service"— Presentation transcript:

Similar presentations

About project

Feedback