Download presentation
Presentation is loading. Please wait.
Published byLewis Pitts Modified over 9 years ago
1
Design and Evaluation of a Real-Time URL Spam Filtering Service
Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011
2
OUTLINE Introduction - Monarch Related Work System Design
Implementation Evaluation Discussion and Conclusion
3
Spam URL Advertisement Harmful content
Phishing, malware, and scams Use of compromised and fraudulent accounts , web services
4
Monarch Spam URL Filtering as a Service Tens of millions of features
5
Related Work “Detecting spammers on Twitter” (2010)
Post frequency, URLs, friends… “Behind phishing: an examination of phisher modi operandi” (2008) Lexical characteristics of phishing URLs “Cantina: a content-based approach to detecting phishing web sites” (2007) Parse HTML content
6
System Design Monarch’s cloud infrastructure Url Aggregation
providers and Twitter’s streaming API Feature Collection Visits a URL with web browsers to collect page content
7
System Design(cont.) Monarch’s cloud infrastructure Feature Extraction
Transform the raw data into a sparse feature vector Classification Training and testing by distributed logistic regression
8
Collect Raw Features – Web Browser
“A taxonomy of JavaScript redirection spam”(2007) Lightweight browser not enough Poor HTML parsing, lack of JavaScript and plugins Instrumented version of Firefox JavaScript enabled Flash and Java installed Visited a URL and monitor a number of details
9
Raw Features Web Browser
Initial URL and Landing URL, Redirects, Sources and Frames HTML Content, Page Links JavaScript Events, Pop-up Windows, Plugins HTTP Headers DNS Resolver Initial, final, and redirect URLs IP Address Analysis City, country, ASN Proxy and Whitelist (200 domains)
10
Features Vector Raw Features => sparse feature vector
Canonicalize URLs Remove obfuscation Tokenize the text corpus Splitting on non-alphanumeric characters => domain feature [adl,tw] path feature [dada,dada2,php] query parameters feature [a,1,b,3] => (…,adl:true,adm:false,…,dada:true,…,tw:true,……..) total 49,960,691 feature(dimension)… => (1,3,a,adl,b,dada,dada2,php,tw)
11
Distributed Classifier Design
Linear classification : feature vector Determine a weight vector A parallel online learner With regularization to yield a sparse weight vector Labeled data , Testing => -1 => non-spam site 1 => spam site
12
Training the weight vector
Logistic Regression With subgradient L1-Regularization yi(xi.wi) larger => f(w) smaller (Classification margin, hyperplane)
13
Distributed Classifier Algorithm
14
Data Set and assumption
1.25 million spam URLs 567,784 spam Twitter URLs 9 million non-spam Twitter URLs Checking all Twitter URLs against: Google Safebrowsing, SURBL, URIBL, APWG, Phishtank Any of its source URLs become blacklisted
15
Data Set and assumption(cont.)
On Twitter: 36% scams, 60% phishing, 4% malware
16
After regularization
17
Implementation Amazon Web Services(AWS) infrastructure URL Aggregation
A queue, keeps 300,000 URLs Feature Collection 20x6 Firefox(4.0b4) on Ubuntu 10.04 With a custom extension Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views Classifier Hadoop Distributed File System On the 50-node cluster
18
Evaluation – Overall Accuracy
5-fold cross-validation 500,000 spam and non-spam each Training set size to 400,000 example 1:1, 4:1, 10:1 Testing set size to 200,000 example 1:1
19
Evaluation – Single Feature
20
Evaluation – Accuracy Over Time
Training once only <-> Retraining every four days
21
Evaluation – Comparing Email and Tweet Spam
Log odds ratio:
22
Evaluation – The Cost For Twitter, $22,751 per month
23
Discussion and Conclusion
Evasion Feature Evasion Time-based Evasion Crawler Evasion Monarch Real-time system Spam URL Filtering as a Service $22,751 a month
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.