Design and Evaluation of a Real-Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011
OUTLINE Introduction - Monarch Related Work System Design Implementation Evaluation Discussion and Conclusion
Spam URL Advertisement Harmful content Phishing, malware, and scams Use of compromised and fraudulent accounts Email, web services
Monarch Spam URL Filtering as a Service Tens of millions of features
Related Work “Detecting spammers on Twitter” (2010) Post frequency, URLs, friends… “Behind phishing: an examination of phisher modi operandi” (2008) Lexical characteristics of phishing URLs “Cantina: a content-based approach to detecting phishing web sites” (2007) Parse HTML content
System Design Monarch’s cloud infrastructure Url Aggregation Email providers and Twitter’s streaming API Feature Collection Visits a URL with web browsers to collect page content
System Design(cont.) Monarch’s cloud infrastructure Feature Extraction Transform the raw data into a sparse feature vector Classification Training and testing by distributed logistic regression
Collect Raw Features – Web Browser “A taxonomy of JavaScript redirection spam”(2007) Lightweight browser not enough Poor HTML parsing, lack of JavaScript and plugins Instrumented version of Firefox JavaScript enabled Flash and Java installed Visited a URL and monitor a number of details
Raw Features Web Browser Initial URL and Landing URL, Redirects, Sources and Frames HTML Content, Page Links JavaScript Events, Pop-up Windows, Plugins HTTP Headers DNS Resolver Initial, final, and redirect URLs IP Address Analysis City, country, ASN Proxy and Whitelist (200 domains)
Features Vector Raw Features => sparse feature vector Canonicalize URLs Remove obfuscation Tokenize the text corpus Splitting on non-alphanumeric characters http://adl.tw/~dada/dada2.php?a=1&b=3 => domain feature [adl,tw] path feature [dada,dada2,php] query parameters feature [a,1,b,3] => (…,adl:true,adm:false,…,dada:true,…,tw:true,……..) total 49,960,691 feature(dimension)… => (1,3,a,adl,b,dada,dada2,php,tw)
Distributed Classifier Design Linear classification : feature vector Determine a weight vector A parallel online learner With regularization to yield a sparse weight vector Labeled data , Testing => -1 => non-spam site 1 => spam site
Training the weight vector Logistic Regression With subgradient L1-Regularization yi(xi.wi) larger => f(w) smaller (Classification margin, hyperplane)
Distributed Classifier Algorithm
Data Set and assumption 1.25 million spam email URLs 567,784 spam Twitter URLs 9 million non-spam Twitter URLs Checking all Twitter URLs against: Google Safebrowsing, SURBL, URIBL, APWG, Phishtank Any of its source URLs become blacklisted
Data Set and assumption(cont.) On Twitter: 36% scams, 60% phishing, 4% malware
After regularization
Implementation Amazon Web Services(AWS) infrastructure URL Aggregation A queue, keeps 300,000 URLs Feature Collection 20x6 Firefox(4.0b4) on Ubuntu 10.04 With a custom extension Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views Classifier Hadoop Distributed File System On the 50-node cluster
Evaluation – Overall Accuracy 5-fold cross-validation 500,000 spam and non-spam each Training set size to 400,000 example 1:1, 4:1, 10:1 Testing set size to 200,000 example 1:1
Evaluation – Single Feature
Evaluation – Accuracy Over Time Training once only <-> Retraining every four days
Evaluation – Comparing Email and Tweet Spam Log odds ratio:
Evaluation – The Cost For Twitter, $22,751 per month
Discussion and Conclusion Evasion Feature Evasion Time-based Evasion Crawler Evasion Monarch Real-time system Spam URL Filtering as a Service $22,751 a month