Download presentation
Presentation is loading. Please wait.
Published byClifford Caldwell Modified over 9 years ago
1
Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California, Berkeley International Computer Science Institute
2
Motivation Social Networks (Facebook, Twitter) Web Mail (Gmail, Live Mail) Blogs, Services (Blogger, Yelp) Spam
3
Motivation Existing solutions: – Blacklists – Service-specific, account heuristics Develop new spam filter service: – Filter spam: scams, phishing, malware – Real-time, fine-grained, generalizable
4
Overview Our system – Monarch: – Accepts millions of URLs from web service – Crawls, labels each URL in real-time Spam Classification – Decision based on URL content, page behavior, hosting – Large-scale; distributed collection, classification Implemented as a cloud service
5
Monarch in Action Social Network 1. Spam Message Spam Account
6
Monarch in Action Monarch Social Network 1. Spam Message 2. Message URL Spam Account
7
Monarch in Action Monarch Social Network 1. Spam Message 2. Message URL 3. Fetch Content Spam URL Content Spam Account
8
Monarch in Action Monarch Social Network 1. Spam Message 2. Message URL 4. Decision 3. Fetch Content Spam URL Content Spam Account
9
Monarch in Action Monarch Social Network Message Recipients 1. Spam Message 2. Message URL 4. Decision 3. Fetch Content Spam URL Content Spam Account
10
Challenges Accuracy Real-Time Scalability Tolerant to Feature Evolution
11
Outline Architecture Results & Performance Limitations Conclusion
12
System Architecture
16
URL Aggregation SourceSample Size Spam email URLs1.25 million Blacklisted Twitter URLs567,000 Non-spam Twitter URLs9 million Collection period: 9/8/2010 – 10/29/2010
17
Feature Collection High Fidelity Browser Navigation – Lexical features of URLs (length, subdomains) – Obfuscation (directory operations, nested encoding) Hosting – IP/ASN – A, NS, MX records – Country, city if available
18
Feature Collection Content – Common HTML templates, keywords – Search engine optimization – Content of request, response headers Behavior – Prevent navigating away – Pop-up windows – Plugin, JavaScript redirects
19
Classification Distributed Logistic Regression – Data overload for single machine
20
Classification Distributed Logistic Regression – Data overload for single machine L1-regularization – Reduces feature space, over-fitting – 50 million features -> 100,000 features
21
Implementation System implemented as a cloud service on Amazon EC2 – Aggregation: 1 machine – Feature Collection: 20 machines Firefox, extension + modified source – Classification & Feature Extraction: 50 machines Hadoop - Spark, Mesos Straightforward to scale the architecture
22
Result Overview High-level summary: – Performance – Overall accuracy – Highlight important features – Feature evolution – Spam independence between services
23
Performance Rate: 638,000 URLs/day – Cost: $1,600/mo Process time: 5.54 sec – Network delay: 5.46 sec Can scale to 15 million URLs/day – Estimated $22,000/mo
24
Measuring Accuracy Dataset: 12 million URLs (<2 million spam) – Sample 500K spam (half tweets, half email) – Sample 500K non-spam Training, Testing – 5-fold validation – Vary training folds non-spam:spam ratio – Test fold equal parts spam, non-spam
25
Overall Accuracy Training Ratio AccuracyFalse Positive Rate False Negative Rate 1:194%4.23%7.5% 4:191%0.87%17.6% 10:187%0.29%26.5% Non-spam labeled as spam Spam labeled as non-spam Correctly labeled samples
26
Overall Accuracy Non-spam labeled as spam Spam labeled as non-spam Correctly labeled samples Training Ratio AccuracyFalse Positive Rate False Negative Rate 1:194%4.23%7.5% 4:191%0.87%17.6% 10:187%0.29%26.5%
27
Error by Feature Error (%) Error = 1 - Accuracy
28
Error by Feature Error (%) Error = 1 - Accuracy
29
Error by Feature Error (%) Error = 1 - Accuracy
30
Feature Evolution – Retraining Required Accuracy (%)
31
Spam Independence Unexpected result: Twitter, email spam qualitatively different Training SetTesting SetAccuracyFalse Negatives Twitter 94%22% TwitterEmail81%88% EmailTwitter80%99% Email 99%4%
32
Spam Independence Unexpected result: Twitter, email spam qualitatively different Training SetTesting SetAccuracyFalse Negatives Twitter 94%22% TwitterEmail81%88% EmailTwitter80%99% Email 99%4%
33
Distinct Email, Twitter Features
34
Email Features Shorter Lived
35
Limitations Adversarial Machine Learning – We provide oracle to spammers – Can adversaries tweak content until passing? Time-based Evasion – Change content after URL submitted for verification Crawler Fingerprinting – Identify IP space of Monarch, fingerprint Monarch browser client – Dual-personality DNS, page behavior
36
Related Work C. Whittaker, B. Ryner, and M. Nazif, “Large-Scale Automatic Classification of Phishing Pages” J. Ma, L. Saul, S. Savage, and G. Voelker, “Identifying suspicious URLs: an application of large-scale online learning” Y. Zhang, J. Hong, and L. Cranor, “Cantina: a content-based approach to detecting phishing web sites” M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of drive- by-download attacks and malicious JavaScript code”
37
Conclusion Monarch provides: – Real-time scam, phishing, malware detection – Experiments show 91% accuracy, 0.87% false positives – Readily scalable cloud service – Applicable to all URL-based spam Spam not guaranteed to overlap between web services – Twitter, email qualitatively different Despite overlap, can still provide generalizable filtering – Require training data from each service
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.