Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google May 5, 2010
Malicious Web Sites Spam-advertised goods Trojan downloads Phishing: which one is real?
3 Visiting Malicious Web Sites Predict what is safe without committing to risky actions Safe URL? Malicious download? Spam-advertised site? Phishing site? URL = Uniform Resource Locator
4 Problem in a Nutshell URL features to identify malicious Web sites No context, no content Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. malicious facebook.comfblight.com
5 What we want...
6 How to build this service? Blacklist Hand-picked features (properties of URL) Machine learning-based classifier
7 State of the Practice Current approaches Blacklists [SORBS, URIBL, SURBL, Spamhaus, SiteAdvisor, WOT, IronPort, WebSense] Learning on hand-tuned features [Kan & Thi '05, Garera et al. '07, CANTINA, Guan et al. '09] Limitations Cannot learn from newest examples quickly Cannot quickly adapt to newest features Arms race: fast feedback cycle is critical More automated approach?
8 Contributions URL classification system Used by large Web mail providers Practical large-scale machine learning in computer security Large feature sets + online learning Related work: Whittaker, Ryner, Nazif, “Large-Scale Automatic Classification of Phishing Pages” (NDSS'10)
9 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
10 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
11 Moving Beyond Blacklists Preliminary study: Do our features work? Batch algorithms for now 10 4 examples and features Outline System overview Features ← focus of this segment Experimental results [Ma, Saul, Savage, Voelker (KDD 2009)]
12 URL Classification System LabelExampleHypothesis
13 Data Sets Malicious URLs 5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing, etc) Benign URLs 15,000 from Yahoo Web directory 15,000 from DMOZ directory Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set
14 Algorithms Logistic regression w/ L1-norm regularization Other models Naive Bayes Support vector machines (linear, RBF kernels) Implicit feature selection Easier to interpret
15 Features Example
16 Feature Vector Construction WHOIS registration: 3/25/2009 Hosted from /22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad”... [ _ _ … … …] Real-valued Host-basedLexical
17 Features to consider? 1)Blacklists 2)Simple heuristics 3)Domain name registration 4)Host properties 5)Lexical
18 (1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL, Spamhaus Not comprehensive In blacklist? Yes No In blacklist? Blacklist queries as features
19 stopgap.cn registered 2 May 2010 (2) Manually-Selected Features Considered by previous studies IP address in hostname? Number of dots in URL WHOIS (domain name) registration date [Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008]
20 (3) WHOIS Features Domain name registration Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration? Registered on 4 May 2010 By SpamMedia
21 (4) Host-Based Features Blacklisted? (SORBS, URIBL, SURBL, Spamhaus) WHOIS: registrar, registrant, dates IP address: Which ASes/IP prefixes? DNS: TTL? PTR record exists/resolves? Geography-related: Locale? Connection speed? / /20 facebook.comfblight.com Bad Parts of the Internet
22 (5) Lexical Features Tokens in URL hostname + path Length of URL Number of dots [Kolari et al., 2006]
23 Which feature sets? Blacklist Manual WHOIS Host-based Lexical 4,000 # Features 13, ,000 More features → Better accuracy Error rate (%)
24 Which feature sets? Blacklist Manual WHOIS Host-based Lexical Full 96—99% accuracy 4,000 # Features 13, ,000 30,000 Error rate (%)
25 Which feature sets? Blacklist Manual WHOIS Host-based Lexical Full w/o WHOIS/Blacklist 4,000 # Features 13, ,000 30,000 26,000 Error rate (%) Blacklists and WHOIS are not comprehensive
26 Beyond Blacklists Blacklist Full features Yahoo-PhishTank Higher detection rate for given false positive rate
27 Summary Detect malicious URLs with high accuracy Only using URL Diverse feature set helps: 99% w/ 30,000+ features Model analysis (more in KDD'09 paper) What about scalable and adaptive malicious URL detection system?
28 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
29 URL Classification System LabelExampleHypothesis
30 Live URL Classification System LabelExampleHypothesis
31 Large-Scale Online Learning How do we scale to live, large-scale data? Outline Live training feed Challenges: scale and non-stationarity Need for large, fresh training sets Online learning [Ma, Saul, Savage, Voelker (ICML 2009)]
32 Live Training Feed Malicious URLs (spamming and phishing) 6,000—7,500 per day from Web mail provider Benign URLs From Yahoo Web directory Total of 20,000 URLs per day Collected Jan – May, days More than two million examples
33 Feature Vector Construction WHOIS registration: 3/25/2009 Hosted from /22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad”... [ _ _ … … …] Real-valued Host-basedLexical 60+ features1.1 million 1.8 million GROWING Day 100
34 New Features Encountered All the Time Many binary features Enumerating tokens, ISPs, registrants, etc million features after 100 days
35 Live URL Classficiation System Online learning
36 Practical Challenges of ML in Systems Industrial concerns Scale: millions of examples, features Non-stationarity: examples change over time (arms race w/ criminals) Pivotal decision: batch or online?
37 Batch vs. Online Learning Batch/offline learning SVM, logistic regression, decision trees, etc Multiple passes over data No incremental updates Potentially high memory and processing overhead Online learning Perceptron-style algorithms Single pass over data Incremental updates Low memory and processing overhead Online learning addresses scale and non-stationarity
38 Evaluations Online learning for URL reputation Need for large, fresh training sets Comparing online algorithms Continuous retraining Growing feature vector
39 Need lots of fresh training data? SVM trained once SVM retrained daily Fresh data helps
40 Need lots of fresh training data? Fresh data helps More data helps SVM trained once on 2 weeks SVM w/ 2-week sliding window
41 Which online algorithm? Perceptron Stochastic Gradient Descent for Logistic Regression Confidence-Weighted Learning
42 Perceptron Convergence result: [Rosenblatt, 1958] + − − − − − − Update on each mistake: radius margin Number of mistakes ≤
43 Logistic Regression with SGD Log likelihood: where [Bottou, 1998] For every example: Proportional
44 Confidence-Weighted Learning Maintain Gaussian distribution over weight vector: [Dredze et al., 2008] [Crammer et al., 2009] Constrained problem: Closed-form update: Update features at different rates Diagonal covariance matrix for evals
45 Which online algorithms? Perceptron
46 Which online algorithms? LR w/ SGD Proportional update helps Perceptron
47 Which online algorithms? Proportional update helps Per-feature confidence really helps Confidence-Weighted LR w/ SGD Perceptron
48 Batch... Fresh data helps More data helps BatchBatch
49 Batch vs. Online Confidence-Weighted BatchBatch Fresh data helps More data helps Online matches batch
50 Why online does well? SVM w/ 2-week sliding window Confidence-Weighted
51 Why online does well? Confidence-Weighted once-a-day More data eventually helps Continuous retraining helps SVM w/ 2-week sliding window Confidence-Weighted
52 Growing feature vector? Confidence-Weighted fixed features
53 Growing feature vector? Confidence-Weighted fixed features Confidence-Weighted growing features Growing feature vector helps
54 Fixed vs. Variable Features Perceptron fixed features Growing feature vector helps
55 Fixed vs. Variable Features Perceptron fixed features Growing feature vector helps Growing + CW really helps Perceptron growing features
56 Proof-of-Concept Plugin
57 Examine URL details...
58 It sure looks like phishing... The Real Site
59 Blacklisted later on
60 Summary Detecting malicious URLs Relevant real-world problem Successful application of online learning What helps? More, fresher data Continuous retraining Growing feature vector Confidence-Weighted vs. Batch As accurate More adaptive Fewer resources
61 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
62 Impact Public data set (UCI ML repo) Industrial impact Mail providers have adopted our approach for classifying URLs in messages Project Info + Data Set
63 Final Thoughts: Systems + ML Systems: high-impact, large-scale applications ML: Methodical approaches Systems: Embrace real-world constraints ML: More than “plug-and-play” solutions Systems ↔ Machine Learning