Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.

Slides:



Advertisements
Similar presentations
Koby Crammer Department of Electrical Engineering
Advertisements

11 PhishNet: Predictive Blacklisting to detect Phishing Attacks Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/4/26.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Efficient Large-Scale Structured Learning
RB-Seeker: Auto-detection of Redirection Botnet Presenter: Yi-Ren Yeh Authors: Xin Hu, Matthew Knysz, Kang G. Shin NDSS 2009 The slides is modified from.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
“Identifying Suspicious URLs: An Application of Large-Scale Online Learning” Paper by Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker.
Design and Evaluation of a Real-Time URL Spam Filtering Service
Computer vision: models, learning and inference
Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.
Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
A Quality Focused Crawler for Health Information Tim Tang.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Announcements Blog Projects Next class: spam infrastructure Next next class: Dave Aucsmith 1.
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Prophiler: A fast filter for the large-scale detection of malicious web pages Reporter : 鄭志欣 Advisor: Hsing-Kuo Pao Date : 2011/03/31 1.
Online Learning Algorithms
Presentation by Kathleen Stoeckle All Your iFRAMEs Point to Us 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008 Google Technical Report.
Face Detection using the Viola-Jones Method
URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression Presented by : Mohammed Nazim Feroz 11/26/2013.
GONE PHISHING ECE 4112 Final Lab Project Group #19 Enid Brown & Linda Larmore.
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )
PhishScore: Hacking Phishers’ Minds
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
FluXOR: Detecting and Monitoring Fast-Flux Service Networks Emanuele Passerini, Roberto Paleari, Lorenzo Martignoni, and Danilo Bruschi 5th international.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.
Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17 1 Data Mining and Machine Learning Lab.
Trends in Circumventing Web-Malware Detection UTSA Moheeb Abu Rajab, Lucas Ballard, Nav Jagpal, Panayiotis Mavrommatis, Daisuke Nojiri, Niels Provos, Ludwig.
Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)
Tracking Malicious Regions of the IP Address Space Dynamically.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering.
An ANN approach to identify malicious URLs ECE 539 – Final Project Jayneel Gandhi.
January 31st, 2017 Samuel Marchal*, Giovanni Armano*, Kalle Saari*,
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Experience Report: System Log Analysis for Anomaly Detection
Learning to Detect and Classify Malicious Executables in the Wild by J
Artificial Intelligence
Multiplicative updates for L1-regularized regression
A Fast Trust Region Newton Method for Logistic Regression
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
Logistic Regression & Parallel SGD
Overfitting and Underfitting
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
Linear Discrimination
Presentation transcript:

Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google May 5, 2010

Malicious Web Sites Spam-advertised goods Trojan downloads Phishing: which one is real?

3 Visiting Malicious Web Sites Predict what is safe without committing to risky actions Safe URL? Malicious download? Spam-advertised site? Phishing site? URL = Uniform Resource Locator

4 Problem in a Nutshell URL features to identify malicious Web sites No context, no content Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. malicious facebook.comfblight.com

5 What we want...

6 How to build this service? Blacklist Hand-picked features (properties of URL) Machine learning-based classifier

7 State of the Practice Current approaches Blacklists [SORBS, URIBL, SURBL, Spamhaus, SiteAdvisor, WOT, IronPort, WebSense] Learning on hand-tuned features [Kan & Thi '05, Garera et al. '07, CANTINA, Guan et al. '09] Limitations Cannot learn from newest examples quickly Cannot quickly adapt to newest features Arms race: fast feedback cycle is critical More automated approach?

8 Contributions URL classification system Used by large Web mail providers Practical large-scale machine learning in computer security Large feature sets + online learning Related work: Whittaker, Ryner, Nazif, “Large-Scale Automatic Classification of Phishing Pages” (NDSS'10)

9 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

10 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

11 Moving Beyond Blacklists Preliminary study: Do our features work? Batch algorithms for now 10 4 examples and features Outline System overview Features ← focus of this segment Experimental results [Ma, Saul, Savage, Voelker (KDD 2009)]

12 URL Classification System LabelExampleHypothesis

13 Data Sets Malicious URLs 5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing, etc) Benign URLs 15,000 from Yahoo Web directory 15,000 from DMOZ directory Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set

14 Algorithms Logistic regression w/ L1-norm regularization Other models Naive Bayes Support vector machines (linear, RBF kernels) Implicit feature selection Easier to interpret

15 Features Example

16 Feature Vector Construction WHOIS registration: 3/25/2009 Hosted from /22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad”... [ _ _ … … …] Real-valued Host-basedLexical

17 Features to consider? 1)Blacklists 2)Simple heuristics 3)Domain name registration 4)Host properties 5)Lexical

18 (1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL, Spamhaus Not comprehensive In blacklist? Yes No In blacklist? Blacklist queries as features

19 stopgap.cn registered 2 May 2010 (2) Manually-Selected Features Considered by previous studies IP address in hostname? Number of dots in URL WHOIS (domain name) registration date [Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008]

20 (3) WHOIS Features Domain name registration Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration? Registered on 4 May 2010 By SpamMedia

21 (4) Host-Based Features Blacklisted? (SORBS, URIBL, SURBL, Spamhaus) WHOIS: registrar, registrant, dates IP address: Which ASes/IP prefixes? DNS: TTL? PTR record exists/resolves? Geography-related: Locale? Connection speed? / /20 facebook.comfblight.com Bad Parts of the Internet

22 (5) Lexical Features Tokens in URL hostname + path Length of URL Number of dots [Kolari et al., 2006]

23 Which feature sets? Blacklist Manual WHOIS Host-based Lexical 4,000 # Features 13, ,000 More features → Better accuracy Error rate (%)

24 Which feature sets? Blacklist Manual WHOIS Host-based Lexical Full 96—99% accuracy 4,000 # Features 13, ,000 30,000 Error rate (%)

25 Which feature sets? Blacklist Manual WHOIS Host-based Lexical Full w/o WHOIS/Blacklist 4,000 # Features 13, ,000 30,000 26,000 Error rate (%) Blacklists and WHOIS are not comprehensive

26 Beyond Blacklists Blacklist Full features Yahoo-PhishTank Higher detection rate for given false positive rate

27 Summary Detect malicious URLs with high accuracy Only using URL Diverse feature set helps: 99% w/ 30,000+ features Model analysis (more in KDD'09 paper) What about scalable and adaptive malicious URL detection system?

28 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

29 URL Classification System LabelExampleHypothesis

30 Live URL Classification System LabelExampleHypothesis

31 Large-Scale Online Learning How do we scale to live, large-scale data? Outline Live training feed Challenges: scale and non-stationarity Need for large, fresh training sets Online learning [Ma, Saul, Savage, Voelker (ICML 2009)]

32 Live Training Feed Malicious URLs (spamming and phishing) 6,000—7,500 per day from Web mail provider Benign URLs From Yahoo Web directory Total of 20,000 URLs per day Collected Jan – May, days More than two million examples

33 Feature Vector Construction WHOIS registration: 3/25/2009 Hosted from /22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad”... [ _ _ … … …] Real-valued Host-basedLexical 60+ features1.1 million 1.8 million GROWING Day 100

34 New Features Encountered All the Time Many binary features Enumerating tokens, ISPs, registrants, etc million features after 100 days

35 Live URL Classficiation System Online learning

36 Practical Challenges of ML in Systems Industrial concerns Scale: millions of examples, features Non-stationarity: examples change over time (arms race w/ criminals) Pivotal decision: batch or online?

37 Batch vs. Online Learning Batch/offline learning SVM, logistic regression, decision trees, etc Multiple passes over data No incremental updates Potentially high memory and processing overhead Online learning Perceptron-style algorithms Single pass over data Incremental updates Low memory and processing overhead Online learning addresses scale and non-stationarity

38 Evaluations Online learning for URL reputation Need for large, fresh training sets Comparing online algorithms Continuous retraining Growing feature vector

39 Need lots of fresh training data? SVM trained once SVM retrained daily Fresh data helps

40 Need lots of fresh training data? Fresh data helps More data helps SVM trained once on 2 weeks SVM w/ 2-week sliding window

41 Which online algorithm? Perceptron Stochastic Gradient Descent for Logistic Regression Confidence-Weighted Learning

42 Perceptron Convergence result: [Rosenblatt, 1958] + − − − − − − Update on each mistake: radius margin Number of mistakes ≤

43 Logistic Regression with SGD Log likelihood: where [Bottou, 1998] For every example: Proportional

44 Confidence-Weighted Learning Maintain Gaussian distribution over weight vector: [Dredze et al., 2008] [Crammer et al., 2009] Constrained problem: Closed-form update: Update features at different rates Diagonal covariance matrix for evals

45 Which online algorithms? Perceptron

46 Which online algorithms? LR w/ SGD Proportional update helps Perceptron

47 Which online algorithms? Proportional update helps Per-feature confidence really helps Confidence-Weighted LR w/ SGD Perceptron

48 Batch... Fresh data helps More data helps BatchBatch

49 Batch vs. Online Confidence-Weighted BatchBatch Fresh data helps More data helps Online matches batch

50 Why online does well? SVM w/ 2-week sliding window Confidence-Weighted

51 Why online does well? Confidence-Weighted once-a-day More data eventually helps Continuous retraining helps SVM w/ 2-week sliding window Confidence-Weighted

52 Growing feature vector? Confidence-Weighted fixed features

53 Growing feature vector? Confidence-Weighted fixed features Confidence-Weighted growing features Growing feature vector helps

54 Fixed vs. Variable Features Perceptron fixed features Growing feature vector helps

55 Fixed vs. Variable Features Perceptron fixed features Growing feature vector helps Growing + CW really helps Perceptron growing features

56 Proof-of-Concept Plugin

57 Examine URL details...

58 It sure looks like phishing... The Real Site

59 Blacklisted later on

60 Summary Detecting malicious URLs Relevant real-world problem Successful application of online learning What helps? More, fresher data Continuous retraining Growing feature vector Confidence-Weighted vs. Batch As accurate More adaptive Fewer resources

61 Today's Talk Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion

62 Impact Public data set (UCI ML repo) Industrial impact Mail providers have adopted our approach for classifying URLs in messages Project Info + Data Set

63 Final Thoughts: Systems + ML Systems: high-impact, large-scale applications ML: Methodical approaches Systems: Embrace real-world constraints ML: More than “plug-and-play” solutions Systems ↔ Machine Learning