Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering.

Slides:



Advertisements
Similar presentations
Koby Crammer Department of Electrical Engineering
Advertisements

PhishDef: URL Names Say It All Anh Le, Athina Markopoulou University of California, Irvine USA Michalis Faloutsos University of California, Riverside USA.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Efficient Large-Scale Structured Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
“Identifying Suspicious URLs: An Application of Large-Scale Online Learning” Paper by Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker.
Supervised Learning Recap
Design and Evaluation of a Real-Time URL Spam Filtering Service
Confidence-Weighted Linear Classification Mark Dredze, Koby Crammer University of Pennsylvania Fernando Pereira Penn  Google.
Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,
Overview over different methods – Supervised Learning
Lecture 14 – Neural Networks
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Internet Quarantine: Requirements for Containing Self-Propagating Code David Moore et. al. University of California, San Diego.
Prophiler: A fast filter for the large-scale detection of malicious web pages Reporter : 鄭志欣 Advisor: Hsing-Kuo Pao Date : 2011/03/31 1.
Online Learning Algorithms
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray,
URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression Presented by : Mohammed Nazim Feroz 11/26/2013.
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science.
Chapter 9: Cooperation in Intrusion Detection Networks Authors: Carol Fung and Raouf Boutaba Editors: M. S. Obaidat and S. Misra Jon Wiley & Sons publishing.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011.
Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.
M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.
Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17 1 Data Mining and Machine Learning Lab.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Trends in Circumventing Web-Malware Detection UTSA Moheeb Abu Rajab, Lucas Ballard, Nav Jagpal, Panayiotis Mavrommatis, Daisuke Nojiri, Niels Provos, Ludwig.
Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.
M Machine Learning F# and Accord.net.
ICDCS 2014 Madrid, Spain 30 June-3 July 2014
Tracking Malicious Regions of the IP Address Space Dynamically.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Strong Supervision From Weak Annotation Interactive Training of Deformable Part Models ICCV /05/23.
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
COLIN O’HANLON & NICK CIGANKO Sam Spade: Network Query Tool.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Experience Report: System Log Analysis for Anomaly Detection
Under the Shadow of sunshine
Learning to Detect and Classify Malicious Executables in the Wild by J
Internet Quarantine: Requirements for Containing Self-Propagating Code
Multilayer Perceptrons
Multiplicative updates for L1-regularized regression
Lecture 07: Soft-margin SVM
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
Lecture 07: Soft-margin SVM
Logistic Regression & Parallel SGD
Lecture 07: Soft-margin SVM
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007
The Updated experiment based on LSTM
Logistic Regression Geoff Hulten.
Presentation transcript:

Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for ICML 2009 June 15, 2009

2 Detecting Malicious Web Sites Predict what is safe without committing to risky actions Safe URL? Web exploit? Spam-advertised site? Phishing site? URL = Uniform Resource Locator

3 Problem in a Nutshell ● URL features to identify malicious Web sites ● Different classes of URLs ● Benign, spam, phishing, exploits, scams... ● For now, distinguish benign vs. malicious facebook.comfblight.co m

4 Today's Talk ● Problem ● Approach ● Learning to detect malicious URLs ● Challenges: scale and non-stationarity ● Evaluations ● Need for large, fresh training sets ● Online learning ● Conclusion

5 State of the Practice ● Current approaches ● Blacklists ● Learning on hand-tuned features ● Limitations ● Cannot learn from newest examples quickly ● Cannot quickly adapt to newest features ● Arms race: fast feedback cycle is critical More automated approach?

6 Live URL Classification System LabelExampleHypothesis

7 Live Training Feed ● Malicious URLs (spamming and phishing) ● 6,000—7,500 per day from Web mail provider ● Benign URLs ● From Yahoo Web directory ● Total of 20,000 URLs per day ● Live collection since Jan. 5, 2009 ● Months of data ● Two million examples after 100 days

8 Feature vector construction WHOIS registration: 3/25/2009 Hosted from /22 IP hosted in San Mateo Connection speed: T1 Has DNS PTR record? Yes Registrant “Chad”... [ _ _ … … …] Real-valued Host-basedLexical 60+ features1.8 million 1.1 million GROWING Day 100

9 Live URL Classficiation System Online learning

10 Practical Challenges of ML in Systems ● Industrial concerns ● Scale: millions of examples, features ● Non-stationarity: examples change over time (arms race w/ criminals) ● Pivotal decision: batch or online?

11 Batch vs. Online Learning ● Batch/offline learning ● SVM, logistic regression, decision trees, etc ● Multiple passes over data ● No incremental updates ● Potentially high memory and processing overhead ● Online learning ● Perceptron-style algorithms ● Single pass over data ● Incremental updates ● Low memory and processing overheard Online learning addresses scale and non-stationarity

12 Evaluations ● Online learning for URL reputation ● Need for large, fresh training sets ● Comparing online algorithms

13 Need lots of fresh training data? SVM trained once SVM retrained daily ● Fresh data helps

14 Need lots of fresh training data? ● Fresh data helps ● More data helps SVM trained once on 2 weeks SVM w/ 2-week sliding window

15 Which online algorithm? ● Perceptron ● Stochastic Gradient Descent for Logistic Regression ● Confidence-Weighted Learning

16 Perceptron ● Convergence result: [Rosenblatt, 1958] + − − − − − − ● Update on each mistake: radius margin Number of mistakes ≤

17 Logistic Regression with SGD ● Log likelihood: where [Bottou, 1998] ● For every example: Proportional

18 Confidence-Weighted Learning ● Maintain Gaussian distribution over weight vector: [Dredze et al., 2008] [Crammer et al., 2009] ● Constrained problem: ● Closed-form update: Treat features differently

19 Which online algorithms? Perceptron

20 Which online algorithms? LR w/ SGD ● Proportional update helps Perceptron

21 Which online algorithms? ● Proportional update helps ● Per-feature confidence really helps Confidence-Weighted LR w/ SGD Perceptron

22 Batch... ● Fresh data helps ● More data helps Batch

23 Batch vs. Online Confidence-Weighted Batch ● Fresh data helps ● More data helps ● Online matches batch

24 Conclusion ● Detecting malicious URLs ● Relevant real-world problem ● Successful application of online learning ● Confidence-Weighted vs. Batch ● As accurate ● More adaptive ● Less resources ● Future work ● Scaling up for deployment