URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression Presented by : Mohammed Nazim Feroz 11/26/2013.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Random Forest Predrag Radenković 3237/10
A Survey of Botnet Size Measurement PRESENTED: KAI-HSIANG YANG ( 楊凱翔 ) DATE: 2013/11/04 1/24.
11 PhishNet: Predictive Blacklisting to detect Phishing Attacks Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/4/26.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
Design and Evaluation of a Real-Time URL Spam Filtering Service
 Firewalls and Application Level Gateways (ALGs)  Usually configured to protect from at least two types of attack ▪ Control sites which local users.
Ensemble Learning: An Introduction
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
Radial-Basis Function Networks
Radial Basis Function Networks
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray,
Collaborative Filtering Matrix Factorization Approach
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
UNIT 14 Lecturer: Ghadah Aldehim 1 Websites. Introduction 2.
GONE PHISHING ECE 4112 Final Lab Project Group #19 Enid Brown & Linda Larmore.
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.
Efficient Model Selection for Support Vector Machines
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science.
Network and Systems Security By, Vigya Sharma (2011MCS2564) FaisalAlam(2011MCS2608) DETECTING SPAMMERS ON SOCIAL NETWORKS.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Computer Security and Penetration Testing
Machine Learning CSE 681 CH2 - Supervised Learning.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Chapter 9 – Classification and Regression Trees
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Universit at Dortmund, LS VIII
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. SIGCOMM, Presented.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
BotCop: An Online Botnet Traffic Classifier 鍾錫山 Jan. 4, 2010.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
Kim HS Introduction considering that the amount of MRI data to analyze in present-day clinical trials is often on the order of hundreds or.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
ONLINE DETECTION AND PREVENTION PHISHING ATTACKS
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Chapter 7. Classification and Prediction
QianZhu, Liang Chen and Gagan Agrawal
10701 / Machine Learning.
The Elements of Statistical Learning
Linear regression project
Collaborative Filtering Matrix Factorization Approach
Logistic Regression & Parallel SGD
Design open relay based DNS blacklist system
Unit 32 Every class minute counts! 2 assignments 3 tasks/assignment
Presentation transcript:

URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression Presented by : Mohammed Nazim Feroz 11/26/2013

Motivation  Web services drive new opportunities for people to interact, they also create new opportunities for criminals  Google detects about 300,000 malicious websites per month, this is a clear indication that these opportunities are being used by criminals  Almost all online threats have something in common, they all require the user to click on a hyperlink or type in a website address

Motivation  The user needs to perform sanity checks and assessing the risk of visiting a URL  Performing such an evaluation might be impossible for a novice user  As a result, users often end up clicking links without paying close attention to the URLs – this further makes them vulnerable to malicious websites on the web which in turn exploit them

Introduction  Openness of the web exposes opportunities for criminals to upload malicious content  Do techniques exist to prevent malicious content from entering the web?

Current Techniques  Security practitioners have developed techniques such as blacklisting in order to protect users from malicious websites  Although this approach has minimal overhead, it does not provide complete protection as about only 55% of the malicious URLs are present in blacklists  Another drawback of this approach is that malicious websites are not a part of the blacklist during the period before their detection

Current Techniques  Security researchers have done extensive research in order to detect accounts on social networks that are used for spreading messages that are malicious  The approach still does not provide thorough protection for users in areas such as social networks where the interaction is in real-time because there is a need to build a profile of malicious activity and the process can take a considerable amount of time

Current Techniques  Researchers from TokDoc have used a method that decides on a per-token basis whether a token requires automatic healing  Their work uses n-grams and length as features for detecting malicious URLs  This research builds on their idea by supplementing a set of their features with host-based features as the latter has exhibited a wealth of information that can be used

Approach  URLDoc classifies URLs automatically based on the lexical (textual) and host-based features  Scalable machine learning algorithms from Mahout are used to develop and test the classifier  Online learning is considered over batch learning  The classifier achieves 93-97% accuracy by detecting a large number of malicious hosts, with a modest false positive rate

Approach  If these predictor variables are correctly identified and the URLs metadata is carefully derived then the machine learning algorithms used can sift through tens of thousands of features  Online algorithms are preferred over batch-learning algorithms  Batch learning algorithms look at every example in the training dataset on every step and then update the weights of the classifier – a costly operation if the number of training examples is large

Approach  Online algorithms update the weights according to the gradient of the error with respect to a single training example  Online algorithms are able to process datasets far more efficiently than batch algorithms

Problem Formulation  URL classification lends itself naturally as a binary classification problem  The target variable y (i) can take one of two possible values-malicious or benign  For k predictor variables over all categories then there will be x1 (i),…, xk (i) ; this will result in a k-dimension feature vector characterizing the URL  The goal is to learn a function h(x)=y that maps the space of input values to the space of output values so that h(x) is a good predictor for the corresponding value of y

Problem Formulation  The two main phases involved in building a classification system  The first phase creates the model (i.e. the function h(x)) produced by the learning algorithm  The second phase makes use of that model to assign new data from the test dataset to its predicted target class  Selection of the training dataset and it’s predictor variables, the target classes, and the learning algorithm through which the classification system will learn are vital in the first phase of building the classification system  Predicted labels are compared with known answers to evaluate the classifer

Overview of Features  Lexical features  These features have values of both types-binary and continuous  These features include  Length of the URL  Number of dots in the URL  Tokens present in the hostname, primary domain, and path parts of a URL  Features in the hostname are further characterized as bigrams  Bigrams are able to capture a certain pattern on character strings permuted randomly and occurring in certain combinations  Example:  Bigrams: depts ttu, ttu edu

Overview of Features  Host-Based features  IP address of the URL – A Record  IP address of the Mail Exchanger – MX Record  IP address of the Name Server – NS Record  PTR Record  AS number  IP Prefix

Overview of Features  Malicious websites have exhibited a pattern of being hosted in a particular “bad” portion of the Internet  Example: McColo provided hosting for major botnets, which in turn were responsible for sending 41% of the world’s spam just before McColo’s takedown in November McColo’s AS number was  These portions of the internet can be characterized on a regular basis by retraining on the predictor variables  This allows keeping track of concept drift

Online Logistic Regression with SGD  Logistic regression is a very flexible algorithm as it allows the predictor variables to be of both types- continuous and binary  Mahout greatly helps in the learning process by choosing an optimum learning rate and thus allowing the classification system to converge to the global minimum

Online Logistic Regression with SGD  Online learning when compared to batch learning is usually much faster, adapts to changes in a continuous manner and is much better when the size of the training and test datasets are large  Support Vector Machines were considered but not chosen since they take a longer period of time to train when compared to Online Logistic Regression  Online Logistic Regression converges more quickly if malicious and benign URLs from the training dataset are presented in a random order

Feature Vector  Feature hashing is used in order to encode the raw feature data into feature vectors  In this approach, a reasonable size (i.e. dimension) is picked for the feature vector and the data is put into feature vectors of the chosen size  After carefully considering the datasets, the size of the feature vectors in the research is in the 100,000 dimension space

Feature Vector Example  The data is encoded into the feature vector as continuous, categorical, word-like, and text-like features using the Mahout API

Results 90/10 dataset split 80/20 dataset split Training/Test dataset split

Results Training/Test dataset split 50/50 dataset split Benign:Malicious

Other Approaches Attempted  Term Frequency – Inverse Document Frequency  A bag of words approach was used and term (lexical features) – document (URL) matrix was created  Online Logistic Regression is not affected by good word weighting  Clustering  The URLs are viewed as a set of vectors in vector space  Cosine similarity was used as the similarity measure between URLs  This research focused on classification over clustering since the target classes of the URLs was known – Clustering has known to be useful when the target classes are unknown

Future Work  Study the various features extensively and only use those with the highest contributions – Also add new features that would help in better classification  Try to use algorithms that can benefit from parallelization

Summary  A reliable framework for the classification of URLs is built  A supervised learning method is used in order to learn the characteristics of both malicious and benign URLs and classify them in real time  The applicability and usefulness of Mahout for the URL classification task is demonstrated, and the benefits of using an online setting over a batch setting are illustrated-the online setting enabled learning new trends in the characteristics of URLs over time

Questions ?