Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA

Slides:

Advertisements

Similar presentations

Data Mining For Credit Card Fraud: A Comparative Study

Advertisements

ECG Signal processing (2)

My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.

Decision Tree Approach in Data Mining

Malware Identification and Classification

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Indian Statistical Institute Kolkata

Classification and Decision Boundaries

Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.

Sparse vs. Ensemble Approaches to Supervised Learning

Mouse Movement Biometrics, Pace University, Fall'20071 Mouse Movement Biometrics Fall 2007 Capstone -Team Members Rafael Diaz Michael Lampe Nkem Ajufor.

1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.

Self-Supervised Segmentation of River Scenes Supreeth Achar *, Bharath Sankaran ‡, Stephen Nuske *, Sebastian Scherer *, Sanjiv Singh * * ‡

1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.

Distributed Representations of Sentences and Documents

05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Beyond Anti-Virus by Dan Keller Fred Cohen- Computer Scientist “there is no algorithm that can perfectly detect all possible computer viruses”

Automated malware classification based on network behavior

CISC Machine Learning for Solving Systems Problems Presented by: Akanksha Kaul Dept of Computer & Information Sciences University of Delaware SBMDS:

Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

Niels Provos and Panayiotis Mavrommatis Google Google Inc. Moheeb Abu Rajab and Fabian Monrose Johns Hopkins University 17 th USENIX Security Symposium.

John P., Fang Yu, Yinglian Xie, Martin Abadi, Arvind Krishnamurthy University of California, Santa Cruz USENIX SECURITY SYMPOSIUM, August, 2010 John P.,

Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.

1 All Your iFRAMEs Point to Us Mike Burry. 2 Drive-by downloads Malicious code (typically Javascript) Downloaded without user interaction (automatic),

Logistic Regression L1, L2 Norm Summary and addition to Andrew Ng’s lectures on machine learning.

Printing: This poster is 48” wide by 36” high. It’s designed to be printed on a large-format printer. Customizing the Content: The placeholders in this.

The identification of interesting web sites Presented by Xiaoshu Cai.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Data mining for credit card fraud: A comparative study.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

1 Figure 4-16: Malicious Software (Malware) Malware: Malicious software Essentially an automated attack robot capable of doing much damage Usually target-of-opportunity.

AccessMiner Using System- Centric Models for Malware Protection Andrea Lanzi, Davide Balzarotti, Christopher Kruegel, Mihai Christodorescu and Engin Kirda.

Automated Classification and Analysis of Internet Malware M. Bailey J. Oberheide J. Andersen Z. M. Mao F. Jahanian J. Nazario RAID 2007 Presented by Mike.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Dealing with Malware By: Brandon Payne Image source: TechTips.com.

By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.

CISC Machine Learning for Solving Systems Problems Presented by: Sandeep Dept of Computer & Information Sciences University of Delaware Detection.

Prediction of Influencers from Word Use Chan Shing Hei.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

LOGOPolyUnpack: Automating the Hidden-Code Extraction of Unpack-Executing Malware Royal, P.; Halpin, M.; Dagon, D.; Edmonds, R.; Wenke Lee; Computer Security.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Ensemble Learning for Low-level Hardware-supported Malware Detection

Advanced Malware Detection Group 8: Alex Finkelstein, Josh Suess, Dom Amos, Mike Hite, Kevin Hao.

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Feasibility of Using Machine Learning Algorithms to Determine Future Price Points of Stocks By: Alexander Dumont.

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

 Using Touchloggers To Build User Profiles Through Machine Learning Craig Dezangle.

Cosc 4765 Antivirus Approaches. In a Perfect world The best solution to viruses and worms to prevent infected the system –Generally considered impossible.

Learning to Detect and Classify Malicious Executables in the Wild by J

Harvesting Runtime Values in Android Applications That Feature Anti-Analysis Techniques Presented by Vikraman Mohan.

Source: Procedia Computer Science（2015）70:

BotCatch: A Behavior and Signature Correlated Bot Detection Approach

Recognition using Nearest Neighbor (or kNN)

Categorizing networks using Machine Learning

Machine Learning Week 1.

The Assistive System Progress Report 2 Shifali Kumar Bishwo Gurung

iSRD Spam Review Detection with Imbalanced Data Distributions

RHMD: Evasion-Resilient Hardware Malware Detectors

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

CAMCOS Report Day December 9th, 2015 San Jose State University

QoI: Assessing Participation in Threat Information Sharing

Presentation transcript:

Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA CMSC 691 Advanced Malware Analysis Presented by Rajesh Avula

Classification Malware family classification is an age old problem that many AntiVirus (AV) companies have tackled. There are two common techniques used for classification, 1.Signature based 2.Behavior based Signature based classification uses a common sequence of bytes that appears in the binary code to identify and detect a family of malware. Behavior based classification uses artifacts created by malware during execution for identification. In this paper we report on a unique dataset we obtained from operations and classified using several machine learning techniques using the behavior based approach. Our main class of malware we are interested in classifying is the popular Zeus malware. For its classification we identify 65 features that are unique and robust for identifying malware families

Zeus malware (banking Trojan) The Zeus malware infects the system by writing a copy of itself to the APPDATA folder using a randomly generated file name. The stolen data is stored under the same directory, APPDATA, encrypted. The Zeus banking Trojan hooks several important Windows APIs to intercept data being sent from the browser and to modify pages seen by the victim. Because Zeus is capable of adding fields in web forms to collect additional information from the victim when visiting banking site. The configuration file can contain a list of backup command-and control servers, link to an updated version of Zeus, list of targeted websites, and list of HTML and JavaScript to be injected in the targeted websites.

Machine Learning algorithms Support Vector Machine Logistic Regression(L1 and L2 norm) Decision Tree K-Nearest Neighbor

Support Vector Machine(SVM) kernel trick Support Vector Classification: (also known as support vector machine; SVM) is a supervised deterministic binary classification algorithm that assigns a label to input determining which of two classes of the output it belongs to. Given a training set of samples (i.e., examples), each of which is marked as belonging to one of two categories, the algorithm builds a model that assigns each of the samples into one category or the other.

Logistic Regression same as the SVM, logistic regression allows us to predict an outcome, such as label, from a set of variables. The goal of logistic regression is to correctly predict the category of outcome for individual cases using the best model. In our experiments, we use both L1 regularization and L2 regularization In this study, we use both to identify the better one for the given dataset size and set of features.

K-Nearest Neighbor

Decision Tree It is a model used for predicting the label by mapping observations about the sample to the conclusions about its target value. In the training part, the samples are used to create a model in which a boundary is created for each label based on the features. In the test part of the algorithm, the decision model created earlier is used for identifying which label is associated with the sample based on how it is fitted on the classification tree.

Raw and Vector Features Extraction The dataset is a set of 1,980 sample of the Zeus Banking Trojan. We ran these 1,980 samples through our automated malware analysis system auto-mal. The system auto-mal is a virtual machine (VM) based system that is used to run samples of malware and capture behavior characteristics of that sample.

Dataset and Labeling Sample Labeling The Zeus Banking Trojan samples have been identified by hand and collected over time by analysts This process can be time consuming. At average, a previously unseen malware sample (not necessarily Zeus) can take more than 10 hours to manually characterize by experts. The data sources are from various AV vendors that we have partnered with for sharing malware samples. The malware feed is delivered with no AV signatures associated with samples. We run Yara signatures on the malware feed to identify malware of interest that we can feed into our automated malware analysis system. Training and Testing data The Zeus data set is split up into 2 different data sets one for learning and one for testing. The learning data set contains 1001 samples of Zeus and 1000 samples of other malware that is picked at random from their malware database collection. The testing set contains 979 samples of Zeus and 1000 samples of non Zeus malware and that is not found in the learning data set.

Results For the KNN classifier the number of nearest neighbors are set to 980 The false positive error (false alarm; for the class label Zeus, for example) measures the number of samples marked as Zeus, while they are in reality not Zeus. The false negative error (for the Zeus family) measures the number of samples that are marked as non-Zeus, while they are in fact Zeus.

Challenging problems when using machine learning techniques for classifying data, particularly when using supervised learning techniques, is the choice of the training set. To understand how results are robust to the initial training set, repeat the experiment by flipping the test and training datasets.

Thank You