Learning to Detect and Classify Malicious Executables in the Wild by J

Slides:

Advertisements

Similar presentations

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Advertisements

Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.

Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.

Farag Saad i-KNOW 2014 Graz- Austria,

RB-Seeker: Auto-detection of Redirection Botnet Presenter: Yi-Ren Yeh Authors: Xin Hu, Matthew Knysz, Kang G. Shin NDSS 2009 The slides is modified from.

Design and Evaluation of a Real-Time URL Spam Filtering Service

Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.

Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.

Automated malware classification based on network behavior

A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.

CISC Machine Learning for Solving Systems Problems Presented by: Akanksha Kaul Dept of Computer & Information Sciences University of Delaware SBMDS:

Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.

A.C. Chen ADL M Zubair Rafique Muhammad Khurram Khan Khaled Alghathbar Muddassar Farooq The 8th FTRA International Conference on Secure and.

Masquerade Detection Mark Stamp 1Masquerade Detection.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.

Data mining and machine learning A brief introduction.

Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)

Appendix: The WEKA Data Mining Software

From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.

The identification of interesting web sites Presented by Xiaoshu Cai.

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.

Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,

INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

CISC Machine Learning for Solving Systems Problems Presented by: Sandeep Dept of Computer & Information Sciences University of Delaware Detection.

Gary M. Weiss Alexander Battistin Fordham University.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.

Limitations of Cotemporary Classification Algorithms Major limitations of classification algorithms like Adaboost, SVMs, or Naïve Bayes include, Requirement.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.

Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Ensemble Learning for Low-level Hardware-supported Malware Detection

***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,

Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA

DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.

Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering.

Malware Classification and Novelty Detection Using PE Header Information Nasser Salim CS529 – Final Project April, 2011.

A lustrum of malware network communication: Evolution & insights

Bag-of-Visual-Words Based Feature Extraction

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Source: Procedia Computer Science（2015）70:

Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel

Dieudo Mulamba November 2017

Unknown Malware Detection Using Network Traffic Classification

Improved Rooftop Detection in Aerial Images with Machine Learning

Machine Learning with Weka

Discriminative Frequent Pattern Analysis for Effective Classification

iSRD Spam Review Detection with Imbalanced Data Distributions

Prasit Usaphapanus Krerk Piromsopa

Statistical Learning Introduction to Weka

Machine Learning with Clinical Data

Assignment 1: Classification by K Nearest Neighbors (KNN) technique

Presentation transcript:

Learning to Detect and Classify Malicious Executables in the Wild by J Learning to Detect and Classify Malicious Executables in the Wild by J.Z. Kolter and M.A. Maloof Slides By Rakesh Verma Machine Learning for Security - capex.cs.uh.edu 4/23/15

Outline Introduction Selected previous work Data Collection Experimental Design Experimental Results Conclusion Machine Learning for Security - capex.cs.uh.edu 4/23/15

Introduction Malware can cause harm or subvert the system’s intended function Malware can be classified into many categories. Three of them (considered by the authors) are: viruses, worms, and Trojan horses. Authors use machine learning and data mining techniques to Detect and classify malicious executables spammers must gather and target a particular set of recipients, construct enticing message content, ensure sufficient IP address diversity to evade blacklists, and maintain sufficient content diversity to evade spam filters. Machine Learning for Security - capex.cs.uh.edu 4/23/15 3

Three main contributions Detect and classify malicious executables Use text classification ideas with machine learning Present empirical results from an extensive study of learning methods for detecting and classifying malicious executables Show that the methods achieve high detection rates even on new malicious executables Machine Learning for Security - capex.cs.uh.edu 4/23/15

Several learning methods Implemented in the Wakaito Environment for Knowledge Acquisition (WEKA) Nearest neighbors (Ibk) Naive Bayes Support vector machine (SVM) Decision trees (J48) Used the AdaBoost.M1 algorithm Boost SVMs, J48, naive Bayes Boosting nearest neighbors was too expensive (Freund and Schapire, 1996) implemented in WEKA Machine Learning for Security - capex.cs.uh.edu 4/23/15 5

Selected Previous Work Schultz et al. used methods such as naïve Bayes for malware Three feature extraction methods Binary profiling: list of DLLs, function calls from DLLs, and number of distinct system calls from each DLL String sequences: UNIX strings command to extract the printable strings in an object or binary file Hex Dumps: similar to UNIX octal dump (od –x) command. Prints the contents of an executable file as a hexadecimal sequence Each method is paired with a single learning algorithm. Five-fold cross validation Machine Learning for Security - capex.cs.uh.edu 4/23/15

Data Collection Gathered data in early 2003 Benign executables 1971 from Windows 2000 and XP operating systems SourceForge download.com Malicious executables 1651 from Web site VX Heavens MITRE Corporation, the sponsors of this project After experiments obtained 291 new malicious executables from VX Heavens Machine Learning for Security - capex.cs.uh.edu 4/23/15

Data Collection Hexdump utility used to convert data into hexadecimal codes in ASCII format Converted into 4-grams (why?) by combining each four-byte sequence into a single term. Example: ab 12 bc 34 de 56 becomes ab12bc34, 12bc34de, and bc34de56 Selected the top 500 4-grams from the training data Machine Learning for Security - capex.cs.uh.edu 4/23/15

Experimental Design Evaluated the methods using Ten-fold cross-validation Conducted ROC analysis for each method Three experimental studies: Pilot study to determine: size of words and n-grams the number of n-grams relevant for prediction Applied all of the classification methods to a small collection of executables Applied the methodology to a larger collection of executables Machine Learning for Security - capex.cs.uh.edu 4/23/15

Pilot Studies Sequential pilot studies to determine three parameters The number of n-grams The n for n-grams The size of words Extracted bytes from 476 malicious executables, 561 benign executables produced n-grams, for n = 4 Selected the best 10, 20, . . . , 100, 200, . . . , 1000, 2000, . . . , 10000 n-grams Selecting 500 n-grams gave the best results Machine Learning for Security - capex.cs.uh.edu 4/23/15

Pilot Studies Fixed the number of n-grams at 500 varied n, the n-gram size Evaluated the same methods for n = 1, 2, ...., 10 n = 4 gave the best results Varied the size of the words (one byte, two bytes, etc.) Single bytes gave better results Machine Learning for Security - capex.cs.uh.edu 4/23/15

Feature Selection Details Formed training examples Used the n-grams extracted from the executables Each n-gram as a Boolean attribute Selected the most relevant attributes by Computed the information gain (IG) for each: Ci denotes ith class, vj the value of the jth attribute Machine Learning for Security - capex.cs.uh.edu 4/23/15

Small Collection Experiment Executables produced 68,744,909 distinct n-grams Areas under these curves (AUC) with 95% confidence intervals Boosted methods performed well Naive Bayes did not perform as well Machine Learning for Security - capex.cs.uh.edu 4/23/15

Machine Learning for Security - capex.cs.uh.edu 4/23/15

Machine Learning for Security - capex.cs.uh.edu 4/23/15

Larger Collection Experiment Consisted of 1971 benign executables 1651 malicious executables over 255 million distinct 4-grams The areas under these curves with 95% confidence intervals Boosted J48 outperformed all other methods Machine Learning for Security - capex.cs.uh.edu 4/23/15

Machine Learning for Security - capex.cs.uh.edu 4/23/15

Machine Learning for Security - capex.cs.uh.edu 4/23/15

Classifying Executables by Payload Function Classify malicious executables based on function of their payload Results for 3 functional categories opened a backdoor mass-mailed executable virus Reduce the effort to characterize previously undiscovered malicious executables One-versus-all classification Results not as good (refer to paper) Machine Learning for Security - capex.cs.uh.edu 4/23/15

Evaluating Real-world, Online Performance Compare the actual detection rates On 291 new malicious (no training on these) Selected three desired false-positive rates 0.01, 0.05, 0.1 Detected about 98% of the new malicious executables Boosted J48 False-positive rate of 0.05 Machine Learning for Security - capex.cs.uh.edu 4/23/15 20

Machine Learning for Security - capex.cs.uh.edu 4/23/15

Conclusion Detecting and classifying unknown malicious executables by Machine learning, data mining, text classification Detecting malicious executables Boosted J48 produced the best detector with an area under the ROC curve of 0.996 Classify malicious executables based on payload function Boosted J48 produced the best detectors with areas under the ROC curve around 0.9 Machine Learning for Security - capex.cs.uh.edu 4/23/15 22

References Learning to Detect and Classify Malicious Executables in the Wild, JZ Kolter and MA Maloof, JMLR 7 (2006) 2721-2744 Some slides adapted from: mmnet.iis.sinica.edu.tw/botnet/file/20100719/20100719_2.p pt Machine Learning for Security - capex.cs.uh.edu 4/23/15