A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.

Slides:



Advertisements
Similar presentations
Scalable Parallel Intrusion Detection Fahad Zafar Advising Faculty: Dr. John Dorband and Dr. Yaacov Yeesha 1 University of Maryland Baltimore County.
Advertisements

Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
Large Lump Detection by SVM Sharmin Nilufar Nilanjan Ray.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
Deep Belief Networks for Spam Filtering
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Data Mining for Security Applications: Detecting Malicious Executables Mr. Mehedy M. Masud (PhD Student) Prof. Latifur Khan Prof. Bhavani Thuraisingham.
2009/9/151 Rishi : Identify Bot Contaminated Hosts By IRC Nickname Evaluation Reporter : Fong-Ruei, Li Machine Learning and Bioinformatics Lab In Proceedings.
Jarhead Analysis and Detection of Malicious Java Applets Johannes Schlumberger, Christopher Kruegel, Giovanni Vigna University of California Annual Computer.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Automated malware classification based on network behavior
CISC Machine Learning for Solving Systems Problems Presented by: Akanksha Kaul Dept of Computer & Information Sciences University of Delaware SBMDS:
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.
A.C. Chen ADL M Zubair Rafique Muhammad Khurram Khan Khaled Alghathbar Muddassar Farooq The 8th FTRA International Conference on Secure and.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
Data Mining for Malware Detection Lecture #2 May 27, 2011 Dr. Bhavani Thuraisingham The University of Texas at Dallas.
FiG: Automatic Fingerprint Generation Shobha Venkataraman Joint work with Juan Caballero, Pongsin Poosankam, Min Gyung Kang, Dawn Song & Avrim Blum Carnegie.
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
AccessMiner Using System- Centric Models for Malware Protection Andrea Lanzi, Davide Balzarotti, Christopher Kruegel, Mihai Christodorescu and Engin Kirda.
KAIST Internet Security Lab. CS710 Behavioral Detection of Malware on Mobile Handsets MobiSys 2008, Abhijit Bose et al 이 승 민.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Statistical Tools for Linking Engine-generated Malware to its Engine Edna C. Milgo M.S. Student in Applied Computer Science TSYS School of Computer Science.
Biologically Inspired Defenses against Computer Viruses International Joint Conference on Artificial Intelligence 95’ J.O. Kephart et al.
Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones,
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
CISC Machine Learning for Solving Systems Problems Presented by: Sandeep Dept of Computer & Information Sciences University of Delaware Detection.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Ensemble Learning for Low-level Hardware-supported Malware Detection
Malicious Code Detection and Security Applications Prof. Bhavani Thuraisingham The University of Texas at Dallas October 2008.
CISC Machine Learning for Solving Systems Problems Presented by: Suparna Manjunath Dept of Computer & Information Sciences University of Delaware.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection Presented by D Callahan.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Dr. Bhavani Thuraisingham October 9, 2015 Analyzing and Securing Social Media Attacks on Social Media.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
PEER TO PEER BOTNET DETECTION FOR CYBER- SECURITY (DEFENSIVE OPERATION): A DATA MINING APPROACH Masud, M. M. 1, Gao, J. 2, Khan, L. 1, Han, J. 2, Thuraisingham,
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
SESSION 1 Introduction in Java. Objectives Introduce classes and objects Starting with Java Introduce JDK Writing a simple Java program Using comments.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
Computer System Structures
Anomaly Detection in Data Science
Experience Report: System Log Analysis for Anomaly Detection
Learning to Detect and Classify Malicious Executables in the Wild by J
Distributed Network Traffic Feature Extraction for a Real-time IDS
Detecting Malicious Executables
Efficient Image Classification on Vertically Decomposed Data
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
Malicious Code Detection and Security Applications
Waikato Environment for Knowledge Analysis
Dieudo Mulamba November 2017
Efficient Image Classification on Vertically Decomposed Data
Soft Error Detection for Iterative Applications Using Offline Training
Discriminative Frequent Pattern Analysis for Effective Classification
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee
Presentation transcript:

A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas at Dallas

Presentation Outline Overview Background Our approach Feature description Feature extraction Experiments Results Conclusion

Overview Goal Detecting Malicious Executables Contribution A new Model that combines Binary, Assembly, and Library Call features An efficient technique to retrieve Assembly features from Binary features A scalable solution to n-gram feature extraction Novelty Combining classical binary n-gram features with the features extracted through reverse-engineering

Malicious Executables: Background Programs that performs malicious activities, such as destroying data stealing information clogging network etc. Consists of different architectures, such as Independent programs (e.g. Worms) Dependent (piggybacked) on a host program (e.g. virus) Propagation mechanisms Mobile: Propagates automatically through networks (worms) Static : propagates when infected files are transferred (viruses)

Detecting Malicious Executables Traditional way: signature-based detection Problems: Requires human intervention Not effective against “zero day attack”, because too slow Requirements Fast detection No human intervention (automatic) Recent techniques Signature auto-generation (Earlybird, Autograph, Polygraph) Data Mining based (Stolfo et al., Maloof et al.)

Our Approach Design goals: to obtain a solution that Is free of signatures Requires no human intervention Can detect new variants and / or zero day attacks Our “Hybrid Feature Retrieval” (HFR) model Is Based on Data Mining Meets all three design goals Steps Collection of Training Data (malicious & benign.exe) Feature Extraction & Selection Training with classifier Testing and detection

Top-level Architecture Training Data (Executables) Feature Extraction Training (SVM) classifier New Executable Feature-Selection Feature Extraction Testing (SVM) Infected? No Yes Keep Delete Training Testing

Features Binary n-grams Assembly instruction sequences (corresponding to the binary n-grams) DLL function calls

Binary n-gram Features Each binary executable is a 'string' of bytes An n-gram of the binary is a sequence of n consecutive bytes Example A string of four bytes: "ab05ef23" (in hexadecimal) 1-grams: "ab", "05", "ef", "23" (single bytes) 2-grams: "ab05", "05ef", "ef23" (2-byte sequences) 3-grams: "ab05ef", "05ef23" (3-byte sequences)

Binary n-gram Feature Extraction Each binary executable is scanned Each extracted n-gram is stored in a balanced binary search tree to avoid duplicates Each n-gram's frequency of occurrence in the training data is also stored in the tree

Binary n-gram Feature Extraction (contd...) Using AVL tree (a balanced binary search tree) we ensure fast insertion and searching Using disk I/O we overcome memory limitations Executables being scanned 1: “abcdef” 2: “93abcd” 3: “dc0ef2” 4: “0ef7gh” Current Scan Position 93ab,1 AVL tree for storing 2-grams and frequencies abcd,2 cdef,1 dc0e,1

Feature Selection Motivation Total number extracted n-grams may be very large (order of millions) Classifier can't be trained with so many features We select K best n-grams using Information Gain criterion Information Gain of a binary attribute A on a collection of examples S is given by Values(A): set of all possible values for attribute A Sv: subset of S for which attribute A has value v. Selected binary features are called “Binary Feature Set” or BFS

Assembly Features An assembly feature is a sequence of assembly instructions We call these features as “Derived Assembly Feature” or DAF Every DAF corresponds to a selected binary n- gram Motivation for extracting DAF : n-gram may contain partial information DAF contains more complete information

Assembly Feature Extraction Disassemble all executables For each selected binary n-gram Q do S  all assembly instruction sequences in the disassembled executables corresponding to Q DAF Q  Best assembly instruction sequence in S according to information gain

Assembly Feature Extraction (Contd...) Example: Let “ ” be a selected 4-gram (Q) Following Assembly instruction sequences (S) corresponding to Q are found in the disassembled executables: DAF Q is selected from these sequences using information gain DAF Q

DLL function call features DLL function call features are the names of system functions called from the executables Ex: call getProcAddress() These features are extracted from the executable header We extract all the DLL call features from training data and select a subset using information gain

Combining features Each feature is considered as a 'binary' feature We create a vector V of all selected features, where V[i] corresponds to the i-th feature This vector is called the Hybrid Feature Set (HFS) For each executable E in the training data, we create a binary feature vector B corresponding to V, where B[i] is 1 if V[i] is present in E B[i] is 0 if V[i] is absent in E We train a classifier using these vectors

Experiments Collect real samples of malicious and normal executables Extract and select features Combine the features into HFS We also extract Assembly n-gram features (sequences of n assembly instructions), called Assembly Feature Set or (AFS) Test accuracy of each three kind of feature sets (BFS, AFS, HFS) using SVM with three- fold cross validation

Data Set There are two datasets, with the following distribution: Malicious instances are collected from Benign instances are collected windows XP machines, and other sources

Experimental Setup OS & H/W Platform: Sun Solaris & Linux Machines: 2GHz, 4GB Disassembler: PEdisassem Disassembles Windows Portable Executables Available from Feature extraction implemented in java, JDK 1.5 K = 500 (number of binary n-grams selected) Support Vector Machine Tool: libsvm ( SVM parameters: C-SVC, with polynomial kernel

Results HFS: Hybrid Feature Set - has the highest accuracy (best values are circled) AFS: Assembly Feature Set BFS: Binary Feature Set DLL features are not shown because DLL n-gram features have poor performance for n > 1. So, We only use DLL 1-grams in HFS

Results (Contd...) HFS: Hybrid Feature Set – has the lowest False Positive & False Negative AFS: Assembly Feature Set BFS: Binary Feature Set

Results (Contd...) Receiver Operating Characteristic (ROC) curves. HFS has the best ROC curve (better curve => greater area under the curve)

Results (Contd...) HFS has the greatest Area Under the Curve

Conclusion Hybrid Feature Retrieval (HFR) model retrieves a novel combination of three different kinds of features We have implemented an efficient, scalable solution to the n-gram feature extraction in general Our results are better compared to other techniques Future works Handle obfuscation Operate online, real time

Thank you