Presentation is loading. Please wait.

Presentation is loading. Please wait.

CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

Similar presentations


Presentation on theme: "CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning."— Presentation transcript:

1 CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning to Detect and Identify Malicious Executables in Wild J. Zico Kotler Marcus A Maloof

2 CISC 879 - Machine Learning for Solving Systems Problems Introduction Machine learning and data mining to identify malicious code Malicious Codes ? Why not antivirus suites? Training set: 1971 good and 1651 malicious executables Features extracted: n-gram byte code and executable based on their functions of payload Learning algorithms: naïve bayes, SVM, decision trees and boosting

3 CISC 879 - Machine Learning for Solving Systems Problems Goals of the research Paper How to use established methods to detect and classify malicious executables ? Present empirical results from an extensive study of inductive methods for detection and classification To show that methods achieve high detection rates on new and unseen executables.

4 CISC 879 - Machine Learning for Solving Systems Problems Related Work Lo et al., 1995; Kephart et al., 1995; Tesauro et al.,1996;Schultz et al.,2001 Lo et al., 1995: analysis of several programs Schultz et al.2001, used data mining to detect Binary profiling (Ripper learning) String Sequences (Naïve Bayes) Hex dumps (six naïve bayesian classifiers)

5 CISC 879 - Machine Learning for Solving Systems Problems Data Collection and Classification methods 1971 benign and 1651 malicious executables of windows pe format N-grams: Combine each four bye sequence into single term. For e.g.: ff 00 ab 3e 12 b3, the corresponding n-grams are ff00ab3e, 00ab3e12, ab3e12b3 etc. N-gram: each of them are considered as attributes Most relevant attribute (n-grams) are calculated using Information gain also called average mutual information. Collected 500 most relevant n-grams

6 CISC 879 - Machine Learning for Solving Systems Problems Classification methods

7 CISC 879 - Machine Learning for Solving Systems Problems Classification methods Instance based learner: Collection of training examples Naive bayes: Probablisitc model. Based on condition probability of each class P(Ci) and P(Vj | Ci)

8 CISC 879 - Machine Learning for Solving Systems Problems Classification methods Support Vector machines: vector of weights w and threshold,b. Uses a kernel function to map training data into higher dimensioned space so that problem is linearly separable. Decision Trees: Internal nodes correspond to attributes and leaf nodes corresponds to class labels. Boosted classifiers: It is method for combining multiple classifiers. Boosting produces set of weighted models by iteratively learning a model from a weighted data set, evaluating it and reweighting the data set based on model’s performance.

9 CISC 879 - Machine Learning for Solving Systems Problems Detecting malicious code using n-grams Used Ten-fold cross validation Pilot Study: To determine the size of n-grams and number of n-grams relevant. Used n-grams with n=4 and calculated the best number of n-grams using Information gain. 500 relevant n-grams produced the best result. Experiment With Small collection: Small collection of executable with total of 68,744,909 n-grams Experiment with Large Collection: 255 million distinct n-grams of size of 4.

10 CISC 879 - Machine Learning for Solving Systems Problems Results of Small Collection ROC curve for detecting malicious executables in small collection

11 CISC 879 - Machine Learning for Solving Systems Problems Result of Bigger Collection ROC Curve for bigger collection

12 CISC 879 - Machine Learning for Solving Systems Problems Classifying executables by Payload function Extent to which classification methods could determine whether a given malicious executable opened a backdoor, mass mailed or was an executable virus. Identify and enumerate the functions of payloads Many executables fell into many categories Experimental design similar to previous but for each of the fucntion data set is made from malicious executables only. Used ten fold Cross validation

13 CISC 879 - Machine Learning for Solving Systems Problems Experimental Results ROC curve for mass mailing capabilities

14 CISC 879 - Machine Learning for Solving Systems Problems Experimental Results ROC Curve for backdoor entries

15 CISC 879 - Machine Learning for Solving Systems Problems Evaluating Real World Online Performance Applied method to 291 real world malicious code to discovered after the original data were gathered Classifiers from the original data were build for both benign and malicious code Boosted decision tree detected 98% of the new malicious code.

16 CISC 879 - Machine Learning for Solving Systems Problems Conclusion and Future work Machine learning and data mining are useful and appropriate tool for detection of malware Boosted Classifiers, support vector machines performed exceptionally well Boosting removes bias and variance and outperformed other classifiers in the study This approach is scalable 20-25 % of the codes were obfuscated using compression and encryption For functions of payload experiments remove obfuscation and rerun the experiments with larger set

17 CISC 879 - Machine Learning for Solving Systems Problems Conclusion and Future Work Similarity of malicious code and how such executables change over time. Clustering can provide good insight into this. This approach combined with search for known signatures, executing and analyzing code in virtual machine will provide better computer security

18 CISC 879 - Machine Learning for Solving Systems Problems Q&A ?


Download ppt "CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning."

Similar presentations


Ads by Google