Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Technische Universität München Large Scale Malware Analysis Master’s Seminar SS 2016 Presented by Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATION USING RANDOM PROJECTIONS AND NEURAL NETWORKS
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Problem: automated malware detection 2 Retrieved from protection.jpg
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Solution: 3 Retrieved from
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Challenges: Low false positive rate Low false negative rate Malware family Huge number of features 4
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS The contributions of this paper: 5
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS A large-scale system Implementation of a system that is able to classify unknown files with random projections and neural networks The contributions of this paper: 6
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Random projections are used to reduce the dimensionality of the input space. PCA via random projections The contributions of this paper: 7
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 8 The contributions of this paper:
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 9 Neural Network Classifier Random Projections Labeled Data Malware Classifier:
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 10 Dataset: 2.6 million files 1,843,359 malicious 817,485 benign 134 malware families generic malware class
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 11 3 types of features > 50 million possible features 179 thousand sparse binary features All of the distinct combinations of the three attribute sets Feature selection using mutual information Features: “Mutual information measures how much information the presence/absence of a term contributes to making the correct classification decision” [4]
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 12 Random projections: = P X R n k k d d n [ 1 ] “An approximate algorithm for estimating distances between pairs of points in a high-dimensional vector space” [3]
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 13 Classifiers: Logistic regression Neural networks All features Random projections With pre- training Without pre- training One-Layer NN Three-Layer NN Two-Layer NN One-Layer NN Two-Layer NN
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Sparse Binary Inputs 4000 Linear Units Sigmoid Hidden Units 136-Way Softmax Output Proposed Neural Network Architecture for Malware Classification Training
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 15 Experimental results: [2] Method Test Error, % Test Two- Class Error,% FPR, % FNR, % Training Time(min) Logistic Regression All features Logistic Regression Random projections One-Layer NN without Pre- training One-Layer NN with Pre-training Two-Layer NN without Pre- training Two-Layer NN with Pre-training Three-Layer NN without Pre- training
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 16 Error rates with Different Random Projection Sizes: Logistic regression Neural networks [2]
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 17 Error Rates for Neural Networks with Number of Hidden Units: [2]
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS 18 Conclusion: Novel, large-scale malware classification system utilizes random projections 43% reduction in the error rate compared to logistic regression with all features 0.49% two-class error rate for one-layer NN and 0.42% for the ensemble of NN < 3 hours to train 2.6 million examples no benefits by employing pre-training and adding additional hidden layers
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS References 19 [1] 957 [2] George E. Dahl and Jack W. Stokes, Li Deng, Dong Yu, “Large-scale malware classification using random projections and neural networks”, IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, pp IEEE International Conference on Acoustics, Speech and Signal Processing [3] Ping Li, Trevor J. Hastie, and Kenneth W. Church, “Very sparse random projections,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006), 2006, pp. 287–296. [4] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, An Introduction to Information Retrieval, Cambridge University Press, 2009.