Download presentation
Presentation is loading. Please wait.
Published byHoward Butler Modified over 9 years ago
1
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning Tan 1 Antonio Nucci 2 1 Michigan State University, Michigan, USA 2 Narus, Inc., Sunnyvale, California, USA.
2
2 Introduction Increasing threats –Continuous and increased attacks on infrastructure –Threats to business, national security Huge financial stake (Conficker: 10 million machines, loss $9.1 Billion) Zeus: 3.6 million machines [HTML Injection] Koobface: 2.9 million machines [Social Networking Sites] TidServ: 1.5 million machines [Email spam attachment] Attacks are becoming more advanced and sophisticated! Malware is … –Malicious software –Virus, Phishing, Spam, … © 2013 Narus, Inc.
3
3 Introduction (Contd.) Host vs Network based approaches Limitation of existing techniques –Signature-based approach Fails to detect zero-day attacks. Fails to detect threats with evolving capabilities such as metamorphic and polymorphic malwares. –Anomaly-based approach Producing high false alarm rate. –Supervised Learning based approach Poor performance on new and evolving malware Building classifier model is challenging due to diversity of malware classes, imbalanced distribution, data imperfection issues, etc. There is no Silver Bullet © 2013 Narus, Inc.
4
4 Our Goal Focus on Layer 3/4 features Threats often exhibit specific behavior in their layer-3/layer-4 flow level features –Even when the payload is encrypted Machine Learning based approach –Two level Supervised learning approach to detect malicious flows and further identify specific type –Combine unsupervised learning with supervised learning to address new class discovery problem © 2013 Narus, Inc.
5
5 Challenges Imbalanced class representation –Majority flows belong to a few dominant classes Missing values –The features used to characterize network flow may contain missing values (only 7% records with all features) Noise in the training data –Training data labeled as good by IDS may contain malwares New class discovery –Not all classes are present at the time of classifier is initially trained. © 2013 Narus, Inc.
6
6 System Architecture © 2013 Narus, Inc.
7
7 Proposed Framework Two level malware detection framework: Macro-level classifier –Used to isolate malicious flows from the non-malicious ones. Micro-level classifier –Further categorize the malicious flows into one of the pre- existing malware or new malware © 2013 Narus, Inc.
8
8 Methodology: Two-layered Learning Framework L1: Ensemble learning based binary classifier Classifies Unknown or Malicious Random Forest Classifier L2: One class SVM with tree-based kernel, along with probabilistic class profiling for specific malware class and novel class detection Combine Classification Process
9
9 Proposed Framework 1-Class SVM for Known Malware Detection:
10
10 Proposed Framework Tree based feature transformation
11
11 X = Y = 1 … 1 2 … 2 3 … 3 x 11 x 12 … x 1d x 21 x 22 … x 2d ………… x m1 x m2 … x md ………… ………… ………… x n1 x n2 … x nd Proposed Framework Example of tree based features with three classes C1 C2 C3
12
12 +1 … … … x 11 x 12 … x 1d x 21 x 22 … x 2d ………… x m1 x m2 … x md ………… ………… x n1 x n2 … x nd X Sample m out of n, f out of d X X … P trees
13
13 X Sample m out of n, f out of d X X … … +1 … x 11 x 12 … x 1d x 21 x 22 … x 2d ………… x m1 x m2 … x md ………… ………… x n1 x n2 … x nd P trees
14
14 X Sample m out of n, f out of d X X … … … +1 x 11 x 12 … x 1d x 21 x 22 … x 2d ………… x m1 x m2 … x md ………… ………… x n1 x n2 … x nd P trees
15
15 Proposed Framework Example of tree base feature transformation.
16
16 Proposed Framework Kernel matrix for 1-class SVM: –Existing kernel, like RBF or Polynomial kernel assume feature vector do not have missing value –Propose a weighted linear kernel matrix for 1- class SVM based on transformed tree-based features by minimizing the following objective function. –W ij is the model regularizer, G ij is a ground truth kernel, which defined as
17
17 Proposed Framework Probabilistic Profiling for New Class Discovery:
18
18 Experimental Evaluation Data: –Network flow data from Internet service provider in Asia, a subset of 108 flow features extracted. –Use IDS/IPS system to generate the class label for each flow by analyzing the payload. 38 different types of malware classes have been identified by IDS/IPS, including Conficker, Tidserv, Trojans, etc. The flows that unlabeled by IDS/IPS are assigned to “good” (unknown) category.
19
19 Experimental Evaluation Data:
20
20 Experimental Evaluation Comparison of Tree-based Feature Transformation against Missing Value Imputation –Original: data without any missing value treatment –OMI: Overall mean value of the feature across all the classes –CMI: mean value of the feature for the given class –LKNN: Local KNN Imputation
21
21 Experimental Evaluation Results Comparison at Macro-level Results Comparison at Micro-level –ROC curve for new malware detection
22
22 Experimental Evaluation Overall Results Comparison for detecting both known and new malware
23
23 Conclusion We proposed an effective malware detection framework based on statistical flow-level features Two level ML based classifier New class detection Encrypted data A tree based kernel for 1-class SVM was proposed to handle the data imperfection issue in network flow data
24
24 Future Works Extend the formulation to an online learning setting Develop a hierarchical multi-class learning method to enhance the testing efficiency when the number of malware classes becomes extremely large.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.