Download presentation
Presentation is loading. Please wait.
1
Prasit Usaphapanus Krerk Piromsopa
Classification of Computer Viruses from Binary Code using Ensemble Classifier and Recursive Feature Elimination Prasit Usaphapanus Krerk Piromsopa Department of Computer Engineering Chulalongkorn University, Thailand ICDIM2017
2
Introduction Number of malware keep increasing
74,000 per days [1] Kaspersky Lab’s web has found 69,277,289 kinds of malicious objects [2] [1] [2] Kaspersky Security Bulletin. Overall Statistics for 2016 ICDIM2017
3
Thesis objective Supervised machine learning to detect unseen virus (static analysis) Inspired from Matthew Schultz and et al. M. Schultz, E. Eskin, F. Zadok, and S. Stolfo, “Data mining methods for detection of new malicious executables,” in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P IEEE Comput. Soc, 2001, pp. 38–49. [Online]. Available: ICDIM2017
4
Related research Type of feature Feature selection Classifier Data set
API Calls Bytecode n-gram Opcode n-gram, basic block Feature selection Information gain TF-IDF Classifier Random forest Data set ICDIM2017
5
Measurement F1-Score 𝐹 1 =2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
𝐹 1 =2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑟𝑒𝑐𝑎𝑙𝑙= 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 ICDIM2017
6
Data set Windows installed http://vxheaven.org/ Download 3052 3267
binary files Data set objdump Feature extraction TF-IDF Feature selection Scikit-learn 10-fold cross-validation Evaluation Fig 2 Table I Fig 3 Soft voting Recursive feature elimination Download 3052 3267 ICDIM2017
7
Feature extraction Feature type Bytecode n-gram Opcode n-gram
binary files Data set objdump Feature extraction TF-IDF Feature selection Scikit-learn 10-fold cross-validation Evaluation Fig 2 Table I Fig 3 Soft voting Recursive feature elimination Feature type Bytecode n-gram Opcode n-gram Opcode basic block Section line count ICDIM2017
8
Feature extraction Feature type Bytecode n-gram Ex. ff ff ff ff a3 75
ICDIM2017
9
Feature extraction Feature type Opcode n-gram
addl-cmp-jz-subl-subf-call 2-gram ICDIM2017
10
Feature extraction Feature type Opcode basic block
addl-cmp-jz-subl-subf-call ICDIM2017
11
Feature selection 𝑡𝑓𝑖𝑑𝑓=𝑡𝑓×𝑖𝑑𝑓 𝑡𝑓(𝑤,𝐷)= 𝑓 𝑤 𝐷 𝑖𝑑𝑓 𝑡 =1+𝑙𝑜𝑔 𝐶 1+𝑑𝑓(𝑡)
binary files Data set objdump Feature extraction TF-IDF Feature selection Scikit-learn 10-fold cross-validation Evaluation Fig 2 Table I Fig 3 Soft voting Recursive feature elimination 𝑡𝑓𝑖𝑑𝑓=𝑡𝑓×𝑖𝑑𝑓 𝑡𝑓(𝑤,𝐷)= 𝑓 𝑤 𝐷 𝑖𝑑𝑓 𝑡 =1+𝑙𝑜𝑔 𝐶 1+𝑑𝑓(𝑡) ICDIM2017
12
Evaluation T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting
System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16. New York, New York, USA: ACM Press, mar 2016, pp. 785–794. Evaluation binary files Data set objdump Feature extraction TF-IDF Feature selection Scikit-learn 10-fold cross-validation Evaluation Fig 2 Table I Fig 3 Soft voting Recursive feature elimination ICDIM2017
13
Soft voting Vote of Random forest & XGBoosting ICDIM2017
14
Recursive feature elimination [4]
Recursive feature elimination with cross-validation [4] Guyon, I., Weston, J., Barnhill, S., & Vapnik, V., “Gene selection for cancer classification using support vector machines”, Mach. Learn., 46(1-3), 389–422, 2002. ICDIM2017
15
Conclusion Type of feature Feature selection Classifier Data set
Bytecode n-gram Opcode n-gram, basic block Mixed Feature selection TF-IDF + Recursive Feature Elimination Classifier Random forest XGBoosting Multi layer perceptron Soft Voting Data set ICDIM2017
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.