Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prasit Usaphapanus Krerk Piromsopa

Similar presentations


Presentation on theme: "Prasit Usaphapanus Krerk Piromsopa"— Presentation transcript:

1 Prasit Usaphapanus Krerk Piromsopa
Classification of Computer Viruses from Binary Code using Ensemble Classifier and Recursive Feature Elimination Prasit Usaphapanus Krerk Piromsopa Department of Computer Engineering Chulalongkorn University, Thailand ICDIM2017

2 Introduction Number of malware keep increasing
74,000 per days [1] Kaspersky Lab’s web has found 69,277,289 kinds of malicious objects [2] [1] [2] Kaspersky Security Bulletin. Overall Statistics for 2016 ICDIM2017

3 Thesis objective Supervised machine learning to detect unseen virus (static analysis) Inspired from Matthew Schultz and et al. M. Schultz, E. Eskin, F. Zadok, and S. Stolfo, “Data mining methods for detection of new malicious executables,” in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P IEEE Comput. Soc, 2001, pp. 38–49. [Online]. Available: ICDIM2017

4 Related research Type of feature Feature selection Classifier Data set
API Calls Bytecode n-gram Opcode n-gram, basic block Feature selection Information gain TF-IDF Classifier Random forest Data set ICDIM2017

5 Measurement F1-Score 𝐹 1 =2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
𝐹 1 =2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑟𝑒𝑐𝑎𝑙𝑙= 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 ICDIM2017

6 Data set Windows installed http://vxheaven.org/ Download 3052 3267
binary files Data set objdump Feature extraction TF-IDF Feature selection Scikit-learn 10-fold cross-validation Evaluation Fig 2 Table I Fig 3 Soft voting Recursive feature elimination Download 3052 3267 ICDIM2017

7 Feature extraction Feature type Bytecode n-gram Opcode n-gram
binary files Data set objdump Feature extraction TF-IDF Feature selection Scikit-learn 10-fold cross-validation Evaluation Fig 2 Table I Fig 3 Soft voting Recursive feature elimination Feature type Bytecode n-gram Opcode n-gram Opcode basic block Section line count ICDIM2017

8 Feature extraction Feature type Bytecode n-gram Ex. ff ff ff ff a3 75
ICDIM2017

9 Feature extraction Feature type Opcode n-gram
addl-cmp-jz-subl-subf-call 2-gram ICDIM2017

10 Feature extraction Feature type Opcode basic block
addl-cmp-jz-subl-subf-call ICDIM2017

11 Feature selection 𝑡𝑓𝑖𝑑𝑓=𝑡𝑓×𝑖𝑑𝑓 𝑡𝑓(𝑤,𝐷)= 𝑓 𝑤 𝐷 𝑖𝑑𝑓 𝑡 =1+𝑙𝑜𝑔 𝐶 1+𝑑𝑓(𝑡)
binary files Data set objdump Feature extraction TF-IDF Feature selection Scikit-learn 10-fold cross-validation Evaluation Fig 2 Table I Fig 3 Soft voting Recursive feature elimination 𝑡𝑓𝑖𝑑𝑓=𝑡𝑓×𝑖𝑑𝑓 𝑡𝑓(𝑤,𝐷)= 𝑓 𝑤 𝐷 𝑖𝑑𝑓 𝑡 =1+𝑙𝑜𝑔 𝐶 1+𝑑𝑓(𝑡) ICDIM2017

12 Evaluation T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting
System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16. New York, New York, USA: ACM Press, mar 2016, pp. 785–794. Evaluation binary files Data set objdump Feature extraction TF-IDF Feature selection Scikit-learn 10-fold cross-validation Evaluation Fig 2 Table I Fig 3 Soft voting Recursive feature elimination ICDIM2017

13 Soft voting Vote of Random forest & XGBoosting ICDIM2017

14 Recursive feature elimination [4]
Recursive feature elimination with cross-validation [4] Guyon, I., Weston, J., Barnhill, S., & Vapnik, V., “Gene selection for cancer classification using support vector machines”, Mach. Learn., 46(1-3), 389–422, 2002. ICDIM2017

15 Conclusion Type of feature Feature selection Classifier Data set
Bytecode n-gram Opcode n-gram, basic block Mixed Feature selection TF-IDF + Recursive Feature Elimination Classifier Random forest XGBoosting Multi layer perceptron Soft Voting Data set ICDIM2017


Download ppt "Prasit Usaphapanus Krerk Piromsopa"

Similar presentations


Ads by Google