Download presentation
Presentation is loading. Please wait.
Published byAlbert Lawrence Modified over 8 years ago
1
Malware Classification and Novelty Detection Using PE Header Information Nasser Salim CS529 – Final Project April, 2011
2
Motivation Malware is becoming exponentially more prolific over time. ”Report: Targeted Attacks Evolve, New Malware Variants Spike By 100 Percent” www.darkreading.comwww.darkreading.com, (2010) ”Malware variants giving anti-virus firms a tough time” www.techrepublic.com, (2008)www.techrepublic.com ”Malware variants may have hit half-million mark”http://www.securityfocus.com/brief/655 (2008)http://www.securityfocus.com/brief/655 ”When you are getting thousands of samples a day, you cannot just rely on human analysts, you need automation.”
3
Detection Using Learning Riech et al. ”Automatic Analysis of Malware Behavior using Machine Learning.” (2009) Analysis using dynamic features collected from the runtime behavior of malware. Novelty detection using protypes of behavior features. Shafiq et al. ”PE-Miner: Realtime Mining of Structural Information to Detect Zero-Day Malicious Portable Executables.” (2009) Analysis using static features from the PE headers of malware. Achieved good results sorting malware into families (~.93 AUC) using decision trees.
4
Malware Countermeasures ”... more than 40% of the total malware samples reduce their malicious behavior under virtual machines or with a debugger attached, and they account for potentially 90% of the Internet attacks during certain periods.” Towards an Understanding of Anti- virtualization and Anti-debugging Behavior in Modern Malware (2010) ”...many packers will take steps to obfuscate a binary's import table by compressing or encrypting the list of functions and libraries that the binary depends upon.” Gray Hat Hacking: The Ethical Hacker’s Handbook (2008)
5
Project Goals Validation Do the detection methods from PE-Miner still work on newer malware? Extension If classification can be done so well using static features, can we also do novelty detection similar to Riech et al. ?
6
PE Header – Example of Features Continuous Features NumberOfSymbols SizeOfCode SizeOfStackReserve Discrete Features DLL flag LARGE_ADDRESS_AWARE flag MajorOperatingSystemVersion
7
Malware Data Sources Offensive Computing 10^5 malware samples updated frequently Unlabeled VX Heavens ~270,000 malware samples from 2010 Labeled according to malware family and variant Successor to the dataset used in PE-Miner
8
Malware Classes ClassNumber of Samples Backdoor50773 Dos/Nuker212 Constructor/VirTool974 Flooder566 Exploit/HackTool1371 Trojan163322 Virus3132 Worm11505 Used in PE-Miner
9
Malware Classes ClassNumber of Samples Backdoor50773 Dos/Nuker212 Constructor/VirTool974 Flooder566 Exploit/HackTool1371 Trojan163322 Virus3132 Worm11505 Hoax1128 Rootkit3179 Including Missing from PE-Miner * Too few samples from SpamTool, Spoofer
10
Malware Classes ClassNumber of Samples Backdoor50773 Dos/Nuker212 Constructor/VirTool974 Flooder566 Exploit/HackTool1371 Trojan163322 Virus3132 Worm11505 Hoax1128 Rootkit3179 Dropping the following from analysis
11
Feature Extraction Static feature extraction on O(10^5) is still slow and generates a lot of data. Sample down to O(10^3) for each class before extracting features. Used pefile parser built in Python. http://code.google.com/p/pefile/
12
Learning using Orange Orange – A lightweight machine learning toolkit for python. http://orange.biolab.si/ Used the C45 decision tree algorithm with 10 fold cross validation. C45 obtained the best results in PE-Miner. Mix of data types and scales makes metric based algorithms challenging.
13
Orange Code #!/usr/bin/python import orange, orngTest, orngStat, orngTree data = orange.ExampleTable("test_data") c45 = orange.C45Learner() c45.name = "C45" learners = [c45] results = orngTest.crossValidation (learners, data, folds=10) for i in xrange(len(learners)): print learners[i].name, " : ", print orngStat.CA(results)[i], ” : ”, print orngStat.AUC(results)[i]
14
Classification Results Still experimenting with tree pruning Classification accuracy > 65% AUC >.8 (PE-Miner achieved >.9)
15
Novelty Detection Using Leader Clustering (Online variant of K-Means) 1. Specify a distance threshold. 2. Try to find an existing cluster center with the smallest distance to the new sample that is less than the threshold. 3. If no cluster center exists, create a new cluster using the new sample as a center.
16
Metric Problem PE Data is a mix of data types and scales Try: Hamming Distance on only binary features Number of differences Turn some features into binary (i.e. Rather than number of symbols imported from a particular.dll, flag the usage of that.dll)
17
Evaluation of Novelty Detection For each class Hold out that class training data Train Leader algorithm with remaining data Add testing data including the held out class Measure the number of false positive and false negative novelty hits Average over all classes
18
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.