Malware Recognition with Binary Fingerprint Final Meeting Students : Tal Greenshpan & Offer Akrabi Supervisors : Ben Herzog & Amir Mizrahi (CheckPoint)
Goals Build an automated classifier for new malware Using static analysis methods Help reverse engineers classify new malware Comparing new functions to known functions
Methodology Static Analysis PE files Research important features in function comparison Reverse engineering Extract key features in order to identify resemblance between functions Keep only key features Develop an algorithm to determine feature similarity Compare functions Feature contribution
Methodology Build a database of known functions MSSQL Develop extractor and classifier Python IDAPython Testing Extra: GUI
Achievements Decided on a set of features to be used to differentiate functions Function size Number of API call Register count Memory count Arguments count Local variables size Features from the Function Call Graph (Generated by IDA) Number of Nodes Min/Max Out-degree Min /Max In-degree Min/Max Well Connected Components size Ratio of out-degrees that are larger than 1 Ratio of in-degrees that are larger than 1 Ratio of Well Connected Components that are larger than 1 Number of API call – All API calls made and the number of occurrences Register count – Number of time the registers were accessed Mem count – Number of times the memory was accessed Arguments count – Number of arguments the function has Local variables – Number of local variables
Achievements Automated mass feature extraction Low runtime complexity Created an Algorithm to differentiate functions Feature contribution Standard deviation Using the Numpy Python library Distance Algorithm – Contribution = -log(distance)
Achievements Successfully matched functions from actual malware samples! Distance Algorithm – Contribution = -log(distance)
Example Two very similar simple C++ malware like programs Different number of arguments Different number of local variables Different order of declaration Database containing about 2,500 functions
Perfect match : Resemblance = 34 כ
Function Call Graphs (generated by IDA) for the encryption function twin1.exe twin2.exe
Live Demonstration Database containing about 1,000 functions Suspected Zeus malware related files Locky ransomware samples Analysis of a different Locky sample, not in the database File analyzed : 0deb_U.exe Function analyzed: sub_402743
Conclusions Efficient classification of functions with selected features The first set of features we selected did not get sufficient results Euclidian distance not good enough to differentiate functions Good classification accuracy Run time complexity for very large databases could be problematic Can improve run time significantly – cost to accuracy Removing only one feature Most of the run time is spent calculating the contribution of each feature , therefore if the database is left unchanged than no need to calculate it again – saves a lot of time. Our run time complexity is O(n^2)
Thank you!