A Malware Similarity Testing Framework Variant A Malware Similarity Testing Framework
Purpose Define a Standard Dataset for use in malware variant testing Align malware variant detection to broader binary classification field Test current variant testing tools against proposed solution
Setting Sources are not testing datasets Virus total, AV, Open Source, malware code Derivation of sources are poor testing datasets Testing against poor results AV signatures Lack of breath in code modification
Previous Work BitShred, TLSH, FRASH FRASH, CTPH Variant detection papers BitShred, TLSH, FRASH Derived from AV signatures or code Varied source Not reproducible with accuracy File similarity FRASH, CTPH Not all variation is as complex as malware
Hypothesis A malware static, reproducible dataset that is based on human grouping will provide more critical testing of proposed variant detection engines Benefits Static, reproducible, results based on best known classification What are you setting out to do with your research described here today? Why is this significant?
Findings
Deriving Datasets through Algorithm Selection of sets from malware sources Via antivirus identification Varying source code Use available source code and vary by algorithm, compile What are you setting out to do with your research described here today? Why is this significant?
What are we testing against? Reproducing AV signature results? Reproduction of a flawed system Detecting a few, untested, constructed variance engines? What about real world breath? Can we reproduce the dataset for further testing and comparison? What are you setting out to do with your research described here today? Why is this significant?
Gold Standard Dataset Representative of real world data Knowledge of dataset derived from best available source Tests enable reproduction for peer review and further comparison Samples are real, wild malware Knowledge of dataset is derived from manual analysis Dataset is static and information is known What are you setting out to do with your research described here today? Why is this significant?
Alignment of Broader Field Malware Variant Detection is Binary/Statistical Classification Yet field has disparate measurements and terms Alignment of Nomenclature Apples to Apples comparison against other malware projects Apples to Apples comparison for broader statistical classification projects Removal of ambiguous terms ie accuacy What are you setting out to do with your research described here today? Why is this significant?
Measurements What are you setting out to do with your research described here today? Why is this significant?
Dataset Group1: Ziyang RAT – 12 Samples Group2: LinseningSvr – 19 Samples Group3: BeepService – 20 Samples Group4: SimpleFileMover – 13 Samples Group5: DD Keylogger – 5 Samples Group6: PUP – 10 Samples Group7: Unspecified Backdoor – 3 samples Group8: SvcInstaller – 3 samples What are you setting out to do with your research described here today? Why is this significant?
Dataset (Cont.) Dataset is manually analyzed Best possible information Dataset is static Reproducible results Small Can be grown What are you setting out to do with your research described here today? Why is this significant?
Candidate Solutions triggered, n-gram, raw input, pairwise comparisons CTPH (fuzzy hash, ssdeep, as published) triggered, n-gram, raw input, pairwise comparisons TLSH (as published) selective, n-gram, raw input, LSH comparisons sdhash (as published) full, n-gram, raw input, pairwise comparisons BitShred (re-implemented) Full, n-gram, section input, pairwise comparisons FirstByte (in house) Selective, n-gram, normalized input, LSH comparisons What are you setting out to do with your research described here today? Why is this significant?
Limiting and equal footing in measurements 2x2 options of FirstByte Recursive Disassembly vs. Linear Sweep Disassembly Library Filtering on/off Selection of Linear, noLibs Faster Signature generation Near performance curve of R-noLib What are you setting out to do with your research described here today? Why is this significant?
Limiting and equal footing in measurements What are you setting out to do with your research described here today? Why is this significant?
Limiting and equal footing in measurements What are you setting out to do with your research described here today? Why is this significant?
Limiting and equal footing in measurements TLSH bounding TLSH is a distance measurement, not a similarity Authors argue distance is a better approach Change 4 other projects, or TLSH Authors state distance of 300 is very dissimilar Sim = (300 – Distance)/3 Bound Sim < 0 to 0 What are you setting out to do with your research described here today? Why is this significant?
Fmeasure over Threshold What are you setting out to do with your research described here today? Why is this significant?
ROC Curve at Peak Threshold What are you setting out to do with your research described here today? Why is this significant?
Peak Recall and Precision What are you setting out to do with your research described here today? Why is this significant?
Signature Generation Performance What are you setting out to do with your research described here today? Why is this significant?
Comparison Performance What are you setting out to do with your research described here today? Why is this significant?
Conclusion - Inconsistencies A common dataset used in testing reveals inconsistencies in other tests Most easily attributed to dataset generation techniques What are you setting out to do with your research described here today? Why is this significant?
Conclusion – Reproducibility and Alignment Dataset can be reproduced exactly Dataset represents a Gold Standard approach Best known (human) information vs. tested project Measurements are aligned with the greater binary classification field Recall, Precision, Fmeasure What are you setting out to do with your research described here today? Why is this significant?
Conclusion – Comparative Results Topped ranked overall speed TLSH 2.5X signature generation (slower) than sdhash n log n all-pairs easily overcomes slower signature generation Topped ranked precision and recall FirstByte 42% better than Bitshred in Fmeasure 95% better than TLSH in Fmeasure 365K signatures/node/day limit Topped ranked if signature generation is a concern (365K/node/day) BitShred if n2 not a concern TLSH if n2 is a concern What are you setting out to do with your research described here today? Why is this significant?
Impact: Gold Standard A static, reproducible dataset based on human classification is superior in accessing malware variant detection techniques in that it is more representative of the “best known” classification requirement of a “Gold Standard Dataset”
Broader Contributions Testing for malware variant binary classification to date is suspect Inaccuracies What can others take away from your research and build on? What broader questions have you answered? What new questions have you enabled?
Summary and Conclusions Variant reveals much lower recall and precision scores between projects Variant is reproducible Variant represents an evaluation of the ability to reproduce human results. Variant aligns field to binary classification Variant tests multiple tools under the same conditions and can be used in future tests Years ago a professor told me how he had structured his presentations using the conclusions up front, then the “story”, followed again by the conclusions. It makes it a lot easier to follow. A presentation is not an anecdote and the conclusions aren’t a punch line, no impact is lost if folks know what is coming and where you are building to.
Remaining Questions Future Work Growing the dataset Open contributions Retesting of proposed works What are you unable to answer at this time? What new questions came up? What data do you need? What kind of collaborations are you looking for?
Jason Upchurch Jason.R.Upchurch@Intel.com