Ensemble Learning for Low-level Hardware-supported Malware Detection Khaled N. Khasawneh*, Meltem Ozsoy***, Caleb Donovick**, Nael Abu-Ghazaleh*, and Dmitry Ponomarev** * University of California, Riverside, ** Binghamton University, *** Intel Corp. RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Malware Growth McAfee Lab Over 350M malware programs in their malware zoo 387 new threat every minute RAID 2015 – Kyoto, Japan, November 2015
Malware Detection Analysis Static analysis Search for signatures in the executable Can detect all known malware programs with no false alarms Can't detect metamorphic malware, polymorphic malware, or targeted attacks RAID 2015 – Kyoto, Japan, November 2015
Malware Detection Analysis Static analysis Search for signatures in the executable Can detect all known malware programs with no false alarms Can't detect metamorphic malware, polymorphic malware, or targeted attacks Dynamic analysis Monitors the behavior of the program Can detect metamorphic malware, polymorphic malware, and targeted attacks Adds substantial overhead to the system and have false positives RAID 2015 – Kyoto, Japan, November 2015
Two-level malware detection framework RAID 2015 – Kyoto, Japan, November 2015
Two-Level Malware Detection MAP was introduced by Ozsoy el al. (HPCA 2015) Explored a number of sub-semantic features vectors Single hardware supported detector Detect malware online (In real time) Two stage detection RAID 2015 – Kyoto, Japan, November 2015
Contributions of this work Better hardware malware detection using ensemble of detectors specialized for each type of malware Metrics to measure resulting advantages of using two-level malware detection framework RAID 2015 – Kyoto, Japan, November 2015
Evaluation Methodology: workloads, features, performance measures RAID 2015 – Kyoto, Japan, November 2015
Data Set & Data Collection Source of programs Malware MalwareDB 2011-2014 3,690 total malware programs Regular Windows system binaries Other applications like Winrar, Notepad++, Acrobat Reader Total Training Testing Cross-Validation Backdoor 815 489 163 Rogue 685 411 137 PWS 558 335 111 Trojan 1123 673 225 Worm 473 283 95 Regular 554 332 Dynamic trace Windows 7 virtual machine Firewall and security services were all disabled Pin tool was used to collect the features during execution RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Feature Space Instruction mix INS1: frequency of instruction categories INS2: frequency of most variant opcodes INS3: presence of instruction categories INS4: presence of most variant opcodes Memory reference patterns MEM1: histogram (count) of memory address distances MEM2: binary (presence) of memory address distances Architectural events ARCH: Total number of memory reads, memory writes, unaligned memory access, immediate branches and taken branches RAID 2015 – Kyoto, Japan, November 2015
Detection Performance Measures Sensitivity: Percent of malware that was detected (True positive rate) Specificity: Percent of correctly classified regular programs (True negative rate) Receiver Operating Characteristic (ROC) Curve Summaries the prediction performance for range of detection thresholds Area Under the Curve (AUC) Traditional performance metric for ROC curve RAID 2015 – Kyoto, Japan, November 2015
Specializing the Detectors for different malware types RAID 2015 – Kyoto, Japan, November 2015
Constructing Specialized Detectors Specialized detectors for each malware type were trained only with the data of that type Supervised learning with logistic regression was used MEM1 Detectors RAID 2015 – Kyoto, Japan, November 2015
General vs. Specialized Detectors Backdoor PWS Rogue Trojan Worm INS1 General 0.713 0.909 0.949 0.715 0.705 Specialized 0.892 0.962 0.727 0.819 INS2 0.905 0.946 0.993 0.768 0.810 0.895 0.954 0.976 0.782 0.984 INS3 0.837 0.924 0.527 0.761 0.840 0.888 0.991 0.808 0.852 INS4 0.866 0.868 0.914 0.788 0.830 0.891 0.941 0.798 0.869 MEM1 0.729 0.893 0.424 0.650 0.961 0.921 0.867 0.871 MEM2 0.833 0.947 0.903 0.843 0.979 0.931 ARCH 0.702 0.919 0.965 0.763 0.602 0.686 0.942 0.970 0.795 0.560 RAID 2015 – Kyoto, Japan, November 2015
General vs. Specialized Detectors Backdoor PWS Rogue Trojan Worm INS1 General 0.713 0.909 0.949 0.715 0.705 Specialized 0.892 0.962 0.727 0.819 INS2 0.905 0.946 0.993 0.768 0.810 0.895 0.954 0.976 0.782 0.984 INS3 0.837 0.924 0.527 0.761 0.840 0.888 0.991 0.808 0.852 INS4 0.866 0.868 0.914 0.788 0.830 0.891 0.941 0.798 0.869 MEM1 0.729 0.893 0.424 0.650 0.961 0.921 0.867 0.871 MEM2 0.833 0.947 0.903 0.843 0.979 0.931 ARCH 0.702 0.919 0.965 0.763 0.602 0.686 0.942 0.970 0.795 0.560 RAID 2015 – Kyoto, Japan, November 2015
General vs. Specialized Detectors Backdoor PWS Rogue Trojan Worm INS1 General 0.713 0.909 0.949 0.715 0.705 Specialized 0.892 0.962 0.727 0.819 INS2 0.905 0.946 0.993 0.768 0.810 0.895 0.954 0.976 0.782 0.984 INS3 0.837 0.924 0.527 0.761 0.840 0.888 0.991 0.808 0.852 INS4 0.866 0.868 0.914 0.788 0.830 0.891 0.941 0.798 0.869 MEM1 0.729 0.893 0.424 0.650 0.961 0.921 0.867 0.871 MEM2 0.833 0.947 0.903 0.843 0.979 0.931 ARCH 0.702 0.919 0.965 0.763 0.602 0.686 0.942 0.970 0.795 0.560 RAID 2015 – Kyoto, Japan, November 2015
General vs. Specialized Detectors Backdoor PWS Rogue Trojan Worm INS1 General 0.713 0.909 0.949 0.715 0.705 Specialized 0.892 0.962 0.727 0.819 INS2 0.905 0.946 0.993 0.768 0.810 0.895 0.954 0.976 0.782 0.984 INS3 0.837 0.924 0.527 0.761 0.840 0.888 0.991 0.808 0.852 INS4 0.866 0.868 0.914 0.788 0.830 0.891 0.941 0.798 0.869 MEM1 0.729 0.893 0.424 0.650 0.961 0.921 0.867 0.871 MEM2 0.833 0.947 0.903 0.843 0.979 0.931 ARCH 0.702 0.919 0.965 0.763 0.602 0.686 0.942 0.970 0.795 0.560 RAID 2015 – Kyoto, Japan, November 2015
Is There an Opportunity? General Specialized Difference Backdoor 0.8662 0.8956 0.0294 PWS 0.8684 0.9795 0.1111 Rogue 0.9149 0.9937 0.0788 Trojan 0.7887 0.8676 0.0789 Worm 0.8305 0.9842 0.1537 Average 0.8537 0.9441 0.0904 Best General (INS4) Best Specialized per Type RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Ensemble Detectors RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Ensemble Learning Multiple diverse base detectors Different learning algorithm Different data set Combined to solve a problem RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Decision Functions Or’ing High Confidence Or’ing RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Decision Functions Majority voting Stacking RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Ensemble Detectors General Ensemble Combines multiple general detectors Best of INS, MEM, ARCH Specialized Ensemble Combines the best specialized detector for each malware type Mixed Ensemble Combines the best general detector with the best specialized detectors from the same features vector RAID 2015 – Kyoto, Japan, November 2015
Offline Detection Effectiveness Decision Function Sensitivity Specificity Accuracy Best General - 82.4% 89.3% 85.1% General Ensemble Or’ing 99.1% 13.3% 65.0% High Confidence 80.7% 92.0% Majority Voting 83.3% 92.1% 86.7% Stacking 96.0% 86.8% Specialized Ensemble 100% 5% 51.3% 94.4% 94.7% 94.5% 95.8% 95.9% Mixed Ensemble 84.2% 70.6% 78.8% 81.3% 82.5% RAID 2015 – Kyoto, Japan, November 2015
Offline Detection Effectiveness Decision Function Sensitivity Specificity Accuracy Best General - 82.4% 89.3% 85.1% General Ensemble Or’ing 99.1% 13.3% 65.0% High Confidence 80.7% 92.0% Majority Voting 83.3% 92.1% 86.7% Stacking 96.0% 86.8% Specialized Ensemble 100% 5% 51.3% 94.4% 94.7% 94.5% 95.8% 95.9% Mixed Ensemble 84.2% 70.6% 78.8% 81.3% 82.5% RAID 2015 – Kyoto, Japan, November 2015
Offline Detection Effectiveness Decision Function Sensitivity Specificity Accuracy Best General - 82.4% 89.3% 85.1% General Ensemble Or’ing 99.1% 13.3% 65.0% High Confidence 80.7% 92.0% Majority Voting 83.3% 92.1% 86.7% Stacking 96.0% 86.8% Specialized Ensemble 100% 5% 51.3% 94.4% 94.7% 94.5% 95.8% 95.9% Mixed Ensemble 84.2% 70.6% 78.8% 81.3% 82.5% RAID 2015 – Kyoto, Japan, November 2015
Online Detection Effectiveness A decision is made after each 10,000 committed instructions Exponentially Weighted Moving Average (EWMA) to filter false alarms Sensitivity Specificity Accuracy Best General 84.2% 86.6% 85.1% General Ensemble (Stacking) 77.1% 94.6% 84.1% Specialized Ensemble (Stacking) 92.9% 92.0% 92.3% Mixed Ensemble (Stacking) 85.5% 90.1% 87.4% RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Metrics to Assess Relative Performance of two-Level Detection framework RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Metrics Work Advantage Time Advantage Detection Performance RAID 2015 – Kyoto, Japan, November 2015
Online Detection Effectiveness A decision is made after each 10,000 committed instructions Exponentially Weighted Moving Average (EWMA) to filter false alarms Sensitivity Specificity Accuracy Best General 84.2% 86.6% 85.1% General Ensemble (Stacking) 77.1% 94.6% 84.1% Specialized Ensemble (Stacking) 92.9% 92.0% 92.3% Mixed Ensemble (Stacking) 85.5% 90.1% 87.4% RAID 2015 – Kyoto, Japan, November 2015
Time & Work Advantage Results Time Advantage Work Advantage RAID 2015 – Kyoto, Japan, November 2015
Hardware Implementation Physical design overhead Area 2.8% (Ensemble), 0.3% (General) Power 1.5% (Ensemble), 0.1% (General) Cycle time 9.8% (Ensemble), 1.9% (General) RAID 2015 – Kyoto, Japan, November 2015
Conclusions & Future Work Ensemble learning with specialized detectors can significantly improve detection performance Hardware complexity increases, but several optimizations still possible Some features are complex to collect; simpler features may carry same information Future work: Demonstrate a fully functional system Study how attackers could evolve and adversarial machine learning RAID 2015 – Kyoto, Japan, November 2015
RAID 2015 – Kyoto, Japan, November 2015 Thank you! Questions? RAID 2015 – Kyoto, Japan, November 2015