CISC Machine Learning for Solving Systems Problems Presented by: Akanksha Kaul Dept of Computer & Information Sciences University of Delaware SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, Min Zhao
CISC Machine Learning for Solving Systems Problems Urgent need to detect malicious executables Major Threats Metamorphic Executables Reprograms itself Capable of infecting two OS. Polymorphic Executables Emulates as Non-malicious code Unseen Executables MOTIVATION
CISC Machine Learning for Solving Systems Problems Need of the Hour SBMDS String Based Malware Detection System What this system is exactly all about?? Performs Interpretable String Analysis Interpretable string is line of codes in a program which contains both API execution calls and important semantic strings representing the intent and goal of the program writer.
CISC Machine Learning for Solving Systems Problems Interpretable String??? Eg: Worm “Nimda ” “html script language = ‘javascript’ window.open(‘readme.eml’)” Another Example: “&gameid= %s&pass=%s; myparentthreadid=%d; myguid=%s” But all Strings are not interpretable Eg: “!0&0h0m0o0t0y0” “*3d%3dtgyhjij”,
CISC Machine Learning for Solving Systems Problems Major Steps to perform Constructing the interpretable strings by developing a feature parser. Performing feature selection to select informative strings. Using SVM ensemble with bagging to construct the classifier. Conducting the malware detector, also predict the exact type of the malware.
CISC Machine Learning for Solving Systems Problems Step 1 Develop Feature parser 39,838 executable collected from Kingsoft Anti-virus lab. All executables are PE files. Extract static features API calls from import table. Strings carrying semantic interpretation.
CISC Machine Learning for Solving Systems Problems SAMPLE (Backdoor-Redgirl.exe) ‘%s’ goto delete” always implicates that the malware may generate the “.bat” file to suicide
CISC Machine Learning for Solving Systems Problems Step 2 Feature Selection Selects only interpretable strings from the huge set of strings obtained from previous step. Assign these strings as signatures of the PE files.
CISC Machine Learning for Solving Systems Problems Step 3 Using SVM to CLASSIFY Why SVM ?? Have showed state-of-art results in classification problem. Problem: training complexity of SVM dependent on size of data set.
CISC Machine Learning for Solving Systems Problems Problem Training Accuracy becomes Constant when size of dataset reaches 3000
CISC Machine Learning for Solving Systems Problems Curse of Dimensionality?? Problem caused by the exponential increase in volume of data. How does SVM deals with “Curse of Dimensionality” Solution: By Using SVM ensemble & Bagging SVM ensemble and Bagging???
CISC Machine Learning for Solving Systems Problems 3.1 SVM Ensemble with Bagging Ensemble is a set of classifiers whose individual decisions are combined in some way to classify new samples. Bagging technique on the training set “BAGGING” (Bootstrap AGGregating) Uniform sampling of training data set
CISC Machine Learning for Solving Systems Problems 3.2 Multi-Classification Various classes of Malwares. To select the identical values from two different classes method of “MAJORITY VOTING” is used. Smallest index is chosen 1= Backdoors 2= Spywares 3= Trojans 4= Worms 0= Benign files
CISC Machine Learning for Solving Systems Problems STEP 4: Malware Detection Unknown variants of malwares are used. Malicious or not. To which class Malware belongs to.
CISC Machine Learning for Solving Systems Problems System Architecture 1. Feature Parser 2. Feature Selection 3. SVM Ensemble Classifier 4. Malware Detector
CISC Machine Learning for Solving Systems Problems Reason why I Chose This paper Comparisons With the Popular Anti- Virus Software. Points of Comparisons: 1. Detecting Known Variants of Malware. 2. Detecting Unknown Variants. 3. Efficiency (Detection Time). 4. Number of False positive Detections.
CISC Machine Learning for Solving Systems Problems Detecting Known Variants
CISC Machine Learning for Solving Systems Problems Detecting Unknown Variants
CISC Machine Learning for Solving Systems Problems Efficiency (Detection Time)
CISC Machine Learning for Solving Systems Problems Number of False Positives
CISC Machine Learning for Solving Systems Problems Conclusion This system has been already incorporated into the scanning tool of a commercial Anti- Virus software. Anti-Virus Name not Disclosed.
CISC Machine Learning for Solving Systems Problems Questions?????
CISC Machine Learning for Solving Systems Problems All Well that Ends Well THANK YOU