Anomaly Detection in Data Science One-class Classification with Privileged Information for Malware Detection Pavel Erofeev, IITP RAS, Airbus Group Russia
Find the Panda
Anomaly Detection: Hadlum vs Hadlum The birth of a child to Mrs. Hadlum happened 349 days after Mr. Haldum left for military service Average human pregnancy period is 280 days (40 weeks) Statistically, 39 days is an outlier
An outlier is an observation which deviates so much from other observations as to arouse suspicions that it was generated by different mechanism Howkins, 1980
Defining Anomaly Detection Digital representation vectors describing observations Mixture of “nominal” and “abnormal” points Anomaly points are generated by different generative process than the nominal points
Possible Settings in CS Supervised (Know attacks) Training data labeled with “nominal” or “anomaly” Clean (Zero-day attacks) Training data are all “nominal”, test data may be contaminated with “anomaly” Unsupervised (Unknown attacks) Training data consists of mixture of “nominal” and “anomaly” points
Real World Data Problems Data is multivariate There is usually more than one generating mechanism underlying the “normal” data Anomalies may represent a different class of objects, so there sre many of them Domain specific definition of what to count as anomaly Normality evaolves in time
Anomaly Taxonomy Point Anomaly
Anomaly Taxonomy Contextual Anomaly
Anomaly Taxonomy Causal Anomaly
Taxonomy
Imbalanced classification Normal data - a lot of samples Abnormal - very few Standard methods do not work as expected Standard metrics do not apply
Imbalanced classification Weights for classes Proved not to be helpful in most cases Resampling methods Oversampling (Bootstrap, SMOTE, etc.) Undersampling How to choose which method to use? How to choose resampling parameter? We compared several methods We proposed a meta-model that on average gives best results [Papanov, Erofeev, Burnaev, 2015]
Statistics-based models Assumption on normal data generation procedure (e.g. Gaussian distribution, etc.) PCA is a method commonly used to extract most variant combinations in data PCA based anomaly detection is good for highly correlated environments
Density-based models SVM-based and nearest neighbours based How to choose best kernel parameter?
One-class SVM with Privileged Information Evgeny Burnaev Dmitry Smolyakov Skoltech, IITP RAS
One-Class SVM
One-Class SVM
One-Class SVM
One-Class SVM Kernel Trick
Kernel Trick
Hyper-parameter Influence
Decision Functions
Learning with Privileged Info Example: Image classification with textual description
Learning with Privileged Info
Learning with Privileged Info
Learning with Privileged Info
Microsoft Malware Classification Challenge Kaggle.com competition data (2015)
Problem Description 9 malware families Raw data Rumnit, Lollipop, Kelihos ver3, Vundo, Simda, Tracur, Kelihos ver1, Obfuscator.ACY, Gatak Raw data Hexadecimal representation of the raw binary content Meta-data extracted from the binaries, including function calls, strings, etc.
Features Original features Privileged features Information from binary files such as Frequencies of bytes Number of different N-grams, etc. Privileged features Information from code disassemble such as Frequencies of commands Number of calls to external dlls Bytecode as an image Features based on image texture which is commonly used for image classification
Features
Experimental Setup
Results
Thanks! Any questions? pavel.erofeev@phystech.edu