Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.

Data Analysis 1 Mark Stamp

Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o False positive, false negative, etc.  ROC curves o Area under the ROC curve (AUC) o Partial AUC (sometimes written as AUC p ) Data Analysis 2

Objective  Assume that we have a proposed method for detecting malware  We want to determine how well it performs on specific dataset o We want to quantify effectiveness  Ideally, compare to previous work o But, often difficult to directly compare  Comparisons to AV products? Data Analysis 3

Basic Assumptions  We have a set of known malware o All from a single (metamorphic) “family”… o …or, at least all of a similar type o For broader “families”, more difficult  Also, a representative non-family set o Often, assumed to be benign files o The more diverse, the more difficult  Much depends on problem specifics Data Analysis 4

Experimental Design  Want to test malware detection score o Refer to malware dataset as match set o And benign dataset is nomatch set  Partition match set into… o Training set  used to determine parameters of the scoring function o Test set  reserved to test scoring function generated from training set  Note: Cannot test on training set Data Analysis 5

Training and Scoring  Two phases: Training and scoring  Training phase o Train a model using training set  Scoring phase o Score data in test set and score nomatch (benign) set  Analyze results from scoring phase o Assume representative of general case Data Analysis 6

Scatterplots  Train a model on the training set  Apply score to test and nomatch sets o Can visualize result as scatterplot Data Analysis 7 score test case match scores nomatch scores

Experimental Design  A couple of potential problems… o How to partition match set? o How to get most out of limited data set?  Why are these things concerns? o When we partition match set, might get biased training/test sets, and … o … more data points is “more better”  Cross validation solves these problems Data Analysis 8

n-fold Cross Validation  Partition match set into n equal subsets o Denote subsets as S 1,S 2,…,S n  Let training set be S 2  S 3  …  S n o And test set is S 1  Repeat with training set S 1  S 3  …  S n o And test set S 2  And so on, for each of n “folds” o In our work, we usually select n = 5 Data Analysis 9

n-fold Cross Validation  Benefits of cross validation?  Any bias in match data smoothed out o Since bias only affects one/few of the S i  Obtain lots more match scores o Usually, no shortage of nomatch data o But match data can be very limited  And it’s easy to do, so why not? o Best of all, it sounds so fancy… Data Analysis 10

Thresholding  Threshold based on test vs nomatch o After training and scoring phases  Ideal is complete separation o I.e., no overlap in scatterplot o Usually, that doesn’t happen o So, where to set the threshold?  In practical use, thresholding critical o At research stage, more of a distraction Data Analysis 11

Thresholding  Where to set threshold? o Left case is easy, right case, not so much Data Analysis 12 score test case score test case

Quantifying Success  We need a way to quantify “better” o Ideas? Data Analysis 13 score test case score test case

Accuracy  Given scatterplot and a threshold…  We have following 4 cases o True positive  correctly classified as + o False positive  incorrectly classified + o True negative  correctly classified as − o False negative  incorrectly classified −  TP, FP, TN, FN, respectively, for short o Append “R” to each for “rate” Data Analysis 14

Sensitivity and Specificity  The TPR also known as sensitivity while TNR is known as specificity  Consider a medical test o Sensitivity is percentage of sick people who “pass” the test (as they should) o Specificity is percentage of healthy people who “fail” the test (as they should)  Inherent tradeoff between TPR/TNR o Note that these depend on threshold Data Analysis 15

Accuracy  Let P be number of positive cases tested and N negative cases tested o Note: P is size of test set, N nomatch set o Also, P = TP + FN and N = TN + FP  Finally, Accuracy = (TP + TN) / (P + N) o Note that accuracy ranges from 0 to 1 o Accuracy of 1 is the ideal case o Accuracy 0? Don’t give up your day job… Data Analysis 16

Balanced Accuracy  Often, there is a large imbalance between test set and nomatch set o Test set is small relative to nomatch set  Define Balanced accuracy = (TPR + TNR) / 2 = 0.5 TP/P + 0.5 TN/N o Errors on both sets weighted same  Consider imbalance issue again later Data Analysis 17

Accuracy  Accuracy tells us something… o But it depends on where threshold is set o How should we set the threshold? o Seems we are going around in circles  like a dog chasing its tail  Bottom line? Still don’t have a good way to compare different techniques o Next slide, please… Data Analysis 18

ROC Curves  Receiver Operating Characteristic o Originated from electrical engineering o But now widely used in many fields  What is an ROC curve? o Plot TPR vs FPR by varying threshold thru the range of scores o That is, FPR on x-axis, TPR on y-axis o Equivalently, 1 – specificity vs sensitivity o What the … ? Data Analysis 19

ROC Curve  Suppose threshold is set at yellow line o Above yellow, classified as positive, o Below yellow is negative  In this case, o TPR = 1.0 o FPR = 1.0 – TNR = 1.0 – 0.0 = 1.0 Data Analysis 20 score test case TPR FPR 1 1 0

ROC Curve  Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative  In this case, o TPR = 1.0 o FPR = 1.0 – TNR = 1.0 – 0.2 = 0.8 Data Analysis 21 score test case TPR FPR 1 1 0

ROC Curve  Connect the dots…  This is ROC curve  What good is it? o Captures info wrt all possible thresholds o Removes threshold as a factor in the analysis  What does it all mean? Data Analysis 31 TPR FPR 1 1 0

ROC Curve  Random classifier? o Yellow 45 degree line  Perfect classifier? o Red lines (Why?)  Above 45 degree line? o Better than random o The closer to the red, the closer to perfect Data Analysis 32 TPR FPR 1 1 0

Area Under the Curve (AUC)  ROC curve lives within a 1x1 square  Random classifier? o AUC ≈ 0.5  Perfect classifier (red)? o AUC = 1.0  Example curve (blue)? o AUC = 0.8 Data Analysis 33 TPR FPR 1 1 0

Area Under the Curve (AUC)  Area under ROC curve quantifies success o 0.5 like flipping a coin o 1.0 perfection achieved  AUC of ROC curve o Enables us to compare different techniques o And no need to worry about threshold Data Analysis 34 TPR FPR 1 1 0

Partial AUC  Might only consider cases where FPR < p  “Partial” AUC is AUC p o Area up to FPR of p o Normalized by p  In this example, AUC 0.4 = 0.2 / 0.4 = 0.5 AUC 0.2 = 0.08/0.2 = 0.4 Data Analysis 35 TPR FPR 1 1 0

Imbalance Problem  Suppose we train model for given malware family  In practice, we expect to score many more non-family files than family o Number of negative cases is large o Number of positive cases is small  So what?  Let’s consider an example Data Analysis 36

Imbalance Problem  In practice, we need threshold  For a given threshold, suppose sensitivity = 0.99, specificity = 0.98 o Then TPR = 0.99 and FPR = 0.02  Assume 1 in 1000 tested is malware o Of the type our model trained to detect  Suppose we scan, say, 100k files o What do we find? Data Analysis 37

Imbalance Problem  Assuming TPR = 0.99 and FPR = 0.02 o And 1 in 1000 is malware  After scanning 100k files… o Detect 99 of 100 actual malware (TP) o Misclassify 1 malware as benign (FN) o Correctly classify 97902 (out of 99900) benign as benign (TN) o Misclassify 1998 benign as malware (FP) Data Analysis 38

Imbalance Problem  We have 97903 classified as benign o Of those, 97902 are actually benign o And 97902/97903 > 0.9999  We classified 2097 as malware o Of these, only 99 are actual malware o But 99/2097 < 0.05  Remember the “boy who cried wolf”? o Here, we have detector that cries wolf… Data Analysis 39

Imbalance Solution?  What to do?  There is inherent tradeoff between sensitivity and specificity  Suppose we can adjust threshold so o TPR = 0.92 and FPR = 0.0003  As before… o We have 1 in 1000 is malware o And we test 100k files Data Analysis 40

Imbalance Solution?  Assuming TPR = 0.92 and FPR = 0.0003 o And 1 in 1000 is malware  After scanning 100k files… o Detect 92 of 100 actual malware (TP) o Misclassify 8 malware as benign (FN) o Correctly classify 99870 (out of 99900) benign as benign (TN) o Misclassify 30 benign as malware (FP) Data Analysis 41

Imbalance Solution?  We have 99878 classified as benign o Of those, all but 8 are actually benign o And 99870/99878 > 0.9999  We classified 122 as malware o Of these, 92 are actual malware o And 92/122 > 0.75  Can adjust threshold to further reduce “crying wolf” effect Data Analysis 42

Imbalance Problem  A better alternative?  Instead of increasing FPR to lower TPR o Perform secondary testing on files that are initially classified as malware o We can thus weed out most FP cases  This gives us best of both worlds o Low FPR, few benign reported as malware  No free lunch, so what’s the cost? Data Analysis 43

Bottom Line  Design your experiments properly o Use n-fold cross validation (e.g., n = 5) o Generally, cross validation is important  Thresholding is important in practice o But not so useful for analyzing results o Accuracy not so informative either  Use ROC curves and compute AUC o Sometimes, partial AUC is better  Imbalance problem may be significant issue Data Analysis 44

References  A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, 30:1145-1159, 1997The use of the area under the ROC curve in the evaluation of machine learning algorithms Data Analysis 45

Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.

Similar presentations

Presentation on theme: "Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.

Similar presentations

Presentation on theme: "Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o."— Presentation transcript:

Similar presentations

About project

Feedback