Download presentation
Presentation is loading. Please wait.
Published byCornelius Caldwell Modified over 9 years ago
1
Data Analysis 1 Mark Stamp
2
Topics Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc. Accuracy o False positive, false negative, etc. ROC curves o Area under the ROC curve (AUC) o Partial AUC (sometimes written as AUC p ) Data Analysis 2
3
Objective Assume that we have a proposed method for detecting malware We want to determine how well it performs on specific dataset o We want to quantify effectiveness Ideally, compare to previous work o But, often difficult to directly compare Comparisons to AV products? Data Analysis 3
4
Basic Assumptions We have a set of known malware o All from a single (metamorphic) “family”… o …or, at least all of a similar type o For broader “families”, more difficult Also, a representative non-family set o Often, assumed to be benign files o The more diverse, the more difficult Much depends on problem specifics Data Analysis 4
5
Experimental Design Want to test malware detection score o Refer to malware dataset as match set o And benign dataset is nomatch set Partition match set into… o Training set used to determine parameters of the scoring function o Test set reserved to test scoring function generated from training set Note: Cannot test on training set Data Analysis 5
6
Training and Scoring Two phases: Training and scoring Training phase o Train a model using training set Scoring phase o Score data in test set and score nomatch (benign) set Analyze results from scoring phase o Assume representative of general case Data Analysis 6
7
Scatterplots Train a model on the training set Apply score to test and nomatch sets o Can visualize result as scatterplot Data Analysis 7 score test case match scores nomatch scores
8
Experimental Design A couple of potential problems… o How to partition match set? o How to get most out of limited data set? Why are these things concerns? o When we partition match set, might get biased training/test sets, and … o … more data points is “more better” Cross validation solves these problems Data Analysis 8
9
n-fold Cross Validation Partition match set into n equal subsets o Denote subsets as S 1,S 2,…,S n Let training set be S 2 S 3 … S n o And test set is S 1 Repeat with training set S 1 S 3 … S n o And test set S 2 And so on, for each of n “folds” o In our work, we usually select n = 5 Data Analysis 9
10
n-fold Cross Validation Benefits of cross validation? Any bias in match data smoothed out o Since bias only affects one/few of the S i Obtain lots more match scores o Usually, no shortage of nomatch data o But match data can be very limited And it’s easy to do, so why not? o Best of all, it sounds so fancy… Data Analysis 10
11
Thresholding Threshold based on test vs nomatch o After training and scoring phases Ideal is complete separation o I.e., no overlap in scatterplot o Usually, that doesn’t happen o So, where to set the threshold? In practical use, thresholding critical o At research stage, more of a distraction Data Analysis 11
12
Thresholding Where to set threshold? o Left case is easy, right case, not so much Data Analysis 12 score test case score test case
13
Quantifying Success We need a way to quantify “better” o Ideas? Data Analysis 13 score test case score test case
14
Accuracy Given scatterplot and a threshold… We have following 4 cases o True positive correctly classified as + o False positive incorrectly classified + o True negative correctly classified as − o False negative incorrectly classified − TP, FP, TN, FN, respectively, for short o Append “R” to each for “rate” Data Analysis 14
15
Sensitivity and Specificity The TPR also known as sensitivity while TNR is known as specificity Consider a medical test o Sensitivity is percentage of sick people who “pass” the test (as they should) o Specificity is percentage of healthy people who “fail” the test (as they should) Inherent tradeoff between TPR/TNR o Note that these depend on threshold Data Analysis 15
16
Accuracy Let P be number of positive cases tested and N negative cases tested o Note: P is size of test set, N nomatch set o Also, P = TP + FN and N = TN + FP Finally, Accuracy = (TP + TN) / (P + N) o Note that accuracy ranges from 0 to 1 o Accuracy of 1 is the ideal case o Accuracy 0? Don’t give up your day job… Data Analysis 16
17
Balanced Accuracy Often, there is a large imbalance between test set and nomatch set o Test set is small relative to nomatch set Define Balanced accuracy = (TPR + TNR) / 2 = 0.5 TP/P + 0.5 TN/N o Errors on both sets weighted same Consider imbalance issue again later Data Analysis 17
18
Accuracy Accuracy tells us something… o But it depends on where threshold is set o How should we set the threshold? o Seems we are going around in circles like a dog chasing its tail Bottom line? Still don’t have a good way to compare different techniques o Next slide, please… Data Analysis 18
19
ROC Curves Receiver Operating Characteristic o Originated from electrical engineering o But now widely used in many fields What is an ROC curve? o Plot TPR vs FPR by varying threshold thru the range of scores o That is, FPR on x-axis, TPR on y-axis o Equivalently, 1 – specificity vs sensitivity o What the … ? Data Analysis 19
20
ROC Curve Suppose threshold is set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 1.0 o FPR = 1.0 – TNR = 1.0 – 0.0 = 1.0 Data Analysis 20 score test case TPR FPR 1 1 0
21
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 1.0 o FPR = 1.0 – TNR = 1.0 – 0.2 = 0.8 Data Analysis 21 score test case TPR FPR 1 1 0
22
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 1.0 o FPR = 1.0 – TNR = 1.0 – 0.4 = 0.6 Data Analysis 22 score test case TPR FPR 1 1 0
23
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 1.0 o FPR = 1.0 – TNR = 1.0 – 0.6 = 0.4 Data Analysis 23 score test case TPR FPR 1 1 0
24
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 0.8 o FPR = 1.0 – TNR = 1.0 – 0.6 = 0.4 Data Analysis 24 score test case TPR FPR 1 1 0
25
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 0.6 o FPR = 1.0 – TNR = 1.0 – 0.6 = 0.4 Data Analysis 25 score test case TPR FPR 1 1 0
26
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 0.6 o FPR = 1.0 – TNR = 1.0 – 0.6 = 0.2 Data Analysis 26 score test case TPR FPR 1 1 0
27
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 0.4 o FPR = 1.0 – TNR = 1.0 – 0.6 = 0.2 Data Analysis 27 score test case TPR FPR 1 1 0
28
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 0.4 o FPR = 1.0 – TNR = 1.0 – 0.6 = 0.0 Data Analysis 28 score test case TPR FPR 1 1 0
29
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 0.2 o FPR = 1.0 – TNR = 1.0 – 0.6 = 0.0 Data Analysis 29 score test case TPR FPR 1 1 0
30
ROC Curve Suppose threshold set at yellow line o Above yellow, classified as positive, o Below yellow is negative In this case, o TPR = 0.0 o FPR = 1.0 – TNR = 1.0 – 0.6 = 0.0 Data Analysis 30 score test case TPR FPR 1 1 0
31
ROC Curve Connect the dots… This is ROC curve What good is it? o Captures info wrt all possible thresholds o Removes threshold as a factor in the analysis What does it all mean? Data Analysis 31 TPR FPR 1 1 0
32
ROC Curve Random classifier? o Yellow 45 degree line Perfect classifier? o Red lines (Why?) Above 45 degree line? o Better than random o The closer to the red, the closer to perfect Data Analysis 32 TPR FPR 1 1 0
33
Area Under the Curve (AUC) ROC curve lives within a 1x1 square Random classifier? o AUC ≈ 0.5 Perfect classifier (red)? o AUC = 1.0 Example curve (blue)? o AUC = 0.8 Data Analysis 33 TPR FPR 1 1 0
34
Area Under the Curve (AUC) Area under ROC curve quantifies success o 0.5 like flipping a coin o 1.0 perfection achieved AUC of ROC curve o Enables us to compare different techniques o And no need to worry about threshold Data Analysis 34 TPR FPR 1 1 0
35
Partial AUC Might only consider cases where FPR < p “Partial” AUC is AUC p o Area up to FPR of p o Normalized by p In this example, AUC 0.4 = 0.2 / 0.4 = 0.5 AUC 0.2 = 0.08/0.2 = 0.4 Data Analysis 35 TPR FPR 1 1 0
36
Imbalance Problem Suppose we train model for given malware family In practice, we expect to score many more non-family files than family o Number of negative cases is large o Number of positive cases is small So what? Let’s consider an example Data Analysis 36
37
Imbalance Problem In practice, we need threshold For a given threshold, suppose sensitivity = 0.99, specificity = 0.98 o Then TPR = 0.99 and FPR = 0.02 Assume 1 in 1000 tested is malware o Of the type our model trained to detect Suppose we scan, say, 100k files o What do we find? Data Analysis 37
38
Imbalance Problem Assuming TPR = 0.99 and FPR = 0.02 o And 1 in 1000 is malware After scanning 100k files… o Detect 99 of 100 actual malware (TP) o Misclassify 1 malware as benign (FN) o Correctly classify 97902 (out of 99900) benign as benign (TN) o Misclassify 1998 benign as malware (FP) Data Analysis 38
39
Imbalance Problem We have 97903 classified as benign o Of those, 97902 are actually benign o And 97902/97903 > 0.9999 We classified 2097 as malware o Of these, only 99 are actual malware o But 99/2097 < 0.05 Remember the “boy who cried wolf”? o Here, we have detector that cries wolf… Data Analysis 39
40
Imbalance Solution? What to do? There is inherent tradeoff between sensitivity and specificity Suppose we can adjust threshold so o TPR = 0.92 and FPR = 0.0003 As before… o We have 1 in 1000 is malware o And we test 100k files Data Analysis 40
41
Imbalance Solution? Assuming TPR = 0.92 and FPR = 0.0003 o And 1 in 1000 is malware After scanning 100k files… o Detect 92 of 100 actual malware (TP) o Misclassify 8 malware as benign (FN) o Correctly classify 99870 (out of 99900) benign as benign (TN) o Misclassify 30 benign as malware (FP) Data Analysis 41
42
Imbalance Solution? We have 99878 classified as benign o Of those, all but 8 are actually benign o And 99870/99878 > 0.9999 We classified 122 as malware o Of these, 92 are actual malware o And 92/122 > 0.75 Can adjust threshold to further reduce “crying wolf” effect Data Analysis 42
43
Imbalance Problem A better alternative? Instead of increasing FPR to lower TPR o Perform secondary testing on files that are initially classified as malware o We can thus weed out most FP cases This gives us best of both worlds o Low FPR, few benign reported as malware No free lunch, so what’s the cost? Data Analysis 43
44
Bottom Line Design your experiments properly o Use n-fold cross validation (e.g., n = 5) o Generally, cross validation is important Thresholding is important in practice o But not so useful for analyzing results o Accuracy not so informative either Use ROC curves and compute AUC o Sometimes, partial AUC is better Imbalance problem may be significant issue Data Analysis 44
45
References A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, 30:1145-1159, 1997The use of the area under the ROC curve in the evaluation of machine learning algorithms Data Analysis 45
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.