Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to identify a piece of evidence (data) it was not trained with. Accurate estimates for error rates of decision model are critical in forensic science applications. The simplest is apparent error rate: Error rate on training set Lousy estimate, but better than nothing Decision Model Validation
Cross-Validation: systematically hold-out chunks of data set for testing Most common: Hold-one-out 1.Omit a data vector from X 2.Train model, 3.Classify held out observation 4.Repeat for all data vectors Simple but give a good estimate Lots of literature to back up its efficacy Decision Model Validation
C-fold cross-validation: hold out data chunks of size c. Can become time consuming. Typically performance not much better than simple HOO-CV Caution! If decision model is sensitive to group sizes (e.g. CVA) cross-validation may not work well. Should have at the very least, 5 replicates/group Decision Model Validation DON’T ARGUE WITH ME!!!!!!!!!!
Bootstrap: Make up data sets with randomly selected observation vectors (with replacement) Bootstrap sample is the same size as X You’ll get repeats 1.Train a decision model with the bootstrapped set Model should not be sensitive to repeated observations! CVA is out!!!! 2.Test model with original X and compute error: Decision Model Validation Decision rules built with bootstrapped data set
3.Test model with bootstrapped data set X* and compute error: 4.Repeat 1-3, B times: B should be at least Compute average “optimism”: 6.Compute the “refined” bootstrap error rate: Decision Model Validation Number of times obs. vect. occurs in X* *Now Exercise: Explore some data sets with: boostrap.R cv_boot_testset.R
t is a test for association between: x unk, data from an unknown Could be from crime scene Could be from suspect A group of data from a source Could be from suspect Could be from crime scene ANY decision rule output by a pattern recognition program can be considered as a test for association Probabilities
Codes: t +/- : Test indicates inclusion/exclusion S +/- : Evidence is/is not associated with a source Four probabilities are of interest: Probability that a test yields a positive association given that there is truly an association between evidence and a source: TPR is very important for forensic applications! Probabilities = probability of a true positive (TP) = true positive rate (TPR) = probability of a true inclusion = sensitivity
Probability that a test yields a positive association given that there is truly no association between evidence and a source: FPR is very important for forensic applications! In traditional hypothesis testing, FPR is sometimes called 1-FPR = specificity (TNR): rate at which true exclusions are correctly excluded Probabilities = probability of a false positive (FP) = false positive rate (FPR) = probability of a false inclusion
Probability that a test yields a negative association given that there is truly no association between evidence and a source: TNR estimates may be the most useful (and trustworthy) numbers that come out of applications probability to physical evidence... Probabilities = probability of a true negative (TN) = true negative rate = probability of a true exclusion = specificity
Probability that a test yields a negative association given that there is truly an association between evidence and a source: In traditional hypothesis testing, FN is sometimes called 1-FNR = sensitivity (TPR): rate at which true inclusions are correctly included Probabilities = probability of a false negative (FN) = false negative rate (FNR) = probability of a false exclusion
Summary: 1- is called test’s power Remember, these are all only ESTIMATES! An association truly exists, S + An association truly does not exist, S - Test indicates an inclusion, t + True Positive RateFalse Positive Rate Type I error Test indicates an exclusion, t - False Negative Rate Type II error True Negative Rate Probabilities
Much more difficult to objectively estimate, but of more interest in Law applications: Probability that an association exists given a test indicates an association: Probability that an no association exists given a test indicates an association: Bayes’ Rule Again Prior probability that there is an association between evidence and a source… Also called positive predictive value (PV + )
Probabilities Dividing these, we get the “famous” (positive) likelihood ratio LR + : LR + can be expressed as: Likelihood Ratio Prior odds in favor of association Odds form of Bayes’ Rule Posterior odds in favor of association given test indicates inclusion
Probabilities LR + interpretations: Ratio of: probability test indicates inclusion given a true association vs. probability test indicates inclusion given a true exclusion LR + serves as a multiplier for the prior odds in favor of an association LR + gives relative effect of same source origin odds given a positive test result
Probabilities Note: In building a decision model, TPR, TNR, FPR, FNR and LR + are computed on a per group basis There is no overall TPR, TNR, FPR, FNR and LR + ! Value comes into forensic science if one of the groups is a known suspect or crime scene group, AND: Unknowns are tested against suspect/crime scene group Confidence measures in the results are: TPR, FPR and LR + computed on the suspect/crime scene group
Probabilities How can these be used/stated in court? ? Striation pattern found at a crime scene (CS) Same class characteristics as C.S. Subclass characteristics eliminated from data Many striation patterns generated by a tool associated with a suspect (SP) Include SP set in database (DB) and compute/test discrimination model Get TP, FP and LR + for SP wrt/ DB I.D. CS with discrimination model Result is inclusion or exclusion TP, FP and LR + for SP apply to result State in court along with size of DB
Receiver Operating Characteristic In general a classification rule t, applied to a data point yields a score, t(x) = score For two groups, consider score distributions Two groups can be right vs. wrong, pos vs. neg, assoc. vs. no assoc., one vs. rest, one vs. one, etc. score cut-off score
Receiver Operating Characteristic The cut off score is adjustable Different choices give different TPR and FPR Cut off is related to prior Changing cut off traces out a curve on a graph of TPR vs. FPR = ROC curve TPR FPR AUC = Mann-Whitney U “chance” diagonal *Now Exercise: Source roc_utilities.R play with roc.R for PLS-DA
Receiver Operating Characteristic “Chance” diagonal: If your ROC curve looks like this Score distributions for two groups are right on top of each other 50/50 chance of assigning an unknown to the correct group. Area under curve (AUC): Probability of misclassification (estimated test error rate) AUC range = 0 to 1 (really 0.5 to 1)* Gini coefficient: “Degree of inequality of ROC curve from chance diagonal = 2AUC-1
How good of a “match” is it? Conformal Prediction Vovk Data should be IID but that’s it Cumulative # of Errors Sequence of Unk Obs Vects 80% confidence 20% error Slope = % confidence 5% error Slope = % confidence 1% error Slope = 0.01 Can give a judge or jury an easy to understand measure of reliability of classification result This is an orthodox “frequentist” approach Roots in Algorithmic Information Theory Confidence on a scale of 0%-100% Testable claim: Long run I.D. error- rate should be the chosen significance level
How Conformal Prediction works for us Given a “bag” of obs with known identities and one obs of unknown identity Vovk Estimate how “wrong” labelings are for each observation with a non- conformity score (“wrong-iness”) Looking at the “wrong-iness” of known observations in the bag: Does labeling-i for the unknown have an unusual amount of “wrong-iness”??: For us, one-vs-one SVMs: If not: p possible-ID i ≥ chosen level of significance Put ID i in the (1 - )*100% confidence interval
Conformal Prediction Theoretical (Long Run) Error Rate: 5% Empirical Error Rate: 5.3% 14D PCA-SVM Decision Model for screwdriver striation patterns For 95%-CPT (PCA-SVM) confidence intervals will not contain the correct I.D. 5% of the time in the long run Straight-forward validation/explanation picture for court
Conformal Prediction Drawbacks CPT is an interval method Can (and does) produce multi-label I.D. intervals A “correct” I.D. is an interval with all labels Doesn’t happen often in practice… Empty intervals count as “errors” Well…, what if the “correct” answer isn’t in the database An “Open-set” problem which Champod, Gantz and Saunders have pointed out Must be run in “on-line” mode for LRG After 500+ I.D. attempts run in “off-line” mode we noticed in practice
An I.D. is output for each questioned toolmark This is a computer “match” What’s the probability it is truly not a “match”? Similar problem in genomics for detecting disease from microarray data They use data and Bayes’ theorem to get an estimate No disease genomics = Not a true “match” toolmarks How good of a “match” is it? Efron Empirical Bayes’
Random Match Probability { Distribution of nDs from fragments at crime scene Distribution of nDs from fragments in population 99% of nDs from crime scene fragments: RMP “window” Shaded area Prob. random frag from pop would be IDd as CS frag
Random Match Probability Example RMP ≈ ( )×100 = 46% Distribution of nDs from glass fragments at crime scene Distribution of nDs from glass fragments in population
Random Match Probability Problems with Random Match Probability Computations To get reliable probabilities, need accurate probability density functions (pdfs) Higher dimensional pdfs require exponential amounts of data to accurately fit (curse of dimensionality) Overlap in higher dimensions?? How wide should RMP “windows” be? Use distributions for univariate “similarity” measures? Different measures correspond to different RMPs! No natural choice!
Empirical Bayes’ We use Efron’s machinery for “empirical Bayes’ two-groups model” Efron Surprisingly simple! Use binned data to do a Poisson regression Some notation: S -, truly no association, Null hypothesis S +, truly an association, Non-null hypothesis z, a score derived from a machine learning task to I.D. an unknown pattern with a group z is a Gaussian random variate for the Null
Empirical Bayes’ From Bayes’ Theorem we can get Efron : Estimated probability of not a true “match” given the algorithms' output z-score associated with its “match” Names: Posterior error probability (PEP) Kall Local false discovery rate (lfdr) Efron Suggested interpretation for casework: We agree with Gelaman and Shalizi Gelman : = Estimated “believability” of machine made association “…posterior model probabilities …[are]… useful as tools for prediction and for understanding structure in data, as long as these probabilities are not taken too seriously.”
Empirical Bayes’ Bootstrap procedure to get estimate of the KNM distribution of “Platt-scores” Platt,e1071 Use a “Training” set Use this to get p-values/z-values on a “Validation” set Inspired by Storey and Tibshirani’s Null estimation method Storey z-score From fit histogram by Efron’s method get: “mixture” density We can test the fits to and ! What’s the point?? z-density given KNM => Should be Gaussian Estimate of prior for KNM Use SVM to get KM and KNM “Platt-score” distributions Use a “Validation” set
Obs#: * * * * Obs#: * * * * Obs#: * * * * Obs#: Obs#: Obs#: Bootstrap sample Train SVM Get Platt scores on whole set Toss KM Platt scores Toss obs. in bootstrap sample Randomly select a KNM score from each obs. Collect Repeat Bootstrap algorithm to Estimate KNM distribution (The Null)
Estimate of log KNM Platt-score distribution Fit of log(KNM) to parametric form helps us avoid plethora of 0 p- values for KM validation set “Problem” p-values now
Validation Set Sample to get a set of IID simulated log(KNM-scores) (“reusing data” less too…??) Check assumptions on the Null Uniform Null p-values Close to N(0,1) Null z-values Lump together as the “validation set” Compute p-values for the validation set from the fit null
Use locfdr locfdr Fit classic Poisson regression for f(z) Use modified locfdr/JAGS JAGS,Plummer or Stan Stan Fit Bayesian hierarchal Poisson regressions z z Fit local-fdr models
Posterior Association Probability: Believability Curve 12D PCA-SVM locfdr fit for Glock primer shear patterns +/- 2 standard errors
Bayesian over-dispersed Poisson with intercept on test setBayesian Poisson with intercept on test set Poisson (Efron) on test set Bayesian Poisson on test set
Bayes Factors/Likelihood Ratios In the “Forensic Bayesian Framework”, the Likelihood Ratio is the measure of the weight of evidence. LRs are called Bayes Factors by most statistician LRs give the measure of support the “evidence” lends to the “prosecution hypothesis” vs. the “defense hypothesis” From Bayes Theorem:
Bayes Factors/Likelihood Ratios Once the “fits” for the Empirical Bayes method are obtained, it is easy to compute the corresponding likelihood ratios. o Using the identity: the likelihood ratio can be computed as:
Bayes Factors/Likelihood Ratios Using the fit posteriors and priors we can obtain the likelihood ratios Tippett, Ramos Known match LR values Known non-match LR values
Empirical Bayes’: Some Things That Bother Me Need a lot of z-scores Big data sets in forensic science largely don’t exist z-scores should be fairly independent Especially necessary for interval estimates around lfdr Efron Requires “binning” in arbitrary number of intervals Also suffers from the “Open-set” problem Interpretation of the prior probability for this application Should Pr(S - ) be 1 or very close to it? How close?