Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.

Pattern Recognition and Machine Learning

What they are and how they fit into

Current Research in Forensic Toolmark Analysis Helping to satisfy the “new” needs of forensic scientists with state of the art microscopy, computation.

Learning Algorithm Evaluation

Lecture 3 HSPM J716. Efficiency in an estimator Efficiency = low bias and low variance Unbiased with high variance – not very useful Biased with low variance.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Neuroinformatics 1: review of statistics Kenneth D. Harris UCL, 28/1/15.

Visual Recognition Tutorial

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections

Classification and risk prediction

Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~

Evaluating Hypotheses

Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.

2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.

G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.

Decision Tree Models in Data Mining

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Inference for regression - Simple linear regression

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

BPS - 3rd Ed. Chapter 211 Inference for Regression.

Significance Tests: THE BASICS Could it happen by chance alone?

1 Nonparametric Methods III Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:

Current Research in Forensic Toolmark Analysis Petraco Group.

CpSc 810: Machine Learning Evaluation of Classifier.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Computational Strategies for Toolmarks:. Outline Introduction Details of Our Approach Data acquisition Methods of statistical discrimination Error rate.

Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.

1 E. Fatemizadeh Statistical Pattern Recognition.

Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.

1 Chapter 8 Introduction to Hypothesis Testing. 2 Name of the game… Hypothesis testing Statistical method that uses sample data to evaluate a hypothesis.

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Frequentist and Bayesian Measures of Association Quality in Algorithmic Toolmark Identification.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.

G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.

1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

Chapter 13 Understanding research results: statistical inference.

A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.

The Probit Model Alexander Spermann University of Freiburg SS 2008.

BPS - 5th Ed. Chapter 231 Inference for Regression.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

Data Science Credibility: Evaluating What’s Been Learned

7. Performance Measurement

Step 1: Specify a null hypothesis

Two-Sample Hypothesis Testing

LECTURE 05: THRESHOLD DECODING

CHAPTER 26: Inference for Regression

Discrete Event Simulation - 4

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

I. Statistical Tests: Why do we use them? What do they involve?

Inferential Statistics

LECTURE 05: THRESHOLD DECODING

More on Maxent Env. Variable importance:

Presentation transcript:

Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to identify a piece of evidence (data) it was not trained with. Accurate estimates for error rates of decision model are critical in forensic science applications. The simplest is apparent error rate: Error rate on training set Lousy estimate, but better than nothing Decision Model Validation

Cross-Validation: systematically hold-out chunks of data set for testing Most common: Hold-one-out 1.Omit a data vector from X 2.Train model, 3.Classify held out observation 4.Repeat for all data vectors Simple but give a good estimate Lots of literature to back up its efficacy Decision Model Validation

C-fold cross-validation: hold out data chunks of size c. Can become time consuming. Typically performance not much better than simple HOO-CV Caution! If decision model is sensitive to group sizes (e.g. CVA) cross-validation may not work well. Should have at the very least, 5 replicates/group Decision Model Validation DON’T ARGUE WITH ME!!!!!!!!!!

Bootstrap: Make up data sets with randomly selected observation vectors (with replacement) Bootstrap sample is the same size as X You’ll get repeats 1.Train a decision model with the bootstrapped set Model should not be sensitive to repeated observations! CVA is out!!!! 2.Test model with original X and compute error: Decision Model Validation Decision rules built with bootstrapped data set

3.Test model with bootstrapped data set X* and compute error: 4.Repeat 1-3, B times: B should be at least Compute average “optimism”: 6.Compute the “refined” bootstrap error rate: Decision Model Validation Number of times obs. vect. occurs in X* *Now Exercise: Explore some data sets with: boostrap.R cv_boot_testset.R

t is a test for association between: x unk, data from an unknown Could be from crime scene Could be from suspect A group of data from a source Could be from suspect Could be from crime scene ANY decision rule output by a pattern recognition program can be considered as a test for association Probabilities

Codes: t +/- : Test indicates inclusion/exclusion S +/- : Evidence is/is not associated with a source Four probabilities are of interest: Probability that a test yields a positive association given that there is truly an association between evidence and a source: TPR is very important for forensic applications! Probabilities = probability of a true positive (TP) = true positive rate (TPR) = probability of a true inclusion = sensitivity

Probability that a test yields a positive association given that there is truly no association between evidence and a source: FPR is very important for forensic applications! In traditional hypothesis testing, FPR is sometimes called 1-FPR = specificity (TNR): rate at which true exclusions are correctly excluded Probabilities = probability of a false positive (FP) = false positive rate (FPR) = probability of a false inclusion

Probability that a test yields a negative association given that there is truly no association between evidence and a source: TNR estimates may be the most useful (and trustworthy) numbers that come out of applications probability to physical evidence... Probabilities = probability of a true negative (TN) = true negative rate = probability of a true exclusion = specificity

Probability that a test yields a negative association given that there is truly an association between evidence and a source: In traditional hypothesis testing, FN is sometimes called 1-FNR = sensitivity (TPR): rate at which true inclusions are correctly included Probabilities = probability of a false negative (FN) = false negative rate (FNR) = probability of a false exclusion

Summary: 1-  is called test’s power Remember, these are all only ESTIMATES! An association truly exists, S + An association truly does not exist, S - Test indicates an inclusion, t + True Positive RateFalse Positive Rate Type I error Test indicates an exclusion, t - False Negative Rate Type II error True Negative Rate Probabilities

Much more difficult to objectively estimate, but of more interest in Law applications: Probability that an association exists given a test indicates an association: Probability that an no association exists given a test indicates an association: Bayes’ Rule Again Prior probability that there is an association between evidence and a source… Also called positive predictive value (PV + )

Probabilities Dividing these, we get the “famous” (positive) likelihood ratio LR + : LR + can be expressed as: Likelihood Ratio Prior odds in favor of association Odds form of Bayes’ Rule Posterior odds in favor of association given test indicates inclusion

Probabilities LR + interpretations: Ratio of: probability test indicates inclusion given a true association vs. probability test indicates inclusion given a true exclusion LR + serves as a multiplier for the prior odds in favor of an association LR + gives relative effect of same source origin odds given a positive test result

Probabilities Note: In building a decision model, TPR, TNR, FPR, FNR and LR + are computed on a per group basis There is no overall TPR, TNR, FPR, FNR and LR + ! Value comes into forensic science if one of the groups is a known suspect or crime scene group, AND: Unknowns are tested against suspect/crime scene group Confidence measures in the results are: TPR, FPR and LR + computed on the suspect/crime scene group

Probabilities How can these be used/stated in court? ? Striation pattern found at a crime scene (CS) Same class characteristics as C.S. Subclass characteristics eliminated from data Many striation patterns generated by a tool associated with a suspect (SP) Include SP set in database (DB) and compute/test discrimination model Get TP, FP and LR + for SP wrt/ DB I.D. CS with discrimination model Result is inclusion or exclusion TP, FP and LR + for SP apply to result State in court along with size of DB

Receiver Operating Characteristic In general a classification rule t, applied to a data point yields a score, t(x) = score For two groups, consider score distributions Two groups can be right vs. wrong, pos vs. neg, assoc. vs. no assoc., one vs. rest, one vs. one, etc. score cut-off score

Receiver Operating Characteristic The cut off score is adjustable Different choices give different TPR and FPR Cut off is related to prior Changing cut off traces out a curve on a graph of TPR vs. FPR = ROC curve TPR FPR AUC = Mann-Whitney U “chance” diagonal *Now Exercise: Source roc_utilities.R play with roc.R for PLS-DA

Receiver Operating Characteristic “Chance” diagonal: If your ROC curve looks like this Score distributions for two groups are right on top of each other 50/50 chance of assigning an unknown to the correct group. Area under curve (AUC): Probability of misclassification (estimated test error rate) AUC range = 0 to 1 (really 0.5 to 1)* Gini coefficient: “Degree of inequality of ROC curve from chance diagonal = 2AUC-1

How good of a “match” is it? Conformal Prediction Vovk Data should be IID but that’s it Cumulative # of Errors Sequence of Unk Obs Vects 80% confidence 20% error Slope = % confidence 5% error Slope = % confidence 1% error Slope = 0.01 Can give a judge or jury an easy to understand measure of reliability of classification result This is an orthodox “frequentist” approach Roots in Algorithmic Information Theory Confidence on a scale of 0%-100% Testable claim: Long run I.D. error- rate should be the chosen significance level

How Conformal Prediction works for us Given a “bag” of obs with known identities and one obs of unknown identity Vovk Estimate how “wrong” labelings are for each observation with a non- conformity score (“wrong-iness”) Looking at the “wrong-iness” of known observations in the bag: Does labeling-i for the unknown have an unusual amount of “wrong-iness”??: For us, one-vs-one SVMs: If not: p possible-ID i ≥ chosen level of significance Put ID i in the (1 - )*100% confidence interval

Conformal Prediction Theoretical (Long Run) Error Rate: 5% Empirical Error Rate: 5.3% 14D PCA-SVM Decision Model for screwdriver striation patterns For 95%-CPT (PCA-SVM) confidence intervals will not contain the correct I.D. 5% of the time in the long run Straight-forward validation/explanation picture for court

Conformal Prediction Drawbacks CPT is an interval method Can (and does) produce multi-label I.D. intervals A “correct” I.D. is an interval with all labels Doesn’t happen often in practice… Empty intervals count as “errors” Well…, what if the “correct” answer isn’t in the database An “Open-set” problem which Champod, Gantz and Saunders have pointed out Must be run in “on-line” mode for LRG After 500+ I.D. attempts run in “off-line” mode we noticed in practice

An I.D. is output for each questioned toolmark This is a computer “match” What’s the probability it is truly not a “match”? Similar problem in genomics for detecting disease from microarray data They use data and Bayes’ theorem to get an estimate No disease genomics = Not a true “match” toolmarks How good of a “match” is it? Efron Empirical Bayes’

Random Match Probability { Distribution of nDs from fragments at crime scene Distribution of nDs from fragments in population 99% of nDs from crime scene fragments: RMP “window” Shaded area Prob. random frag from pop would be IDd as CS frag

Random Match Probability Example RMP ≈ ( )×100 = 46% Distribution of nDs from glass fragments at crime scene Distribution of nDs from glass fragments in population

Random Match Probability Problems with Random Match Probability Computations To get reliable probabilities, need accurate probability density functions (pdfs) Higher dimensional pdfs require exponential amounts of data to accurately fit (curse of dimensionality) Overlap in higher dimensions?? How wide should RMP “windows” be? Use distributions for univariate “similarity” measures? Different measures correspond to different RMPs! No natural choice!

Empirical Bayes’ We use Efron’s machinery for “empirical Bayes’ two-groups model” Efron Surprisingly simple! Use binned data to do a Poisson regression Some notation: S -, truly no association, Null hypothesis S +, truly an association, Non-null hypothesis z, a score derived from a machine learning task to I.D. an unknown pattern with a group z is a Gaussian random variate for the Null

Empirical Bayes’ From Bayes’ Theorem we can get Efron : Estimated probability of not a true “match” given the algorithms' output z-score associated with its “match” Names: Posterior error probability (PEP) Kall Local false discovery rate (lfdr) Efron Suggested interpretation for casework: We agree with Gelaman and Shalizi Gelman : = Estimated “believability” of machine made association “…posterior model probabilities …[are]… useful as tools for prediction and for understanding structure in data, as long as these probabilities are not taken too seriously.”

Empirical Bayes’ Bootstrap procedure to get estimate of the KNM distribution of “Platt-scores” Platt,e1071 Use a “Training” set Use this to get p-values/z-values on a “Validation” set Inspired by Storey and Tibshirani’s Null estimation method Storey z-score From fit histogram by Efron’s method get: “mixture” density We can test the fits to and ! What’s the point?? z-density given KNM => Should be Gaussian Estimate of prior for KNM Use SVM to get KM and KNM “Platt-score” distributions Use a “Validation” set

Obs#: * * * * Obs#: * * * * Obs#: * * * * Obs#: Obs#: Obs#: Bootstrap sample Train SVM Get Platt scores on whole set Toss KM Platt scores Toss obs. in bootstrap sample Randomly select a KNM score from each obs. Collect Repeat Bootstrap algorithm to Estimate KNM distribution (The Null)

Estimate of log KNM Platt-score distribution Fit of log(KNM) to parametric form helps us avoid plethora of 0 p- values for KM validation set “Problem” p-values now

Validation Set Sample to get a set of IID simulated log(KNM-scores) (“reusing data” less too…??) Check assumptions on the Null Uniform Null p-values Close to N(0,1) Null z-values Lump together as the “validation set” Compute p-values for the validation set from the fit null

Use locfdr locfdr Fit classic Poisson regression for f(z) Use modified locfdr/JAGS JAGS,Plummer or Stan Stan Fit Bayesian hierarchal Poisson regressions z z Fit local-fdr models

Posterior Association Probability: Believability Curve 12D PCA-SVM locfdr fit for Glock primer shear patterns +/- 2 standard errors

Bayesian over-dispersed Poisson with intercept on test setBayesian Poisson with intercept on test set Poisson (Efron) on test set Bayesian Poisson on test set

Bayes Factors/Likelihood Ratios In the “Forensic Bayesian Framework”, the Likelihood Ratio is the measure of the weight of evidence. LRs are called Bayes Factors by most statistician LRs give the measure of support the “evidence” lends to the “prosecution hypothesis” vs. the “defense hypothesis” From Bayes Theorem:

Bayes Factors/Likelihood Ratios Once the “fits” for the Empirical Bayes method are obtained, it is easy to compute the corresponding likelihood ratios. o Using the identity: the likelihood ratio can be computed as:

Bayes Factors/Likelihood Ratios Using the fit posteriors and priors we can obtain the likelihood ratios Tippett, Ramos Known match LR values Known non-match LR values

Empirical Bayes’: Some Things That Bother Me Need a lot of z-scores Big data sets in forensic science largely don’t exist z-scores should be fairly independent Especially necessary for interval estimates around lfdr Efron Requires “binning” in arbitrary number of intervals Also suffers from the “Open-set” problem Interpretation of the prior probability for this application Should Pr(S - ) be 1 or very close to it? How close?