THOUGHTS ON MODEL ASSESSMENT Porco, DAIDD, December 2012.

Slides:

Advertisements

Similar presentations

Michael Lacewing Religious belief Michael Lacewing © Michael Lacewing.

Advertisements

Exeter, February The impenetrable hedge: some thoughts on propriety, equitability and consistency Ian Jolliffe Forecast verification Hedging Propriety.

SAMSI Kickoff 11/9/06Slide 1 Simulators, emulators, predictors – Validity, quality, adequacy Tony O’Hagan.

Interfacing physical experiments and computer models Preliminary remarks Tony O’Hagan.

Dealing with Random Phenomena A random phenomenon is a situation in which we know what outcomes could happen, but we don’t know which particular outcome.

Guidance of the WMO Commission for CIimatology on verification of operational seasonal forecasts Ernesto Rodríguez Camino AEMET (Thanks to S. Mason, C.

Survey Methodology Reliability and Validity EPID 626 Lecture 12.

Making Sense of Kant Faced the twin issues of Western philosophy: individual v collective meaning v emptiness Lived and taught in the late 1700s Opened.

Concept Summary Batesville High School Physics. Natural Philosophy  Socrates, Plato, Aristotle  Were the “authorities” in Western thought from about.

Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.

Bookmaker or Forecaster? By Philip Johnson. Jersey Meteorological Department.

CITS4403 Computational Modelling Game of Life. One of the first cellular automata to be studied, and probably the most popular of all time, is a 2-D CA.

©GoldSim Technology Group LLC., 2004 Probabilistic Simulation “Uncertainty is a sign of humility, and humility is just the ability or the willingness to.

A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,

Models -1 Scientists often describe what they do as constructing models. Understanding scientific reasoning requires understanding something about models.

University of Oxford Quantifying and communicating the robustness of estimates of uncertainty in climate predictions Implications for uncertainty language.

Copyright © 2005 Department of Computer Science CPSC 641 Winter Simulation Validation Plan: –Discuss verification and validation –Define concepts.

The Future of Testing Chapter 21. Random testing Simply put, if testing is targeted at a particular area, it is because that area is either critical,

THE PROCESS OF SCIENCE. Assumptions  Nature is real, understandable, knowable through observation  Nature is orderly and uniform  Measurements yield.

Validity n Internal validity: Are the methods correct and the results accurate? n External validity: are the findings generalizable beyond that particular.

1 Simulation Validation H Plan: –Discuss verification and validation –Define concepts and terminology –Example: Weasley Barber Shop simulator u Effect.

Chapter 14 Simulation. Monte Carlo Process Statistical Analysis of Simulation Results Verification of the Simulation Model Computer Simulation with Excel.

8-2 Basics of Hypothesis Testing

1 The probability that a medical test will correctly detect the presence of a certain disease is 98%. The probability that this test will correctly detect.

Evaluation of Potential Performance Measures for the Advanced Hydrologic Prediction Service Gary A. Wick NOAA Environmental Technology Laboratory On Rotational.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 24 Slide 1 Critical Systems Validation 1.

Section 2: Science as a Process

Philosophy of Mind Week 3: Objections to Dualism Logical Behaviorism

Lecture Slides Elementary Statistics Twelfth Edition

1 Performance Evaluation of Computer Networks: Part II Objectives r Simulation Modeling r Classification of Simulation Modeling r Discrete-Event Simulation.

Environmental Futures Emile Servan-Schreiber, Ph.D. NewsFutures.com.

1 OM2, Supplementary Ch. D Simulation ©2010 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible.

Probability and Statistics Required!. 2 Review Outline  Connection to simulation.  Concepts to review.  Assess your understanding.  Addressing knowledge.

Tutorial. Other post-processing approaches … 1) Bayesian Model Averaging (BMA) – Raftery et al (1997) 2) Analogue approaches – Hopson and Webster, J.

1 Science as a Process Chapter 1 Section 2. 2 Objectives  Explain how science is different from other forms of human endeavor.  Identify the steps that.

Theoretical and Experimental Probability Today you will learn to: calculate the theoretical and experimental probabilities of an event. M07.D-S.3.1.1:

Scientific Method. My 4 out of 5 Rule If you make an observation.

Epistemology ► Area of Philosophy that deals with questions concerning knowledge ► Philosophy of Knowledge.

Introduction to Earth Science Section 2 Section 2: Science as a Process Preview Key Ideas Behavior of Natural Systems Scientific Methods Scientific Measurements.

Chapter 5: Mind and Body The Rejection of Dualism

ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.

Using Models Chapter 2-3. What is a model  Simplified representations of reality play a crucial role in economics.

Branches of Science Web-quest What is science? How many different branches of science are there? What is the difference between a scientific fact, law.

PHIL 2525 Contemporary Moral Issues Lec 2 Arguments are among us…

Epistemology ► Area of Philosophy that deals with questions concerning knowledge ► Philosophy of Knowledge.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

CSCI1600: Embedded and Real Time Software Lecture 28: Verification I Steven Reiss, Fall 2015.

Copyright © MSC.Software Corporation, All rights reserved. Chaos and Uncertainty: Fighting the Stagnation in CAE J. Marczyk Ph.D. MSC Software Managing.

Probabilistic Forecasts of Extreme Precipitation Events for the U.S. Hazards Assessment Kenneth Pelman 32 nd Climate Diagnostics Workshop Tallahassee,

And the search for truth. Knowledge. Knowing: Introduction to a classification scheme In ToK we may treat knowledge as falling into 3 categories. These.

PICKING THE RIGHT DATES AND QUANTITIES Continuous Estimation.

DEVELOPING AND USING MODELS IN SCIENCE

CHAPTER 1 EVERYTHING YOU EVER WANTED TO KNOW ABOUT STATISTCS.

© Andrew F. Siegel, 1997 and 2000 Irwin/McGraw-Hill 1-1 BQT 173 BUSINESS STATISTICS.

Philosophy An introduction. What is philosophy? Ancient Greek philosopher Aristotle said that philosophy is ‘the science which considers truth’

Introduction. What is probability? Probability is the chance that a given event will occur. Examples Probability that it will rain tomorrow. Probability.

Håkan L. S. YounesDavid J. Musliner Carnegie Mellon UniversityHoneywell Laboratories Probabilistic Plan Verification through Acceptance Sampling.

Modeling of Core Protection Calculator System Software February 28, 2005 Kim, Sung Ho Kim, Sung Ho.

F5 Performance Management. 2 Section C: Budgeting Designed to give you knowledge and application of: C1. Objectives C2. Budgetary systems C3. Types of.

1 Pruning of Ensemble CME modeling using Interplanetary Scintillation and Heliospheric Imager Observations A. Taktakishvili, M. L. Mays, L. Rastaetter,

Uncertain Judgements: Eliciting experts’ probabilities Anthony O’Hagan et al 2006 Review by Samu Mäntyniemi.

Become a Statistic (-ally trained professional).

CHAPTER 5 Handling Uncertainty BIC 3337 EXPERT SYSTEM.

HOW SMALL PROBABILITIES AFFECT OUR LIFE?

Lecture on Bayesian Belief Networks (Basics)

Section 2: Science as a Process

Elementary Statistics

Mr. Morris Physical Science

Probabilistic forecasts

Recurrence Probabilites of Weather Events What does this "40 percent" mean? ...will it rain 40 percent of of the time? ...will it rain.

Presentation transcript:

THOUGHTS ON MODEL ASSESSMENT Porco, DAIDD, December 2012

Assessment  Some models are so wrong as to be useless.  Not really possible to argue that a model is “true”  Treacherous concept of “validation”: has a model that has been “validated” been shown to be “valid”?  Philosophy of science

Face Validity  Knowledge representation  Analogy  Does the model represent in some approximate sense facts about the system?

Verification  Does your simulation actually simulate what you think it simulates?  How do you know your program is not generating nonsense?  Unit testing  Boundary cases  “Test harnesses”  Formal software testing techniques

Thought  “…the gold standard of model performance is predictive power on data that wasn’t used to fit your model…” --Conway and White, Machine Learning for Hackers

Prediction  Not everything is predictable  Predict individuals? Communities? Counterfactuals?  Does the model somehow correspond to reality?

California TB

Probabilistic forecasting  Weather  Forecast expressed as a probability  Correctly quantify uncertainty  Uncertainty is needed to make the best use of a forecast  If you don’t know, don’t say you do  Not bet-hedging, but honesty

Two approaches  Assessment of probabilistic forecasts  Brier Score  Information Score

Assessing simple forecasts  Binary event: 0 for no, 1 for yes  Forecast: p is the probability it occurred  If p=1, you forecast it with certainty  If p=0, you forecast that it was certain not to have occurred  If 0<p<1, you were somewhere in the middle

Brier score

 Squared error of a probabilistic forecast  BS = (p-1) 2 + ((1-p)-0) 2  BS = 2(1-p) 2  Brier score has a negative orientation (like in golf, smaller is better)  Some authors do not use the factor of 2

Brier score  Example. Suppose I compute a forecast that trachoma will be eliminated in village A after two years as measured by pooled PCR for DNA in 50 randomly selected children. Suppose I say the elimination probability is 0.8.  If trachoma is in fact eliminated, the Brier score for this is (0.8-1)^2 + (0.2-0)^2 = [you work it out]

Brier score  What is the smallest possible Brier score?  What is the largest possible Brier score?

Brier score  Suppose we now forecast elimination in five villages. We can compute an overall Brier score by just adding up separate scores for each village. Elimination Prob.EliminationScore

Murphy decomposition  Classical form applies to binary predictions  Predictions are probabilities  Finite set of possible probabilities  Example: chance of a meteorological event happening might be predicted as 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%.

Reliability (sensu Murphy)  Similar to Brier score – less is better  Looking at all identical forecasts, how similar is what really happened to what was forecast?

Examples PredictionResultPredictionResult

Example  Here, we get a total Brier score of We have to add up (0.2-0)^2 + (0.2-1)^2+… (and not forget the factor of two).

Some terminology  Forecast levels: the different values possible for the forecast. Notation: f k for the levels of k  Observed means at the forecast levels: averaging the number of times the event happened stratifying on forecast level. Notation:  We have two forecast levels in the example, 0.2 and 0.8. For forecast level 0.2, we have an observed mean of 2/6. For the forecast level 0.8, we have an observed mean of 3/6.

Examples  We had 2/6 that were YES when we gave a 20% chance. Let’s compare the prediction to the successes just for the 20% predictions. For each one we have the same thing: (0.2 – 2/6) 2 +(0.8-4/6) 2 which is about  We had 3/6 that were YES when we gave an 80% chance. We do a similar computation and we get (0.8-3/6) 2 + (0.2-3/6) 2 =0.222 or so.

Reliability  So the total reliability score is going to be computed: This yields a total score of approximately for the reliability component REL. PredictionReliabilityRepeatsTotal

Resolution  When we said different things would happen, how different really were they? We’re going to contrast the observed means at different forecast levels.  Specifically, we want the variance of the distribution of the observed means at different forecast levels.

Example  Using the simple example: the overall mean is 5/12. Then compute 2*(2/6 – 5/12) 2 for every observation in the first group, and 2*(3/6-5/12) 2 for every observation in the second group.  The total resolution component RES is 6*2*(2/6 – 5/12) 2 +2*(3/6-5/12) 2 = 1/6.

Uncertainty  In the classical Murphy formula, this is computed by calculating the mean observation times one minus the mean—and adding this up for each observation.  For our example, the uncertainty is or so.

Murphy decomposition  Murphy (1973) showed that the Brier score can be decomposed as follows:  BS=REL-RES+UNC  N.B. the negative sign in front of resolution  High uncertainty contributes to high Brier score (all else being equal)  High discrepancy of the observed means at each level from the forecasts raises the Brier score  But having those observed means at each level separate from each other lowers the Brier score

Script brier.component <- function(forecast,observation) { 2*(forecast-observation)^2 } reliability <- function(fk,ok) { 2*(fk-ok)^2 } resolution <- function(obar,ok) { 2*(ok-obar)^2 } gen.obark <- function(predictions,outcome,key) { tmp <- data.frame(key=key,outcome=outcome,pred=predictions) f1 <- merge(tmp, ddply( tmp,.(pred), function(x){mean(x$outcome)} ), by="pred") f1 <- f1[order(f1$key),] f1$V1 }

Example ds <- c(0,1,0,0,0,1,1,0,0,0,1,1) fs <- c(rep(0.2,6),rep(0.8,6)) obark <- gen.obark(fs,ds,1:12) rel <- sum(reliability(fs,obark)) res <- sum(resolution(mean(ds),obark)) unc <- 12*2*(mean(ds)*(1-mean(ds))) brs <- sum(brier.component(fs,ds)) (rel-res+unc)-brs [1] e-15

General form B: Brier score; K, number of forecast levels; n k the number of forecasts at level k Other quantities as defined earlier

More general forms  More general decompositions (Stephenson, five terms)  Continuous forms

Alternatives to Brier score  Decomposition into Reliability, Resolution, Uncertainty works for information measures (Weijs et al)

Information measures  Surprise  Expected surprise  Surprising a model

Surprisal  How much do you learn if two equally likely alternatives are disclosed to you?  Heads vs tails; 1 vs 2

Surprisal  How much do you learn if 3 are disclosed?  A, B, or C  How much if I disclose one of 26 equally likely outcomes? You should have learned more

Surprisal  Standard example:  Now combine two independent things. If we had 1 vs 2, and A, B, or C:  A/1  C/2  B/1  B/2  …

Surprisal  Six equally likely things  s(2 x 3) = s(2) + s(3)  Uniquely useful way to do this: define surprisal as log(1/p), log of the reciprocal probability  Stevens information tutorial online

Surprisal  Different bases can be used for the logarithm  Log base 2 is traditional  How much information is transmitted if you learn which of two equally likely outcomes happened?  Log 2 (1/(1/2)) = 1  If we use base two, then the value is 1, referred to as 1 bit.  Disclosure of one of two equally likely outcomes reveals one bit of information.  Notation: log base 2 often written lg

Example  Sequence of values of flips of a bent coin with P(heads)=1/3, P(tails)=2/3 ObservedProbability1/ProbabilitySurprisal 02/ / / / / / The average surprisal was or so.

Expected surprisal  Every time an event with probability p i happens, the surprisal is lg(1/p i ).  What is the expected surprisal?

Expected value

Expected square

Expected surprisal Shannon entropy

 Shannon entropy can be used to quantify the amount of information provided by a model for a categorical outcome, in much the same way that the squared multiple correlation coefficient can quantify the amount of variance explained by a continuous model  Use of entropy and related measures to assess probabilistic predictions is sometimes recommended (Weijs)

Estimated entropy  If you have a Bernoulli trial, what is the entropy? The success probability is p.  H=-(1-p)lg(1-p) – p lg(p)  If we estimate p from data and plug this in, we get an estimator of the entropy.  Example: if we observe 8 successes in 20 trials, the observed frequency is 0.4, and the estimated entropy would be about

What does a model do?  If we now have a statistical model that provides a better prediction, we can compute the conditional entropy.  Without the model, our best estimate of the chance of something is just the observed relative frequency.  But with the model, maybe we have a better estimate.

Example  Logistic regression: use data from the Mycotic Ulcer Treatment Trial (Prajna et al 2012, Lietman group).  Using just the chance of having a transplant or perforation, the estimated entropy is About 16% of patients had such an event.

Example  Now, let’s use a covariate, such as what treatment a person got.  Now, we have a better (we hope) estimate of the chance of a bad outcome for each person. These are 11% for treatment 1 and 21% for treatment 2.

Example  For group 1, the chance of a bad outcome was 11%. The entropy given membership in this group is about  For group 2, the chance of a bad outcome was 21% or so. The entropy given membership in this group is about  The weighted average of these is the conditional entropy given the model. This gives us just

Example  So the entropy was 0.636, and now it’s The treatment model does not predict very much.  If we look at the proportional reduction, we get 2.2%; this model explained 2.2% of the uncertainty.  This is exactly the McFadden R 2 you get in logistic regression output.

Summary  Probabilistic forecasts can be assessed using the Brier score.  The Brier score can be decomposed into reliability, resolution, and uncertainty components.  Information theoretic measures can also be used in assessment.  Information theoretic measures are useful in other biostatistical applications as well.