User’s Guide to the ‘QDE Toolkit Pro’ Ch 5: 59 April 16, 2002 National Research Conseil national Council Canada de recherches User’s Guide to the ‘QDE Toolkit Pro’ Excel Tools for Presenting Metrological Comparisons by B.M. Wood, R.J. Douglas & A.G. Steele Chapter 5. Tables of Confidence-in-CMCs This chapter presents the macro for calculating the confidence established by a comparison in a similar CMC. In this circumstance it can provide a rigorous basis for voting on others’ CMCs, and for defending the validity of your own CMC to other comparison participants. A graphical development of the confidence calculations is included to help you justify your conclusions to skeptics who are frightened of ‘probability calculus’ - the graphs show how this is simply the adding of probabilities.
Ch 5: 60 Quantified Demonstrated Confidence for agreement within CMClab The Toolkit includes: tk_mraQDC_TableBuilder: Creates a table of probability for agreement of the column lab and row lab within the row lab’s CMC confidence interval. CMClab defaults to (2 ulab), but can be edited and used if the macro is re-run at the same anchor. Your ROW lets you defend your CMC, and your COLUMN can substantiate your vote on others’ CMCs. again, there are lots of pop-up comments automatically added
Ch 5: 61 tk_mraQDC_TableBuilder: Your ROW lets you defend your CMC, and your COLUMN can substantiate your vote on others’ CMCs. You can edit the CMCs and re-run the macro in the same place Pop-up comments help to keep track of everything...
Ch 5: 62 Quantified Demonstrated Confidence for agreement within CMClab Your Tables will be more useful if it is easy for you to convert any skeptics to your point of view. The following pages show the underlying simplicity of the confidence calculations for the QDC table. Why are there skeptics? Inter-laboratory comparisons do not fit comfortably into either the frequencist or Bayesian viewpoint of statistics. A bilateral comparison clearly involves two opinions (cheers from Bayesians) concerning the value of the measurand, but its utility rests almost entirely in its predictive power that will be tested by further measurements (cheers from the frequencists). Statisticians from either school can be expected to try to answer slightly different questions that are more appealing to them. As metrologists, we believe that for the a bilateral inter-laboratory comparison, the best description is the reported difference with the two opinions about the uncertainties combined into the pair uncertainty. For equivalence, we primarily address questions about how well future, similar measurements would agree WITH EACH OTHER, and only secondarily with the pooled mean of the “population” of the two laboratories which statisticians have focussed on in the past, but which is not usually available. Skeptical statisticians are not the sole custodians of probability calculus: chemists and physicists doing calculations in quantum mechanics perform vastly more involved probability calculations than do most statisticians. Physicists and chemists also use more complete experimental validations! If you are seeking allies in probability calculus - check with them!
Probability Calculus of One Result Ch 5: 63 Probability Calculus of One Result Key Concepts for Calculating Confidence A metrologist’s result is a mean value and an uncertainty This result represents a probability distribution function as described in the ISO Guide to the Expression of Uncertainty in Measurement The fractional probability of observing a value between a & b is the normalized integration of the probability distribution function in the interval [a, b] This is just addition of all the ‘bits’ of the function between a & b Joint probability distributions which combine results from two different labs can be calculated and handled the same way
One Lab’s Measurement Data Ch 5: 64 One Lab’s Measurement Data 1s 68% 2s 95% The Laboratory Value and Uncertainty are reported The Uncertainty is representative of an underlying normal probability distribution, calculated and reported per the ISO Guide to the Expression of Uncertainty in Measurement Integrating yields the typical 1s 68% and 2s 95%
One Lab’s Measurement Data Repeated Ch 5: 65 One Lab’s Measurement Data Repeated If the measurement is repeated, the value may move around a little bit With many repeated experiments, the uncertainty may get a bit smaller, and the mean may move within the original error bar – but the metrologist expects repeated measurements to be close to his originally reported value
Two Labs in a Comparison Ch 5: 66 Two Labs in a Comparison Two metrologists (Lab 1 & Lab 2) measure the same artifact in a comparison experiment Two Values (m1 & m2) and Uncertainties (u1 & u2) are reported, perhaps with a correlation coefficient r12 Here, the results are representative of the Lab 1 & Lab 2 normal distributions or Student distributions with reported effective degrees of freedom n1 and n2
Two Labs in a Comparison - Expectations Ch 5: 67 Two Labs in a Comparison - Expectations Neither the Red nor Blue lab thinks that their means will move “very far” in terms of their reported uncertainties Neither one thinks that anything will happen to his real mean value just because the other did or did not measure the artifact
MRA Bilateral Degree of Equivalence Ch 5: 68 MRA Bilateral Degree of Equivalence The degree of equivalence is the difference in the values and the uncertainty of the difference This is a two parameter description of the comparison We use two separate graphs to illustrate this d12 = m1 m2 up2 = u12 + u22 - 2r12u1u2
MRA Degree of Equivalence Ch 5: 69 MRA Degree of Equivalence The DoE is a two parameter description of the difference between the laboratories as revealed by the comparison experiments This is what goes into Appendix B of the MRA This is the most complete answer to the question: “What is the difference between Lab 1 and Lab 2?” However, the DoE requires interpretation to answer simple questions about “equivalence”
“Degree of Equivalence” and “Equivalence” Ch 5: 70 “Degree of Equivalence” and “Equivalence” Can the DoE be used to answer a simple question: “Are the results from Lab 1 and Lab 2 equivalent to each other?” This is what users of Appendix C of the MRA are really interested in, although the question is not precise enough to answer without further refinement. Equivalence is a single parameter concept…what can we do with the two parameter DoE to convert to a single parameter description?
Equivalence and the Null Hypothesis Ch 5: 71 Equivalence and the Null Hypothesis The common approach is to test whether or not the results for Lab 1 and Lab 2 can be considered as samples from a single population (remember, this was NOT QUITE the question we just asked) With a very large number of re-measurements, the single population or null hypothesis could be tested very well. It would likely reveal a small difference between the mean of Lab 1 results and the mean of Lab 2 results. If (and only if) the mean difference is much less than the pair uncertainty, then the single population is a good description of the two Labs’ results.
Equivalence and the Null Hypothesis Ch 5: 72 Equivalence and the Null Hypothesis Test whether results for Lab 1 and Lab 2 can be considered as samples from a single population. With just one comparison this is a very weak test. Statistical Criterion for answering this question: Internal and External consistency ie the Birge Ratio si/se 1 Another method in use among the CC’s: Normalized Error ie En(k=2) < 1
Using En to test the Null Hypothesis Ch 5: 73 Using En to test the Null Hypothesis For this example data: En(k=2) = 0.9 < 1 up2 = u12 + u22 - 2r12u1u2 Up = k up En = abs(m1 - m2)/Up En < 1, and so the Null Hypothesis is accepted: Lab 1 and Lab 2 are equivalent The simplest estimator for the Single Population is centered at the average value (m1 + m2) / 2, and has standard deviation up
Consequences of Null Hypothesis Testing Ch 5: 74 Consequences of Null Hypothesis Testing When the Null Hypothesis IS NOT ACCEPTED: The methodology generates no further information about the results We do not have statistical confidence that the two results are from a Single Population When the Null Hypothesis IS ACCEPTED: Lab 1 and Lab 2 must accept that their mean values were “really” at the aggregate mean for the Single Population (the KCRV) Similar reasoning applies to their uncertainty claims We have no idea what confidence level for agreement was just accepted, because our “cutoff” rules are not quantified
Fuzziness of Null Hypothesis Testing Ch 5: 75 Fuzziness of Null Hypothesis Testing what confidence do we have when En(k=2) < 1? if En= 0, the confidence within ±Up is 95% (note that this is ±Up, not ±U1 nor ±U2 (nor ±u1 nor ±u2). if En= 1, we have 95% confidence that the Null Hypothesis is false. This is definitely not the same as saying “If En< 1, then we have demonstrated a reasonable level of confidence in the null hypothesis.” For our given inputs, the confidence for agreement within some tolerance can be calculated.
Quantifying Equivalence to this KCRV Ch 5: 76 Quantifying Equivalence to this KCRV What is the probability that a repeat comparison would have a KCRV value within Lab 1’s 95% uncertainty interval? 95% interval Probability Calculus tells us the answer: for this example, QDC = 71% This is the kind of thinking that prevailed in preparing the MRA
Quantifying Equivalence to this KCRV Ch 5: 77 Quantifying Equivalence to this KCRV What is the probability that a repeat comparison would have a KCRV value within Lab 2’s 95% uncertainty interval? 95% interval Probability Calculus tells us the answer: for this example, QDC = 50% This is the kind of thinking that prevailed in preparing the MRA
Quantifying Bilateral Equivalence Ch 5: 78 Quantifying Bilateral Equivalence We believe that bilateral equivalence is more important (and often more meaningful!) than equivalence to a Key Comparison Reference Value We argued this point at BIPM meetings, so that Appendix B also includes bilateral tables of equivalence We can ask the same kinds of questions, ignoring the KCRV completely The same kind of probability calculus works here, but we use the joint probability distribution for Lab 1 and Lab 2 appropriate for the particular question
Quantifying Equivalence Ch 5: 79 Quantifying Equivalence What is the probability that a repeat comparison would have a Lab 2 value within Lab 1’s 95% uncertainty interval? Probability Calculus tells us the answer: for this example, QDC = 47% Lab 1’s CMC interval will usually be a different size: if it is larger, then the integrated confidence is larger. CMC1 95% interval This is exactly the type of “awkward question” that a Client might ask! Recall QDC = 71% for Lab 1 including the KCRV
Quantifying Equivalence Ch 5: 80 Quantifying Equivalence What is the probability that a repeat comparison would have a Lab 1 value within Lab 2’s 95% uncertainty interval? Probability Calculus tells us the answer: for this example, QDC = 22% Lab 2’s CMC interval will usually be a different size: if it is larger, then the integrated confidence is larger. CMC2 95% interval These subtly different “awkward” questions have very different “straightforward” answers! Recall QDC = 50% for Lab 2 including the KCRV
Tricky things about Equivalence Ch 5: 81 Tricky things about Equivalence 95% interval QDC = 47% QDC = 22% Equivalence is not transitive Lab 1 KCRV and Lab 2 KCRV does necessarily not imply that Lab 1 Lab 2 ! Equivalence (within different CMCs) does not commute we are asking two very different questions here!
Probability Calculus Conclusions Ch 5: 82 Probability Calculus Conclusions Degrees of Equivalence can be interpreted using QDE0.95 and QDC to provide quantitative answers to complicated questions Probability calculus illustrates and can resolve some of the misconceptions associated with the application of the Null Hypothesis testing and interpretations of “equivalence” Probability calculus can give quantitative guidance about the confidence in a CMC claim, as viewed from the perspective of another participant of a Key Comparison. The above material is taken from a invited presentation titled “Probability Calculus…” given by A.G. Steele at NIST in July, 2001.