Discussion on fitting and averages including “errors on errors”

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

1 Bayesian statistical methods for parton analyses Glen Cowan Royal Holloway, University of London DIS 2006.
Bayesian Estimation in MARK
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.
1 Nuisance parameters and systematic uncertainties Glen Cowan Royal Holloway, University of London IoP Half.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
G. Cowan RHUL Physics Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School page 1 Statistical Methods for Particle Physics (2) CERN-FNAL.
G. Cowan Lectures on Statistical Data Analysis Lecture 14 page 1 Statistical Data Analysis: Lecture 14 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 13 page 1 Statistical Data Analysis: Lecture 13 1Probability, Bayes’ theorem 2Random variables and.
Frequently Bayesian The role of probability in data analysis
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan RHUL Physics Bayesian Higgs combination page 1 Bayesian Higgs combination using shapes ATLAS Statistics Meeting CERN, 19 December, 2007 Glen Cowan.
Lecture II-2: Probability Review
G. Cowan SUSSP65, St Andrews, August 2009 / Statistical Methods 1 page 1 Statistical Methods in Particle Physics Lecture 1: Bayesian methods SUSSP65.
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
G. Cowan RHUL Physics Bayesian Higgs combination page 1 Bayesian Higgs combination based on event counts (follow-up from 11 May 07) ATLAS Statistics Forum.
G. Cowan Terascale Statistics School 2015 / Combination1 Statistical combinations, etc. Terascale Statistics School DESY, Hamburg March 23-27, 2015 Glen.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
G. Cowan Computing and Statistical Data Analysis / Stat 9 1 Computing and Statistical Data Analysis Stat 9: Parameter Estimation, Limits London Postgraduate.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Statistical Methods in Particle Physics1 Statistical Methods in Particle Physics Day 5: Bayesian Methods 清华大学高能物理研究中心 2010 年 4 月 12—16 日 Glen.
G. Cowan SUSSP65, St Andrews, August 2009 / Statistical Methods 1 page 1 Statistical Methods in Particle Physics Lecture 1: Bayesian methods SUSSP65.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Lecture 5 1 Probability Definition, Bayes’ theorem, probability densities and their properties,
1 Nuisance parameters and systematic uncertainties Glen Cowan Royal Holloway, University of London IoP Half.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
G. Cowan SLAC Statistics Meeting / 4-6 June 2012 / Two Developments 1 Two developments in discovery tests: use of weighted Monte Carlo events and an improved.
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Lecture 5 1 Probability Definition, Bayes’ theorem, probability densities and their properties,
Data Modeling Patrice Koehl Department of Biological Sciences
Statistics for HEP Lecture 1: Introduction and basic formalism
Discussion on significance
Physics 114: Lecture 13 Probability Tests & Linear Fitting
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
Model Inference and Averaging
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Lecture 4 1 Probability (90 min.)
Statistical Methods (Lectures 2.1, 2.2)
Statistical Methods: Parameter Estimation
Introduction to Instrumentation Engineering
Modelling data and curve fitting
Graduierten-Kolleg RWTH Aachen February 2014 Glen Cowan
Statistical Data Analysis Stat 3: p-values, parameter estimation
Statistical Methods: Parameter Estimation
Computing and Statistical Data Analysis / Stat 8
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical Models with Uncertain Error Parameters
Statistical Data Analysis / Stat 5
10701 / Machine Learning Today: - Cross validation,
Lecture 3 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
Lecture 4 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
Statistics for Particle Physics Lecture 3: Parameter Estimation
TAE 2018 / Statistics Lecture 4
Some Prospects for Statistical Data Analysis in High Energy Physics
Computing and Statistical Data Analysis / Stat 7
Introduction to Unfolding
Statistical Models with Uncertain Error Parameters
#21 Marginalize vs. Condition Uninteresting Fitted Parameters
Statistics for HEP Lecture 1: Introduction and basic formalism
Decomposition of Stat/Sys Errors
Parametric Methods Berlin Chen, 2005 References:
Introduction to Statistics − Day 4
Lecture 4 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
12. Principles of Parameter Estimation
Mathematical Foundations of BME Reza Shadmehr
Computing and Statistical Data Analysis / Stat 10
Bayesian statistical methods for parton analyses
Applied Statistics and Probability for Engineers
Presentation transcript:

Discussion on fitting and averages including “errors on errors” LAL Orsay, 23 March 2018 Glen Cowan Physics Department Royal Holloway, University of London g.cowan@rhul.ac.uk www.pp.rhul.ac.uk/~cowan G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages Outline Review of parameter estimation (pedagogical) Some examples of averaging measurements Averaging with errors on errors G. Cowan 23 Mar 2018 / Disussion on averages

Distribution, likelihood, model Suppose the outcome of a measurement is x. (e.g., a number of events, a histogram, or some larger set of numbers). The probability density (or mass) function or ‘distribution’ of x, which may depend on parameters θ, is: P(x|θ) (Independent variable is x; θ is a constant.) If we evaluate P(x|θ) with the observed data and regard it as a function of the parameter(s), then this is the likelihood: L(θ) = P(x|θ) (Data x fixed; treat L as function of θ.) We will use the term ‘model’ to refer to the full function P(x|θ) that contains the dependence both on x and θ. G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages Maximum likelihood The most important frequentist method for constructing estimators is to take the value of the parameter(s) that maximize the likelihood: The resulting estimators are functions of the data and thus characterized by a sampling distribution with a given (co)variance: In general they may have a nonzero bias: Under conditions usually satisfied in practice, bias of ML estimators is zero in the large sample limit, and the variance is as small as possible for unbiased estimators. ML estimator may not in some cases be regarded as the optimal trade-off between these criteria (cf. regularized unfolding). G. Cowan 23 Mar 2018 / Disussion on averages

Example: fitting a straight line Data: Model: yi independent and all follow yi ~ Gauss(μ(xi ), σi ) assume xi and σi known. Goal: estimate θ0 Here suppose we don’t care about θ1 (example of a “nuisance parameter”) G. Cowan 23 Mar 2018 / Disussion on averages

Maximum likelihood fit with Gaussian data In this example, the yi are assumed independent, so the likelihood function is a product of Gaussians: Maximizing the likelihood is here equivalent to minimizing i.e., for Gaussian data, ML same as Method of Least Squares (LS) G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages θ1 known a priori For Gaussian yi, ML same as LS Minimize χ2 → estimator Come up one unit from to find G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages ML (or LS) fit of θ0 and θ1 Standard deviations from tangent lines to contour Correlation between causes errors to increase. G. Cowan 23 Mar 2018 / Disussion on averages

If we have a measurement t1 ~ Gauss (θ1, σt1) The information on θ1 improves accuracy of G. Cowan 23 Mar 2018 / Disussion on averages

The Bayesian approach In Bayesian statistics we can associate a probability with a hypothesis, e.g., a parameter value θ. Interpret probability of θ as ‘degree of belief’ (subjective). Need to start with ‘prior pdf’ π(θ), this reflects degree of belief about θ before doing the experiment. Our experiment has data x, → likelihood function L(x|θ). Bayes’ theorem tells how our beliefs should be updated in light of the data x: Posterior pdf p(θ | x) contains all our knowledge about θ. 23 Mar 2018 / Disussion on averages G. Cowan

23 Mar 2018 / Disussion on averages Bayesian method We need to associate prior probabilities with θ0 and θ1, e.g., ‘non-informative’, in any case much broader than ← based on previous measurement Putting this into Bayes’ theorem gives: posterior ∝ likelihood ✕ prior G. Cowan 23 Mar 2018 / Disussion on averages

Bayesian method (continued) We then integrate (marginalize) p(θ0, θ1 | x) to find p(θ0 | x): In this example we can do the integral (rare). We find Usually need numerical methods (e.g. Markov Chain Monte Carlo) to do integral. G. Cowan 23 Mar 2018 / Disussion on averages

Digression: marginalization with MCMC Bayesian computations involve integrals like often high dimensionality and impossible in closed form, also impossible with ‘normal’ acceptance-rejection Monte Carlo. Markov Chain Monte Carlo (MCMC) has revolutionized Bayesian computation. MCMC (e.g., Metropolis-Hastings algorithm) generates correlated sequence of random numbers: cannot use for many applications, e.g., detector MC; effective stat. error greater than if all values independent . Basic idea: sample multidimensional look, e.g., only at distribution of parameters of interest. G. Cowan 23 Mar 2018 / Disussion on averages

MCMC basics: Metropolis-Hastings algorithm Goal: given an n-dimensional pdf generate a sequence of points Proposal density e.g. Gaussian centred about 1) Start at some point 2) Generate 3) Form Hastings test ratio 4) Generate 5) If move to proposed point else old point repeated 6) Iterate G. Cowan 23 Mar 2018 / Disussion on averages

Metropolis-Hastings (continued) This rule produces a correlated sequence of points (note how each new point depends on the previous one). For our purposes this correlation is not fatal, but statistical errors larger than if points were independent. The proposal density can be (almost) anything, but choose so as to minimize autocorrelation. Often take proposal density symmetric: Test ratio is (Metropolis-Hastings): I.e. if the proposed step is to a point of higher , take it; if not, only take the step with probability If proposed step rejected, hop in place. G. Cowan 23 Mar 2018 / Disussion on averages

Example: posterior pdf from MCMC Sample the posterior pdf from previous example with MCMC: Summarize pdf of parameter of interest with, e.g., mean, median, standard deviation, etc. Although numerical values of answer here same as in frequentist case, interpretation is different (sometimes unimportant?) G. Cowan 23 Mar 2018 / Disussion on averages

Bayesian method with alternative priors Suppose we don’t have a previous measurement of θ1 but rather, e.g., a theorist says it should be positive and not too much greater than 0.1 "or so", i.e., something like From this we obtain (numerically) the posterior pdf for θ0: This summarizes all knowledge about θ0. Look also at result from variety of priors. G. Cowan 23 Mar 2018 / Disussion on averages

A more general fit (symbolic) Given measurements: and (usually) covariances: Predicted value: expectation value control variable parameters bias Often take: Minimize Equivalent to maximizing L(θ) ~ e-χ2/2, i.e., least squares same as maximum likelihood using a Gaussian likelihood function. G. Cowan 23 Mar 2018 / Disussion on averages

Its Bayesian equivalent Take Joint probability for all parameters and use Bayes’ theorem: To get desired probability for θ, integrate (marginalize) over b: → Posterior is Gaussian with mode same as least squares estimator, σθ same as from χ2 = χ2min + 1. (Back where we started!) G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages The error on the error Some systematic errors are well determined Error from finite Monte Carlo sample Some are less obvious Do analysis in n ‘equally valid’ ways and extract systematic error from ‘spread’ in results. Some are educated guesses Guess possible size of missing terms in perturbation series; vary renormalization scale Can we incorporate the ‘error on the error’? (cf. G. D’Agostini 1999; Dose & von der Linden 1999) G. Cowan 23 Mar 2018 / Disussion on averages

Motivating a non-Gaussian prior πb(b) Suppose now the experiment is characterized by where si is an (unreported) factor by which the systematic error is over/under-estimated. Assume correct error for a Gaussian πb(b) would be siσisys, so Width of σs(si) reflects ‘error on the error’. G. Cowan 23 Mar 2018 / Disussion on averages

Error-on-error function πs(s) A simple unimodal probability density for 0 < s < 1 with adjustable mean and variance is the Gamma distribution: mean = b/a variance = b/a2 Want e.g. expectation value of 1 and adjustable standard Deviation σs , i.e., s In fact if we took πs (s) ~ inverse Gamma, we could integrate πb(b) in closed form (cf. D’Agostini, Dose, von Linden). But Gamma seems more natural & numerical treatment not too painful. G. Cowan 23 Mar 2018 / Disussion on averages

Prior for bias πb(b) now has longer tails Gaussian (σs = 0) P(|b| > 4σsys) = 6.3 × 10-5 σs = 0.5 P(|b| > 4σsys) = 0.65% G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages A simple test Suppose a fit effectively averages four measurements. Take σsys = σstat = 0.1, uncorrelated. Case #1: data appear compatible Posterior p(μ|y): measurement p(μ|y) experiment μ Usually summarize posterior p(μ|y) with mode and standard deviation: G. Cowan 23 Mar 2018 / Disussion on averages

Simple test with inconsistent data Case #2: there is an outlier Posterior p(μ|y): measurement p(μ|y) experiment μ → Bayesian fit less sensitive to outlier. (See also D'Agostini 1999; Dose & von der Linden 1999) G. Cowan 23 Mar 2018 / Disussion on averages

Goodness-of-fit vs. size of error In LS fit, value of minimized χ2 does not affect size of error on fitted parameter. In Bayesian analysis with non-Gaussian prior for systematics, a high χ2 corresponds to a larger error (and vice versa). post- erior 2000 repetitions of experiment, σs = 0.5, here no actual bias. σμ from least squares χ2 G. Cowan 23 Mar 2018 / Disussion on averages

Frequentist errors on errors Despite the nice features of the Bayesian treatment, it has some important drawbacks: Bayesian gamma-distributed error-on-error requires numerical integration. (Inverse-gamma prior for s gives Student’s t, but this allows very large errors). The Particle Physics community does almost every analysis within a frequentist framework – best if one can include errors on errors without changing entire paradigm. Recently I have been studying the frequentist treatment of errors- on-errors; outcome is very similar to the Bayesian approach (work in progress). So first review some properties of frequentist averages... G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages ML example Suppose we measure uncorrelated yi ~ Gauss(μ, σyi2), i = 1,..., N so that the likelihood is Maximizing the likelihood is equivalent to minimizing This gives a linear unbiased estimator with minimum variance (i.e., equivalent to BLUE): G. Cowan 23 Mar 2018 / Disussion on averages

ML example with systematics Suppose now yi ~ Gauss(μ + bi, σi2), and we have estimates of bias parameter bi, ui ~ Gauss(bi, σu,i2), so that the likelihood becomes After profiling over the nuisance parameters b, one obtains the same result as before but with So again this is the same as BLUE, extended to use addition of the statistical and systematic errors in quadrature. G. Cowan 23 Mar 2018 / Disussion on averages

Example extension: PDG scale factor Suppose we do not want to take the quoted errors as known constants. Scale the variances by a factor ϕ, The likelihood function becomes The estimator for μ is the same as before; for ϕ ML gives which has a bias; is unbiased. The variance of μ is inflated by ϕ: ^ G. Cowan 23 Mar 2018 / Disussion on averages

Frequentist errors on errors Suppose we want to treat the systematic errors as uncertain, so let the σu,i be adjustable nuisance parameters. Suppose we have estimates si for σu,i or equivalently vi = si2, is an estimate of σu,i2. Model the vi as independent and gamma distributed: We can set α and β so that they give e.g. a desired relative uncertainty r in σu. G. Cowan 23 Mar 2018 / Disussion on averages

Gamma model for estimates of variance Suppose the estimated variance v was obtained as the sample variance from n observations of a Gaussian distributed bias estimate u. In this case one can show v is gamma distributed with We can relate α and β to the relative uncertainty r in the systematic uncertainty as reflected by the standard deviation of the sampling distribution of s, σs G. Cowan 23 Mar 2018 / Disussion on averages

Full likelihood with errors on errors Treated like data: y1,...,yN (the real measurements) u1,...,uN (estimates of biases) v1,...,vN (estimates of variances of estimates of biases) Parameters: μ (parameter of interest) b1,...,bN (bias parameters) σu1,..., σuN (sys. errors = std. dev. of of bias estimates) G. Cowan 23 Mar 2018 / Disussion on averages

Full log-likelihood with errors on errors Setting the parameters of the gamma distributions in terms of the relative uncertainty in the systematic errors ri and the systematic errors themselves σui, which gives the log-likelihood G. Cowan 23 Mar 2018 / Disussion on averages

Toy study with errors on errors MINOS interval (= approx. confidence interval) based on with Increased discrepancy between values to be averaged gives larger interval. Interval length saturates at ~level of absolute discrepancy between input values. relative error on sys. error G. Cowan 23 Mar 2018 / Disussion on averages

Goodness of fit with errors on errors Because we now treat the σui as adjustable parameters, -2lnL is no longer a sum of squares; “usual” χ2 not usable for g.o.f. To assess the goodness of fit, one can define a statistic based on the profile likelihood ratio For q in the numerator one assumes a single value of mu for all measurements (the model being tested). The denominator is the “saturated model”, i.e., an adjustable μi for each measurement, i.e., μ = (μ1,..., μN). Asymptotically from Wilks’ theorem q ~ chi-square(N-1) G. Cowan 23 Mar 2018 / Disussion on averages

Goodness of fit with errors on errors (N = 2,3) Asymptotic distribution works well with r = 0.2 G. Cowan 23 Mar 2018 / Disussion on averages

Goodness of fit with errors on errors (N = 4,5) Asymptotic distribution works well with r = 0.2 G. Cowan 23 Mar 2018 / Disussion on averages

Goodness of fit with errors on errors (N = 6,7) Asymptotic distribution works well with r = 0.2 G. Cowan 23 Mar 2018 / Disussion on averages

Goodness-of-fit with large errrors on errors Asymptotic distribution starts to break for large r r = 0.4 r = 0.6 G. Cowan 23 Mar 2018 / Disussion on averages

Connection between goodness-of-fit and size of confidence interval Similar to Bayesian case, the length of the confidence interval increases if the goodness-of-fit is bad (high q), but only if one includes enough error-on-error. G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages Further work Next: Study sensitivity of average to outliers with errors-on-errors Extend to analyses beyond averages, e.g., μ → μ = (μ1,..., μN) = μ(θ) Still a number of software issues to overcome, e.g., problems with fit convergence, determining MINOS interval (thanks to Lorenzo Moneta for help with this) G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages Extra slides G. Cowan 23 Mar 2018 / Disussion on averages

Using LS to combine measurements G. Cowan 23 Mar 2018 / Disussion on averages

Combining correlated measurements with LS G. Cowan 23 Mar 2018 / Disussion on averages

Example: averaging two correlated measurements G. Cowan 23 Mar 2018 / Disussion on averages

Negative weights in LS average G. Cowan 23 Mar 2018 / Disussion on averages

Example of “correlated systematics” Suppose we carry out two independent measurements of the length of an object using two rulers with diferent thermal expansion properties. Suppose the temperature is not known exactly but must be measured (but lengths measured together so T same for both), The expectation value of the measured length Li (i = 1, 2) is related to true length λ at a reference temperature τ0 by and the (uncorrected) length measurements are modeled as G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages Two rulers (2) The model thus treats the measurements T, L1, L2 as uncorrelated with standard deviations σT, σ1, σ2, respectively: Alternatively we could correct each raw measurement: which introduces a correlation between y1, y2 and T But the likelihood function (multivariate Gauss in T, y1, y2) is the same function of τ and λ as before. Language of y1, y2: temperature gives correlated systematic. Language of L1, L2: temperature gives “coherent” systematic. G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages Two rulers (3) Outcome has some surprises: Estimate of λ does not lie between y1 and y2. Stat. error on new estimate of temperature substantially smaller than initial σT. These are features, not bugs, that result from our model assumptions. G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages Two rulers (4) We may re-examine the assumptions of our model and conclude that, say, the parameters α1, α2 and τ0 were also uncertain. We may treat their nominal values as measurements (need a model; Gaussian?) and regard α1, α2 and τ0 as as nuisance parameters. G. Cowan 23 Mar 2018 / Disussion on averages

23 Mar 2018 / Disussion on averages Two rulers (5) The outcome changes; some surprises may be “reduced”. G. Cowan 23 Mar 2018 / Disussion on averages

Non-Gaussian likelihoods Tails of Gaussian fall of very quickly; not always most realistic model esp. for systematics. Can try e.g. Student’s t, ν = 1 gives Cauchy, ν large gives Gaussian. Can either fix ν or constrain like other nuisance parameters. ML no longer equivalent to least-squares, BLUE. G. Cowan 23 Mar 2018 / Disussion on averages

Gamma distribution for sys. errors First attempt to treat estimates of sys. errors as log-normal distributed has long tail towards large errors; maybe more realistic to use gamma distribution: (here k = α, θ = 1/β) x Take s ~ Gamma(α, β) as “measurement” of sys. error σu with relative unc. in sys. error r, α = 1/r2, β= α /σu. G. Cowan 23 Mar 2018 / Disussion on averages