G. Cowan Terascale Statistics School 2015 / Combination1 Statistical combinations, etc. Terascale Statistics School DESY, Hamburg March 23-27, 2015 Glen.

Slides:



Advertisements
Similar presentations
1 Bayesian statistical methods for parton analyses Glen Cowan Royal Holloway, University of London DIS 2006.
Advertisements

Statistical Data Analysis Stat 3: p-values, parameter estimation
G. Cowan RHUL Physics Profile likelihood for systematic uncertainties page 1 Use of profile likelihood to determine systematic uncertainties ATLAS Top.
8. Statistical tests 8.1 Hypotheses K. Desch – Statistical methods of data analysis SS10 Frequent problem: Decision making based on statistical information.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.
1 Nuisance parameters and systematic uncertainties Glen Cowan Royal Holloway, University of London IoP Half.
G. Cowan RHUL Physics Comment on use of LR for limits page 1 Comment on definition of likelihood ratio for limits ATLAS Statistics Forum CERN, 2 September,
Statistics for HEP / LAL Orsay, 3-5 January 2012 / Lecture 1
G. Cowan Lectures on Statistical Data Analysis Lecture 12 page 1 Statistical Data Analysis: Lecture 12 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
G. Cowan 2011 CERN Summer Student Lectures on Statistics / Lecture 41 Introduction to Statistics − Day 4 Lecture 1 Probability Random variables, probability.
G. Cowan RHUL Physics Statistical Methods for Particle Physics / 2007 CERN-FNAL HCP School page 1 Statistical Methods for Particle Physics (2) CERN-FNAL.
G. Cowan Lectures on Statistical Data Analysis Lecture 14 page 1 Statistical Data Analysis: Lecture 14 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Discovery and limits / DESY, 4-7 October 2011 / Lecture 1 1 Statistical Methods for Discovery and Limits Lecture 1: Introduction, Statistical.
G. Cowan Lectures on Statistical Data Analysis Lecture 13 page 1 Statistical Data Analysis: Lecture 13 1Probability, Bayes’ theorem 2Random variables and.
Frequently Bayesian The role of probability in data analysis
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
G. Cowan Discovery and limits / DESY, 4-7 October 2011 / Lecture 3 1 Statistical Methods for Discovery and Limits Lecture 3: Limits for Poisson mean: Bayesian.
G. Cowan RHUL Physics Bayesian Higgs combination page 1 Bayesian Higgs combination using shapes ATLAS Statistics Meeting CERN, 19 December, 2007 Glen Cowan.
Lecture II-2: Probability Review
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
G. Cowan SUSSP65, St Andrews, August 2009 / Statistical Methods 1 page 1 Statistical Methods in Particle Physics Lecture 1: Bayesian methods SUSSP65.
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
G. Cowan 2009 CERN Summer Student Lectures on Statistics1 Introduction to Statistics − Day 4 Lecture 1 Probability Random variables, probability densities,
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
1 Statistical Distribution Fitting Dr. Jason Merrick.
G. Cowan Lectures on Statistical Data Analysis Lecture 1 page 1 Lectures on Statistical Data Analysis London Postgraduate Lectures on Particle Physics;
Statistics In HEP Helge VossHadron Collider Physics Summer School June 8-17, 2011― Statistics in HEP 1 How do we understand/interpret our measurements.
G. Cowan CLASHEP 2011 / Topics in Statistical Data Analysis / Lecture 1 1 Topics in Statistical Data Analysis for HEP Lecture 1: Parameter Estimation CERN.
G. Cowan RHUL Physics Bayesian Higgs combination page 1 Bayesian Higgs combination based on event counts (follow-up from 11 May 07) ATLAS Statistics Forum.
G. Cowan, RHUL Physics Discussion on significance page 1 Discussion on significance ATLAS Statistics Forum CERN/Phone, 2 December, 2009 Glen Cowan Physics.
G. Cowan RHUL Physics LR test to determine number of parameters page 1 Likelihood ratio test to determine best number of parameters ATLAS Statistics Forum.
G. Cowan CERN Academic Training 2010 / Statistics for the LHC / Lecture 41 Statistics for the LHC Lecture 4: Bayesian methods and further topics Academic.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #24.
G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
G. Cowan TAE 2010 / Statistics for HEP / Lecture 11 Statistical Methods for HEP Lecture 1: Introduction Taller de Altas Energías 2010 Universitat de Barcelona.
G. Cowan NExT Workshop, 2015 / GDC Lecture 11 Statistical Methods for Particle Physics Lecture 1: introduction & statistical tests Fifth NExT PhD Workshop:
G. Cowan Lectures on Statistical Data Analysis Lecture 4 page 1 Lecture 4 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
G. Cowan Computing and Statistical Data Analysis / Stat 9 1 Computing and Statistical Data Analysis Stat 9: Parameter Estimation, Limits London Postgraduate.
G. Cowan Systematic uncertainties in statistical data analysis page 1 Systematic uncertainties in statistical data analysis for particle physics DESY Seminar.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan SUSSP65, St Andrews, August 2009 / Statistical Methods 1 page 1 Statistical Methods in Particle Physics Lecture 1: Bayesian methods SUSSP65.
G. Cowan Lectures on Statistical Data Analysis Lecture 6 page 1 Statistical Data Analysis: Lecture 6 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Statistical Data Analysis: Lecture 5 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 12 page 1 Statistical Data Analysis: Lecture 12 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Lecture 5 1 Probability Definition, Bayes’ theorem, probability densities and their properties,
G. Cowan Statistical methods for HEP / Freiburg June 2011 / Lecture 1 1 Statistical Methods for Discovery and Limits in HEP Experiments Day 1: Introduction,
G. Cowan Likelihood Workshop, CERN, Jan Introduction to Likelihoods Likelihood Workshop CERN, 21-23, 2013 Glen Cowan Physics Department Royal.
1 Nuisance parameters and systematic uncertainties Glen Cowan Royal Holloway, University of London IoP Half.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
G. Cowan SLAC Statistics Meeting / 4-6 June 2012 / Two Developments 1 Two developments in discovery tests: use of weighted Monte Carlo events and an improved.
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Lecture 5 1 Probability Definition, Bayes’ theorem, probability densities and their properties,
Statistics for HEP Lecture 1: Introduction and basic formalism
Discussion on significance
Lecture 4 1 Probability (90 min.)
Graduierten-Kolleg RWTH Aachen February 2014 Glen Cowan
Computing and Statistical Data Analysis / Stat 8
Lecture 3 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
Lecture 4 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
Statistics for Particle Physics Lecture 3: Parameter Estimation
TAE 2018 Benasque, Spain 3-15 Sept 2018 Glen Cowan Physics Department
Computing and Statistical Data Analysis / Stat 7
Introduction to Statistics − Day 4
Lecture 4 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
Computing and Statistical Data Analysis / Stat 10
Presentation transcript:

G. Cowan Terascale Statistics School 2015 / Combination1 Statistical combinations, etc. Terascale Statistics School DESY, Hamburg March 23-27, 2015 Glen Cowan Physics Department Royal Holloway, University of London TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

G. Cowan Terascale Statistics School 2015 / Combination2 Outline 1. Review of some formalism and analysis tasks 2. Broad view of combinations & review of parameter estimation 3. Combinations of parameter estimates. 4. Least-squares averages, including correlations 5. Comparison with Bayesian parameter estimation 6. Bayesian averages with outliers 7. PDG brief overview

G. Cowan Terascale Statistics School 2015 / Combination3 Hypothesis, distribution, likelihood, model Suppose the outcome of a measurement is x. (e.g., a number of events, a histogram, or some larger set of numbers). A hypothesis H specifies the probability of the data P(x|H). Often H is labeled by parameter(s) θ → P(x|θ). For the probability distribution P(x|θ), variable is x; θ is a constant. If e.g. we evaluate P(x|θ) with the observed data and regard it as a function of the parameter(s), then this is the likelihood: Here use the term ‘model’ to refer to the full function P(x|θ) that contains the dependence both on x and θ. (Sometimes write L(x|θ) for model or likelihood, depending on context.) L(θ) = P(x|θ) (Data x fixed; treat L as function of θ.)

G. Cowan Terascale Statistics School 2015 / Combination4 Bayesian use of the term ‘likelihood’ We can write Bayes theorem as where L(x|θ) is the likelihood. It is the probability for x given θ, evaluated with the observed x, and viewed as a function of θ. Bayes’ theorem only needs L(x|θ) evaluated with a given data set (the ‘likelihood principle’). For frequentist methods, in general one needs the full model. For some approximate frequentist methods, the likelihood is enough. posterior prior

G. Cowan Terascale Statistics School 2015 / Combination5 Theory ↔ Statistics ↔ Experiment + simulation of detector and cuts Theory (model, hypothesis): Experiment: + data selection

G. Cowan Terascale Statistics School 2015 / Combination6 Nuisance parameters In general our model of the data is not perfect: x L ( x | θ ) model: truth: Can improve model by including additional adjustable parameters. Nuisance parameter ↔ systematic uncertainty. Some point in the parameter space of the enlarged model should be “true”. Presence of nuisance parameter decreases sensitivity of analysis to the parameter of interest (e.g., increases variance of estimate).

G. Cowan Terascale Statistics School 2015 / Combination7 Data analysis in particle physics Observe events (e.g., pp collisions) and for each, measure a set of characteristics: particle momenta, number of muons, energy of jets,... Compare observed distributions of these characteristics to predictions of theory. From this, we want to: Estimate the free parameters of the theory: Quantify the uncertainty in the estimates: Assess how well a given theory stands in agreement with the observed data: Test/exclude regions of the model’s parameter space (→ limits)

G. Cowan Terascale Statistics School 2015 / Combination8 Combinations “Combination of results” can be taken to mean: how to construct a model that incorporates more data. E.g. several experiments, Experiment 1: data x, model P(x|θ) → upper limit θ up,1 Experiment 2: data y, model P(y|θ) → upper limit θ up,2 Or main experiment and control measurement(s). The best way to do the combination is at the level of the data, e.g., (if x,y independent) P(x,y|θ) = P(x|θ) P(y|θ) → “combined” limit θ up,comb If the data are not available but rather only the “results” (limits, parameter estimates, p-values) then possibilities are more limited. Usually OK for parameter estimates, difficult/impossible for limits, p-values without additional assumptions & information loss.

G. Cowan Terascale Statistics School 2015 / Combination9 Quick review of frequentist parameter estimation Suppose we have a pdf characterized by one or more parameters: random variable Suppose we have a sample of observed values: parameter We want to find some function of the data to estimate the parameter(s): ← estimator written with a hat Sometimes we say ‘estimator’ for the function of x 1,..., x n ; ‘estimate’ for the value of the estimator with a particular data set.

G. Cowan Terascale Statistics School 2015 / Combination10 Maximum likelihood The most important frequentist method for constructing estimators is to take the value of the parameter(s) that maximize the likelihood: The resulting estimators are functions of the data and thus characterized by a sampling distribution with a given (co)variance: In general they may have a nonzero bias: Under conditions usually satisfied in practice, bias of ML estimators is zero in the large sample limit, and the variance is as small as possible for unbiased estimators. ML estimator may not in some cases be regarded as the optimal trade-off between these criteria (cf. regularized unfolding).

G. Cowan Terascale Statistics School 2015 / Combination11 Ingredients for ML To find the ML estimate itself one only needs the likelihood L(θ). In principle to find the covariance of the estimators, one requires the full model L(x|θ). E.g., simulate many times independent data sets and look at distribution of the resulting estimates:

G. Cowan Terascale Statistics School 2015 / Combination12 Ingredients for ML (2) Often (e.g., large sample case) one can approximate the covariances using only the likelihood L(θ): → Tangent lines to contours give standard deviations. → Angle of ellipse  related to correlation: This translates into a simple graphical recipe:

G. Cowan Terascale Statistics School 2015 / Combination13 The method of least squares Suppose we measure N values, y 1,..., y N, assumed to be independent Gaussian r.v.s with Assume known values of the control variable x 1,..., x N and known variances The likelihood function is We want to estimate , i.e., fit the curve to the data points.

G. Cowan Terascale Statistics School 2015 / Combination14 The method of least squares (2) The log-likelihood function is therefore So maximizing the likelihood is equivalent to minimizing Minimum defines the least squares (LS) estimator Very often measurement errors are ~Gaussian and so ML and LS are essentially the same. Often minimize  2 numerically (e.g. program MINUIT ).

G. Cowan Terascale Statistics School 2015 / Combination15 LS with correlated measurements If the y i follow a multivariate Gaussian, covariance matrix V, Then maximizing the likelihood is equivalent to minimizing

G. Cowan Terascale Statistics School 2015 / Combination16 Linear LS problem

G. Cowan Terascale Statistics School 2015 / Combination17 Linear LS problem (2)

G. Cowan Terascale Statistics School 2015 / Combination18 Linear LS problem (3) Equals MVB if y i Gaussian)

G. Cowan Terascale Statistics School 2015 / Combination19 Goodness-of-fit with least squares The value of the  2 at its minimum is a measure of the level of agreement between the data and fitted curve: It can therefore be employed as a goodness-of-fit statistic to test the hypothesized functional form  (x;  ). We can show that if the hypothesis is correct, then the statistic t =  2 min follows the chi-square pdf, where the number of degrees of freedom is n d = number of data points - number of fitted parameters

G. Cowan Terascale Statistics School 2015 / Combination20 Goodness-of-fit with least squares (2) The chi-square pdf has an expectation value equal to the number of degrees of freedom, so if  2 min ≈ n d the fit is ‘good’. More generally, find the p-value: This is the probability of obtaining a  2 min as high as the one we got, or higher, if the hypothesis is correct.

G. Cowan Terascale Statistics School 2015 / Combination21 Using LS to combine measurements

G. Cowan Terascale Statistics School 2015 / Combination22 Combining correlated measurements with LS That is, if we take the estimator to be a linear form Σ i w i y i, and find the w i that minimize its variance, we get the LS solution (= BLUE, Best Linear Unbiased Estimator).

G. Cowan Terascale Statistics School 2015 / Combination23 Example: averaging two correlated measurements

G. Cowan Terascale Statistics School 2015 / Combination24 Negative weights in LS average

G. Cowan Terascale Statistics School 2015 / Combination25 Covariance, correlation, etc. For a pair of random variables x and y, the covariance and correlation are One only talks about the correlation of two quantities to which one assigns probability (i.e., random variables). So in frequentist statistics, estimators for parameters can be correlated, but not the parameters themselves. In Bayesian statistics it does make sense to say that two parameters are correlated, e.g.,

G. Cowan Terascale Statistics School 2015 / Combination26 Example of “correlated systematics” Suppose we carry out two independent measurements of the length of an object using two rulers with diferent thermal expansion properties. Suppose the temperature is not known exactly but must be measured (but lengths measured together so T same for both), and the (uncorrected) length measurements are modeled as The expectation value of the measured length L i (i = 1, 2) is related to true length λ at a reference temperature τ 0 by

G. Cowan Terascale Statistics School 2015 / Combination27 Two rulers (2) The model thus treats the measurements T, L 1, L 2 as uncorrelated with standard deviations σ T, σ 1, σ 2, respectively: Alternatively we could correct each raw measurement: which introduces a correlation between y 1, y 2 and T But the likelihood function (multivariate Gauss in T, y 1, y 2 ) is the same function of τ and λ as before (equivalent!). Language of y 1, y 2 : temperature gives correlated systematic. Language of L 1, L 2 : temperature gives “coherent” systematic.

G. Cowan Terascale Statistics School 2015 / Combination28 Two rulers (3) Outcome has some surprises: Estimate of λ does not lie between y 1 and y 2. Stat. error on new estimate of temperature substantially smaller than initial σ T. These are features, not bugs, that result from our model assumptions.

G. Cowan Terascale Statistics School 2015 / Combination29 Two rulers (4) We may re-examine the assumptions of our model and conclude that, say, the parameters α 1, α 2 and τ 0 were also uncertain. We may treat their nominal values as measurements (need a model; Gaussian?) and regard α 1, α 2 and τ 0 as as nuisance parameters.

G. Cowan Terascale Statistics School 2015 / Combination30 Two rulers (5) The outcome changes; some surprises may be “reduced”.

G. Cowan Terascale Statistics School 2015 / Combination31 “Related” parameters Suppose the model for two independent measurements x and y contain the same parameter of interest μ and a common nuisance parameter, θ, such as the jet-energy scale. To combine the measurements, construct the full likelihood: Although one may think of θ as common to the two measurements, this could be a poor approximation (e.g., the two analyses use jets with different angles/energies, so a single jet-energy scale is not a good model). Better model: suppose the parameter for x is θ, and for y it is where ε is an additional nuisance parameter expected to be small.

G. Cowan Terascale Statistics School 2015 / Combination32 Model with nuisance parameters The additional nuisance parameters in the model may spoil our sensitivity to the parameter of interest μ, so we need to constrain them with control measurements. Often we have no actual control measurements, but some “nominal values” for θ and ε, θ and ε, which we treat as if they were measurements, e.g., with a Gaussian model: ~ ~ We started by considering θ and θ’ to be the same, so probably take ε = 0. So we now have an improved model ~ with which we can estimate μ, set limits, etc.

G. Cowan Terascale Statistics School 2015 / Combination33 Example: fitting a straight line Data: Model: y i independent and all follow y i ~ Gauss(μ(x i ), σ i ) assume x i and  i known. Goal: estimate  0 Here suppose we don’t care about  1 (example of a “nuisance parameter”)

G. Cowan Terascale Statistics School 2015 / Combination34 Maximum likelihood fit with Gaussian data In this example, the y i are assumed independent, so the likelihood function is a product of Gaussians: Maximizing the likelihood is here equivalent to minimizing i.e., for Gaussian data, ML same as Method of Least Squares (LS)

G. Cowan Terascale Statistics School 2015 / Combination35  1 known a priori For Gaussian y i, ML same as LS Minimize  2 → estimator Come up one unit from to find

G. Cowan Terascale Statistics School 2015 / Combination36 Correlation between causes errors to increase. Standard deviations from tangent lines to contour ML (or LS) fit of  0 and  1

G. Cowan Terascale Statistics School 2015 / Combination37 The information on  1 improves accuracy of If we have a measurement t 1 ~ Gauss (  1, σ t 1 )

G. Cowan Terascale Statistics School 2015 / Combination38 Bayesian method We need to associate prior probabilities with  0 and  1, e.g., Putting this into Bayes’ theorem gives: posterior  likelihood  prior ← based on previous measurement ‘non-informative’, in any case much broader than

G. Cowan Terascale Statistics School 2015 / Combination39 Bayesian method (continued) Usually need numerical methods (e.g. Markov Chain Monte Carlo) to do integral. We then integrate (marginalize) p(  0,  1 | x) to find p(  0 | x): In this example we can do the integral (rare). We find

G. Cowan Terascale Statistics School 2015 / Combinationpage 40 Marginalization with MCMC Bayesian computations involve integrals like often high dimensionality and impossible in closed form, also impossible with ‘normal’ acceptance-rejection Monte Carlo. Markov Chain Monte Carlo (MCMC) has revolutionized Bayesian computation. MCMC (e.g., Metropolis-Hastings algorithm) generates correlated sequence of random numbers: cannot use for many applications, e.g., detector MC; effective stat. error greater than naive √n. Basic idea: sample full multidimensional parameter space; look, e.g., only at distribution of parameters of interest.

G. Cowan Terascale Statistics School 2015 / Combination41 Although numerical values of answer here same as in frequentist case, interpretation is different (sometimes unimportant?) Example: posterior pdf from MCMC Sample the posterior pdf from previous example with MCMC: Summarize pdf of parameter of interest with, e.g., mean, median, standard deviation, etc.

G. Cowan Terascale Statistics School 2015 / Combination42 Bayesian method with alternative priors Suppose we don’t have a previous measurement of  1 but rather, e.g., a theorist says it should be positive and not too much greater than 0.1 "or so", i.e., something like From this we obtain (numerically) the posterior pdf for  0 : This summarizes all knowledge about  0. Look also at result from variety of priors.

G. Cowan Terascale Statistics School 2015 / CombinationLecture 5 page 43 The error on the error Some systematic errors are well determined Error from finite Monte Carlo sample Some are less obvious Do analysis in n ‘equally valid’ ways and extract systematic error from ‘spread’ in results. Some are educated guesses Guess possible size of missing terms in perturbation series; vary renormalization scale Can we incorporate the ‘error on the error’? (cf. G. D’Agostini 1999; Dose & von der Linden 1999)

G. Cowan Terascale Statistics School 2015 / Combinationpage 44 A more general fit (symbolic) Given measurements: and (usually) covariances: Predicted value: control variableparametersbias Often take: Minimize Equivalent to maximizing L(  ) » e  2 /2, i.e., least squares same as maximum likelihood using a Gaussian likelihood function. expectation value

G. Cowan Terascale Statistics School 2015 / Combinationpage 45 Its Bayesian equivalent and use Bayes’ theorem: To get desired probability for , integrate (marginalize) over b: → Posterior is Gaussian with mode same as least squares estimator,    same as from  2 =  2 min + 1. (Back where we started!) Take Joint probability for all parameters

G. Cowan Terascale Statistics School 2015 / Combination 46 Motivating a non-Gaussian prior  b (b) Suppose now the experiment is characterized by where s i is an (unreported) factor by which the systematic error is over/under-estimated. Assume correct error for a Gaussian  b (b) would be s i  i sys, so Width of  s (s i ) reflects ‘error on the error’.

G. Cowan Terascale Statistics School 2015 / Combination47 Error-on-error function  s (s) A simple unimodal probability density for 0 < s < 1 with adjustable mean and variance is the Gamma distribution: Want e.g. expectation value of 1 and adjustable standard deviation  s, i.e., mean = b/a variance = b/a 2 In fact if we took  s (s) ~ inverse Gamma, we could integrate  b (b) in closed form (cf. D’Agostini, Dose, von Linden). But Gamma seems more natural & numerical treatment not too painful. s

G. Cowan Terascale Statistics School 2015 / Combination48 Prior for bias  b (b) now has longer tails Gaussian (  s = 0) P(|b| > 4  sys ) = 6.3 ×  s = 0.5 P(|b| > 4  sys ) = 0.65% b

G. Cowan Terascale Statistics School 2015 / Combination49 A simple test Suppose fit effectively averages four measurements. Take  sys =  stat = 0.1, uncorrelated. Case #1: data appear compatible Posterior p(  |y): Usually summarize posterior p(  |y) with mode and standard deviation: experiment measurement 

G. Cowan Terascale Statistics School 2015 / Combination50 Simple test with inconsistent data Case #2: there is an outlier → Bayesian fit less sensitive to outlier. → Error now connected to goodness-of-fit. Posterior p(  |y): experiment measurement 

G. Cowan Terascale Statistics School 2015 / Combination51 Goodness-of-fit vs. size of error In LS fit, value of minimized  2 does not affect size of error on fitted parameter. In Bayesian analysis with non-Gaussian prior for systematics, a high  2 corresponds to a larger error (and vice versa) repetitions of experiment,  s = 0.5, here no actual bias. 22   from least squares post-erior

G. Cowan Terascale Statistics School 2015 / Combination52 Particle Data Group averages The PDG needs pragmatic solutions for averages where the reported information may be incomplete/inconsistent. Often this involves taking the quadratic sum of statistical and systematic uncertainties for LS averages. If asymmetric errors (confidence intervals) are reported, PDG has a recipe to reconstruct a model based on a Gaussian-like function where sigma is a continuous function of the mean. Exclusion of inconsistent data “sometimes quite subjective”. If min. chi-squared is much larger than the number of degrees of freedom N dof = N  1, scale up the input errors a factor so that new χ 2 = N dof. Error on the average increased by S. K.A. Olive et al. (Particle Data Group), Chin. Phys. C, 38, (2014); pdg.lbl.gov

G. Cowan Terascale Statistics School 2015 / Combination53 Summary on combinations The basic idea of combining measurements is to write down the model that describes all of the available experimental outcomes. If the original data are not available but only parameter estimates, then one treats the estimates (and their covariances) as “the data”. Often a multivariate Gaussian model is adequate for these. If the reported values are limits, there are few meaningful options. PDG does not combine limits unless the can be “deconstructed” back into a Gaussian measurement. ATLAS/CMS 2011 combination of Higgs limits used the histograms of event counts (not the individual limits) to construct a full model (ATLAS-CONF , CMS PAS HIG ). Important point is to publish enough information so that meaningful combinations can be carried out.

G. Cowan Terascale Statistics School 2015 / Combination54 Extra slides

G. Cowan Terascale Statistics School 2015 / Combination55 A quick review of frequentist statistical tests Consider a hypothesis H 0 and alternative H 1. A test of H 0 is defined by specifying a critical region w of the data space such that there is no more than some (small) probability , assuming H 0 is correct, to observe the data there, i.e., P(x  w | H 0 ) ≤  Need inequality if data are discrete. α is called the size or significance level of the test. If x is observed in the critical region, reject H 0. data space Ω critical region w

G. Cowan Terascale Statistics School 2015 / Combination56 Definition of a test (2) But in general there are an infinite number of possible critical regions that give the same significance level . So the choice of the critical region for a test of H 0 needs to take into account the alternative hypothesis H 1. Roughly speaking, place the critical region where there is a low probability to be found if H 0 is true, but high if H 1 is true:

G. Cowan Terascale Statistics School 2015 / Combination57 Type-I, Type-II errors Rejecting the hypothesis H 0 when it is true is a Type-I error. The maximum probability for this is the size of the test: P(x  W | H 0 ) ≤  But we might also accept H 0 when it is false, and an alternative H 1 is true. This is called a Type-II error, and occurs with probability P(x  S  W | H 1 ) =  One minus this is called the power of the test with respect to the alternative H 1 : Power =  

G. Cowan Terascale Statistics School 2015 / Combination58 p-values Suppose hypothesis H predicts pdf observations for a set of We observe a single point in this space: What can we say about the validity of H in light of the data? Express level of compatibility by giving the p-value for H: p = probability, under assumption of H, to observe data with equal or lesser compatibility with H relative to the data we got. This is not the probability that H is true! Requires one to say what part of data space constitutes lesser compatibility with H than the observed data (implicitly this means that region gives better agreement with some alternative).

G. Cowan 59 Significance from p-value Often define significance Z as the number of standard deviations that a Gaussian variable would fluctuate in one direction to give the same p-value. 1 - TMath::Freq TMath::NormQuantile Terascale Statistics School 2015 / Combination E.g. Z = 5 (a “5 sigma effect”) corresponds to p = 2.9 × 10 .

G. Cowan 60 Using a p-value to define test of H 0 One can show the distribution of the p-value of H, under assumption of H, is uniform in [0,1]. So the probability to find the p-value of H 0, p 0, less than  is Terascale Statistics School 2015 / Combination We can define the critical region of a test of H 0 with size  as the set of data space where p 0 ≤ . Formally the p-value relates only to H 0, but the resulting test will have a given power with respect to a given alternative H 1.

G. Cowan Terascale Statistics School 2015 / Combination61 The Poisson counting experiment Suppose we do a counting experiment and observe n events. Events could be from signal process or from background – we only count the total number. Poisson model: s = mean (i.e., expected) # of signal events b = mean # of background events Goal is to make inference about s, e.g., test s = 0 (rejecting H 0 ≈ “discovery of signal process”) test all non-zero s (values not rejected = confidence interval) In both cases need to ask what is relevant alternative hypothesis.

G. Cowan Terascale Statistics School 2015 / Combination62 Poisson counting experiment: discovery p-value Suppose b = 0.5 (known), and we observe n obs = 5. Should we claim evidence for a new discovery? Give p-value for hypothesis s = 0:

G. Cowan Terascale Statistics School 2015 / Combination63 Poisson counting experiment: discovery significance In fact this tradition should be revisited: p-value intended to quantify probability of a signal- like fluctuation assuming background only; not intended to cover, e.g., hidden systematics, plausibility signal model, compatibility of data with signal, “look-elsewhere effect” (~multiple testing), etc. Equivalent significance for p = 1.7 × 10  : Often claim discovery if Z > 5 (p < 2.9 × 10 , i.e., a “5-sigma effect”)

G. Cowan Terascale Statistics School 2015 / Combination64 Confidence intervals by inverting a test Confidence intervals for a parameter  can be found by defining a test of the hypothesized value  (do this for all  ): Specify values of the data that are ‘disfavoured’ by  (critical region) such that P(data in critical region) ≤  for a prespecified , e.g., 0.05 or 0.1. If data observed in the critical region, reject the value . Now invert the test to define a confidence interval as: set of  values that would not be rejected in a test of size  (confidence level is 1  ). The interval will cover the true value of  with probability ≥ 1 . Equivalently, the parameter values in the confidence interval have p-values of at least .

G. Cowan Terascale Statistics School 2015 / Combination65 Frequentist upper limit on Poisson parameter Consider again the case of observing n ~ Poisson(s + b). Suppose b = 4.5, n obs = 5. Find upper limit on s at 95% CL. Relevant alternative is s = 0 (critical region at low n) p-value of hypothesized s is P(n ≤ n obs ; s, b) Upper limit s up at CL = 1 – α found from

G. Cowan Terascale Statistics School 2015 / Combination66 Frequentist upper limit on Poisson parameter Upper limit s up at CL = 1 – α found from p s = α. n obs = 5, b = 4.5

G. Cowan Terascale Statistics School 2015 / Combination67 Frequentist treatment of nuisance parameters in a test Suppose we test a value of θ with the profile likelihood ratio: We want a p-value of θ: Wilks’ theorem says in the large sample limit (and under some additional conditions) f(t θ |θ,ν) is a chi-square distribution with number of degrees of freedom equal to number of parameters of interest (number of components in θ). Simple recipe for p-value; holds regardless of the values of the nuisance parameters!

G. Cowan Terascale Statistics School 2015 / Combination68 Frequentist treatment of nuisance parameters in a test (2) But for a finite data sample, f(t θ |θ,ν) still depends on ν. So what is the rule for saying whether we reject θ? Exact approach is to reject θ only if p θ < α (5%) for all possible ν. Some values of θ might not be excluded for a value of ν known to be disfavoured. Less values of θ rejected → larger interval → higher probability to cover true value (“over-coverage”). But why do we say some values of ν are disfavoured? If this is because of other measurements (say, data y) then include y in the model: Now ν better constrained, new interval for θ smaller.

G. Cowan Terascale Statistics School 2015 / Combination69 Profile construction (“hybrid resampling”) Approximate procedure is to reject θ if p θ ≤ α where the p-value is computed assuming the value of the nuisance parameter that best fits the data for the specified θ (the profiled values): The resulting confidence interval will have the correct coverage for the points. Elsewhere it may under- or over-cover, but this is usually as good as we can do (check with MC if crucial or small sample problem).

G. Cowan Terascale Statistics School 2015 / Combination70 Bayesian treatment of nuisance parameters Conceptually straightforward: all parameters have a prior: Often “non-informative” (broad compared to likelihood). Usually“informative”, reflects best available info. on ν. Use with likelihood in Bayes’ theorem: To find p(θ|x), marginalize (integrate) over nuisance param.:

G. Cowan Terascale Statistics School 2015 / Combination71 Prototype analysis in HEP Each event yields a collection of numbers x 1 = number of muons, x 2 = p t of jet,... follows some n-dimensional joint pdf, which depends on the type of event produced, i.e., signal or background. 1) What kind of decision boundary best separates the two classes? 2) What is optimal test of hypothesis that event sample contains only background?

G. Cowan Terascale Statistics School 2015 / Combination72 Test statistics The boundary of the critical region for an n-dimensional data space x = (x 1,..., x n ) can be defined by an equation of the form We can work out the pdfs Decision boundary is now a single ‘cut’ on t, defining the critical region. So for an n-dimensional problem we have a corresponding 1-d problem. where t(x 1,…, x n ) is a scalar test statistic.

Terascale Statistics School 2015 / Combination73 Test statistic based on likelihood ratio For multivariate data x, not obvious how to construct best test. Neyman-Pearson lemma states: To get the highest power for a given significance level in a test of H 0, (background) versus H 1, (signal) the critical region should have inside the region, and ≤ c outside, where c is a constant which depends on the size of the test α. Equivalently, optimal scalar test statistic is N.B. any monotonic function of this is leads to the same test. G. Cowan

Terascale Statistics School 2015 / Combination74 Ingredients for a frequentist test In general to carry out a test we need to know the distribution of the test statistic t(x), and this means we need the full model P(x|H). Often one can construct a test statistic whose distribution approaches a well-defined form (almost) independent of the distribution of the data, e.g., likelihood ratio to test a value of θ: In the large sample limit t θ follows a chi-square distribution with number of degrees of freedom = number of components in θ (Wilks’ theorem). So here one doesn’t need the full model P(x|θ), only the observed value of t θ.

G. Cowan Terascale Statistics School 2015 / CombinationLecture 5 page 75 MCMC basics: Metropolis-Hastings algorithm Goal: given an n-dimensional pdf generate a sequence of points 1) Start at some point 2) Generate Proposal density e.g. Gaussian centred about 3) Form Hastings test ratio 4) Generate 5) If else move to proposed point old point repeated 6) Iterate

G. Cowan Terascale Statistics School 2015 / CombinationLecture 5 page 76 Metropolis-Hastings (continued) This rule produces a correlated sequence of points (note how each new point depends on the previous one). For our purposes this correlation is not fatal, but statistical errors larger than naive The proposal density can be (almost) anything, but choose so as to minimize autocorrelation. Often take proposal density symmetric: Test ratio is (Metropolis-Hastings): I.e. if the proposed step is to a point of higher, take it; if not, only take the step with probability If proposed step rejected, hop in place.