Statistics/Thomas R. Junk/TSI July 20091 Statistical Methods for Experimental Particle Physics Theory and Lots of Examples Thomas R. Junk Fermilab TRIUMF.

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Statistics.  Statistically significant– When the P-value falls below the alpha level, we say that the tests is “statistically significant” at the alpha.
Using the Profile Likelihood in Searches for New Physics / PHYSTAT 2011 G. Cowan 1 Using the Profile Likelihood in Searches for New Physics arXiv:
27 th March CERN Higgs searches: CL s W. J. Murray RAL.
Practical Statistics for LHC Physicists Bayesian Inference Harrison B. Prosper Florida State University CERN Academic Training Lectures 9 April, 2015.
NASSP Masters 5003F - Computational Astronomy Lecture 5: source detection. Test the null hypothesis (NH). –The NH says: let’s suppose there is no.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
1 LIMITS Why limits? Methods for upper limits Desirable properties Dealing with systematics Feldman-Cousins Recommendations.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #22.
Chapter 7: Statistical Applications in Traffic Engineering
Statistics In HEP 2 Helge VossHadron Collider Physics Summer School June 8-17, 2011― Statistics in HEP 1 How do we understand/interpret our measurements.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
8. Statistical tests 8.1 Hypotheses K. Desch – Statistical methods of data analysis SS10 Frequent problem: Decision making based on statistical information.
Credibility of confidence intervals Dean Karlen / Carleton University Advanced Statistical Techniques in Particle Physics Durham, March 2002.
G. Cowan Lectures on Statistical Data Analysis Lecture 12 page 1 Statistical Data Analysis: Lecture 12 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan 2011 CERN Summer Student Lectures on Statistics / Lecture 41 Introduction to Statistics − Day 4 Lecture 1 Probability Random variables, probability.
Inference about a Mean Part II
Inferences About Process Quality
Sensitivity of searches for new signals and its optimization
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
G. Cowan RHUL Physics Bayesian Higgs combination page 1 Bayesian Higgs combination using shapes ATLAS Statistics Meeting CERN, 19 December, 2007 Glen Cowan.
Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Statistical Analysis of Systematic Errors and Small Signals Reinhard Schwienhorst University of Minnesota 10/26/99.
Statistical aspects of Higgs analyses W. Verkerke (NIKHEF)
Hypothesis Testing. Distribution of Estimator To see the impact of the sample on estimates, try different samples Plot histogram of answers –Is it “normal”
Introduction to Statistical Inferences Inference means making a statement about a population based on an analysis of a random sample taken from the population.
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
CHAPTER 16: Inference in Practice. Chapter 16 Concepts 2  Conditions for Inference in Practice  Cautions About Confidence Intervals  Cautions About.
Statistical problems in network data analysis: burst searches by narrowband detectors L.Baggio and G.A.Prodi ICRR TokyoUniv.Trento and INFN IGEC time coincidence.
1 Statistical Inference Greg C Elvers. 2 Why Use Statistical Inference Whenever we collect data, we want our results to be true for the entire population.
G. Cowan 2009 CERN Summer Student Lectures on Statistics1 Introduction to Statistics − Day 4 Lecture 1 Probability Random variables, probability densities,
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #23.
Statistics In HEP Helge VossHadron Collider Physics Summer School June 8-17, 2011― Statistics in HEP 1 How do we understand/interpret our measurements.
Copyright © 2012 Pearson Education. All rights reserved © 2010 Pearson Education Copyright © 2012 Pearson Education. All rights reserved. Chapter.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Issues concerning the interpretation of statistical significance tests.
ROOT and statistics tutorial Exercise: Discover the Higgs, part 2 Attilio Andreazza Università di Milano and INFN Caterina Doglioni Université de Genève.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #25.
G. Cowan RHUL Physics page 1 Status of search procedures for ATLAS ATLAS-CMS Joint Statistics Meeting CERN, 15 October, 2009 Glen Cowan Physics Department.
G. Cowan, RHUL Physics Discussion on significance page 1 Discussion on significance ATLAS Statistics Forum CERN/Phone, 2 December, 2009 Glen Cowan Physics.
Practical Statistics for Particle Physicists Lecture 2 Harrison B. Prosper Florida State University European School of High-Energy Physics Parádfürdő,
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Easy Limit Statistics Andreas Hoecker CAT Physics, Mar 25, 2011.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #24.
G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Lecture 8 Source detection NASSP Masters 5003S - Computational Astronomy
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
A significance test or hypothesis test is a procedure for comparing our data with a hypothesis whose truth we want to assess. The hypothesis is usually.
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
G. Cowan Lectures on Statistical Data Analysis Lecture 4 page 1 Lecture 4 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
G. Cowan Computing and Statistical Data Analysis / Stat 9 1 Computing and Statistical Data Analysis Stat 9: Parameter Estimation, Limits London Postgraduate.
06/2006I.Larin PrimEx Collaboration meeting  0 analysis.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
In Bayesian theory, a test statistics can be defined by taking the ratio of the Bayes factors for the two hypotheses: The ratio measures the probability.
G. Cowan Lectures on Statistical Data Analysis Lecture 12 page 1 Statistical Data Analysis: Lecture 12 1Probability, Bayes’ theorem 2Random variables and.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
G. Cowan RHUL Physics Statistical Issues for Higgs Search page 1 Statistical Issues for Higgs Search ATLAS Statistics Forum CERN, 16 April, 2007 Glen Cowan.
I'm concerned that the OS requirement for the signal is inefficient as the charge of the TeV scale leptons can be easily mis-assigned. As a result we do.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Practical Statistics for LHC Physicists Frequentist Inference Harrison B. Prosper Florida State University CERN Academic Training Lectures 8 April, 2015.
Confidence Intervals and Limits
Lecture 4 1 Probability (90 min.)
Lecture 4 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
Introduction to Statistics − Day 4
Lecture 4 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
Statistics for HEP Roger Barlow Manchester University
Why do Wouter (and ATLAS) put asymmetric errors on data points ?
Presentation transcript:

Statistics/Thomas R. Junk/TSI July Statistical Methods for Experimental Particle Physics Theory and Lots of Examples Thomas R. Junk Fermilab TRIUMF Summer Institute July , 2009 Day 2: Hypothesis Testing Confidence Intervals

Statistics/Thomas R. Junk/TSI July Hypothesis Testing Simplest case: Deciding between two hypotheses. Typically called the null hypothesis H 0 and the test hypothesis H 1 Can’t we be even simpler and just test one hypothesis H 0 ? Data are random -- if we don’t have another explanation of the data, we’d be forced to call it a random fluctuation. Is this enough? All models are wrong, but some are useful. H 0 may be broadly right but the predictions slightly flawed Look at enough distributions and for sure you’ll spot one that’s mismodeled. A second hypothesis provides guidance of where to look. Popper: You can only prove models wrong, never prove one right. Proving one hypothesis wrong doesn’t mean the proposed alternative must be right.

Statistics/Thomas R. Junk/TSI July Frequentist Hypothesis Testing: Test Statistics and p-values Step 1: Devise a quantity that depends on the observed data that ranks outcomes as being more signal-like or more background-like. Called a test statistic. Simplest case: Searching for a new particle by counting events passing a selection requirement. Expect b events in H 0, s+b in H 1. The event count n obs is a good test statistic. Step 2: Predict the distributions of the test statistic separately assuming: H 0 is true H 1 is true (Two distributions. More on this later)

Statistics/Thomas R. Junk/TSI July Frequentist Hypothesis Testing: Test Statistics and p-values Step 3: Run the experiment, get observed value of test statistic. Step 4: Compute p-value p(n  n obs |H 0 )  = 6 Example: H 0 : b =  = 6 n obs = 10 p-value = A p-value is not the “probability H 0 is true” But many often say that.

Statistics/Thomas R. Junk/TSI July Common Standards of Evidence Physicists like to talk about how many “sigma” a result corresponds to and generally have less feel for p-values. The number of “sigma” is called a “z-value” and is just a translation of a p-value using the integral of one tail of a Gaussian Double_t zvalue = - TMath::NormQuantile(Double_t pvalue) 1  15.9% Tip: most physicists talk about p-values now but hardly use the term z-value Folklore: 95% CL -- good for exclusion 3  : “evidence” 5  : “observation” Some argue for a more subjective scale. z-value (  ) p-value E E-7

Statistics/Thomas R. Junk/TSI July Sociological Issues Discovery is conventionally 5 . In a Gaussian asymptotic case, that would correspond to a ±20% measurement. Less precise measurements are called “measurements” all the time We are used to measuring undiscovered particles and processes. In the case of a background-dominated search, it can take years to climb up the sensitivity curve and get an observation, while evidence, measurements, etc. proceed. Referees can be confused.

Statistics/Thomas R. Junk/TSI July A More Sophisticated Test Statistic What if you have two or more bins in your histogram? Not just a single counting experiment any more. Still want to rank outcomes as more signal-like or less signal-like Neyman-Pearson Lemma: The likelihood ratio is the “uniformly most powerful” test statistic Acts like a difference of Chisquareds in the Gaussian limit signal-like bkg-like yellow=p-value for ruling out H 0. Green= p-value for ruling out H 1

Statistics/Thomas R. Junk/TSI July More Sensitivity or Less Sensitivity signal p-value very small. Signal ruled out. Can make no statement regardless of experimental outcome.

Statistics/Thomas R. Junk/TSI July What’s with and ? A simple hypothesis is one for which the only free parameters are parameters of interest. A compound hypothesis is less specific. It may have parameters whose values we are not particularly concerned about but which affects its predictions. These are called nuisance parameters, labeled . Example: H 0 =SM. H 1 =MSSM. Both make predictions about what may be seen in an experiment. A nuisance parameter would be, for example, the b-tagging efficiency. It affects the predictions but in the end of the day we are really concerned about H 0 and H 1.

Statistics/Thomas R. Junk/TSI July What’s with and ? We parameterize our ignorance of the model predictions with nuisance parameters. A model with a lot of uncertainty is hard to rule out! -- either many nuisance parameters, or one parameter that has a big effect on its predictions and whose value cannot be determined in other ways maximizes L under H 1 maximizes L under H 0

Statistics/Thomas R. Junk/TSI July

Statistics/Thomas R. Junk/TSI July Example: flat background, 30 bins, 10 bg/bin, Gaussian signal. Run a pseudoexperiment (assuming s+b). Fit to flat bg, Separate fit to flat bg + known signal shape. The background rate is a nuisance parameter  = b Use fit signal and bg rates to calculate Q. Fitting the signal is a separate option. Fit twice! Once assuming H 0, once assuming H 1

Statistics/Thomas R. Junk/TSI July Fitting Nuisance Parameters to Reduce Sensitivity to Mismodeling Means of PDF’s of -2lnQ very sensitive to background rate estimation. Still some sensitivity in PDF’s residual due to prob. of each outcome varies with bg estimate.

Statistics/Thomas R. Junk/TSI July Some Comments on Fitting Fitting is an optimization step and is not needed for correctly handling systematic uncertainties on nuisance parameters. More on systematics later Some advocate just using -2lnQ with fits as the final step in quoting significance (Fisher, Rolke, Conrad, Lopez) Fits can “fail” -- MINUIT can give strange answers (often not MINUIT’s fault). Good to explore distributions of possible fits, not just the one found in the data.

Statistics/Thomas R. Junk/TSI July An Alternate Likelihood Ratio Fit the signal freely in H 1. H 0 is then just a special case of H 1 (with s=0). Maximize over parameters of interest. If we maximize the numerator, it will always then be at least as big as the denominator. 2lnQ will be distributed as a chisquared with one degree of freedom then -- Wilks’s Theorem (but -- need to check. MINUIT can give strange answers)

Statistics/Thomas R. Junk/TSI July Expected p-values and Error Rates If H 0 is true, then the distribution of the p-value is uniform between 0 and 1 If H 1 is true, then the distribution of p-values will be peaked towards smaller values (can be quite small if our sensitivity is large) We quote sensitivity as the median expected p-value if H 1 is true. Physicists say “sensitivity” -- statisticians use “power” Need to set a threshold for p-values to claim evidence or discovey (3  and 5  ). These are the error rates e.g., 2.87E-7 is the error rate for false 5  discoveries These are called “Type-I Errors” in stats jargon: rejecting H 0 when it’s true. Can calculate probability of a 5  discovery if H 1 is true -- spokespeople and lab directors like this.

Statistics/Thomas R. Junk/TSI July Incorporating Systematic Uncertainties into the p-Value Two plausible options: “Supremum p-value” Choose ranges of nuisance parameters for which the p-value is to be valid Scan over space of nuisance parameters and calculate the p-value for each of those. Take the largest (i.e., least significant, most “conservative”) p-value. “Frequentist” -- at least it’s not Bayesian “Prior Predictive p-value” When evaluating the distribution of the test statistic, vary the nuisance parameters within their prior distributions. “Cousins and Highland” Resulting p-values are no longer fully frequentist but are a mixture of Bayesian and Frequentist reasoning. In fact, adding statistical errors and systematic errors in quadrature is a mixture of Bayesian and Frequentist reasoning. But very popular.

Statistics/Thomas R. Junk/TSI July Fitting and Fluctuating Monte Carlo pseudoexperiments are used to get p-values. Test statistic -2lnQ is not uncertain for the data. Distribution from which -2lnQ is drawn is uncertain! Nuisance parameter fits in numerator and denominator of -2lnQ do not incorporate systematics into the result. Example -- 1-bin search; all test statistics are equivalent to the event count, fit or no fit. Instead, we fluctuate the probabilities of getting each outcome since those are what we do not know. Each pseudoexperiment gets random values of nuisance parameters. Why fit at all? It’s an optimization. Fitting reduces sensitivity to the uncertain true values and the fluctuated values. For stability and speed, you can choose to fit a subset of nuisance parameters (the ones that are constrained by the data). Or do constrained or unconstrained fits, it’s your choice. If not using pseudoexperiments but using Wilk’s theorem, then the fits are important for correctness, not just optimality.

Statistics/Thomas R. Junk/TSI July The Trials Factor Also called the “Look Elsewhere Effect” Bump-hunters are familiar with it. What is the probability of an upward fluctuation as big as the one I saw anywhere in my histogram? -- Lots of bins  Lots of chances at a false discovery -- Approximation: Multiply smallest p-value by the number of “independent” models sought (not histogram bins!). Bump hunters: roughly (histogram width)/(mass resolution) Criticisms: Adjusted p-value can now exceed unity! What if histogram bins are empty? What if we seek things that have been ruled out already?

Statistics/Thomas R. Junk/TSI July The Trials Factor More seriously, what to do if the p-value comes from a big combination of many channels each optimized at each m H sought? Channels have different resolutions (or is resolution even the right word for a multivariate discriminant? Channels vary their weight in the combination as cross sections and branching ratios change with m H Proper treatment -- want a p-value of p-values! (use the p-value as a test statistic) Run pseudoexperiments and analyze each one at each m H studied. Look for the distribution of smallest p-values. Next to impossible unless somehow analyzers supply how each pseudo-dataset looks at each test mass.

Statistics/Thomas R. Junk/TSI July An internal CDF study that didn’t make it to prime time – dimuon mass spectrum with signal fit  60.9 events fit in bigger signal peak (4  ? No!) Null hypothesis pseudoexperiments with largest peak fit values (not enough PE’s)

Statistics/Thomas R. Junk/TSI July Looking Everywhere in a m ee plot method: –scan along the mass spectrum in 1 GeV steps –at each point, work out prob for the bkg to fluctuate ≥ data in a window centred on that point window size is 2 times the width of a Z' peak at that mass –sys. included by smearing with Gaussian with mean and sigma = bkg + bkg error –use pseudo experiements to determine how often a given probability will occur e.g. a prob ≤0.001 will occur somewhere 5-10% of the time Single Pseudoexperiment

Statistics/Thomas R. Junk/TSI July Aside -- Blind Analysis Fear of intentional or even unintentional biasing of results by experimenters modifying the analysis procedure after the data have been collected. Problem is bigger when event counts are small -- cuts can be designed around individual observed events. Ideal case -- construct and optimize experiment before the experiment is run. Almost ideal -- just don’t look at the data Hadron collider environment requires data calibration of backgrounds and efficiencies Often necessary to look at “control regions” (“sidebands”) to do calibrations. Be careful not to look “inside the box” until analysis is finalized. Systematic errors too!

Statistics/Thomas R. Junk/TSI July At Least they Explained what They Did Dijet mass sum in e + e -  jjjj ALEPH Collaboration, Z. Phys. C71, 179 (1996) “the width of the bins is designed to correspond to twice the expected resolution... and their origin is deliberately chosen to maximize the number of events found in any two consecutive bins”

Statistics/Thomas R. Junk/TSI July No Discovery and No Measurement? No Problem! Often we are just not sensitive enough (yet) to discover a particular new particle we’re looking for, even if it’s truly there. Or we’d like to test a lot of models (each SUSY parameter choice is a model) and they can’t all be true. It is our job as scientists to explain what we could have found had it been there. “How hard did you look?” Strategy -- exclude models: set limits! Frequentist Semi-Frequentist Bayesian

Statistics/Thomas R. Junk/TSI July CL s Limits -- extension of the p-value argument Advantages: Exclusion and Discovery p-values are consistent. Example -- a 2  upward fluctuation of the data with respect to the background prediciton appears both in the limit and the p-value as such Does not exclude where there is no sensitivity (big enough search region with small enough resolution and you get a 5% dusting of random exclusions with CL s+b ) p-values: CL b = P(-2lnQ  -2lnQ obs | b only) Green area = CL s+b = P(-2lnQ  -2lnQ obs | s+b) Yellow area = “1-CL b ” = P(-2lnQ  -2lnQ obs |b only) CL s  CL s+b /CL b ≥ CL s+b Exclude at 95% CL if CL s <0.05 Vary r until CL s =0.05 to get r lim (apologies for the notation)

Statistics/Thomas R. Junk/TSI July A Simple Case -- CL s in a Counting Search CL s+b = p(n  n obs |s+b) -2lnQ is just a monotonic function of the observed number of events. In this case, more events is more “signal-like” (s+b>b). Not always the case Probability of s+b fluctuating downwards to n obs or less (question: why not ask for equality?). “What is the chance of missing the signal this badly? CL b = p(n  n obs |b) Not quite 1-discovery p-value (equality flipped) CL s =CL s+b /CL b

Statistics/Thomas R. Junk/TSI July Overcoverage on Exclusion T. Junk, NIM A434 (1999) 435. No similar penalty for the discovery p-value 1-CL b. Coverage: The “false exclusion rate” should be no more than 1-Confidence Level In this case, if a signal were truly there, we’d exclude it no more than 5% of the time. “Type-II Error rate” Excluding H 1 when it is true Exact coverage: 5% error rate (at 95% CL) Overcoverage: <5% error rate Undercoverge: >5% error rate Overcoverage introduced by the ratio CL s =CL s+b /CL b It’s the price we pay for not excluding what we have no sensitivity to.

Statistics/Thomas R. Junk/TSI July Different kinds of analyses switching on and off OPAL’s flavor-independent hadronically-decaying Higgs boson search. Two overlapping analyses: Can pick the one with the smallest median CL s, or separate them into mutually exclusive sets. Important for SUSY Higgs searches.

Statistics/Thomas R. Junk/TSI July The “Neyman Construction” of Frequentist Confidence Intervals Essentially a “calibration curve” Pick an observable x somehow related to the parameter  you’d like to measure Figure out what distribution of observed x would be for each value of  possible. Draw bands containing 68% (or 95% or whatever) of the outcomes Invert the relationship using the prescription on this page. A pathology: can get an empty interval. But the error rate has to be the specified one. Imagine publishing that all branching ratios between 0 and 1 are excluded at 95% CL. Proper Coverage is Guaranteed!

Statistics/Thomas R. Junk/TSI July A Special Case of Frequentist Confidence Intervals: Feldman-Cousins G. Feldman and R. Cousins, “A Unified approach to the classical statistical analysis of small signals” Phys.Rev.D57: ,1998. arXiv:physics/ Each horizontal band contains 68% of the expected outcomes (for 68% CL intervals) But Neyman doesn’t prescribe which 68% of the outcomes you need to take! Take lowest x values: get lower limits. Take highest x values: get upper limits. Cousins and Feldman: Sort outcomes by the likelihood ratio. R = L(x|  )/L(x|  best ) R=1 for all x for some . Picks 1-sided or 2-sided intervals -- no flip-flopping between limits and 2-sided intervals. No empty intervals!

Statistics/Thomas R. Junk/TSI July Some Properties of Frequentist Confidence Intervals Really just one: coverage. If the experiment is repeated many times, the intervals obtained will include the true value at the specified rate (say, 68% or 95%). Conversely, the rest of them (1-  ) of them, must not contain the true value. But the interval obtained on a particular experiment may obviously be in the unlucky fraction. Intervals may lack credibility but still cover. Example: 68% of the intervals are from -  to + , and 32% of them are empty. Coverage is good, but power is terrible. FC solves some of these problems, but not all. Can get a 68% CL interval that spans the entire domain of . Imagine publishing that a branching ratio is between 0 and 1 at 68% CL. Still possible to exclude models to which there is no sensitivity. FC assumes model parameter space is complete -- one of the models in there is the truth. If you find it, you can rule out others even if we cannot test them directly.