Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Methods for Data Analysis parameter estimate

Similar presentations


Presentation on theme: "Statistical Methods for Data Analysis parameter estimate"— Presentation transcript:

1 Statistical Methods for Data Analysis parameter estimate
Luca Lista INFN Napoli

2 Statistical Methods for Data Analysis
Contents Parameter estimates Likelihood function Maximum Likelihood method Problems with asymmetric errors Luca Lista Statistical Methods for Data Analysis

3 Meaning of parameter estimate
We are interested in some physical unknown parameters Experiments provide samplings of some PDF which has among its parameters the physical unknowns we are interested in Experiment’s results are statistically “related” to the unknown PDF PDF parameters can be determined from the sample within some approximation or uncertainty Knowing a parameter within some error may mean different things: Frequentist: a large fraction (68% or 95%, usually) of the experiments will contain, in the limit of large number of experiments, the (fixed) unknown true value within the quoted confidence interval, usually [ - , + ] (‘coverage’) Bayesian: we determine a degree of belief that the unknown parameter is contained in a specified interval can be quantified as 68% or 95% We will see that there is still some more degree of arbitrariness in the definition of confidence intervals… Luca Lista Statistical Methods for Data Analysis

4 Statistical inference
Probability Theory Model Data Data fluctuate according to process randomness Inference Theory Model Data Model uncertainty due to fluctuations of the data sample Luca Lista Statistical Methods for Data Analysis

5 Statistical Methods for Data Analysis
Hypothesis tests Theory Model 1 Data Theory Model 2 Which hypothesis is the most consistent with the experimental data? Luca Lista Statistical Methods for Data Analysis

6 Statistical Methods for Data Analysis
Parameter estimators An estimator is a function of a given sample whose statistical properties are known and related to some PDF parameters “Best fit” Simplest example: Assume we have a Gaussian PDF with a known  and an unknown  A single experiment will provide a measurement x We estimate  as est = x The distribution of est (repeating the experiment many times) is the original Gaussian 68.27%, on average, of the experiments will provide an estimate within:  -  < est <  +  We can determine:  = est   Luca Lista Statistical Methods for Data Analysis

7 Statistical Methods for Data Analysis
Likelihood function Given a sample of N events each with variables (x1, …, xn), the likelihood function expresses the probability density of the sample, as a function of the unknown parameters: Sometimes the used notation for parameters is the same as for conditional probability: If the size N of the sample is also a random variable, the extended likelihood function is also used: Where p is most of the times a Poisson distribution whose average is a function of the unknown parameters In many cases it is convenient to use –ln L or –2ln L: Luca Lista Statistical Methods for Data Analysis

8 Maximum likelihood estimates
ML is the widest used parameter estimator The “best fit” parameters are the set that maximizes the likelihood function “Very good” statistical properties, as will be seen in the following The maximization can be performed analytically, for the simplest cases, and numerically for most of the cases Minuit is historically the most used minimization engine in High Energy Physics F. James, 1970’s; rewritten in C++ recently Luca Lista Statistical Methods for Data Analysis

9 Extended likelihood function
Given a sample of N measurements of the variables (x1, …, xn), the likelihood function expresses the probability density of the sample, as a function of the unknown parameters: If the size N of the sample is also a random variable, the extended likelihood function is usually also used: Where P(N; θ1, ... ,θm) is in practice always a Poisson distribution whose expected rate is a function of the unknown parameters In many cases it is convenient to use –ln L or –2ln L: Luca Lista Statistical Methods for Data Analysis

10 Extended likelihood function
For Poissonian signal and background processes: We can fit simultaneously s, b and θ minimizing: Sometimes s is replaced by μ s0, where s0 is the theory estimate and μ is called signal strength constant! Luca Lista Statistical Methods for Data Analysis

11 Statistical Methods for Data Analysis
Example of ML fit Exponential decay parameter λ, Guassian mean μ and standard deviation σ can be fit together with sig. and bkg. yields s and b. Ps(m): Gaussian peak Pb(m): exponential shape The additional parameters, beyond the parameters of interest (s in this case), used to model background, resolution, etc. are examples of nuisance parameters Data In the plot, data are accumulated into bins of a given width Error bars usually represent uncertainty on each bin count (in this case: Poissonian) Luca Lista Statistical Methods for Data Analysis

12 Gaussian case If we have n independent measurements all modeled with (or approximated to) the same Gaussian PDF, we have: An analytical minimization of−2ln L w.r.t μ (assuming σ2 is known) gives the arithmetic mean as ML estimate of μ: If σ2 is also unknown, the ML estimate of σ2 is: The above estimate can be demonstrated to have an unpleasant feature, called bias ( next slide) (example of a χ2 variable) Luca Lista Statistical Methods for Data Analysis

13 Statistical Methods for Data Analysis
Estimator properties Consistency Bias Efficiency Robustness Luca Lista Statistical Methods for Data Analysis

14 Estimator consistency
The estimator converges to the true value (in probability) ML estimators are consistent \forall \varepsilon>0\,\,\,\lim_{n\rightarrow\infty}P(|\theta^{\mathrm{est}}_n -n|<\varepsilon) = 1 Luca Lista Statistical Methods for Data Analysis

15 Efficiency of an estimator
The variance of any consistent estimator is subject a lower bound (Cramér-Rao bound): Efficiency can be defined as the ratio of Cramér-Rao bound and the estimator’s variance: Efficiency for ML estimators tends to 1 for large number of measurements I.e.: ML estimates have, asymptotically, the smallest possible variance bias of θ Fisher information Luca Lista Statistical Methods for Data Analysis

16 Statistical Methods for Data Analysis
Bias of an estimator The bias of a parameter is the average value of its deviation from the true value ML estimators may have a bias, but the bias decreases with large number of measurements (if the fit model is correct…!) E.g.: in the case of the estimate of a Gaussian’s σ2, the unbiased estimate is the well known: ML method underestimates the variance σ2 Luca Lista Statistical Methods for Data Analysis

17 Statistical Methods for Data Analysis
Robustness If the sample distribution has (slight?) deviations from the theoretical PDF model, some estimators may deviate more or less than others from the true value E.g.: unexpected tails (“outliers”) The median is a robust estimate of a distribution average, while the mean is not Trimmed estimators: removing n extreme values Evaluation of estimator robustness: Breakdown point: max. fraction of incorrect measurements that above which the estimate may be arbitrary large Trimmed observations at x% have a break point of x The median has a break point of 0.5 Influence function: Deviation of estimator if one measurement is replaced by an arbitrary (incorrect measurement) Details are beyond the purpose of this course… Luca Lista Statistical Methods for Data Analysis

18 Neyman’s confidence intervals
Procedure to determine frequentist confidence intervals Plot from PDG statistics review Scan the allowed range of an unknown parameter θ Given a value of θ compute the interval [x1, x2] that contain x with a probability 1 − α equal to 68% (or 90%, 95%) Choice of interval needed! Invert the confidence belt: for an observed value of x, find the interval [θ1, θ2] A fraction of the experiments equal to 1 − α will measure x such that the corresponding [θ1, θ2] contains (“covers”) the true value of θ (“coverage”) Note: the random variables are [θ1, θ2], not θ ! α = significance level Luca Lista Statistical Methods for Data Analysis

19 Simplest example: Gaussian case
Assume a Gaussian distribution with unknown average μ and known σ = 1 The belt inversion is trivial and gives the expected result: Central value 𝜇 = x , [μ1, μ2] = [x − σ, x + σ] So we can quote: μ 1 − α = 68% x 𝜇 = x ± σ Luca Lista Statistical Methods for Data Analysis

20 Statistical Methods for Data Analysis
Binomial intervals The Neyman’s belt construction may only guarantee approximate coverage in case of discrete variables For a Binomial distribution: find the interval {nmin, …, nmax} such that: p N = 10 p = 0.83 Clopper and Pearson (1934) solved the belt inversion problem for central intervals For an observed n = k, find lowest plo and highest pup such that: P(n ≤ k | N, plo) = α/2, P(n ≥ k | N, pup) = α/2 E.g.: n = N = 10, P(N|N) = pN = α/2, hence: plo = 10 𝛼/2 = 0.83 (68% CL), 0.74 (90% CL) A frequently used approximation, which fails for n = 0, N is: p = 0.17 1 − α = 68% n Luca Lista Statistical Methods for Data Analysis

21 Clopper-Pearson coverage (I)
CP intervals are often defined as “exact” in literature Exact coverage is often impossible to achieve for discrete variables P (coverage) 1 − α = 68% N = 10 p Luca Lista Statistical Methods for Data Analysis

22 Clopper-Pearson coverage (II)
For larger N the “ripple” gets closer to the nominal 68% coverage P (coverage) 1 − α = 68% N = 100 p Luca Lista Statistical Methods for Data Analysis

23 Approx. maximum likelihood errors
A parabolic approximation of −2ln L around the minimum is equivalent to a Gaussian approximation Sufficiently accurate in many but not all cases Estimate of the covariance matrix from 2nd order partial derivatives w.r.t. fit parameters at the minimum: Implemented in Minuit as MIGRAD/HESSE function Luca Lista Statistical Methods for Data Analysis

24 Statistical Methods for Data Analysis
Asymmetric errors Another approximation alternative to the parabolic one may be to evaluate the excursion range of -2ln L. Error (nσ) determined by the range around the maximum for which -2ln L increases by +1 (+n2 for nσ intervals) Errors can be asymmetric For a Gaussian PDF the result is identical to the 2nd order derivative matrix Implemented in Minuit as MINOS function -2lnL -2lnLmax+ 1 1 -2lnLmax 𝜃 – δ− 𝜃 𝜃 + δ+ θ Luca Lista Statistical Methods for Data Analysis

25 Error of the (Gaussian) average
We have the previous log-likelihood function: The error on  is given by: I.e.: the error on the average is: Luca Lista Statistical Methods for Data Analysis

26 Statistical Methods for Data Analysis
Exercise Assume we have n independent measurements from an exponential PDF: How can we estimate by ML  and its error? Luca Lista Statistical Methods for Data Analysis

27 Statistical Methods for Data Analysis
2D intervals In more dimensions one can determine 1σ and 2σ contours Note: different probability content in 2D compared to one dimension 68% and 95% contours are usually preferable x y Width P1D P2D 0.6827 0.3934 0.9545 0.8647 0.9973 0.9889 1.515σ 2.486σ 3.439σ Luca Lista Statistical Methods for Data Analysis

28 Statistical Methods for Data Analysis
Example of 2D contour Exponential decay parameter, Gaussian mean and standard deviation are fit together with s and b yields. The contour shows for this case a mild correlation between s and b From previous fit example: Ps(m): Gaussian peak Pb(m): exponential shape 1σ contour (39.4% CL) Luca Lista Statistical Methods for Data Analysis

29 Statistical Methods for Data Analysis
Error propagation Assume we estimate from a fit the parameter set: θ = (θ1, …, θn) and we know their covariance matrix Θij We want to determine a new set of parameters that are functions of θ: η = (η1, …, ηm). For small uncertainties, a linear approximation maybe sufficient A Taylor expansion around the central values of θ gives, using the error matrixΘij: Few examples in case of no correlation: η ση σθ θ Luca Lista Statistical Methods for Data Analysis

30 Care with asymmetric errors
Be careful about: Asymmetric error propagation Combining measurements with asymmetric errors Difference of “most likely value” w.r.t. “average value” Naïve quadrature sum of + and - lead to wrong answer Violates the central limit theorem: the combined result should be more symmetric than the original sources! A model of the non-linear dependence may be needed for quantitative calculations Biases are very easy to achieve (depending on + - -, and on the non-linear model) Much better to know the original PDF and propagate/combine the information properly! Be careful about interpreting the meaning of the result Average value and Variance propagate linearly, while most probable value (mode) does not add linearly Whenever possible, use a single fit rather than multiple cascade fits, and quote the final asymmetric errors only Luca Lista Statistical Methods for Data Analysis

31 Statistical Methods for Data Analysis
Non linear models Mean, variance and skewness add linearly when doing convolution Not the most probable values (fit)! For this model: Online calculator (R. Barlow): java/statistics1.html Central value shifted! x + - x See: R. Barlow, PHYSTAT2003 Luca Lista Statistical Methods for Data Analysis

32 Statistical Methods for Data Analysis
Binned likelihood Sometimes data are available as binned histogram Most often each bin obeys Poissonian statistics (event counting) The likelihood function is the product of Poisson PDFs corresponding to each bin having entries ni The expected number of entries ni depends on some unknown parameters: μi = μi(θ1, …, θm) The function to minimize is the following −2 ln L: The expected number of entries μi is often approximated by a continuous function μ(x) evaluated at the center xi of the bin Alternatively, μi can be a combination of other histograms (“templates”) E.g.: sum of different simulated processes with floating yields as fit parameters Luca Lista Statistical Methods for Data Analysis

33 Binned fits: minimum𝜒2
Bin entries can be approximated by Gaussian variables for sufficiently large number of entries with standard deviation equal to ni (Neyman’s χ2) Maximizing L is equivalent to minimize: Sometimes, the denominator ni is replaced (Pearson’s χ2) by: μi = μ (xi; θ1, …, θm) in order to avoid cases with zero or small ni Analytic solution exists for linear and other simple problems E.g.: linear fit model Most of the cases are treated numerically, as for unbinned ML fits Luca Lista Statistical Methods for Data Analysis

34 Statistical Methods for Data Analysis
Binned fit example Binned fits are convenient w.r.t. unbinned fits because the number of input variables decreases from the number of entries to the number of bins Usually simpler and faster numerically Unbinned fits become unpractical for very large number of entries A fraction of the information is lost, hence a possible loss of precision may occur for small number of entries Treat correctly bins with smalll number of entries! Gaussian fit (determine yield, μ and σ) Bins with small number of entries! Luca Lista Statistical Methods for Data Analysis

35 Statistical Methods for Data Analysis
Fit quality (𝜒2 test) The maximum value of the likelihood function obtained from the fit doesn’t usually give information about the goodness of the fit The𝜒2 of a fit with a Gaussian underlying model is distributed according to a known PDF The cumulative distribution of P(χ2; n) follows a uniform distribution between 0 and 1 (p-value) If the model deviates from the assumed distribution, the distribution of the p-value will be more peaked around zero Note! p-values are not the “probability of the fit hypothesis” This would be a Bayesian probability, with a different meaning, and should be computed in a different way n is the number of degrees of freedom (n. of bins − n. of params.) Luca Lista Statistical Methods for Data Analysis

36 Binned likelihood ratio
A better alternative to the (Gaussian-inspired, Neyman and Pearson’s) 𝜒2 has been proposed by Baker and Cousins using the following likelihood ratio: Same minimum value as from Poisson likelihood function, since a constant term has been added to the log-likelihood function In addition, it provides goodness-of-fit information, and asymptotically obeys chi-squared distribution with n − m degrees of freedom (Wilks’ theorem, see following slides) S. Baker, R. Cousins NIM 221 (1984) 437 Luca Lista Statistical Methods for Data Analysis

37 Combining measurements
Assume two measurements with different uncorrelated (Gaussian) errors: Build the 𝜒2: Minimize the 𝜒2: Estimate m as: Error estimate: Weighted average, wi = σi−2 Luca Lista Statistical Methods for Data Analysis

38 Generalization of 𝜒2 to n dimensions
We have n measurements, (m1, …, mn) with a n ⨉ n covariance matrix (Cij) Expected values for m1, …, mn, M1, …, Mn may depend on some theory parameter(s) θ The following χ2 can be minimized to have an estimate of the parameter(s) θ: Luca Lista Statistical Methods for Data Analysis

39 Concrete examples

40 Global electroweak fit
A Global 𝜒2 fit to electroweak measurements predicts the W mass allowing a comparison with direct measurements Details on: Luca Lista Statistical Methods for Data Analysis

41 More on electroweak fit
W mass vs top-quark mass from global electroweak fit Luca Lista Statistical Methods for Data Analysis

42 Fitting B(B J/) / B(B J/K)
Four variables: m = B reconstructed mass as J/ + charged hadron invariant mass E = Beam – B energy in the  mass hypothesis EK = Beam – B energy in the K mass hypothesis q = B meson charge Two samples: J/ , J/ ee Simultaneous fit of: Total yield of B J/, B J/K and background Resolutions separately for J/ , J/ ee Charge asymmetry (direct CP violation) Luca Lista Statistical Methods for Data Analysis

43 Statistical Methods for Data Analysis
E and EK Depend on charged hardron mass hypothesis! Luca Lista Statistical Methods for Data Analysis

44 Extended Likelihood function
To extract the ratio of BR: Likelihood can be written separately, or combined for ee and  events Fit contains parameters of interest (mainly n, nK) plus uninteresting nuisance parameters Separating q = +1 / -1 can be done adding ACP as extra parameter Poisson term BJ/ BJ/K Background \begin{array}{cccccc} -\ln L & = & & & n_\pi + n_K + n_{bkg} &\\ & = & -\sum_i \left[\right. & & n_\pi P_\pi(\Delta E_{\pi i}, \Delta E_{K i}, m_i) &\\ & & & +& n_K P_K(\Delta E_{\pi i}, \Delta E_{K i}, m_i) &\\ & & & +& n_{bkg} P_{bkg}(\Delta E_{\pi i}, \Delta E_{K i}, m_i)& \left.\right]\\ \end{array} Luca Lista Statistical Methods for Data Analysis

45 Model for independent PDFs
EK D D E Luca Lista Statistical Methods for Data Analysis

46 Signals PDFs in new variables
(E, EK)  (E, EK-E), (EK,  EK-E) Luca Lista Statistical Methods for Data Analysis

47 Statistical Methods for Data Analysis
Background PDF Background shape is taken from events in the mES sideband (mES < 5.27 GeV) mES Sideband mES (GeV) Luca Lista Statistical Methods for Data Analysis

48 Dealing with kinematical pre-selection
-120 MeV < E, EK < 120MeV A B D C A B D C The area is preserved after the transformation Luca Lista Statistical Methods for Data Analysis

49 Statistical Methods for Data Analysis
Signal extraction B J/K B J/ J/y  ee events Background Likelihood projection J/y  mm events Luca Lista Statistical Methods for Data Analysis

50 A concrete fit example (II)
Measurement of ms by CDF

51 B production at the TeVatron
Production: ggbb NO QM coherence, unlike B factories Opposite flavor at productionone of the b quarks can be used to tag the flavor of the other at production Fragmentation products have some memory of b flavor as well Bs reconstruction b-quark flavor tagging pp->bb how b are produced and how flavor is identified using the ‘other’ b Unlike B factories mixing here is incoherent (many particles produced betweent he b and the bbar) Flavor at production is determined using the away B Luca Lista Statistical Methods for Data Analysis

52 Bs vs Bd Mixing Nunmix-Nmix Nunmix+Nmix A= cos(ms t) ms>>md
Different oscillation regime  Amplitude Scan B lifetime Perform a ‘Fourier transform’ rather than fit for frequency Bs vs Bd oscillation A Start from difference in oscillations Indicate what the lifetime is with a bar ontop of the histogram (show how many oscillations there are per lifetime) Approach is then doing a fourier transform Then point out actual formula Point out that lifetime resolution is a critical factor Put a picture of how the amplitude plot looks like, next to the time-dependent plot ms [ps-1] Luca Lista Statistical Methods for Data Analysis

53 Mixing in the real world
Flavor tagging power (1.5%) Proper time resolution (0.10.4 ps) Significance = Luca Lista Statistical Methods for Data Analysis

54 Likelihood definition
Product of PDF (for ith event) defined as: +/- = same/opposite b flavor A = amplitude (=1 for right ms, 0 for wrong value) fitted for each point of the scan at fixed ms D = dilution factor = 1-2w (w = wrong tag fraction)  = trigger + selection efficiency depends on t, taken from MC G = resolution function Gaussian, with resolution , estimated event by event s = decay width of the Bs = inverse of decay time Luca Lista Statistical Methods for Data Analysis

55 Statistical Methods for Data Analysis
Bs mixing: method Mixing amplitude A fitted for each (fixed) value of m On average A=0 for every wrong m A=1 for right value of m Green band below 1  A=1 is excluded at 95% C.L. 1.645 = 95%C.L. Exclude range where green area is below the A=1 line Actual limit for a single experiment defined by the systematic band centered at the measured asymmetry “Sensitivity”: m for which the average 1.645r.m.s. of many “toy” experiments [with A = 0] reaches 1 Combining experiments as easy as averaging points! Average 1.645 of many toy experiments A single experiment Not real data, just a random experiment Luca Lista Statistical Methods for Data Analysis

56 Bs Mixing: Hadronic vs semilept.
Hadronic Bs decays 95% CL Sensitivity: 19.3 ps-1 95% CL Sensitivity: 30.7 ps-1 D0: 7.3 with sens. 9.5 with 610pb-1 Reach at large ms limited by incomplete reconstruction (ct)! This looks a lot like a signal! Luca Lista Statistical Methods for Data Analysis

57 Bs Mixing: combined CDF result
Likelihood ratio ms> % CL Sensitivity: 31.3 ps-1 - Log-likelihood ratio preferred to log-likelihood: Will be discussed in next slides Minimum: Luca Lista Statistical Methods for Data Analysis

58 Statistical Methods for Data Analysis
Likelihood Ratio Luca Lista Statistical Methods for Data Analysis

59 Statistical Methods for Data Analysis
Likelihood Ratio PRL 97, (2006) Luca Lista Statistical Methods for Data Analysis

60 Statistical Methods for Data Analysis
References F.James CERN Program Library Long Writeup D506 Robust statistics (introduction with some references): Use of chi-square and Likelihood for binned samples S. Baker and R. Cousins, Clarification of the Use of Chi-Square and Likelihood Functions in Fits to Histograms, NIM 221:437 (1984) Unified approach: G.Feldman, R.D. Cousins Phys. Rev. D 57, 3873 (1988) G.Feldman, Fermilab Colloquium: Journeys of an Accidental Statistician Asymmetric error treatment R. Barlow, proceedings of PHYSTAT2003 R. Barlow, arXiv:physics/ v1 G. D’Agostini, Asymmetric Uncertainties Sources, Treatment and Potential Dangers, arXiv:physics/ Lep Electro-Weak Working Group ZFITTER: Comput. Phys. Commun. 174 (2006) hep-ph/ Fitting B(B J/) / B(B J/K) BaBar collaboration Phys.Rev.Lett.92:241802,2004, hep-ex/ Phys.Rev.D65:091101,2002, hep-ex/ F.Fabozzi, L.Lista: BaBar Analysis Document (BAD) 93, 574 Bs mixing by CDF CDF collaboration Phys.Rev.Lett.97:062003,2006, hep-ex/ Presentation by Alessandro Cerri (2006) Luca Lista Statistical Methods for Data Analysis


Download ppt "Statistical Methods for Data Analysis parameter estimate"

Similar presentations


Ads by Google