Probability and Statistics Basic concepts II (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Regression Eric Feigelson Lecture and R tutorial Arcetri Observatory April 2014.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
8. Statistical tests 8.1 Hypotheses K. Desch – Statistical methods of data analysis SS10 Frequent problem: Decision making based on statistical information.
Maximum likelihood (ML) and likelihood ratio (LR) test
Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.
CF-3 Bank Hapoalim Jun-2001 Zvi Wiener Computational Finance.
Machine Learning CMPT 726 Simon Fraser University
G. Cowan Lectures on Statistical Data Analysis Lecture 12 page 1 Statistical Data Analysis: Lecture 12 1Probability, Bayes’ theorem 2Random variables and.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
7. Least squares 7.1 Method of least squares K. Desch – Statistical methods of data analysis SS10 Another important method to estimate parameters Connection.
G. Cowan 2011 CERN Summer Student Lectures on Statistics / Lecture 41 Introduction to Statistics − Day 4 Lecture 1 Probability Random variables, probability.
Chapter 2 Simple Comparative Experiments
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
July 3, A36 Theory of Statistics Course within the Master’s program in Statistics and Data mining Fall semester 2011.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Rao-Cramer-Frechet (RCF) bound of minimum variance (w/o proof) Variance of an estimator of single parameter is limited as: is called “efficient” when the.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Lecture II-2: Probability Review
Statistical Analysis of Systematic Errors and Small Signals Reinhard Schwienhorst University of Minnesota 10/26/99.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Model Inference and Averaging
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
G. Cowan 2009 CERN Summer Student Lectures on Statistics1 Introduction to Statistics − Day 4 Lecture 1 Probability Random variables, probability densities,
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Statistics In HEP Helge VossHadron Collider Physics Summer School June 8-17, 2011― Statistics in HEP 1 How do we understand/interpret our measurements.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Confidence Interval & Unbiased Estimator Review and Foreword.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Machine Learning 5. Parametric Methods.
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
ES 07 These slides can be found at optimized for Windows)
G. Cowan Lectures on Statistical Data Analysis Lecture 4 page 1 Lecture 4 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
G. Cowan Computing and Statistical Data Analysis / Stat 9 1 Computing and Statistical Data Analysis Stat 9: Parameter Estimation, Limits London Postgraduate.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Sep 14, 2005 Rajendran Raja, PHYSTAT05, Oxford, England 1 A General Theory of Goodness of Fit in Likelihood fits Rajendran Raja Fermilab Phystat05, Oxford.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
G. Cowan Lectures on Statistical Data Analysis Lecture 12 page 1 Statistical Data Analysis: Lecture 12 1Probability, Bayes’ theorem 2Random variables and.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Probability Theory and Parameter Estimation I
Model Inference and Averaging
Graduierten-Kolleg RWTH Aachen February 2014 Glen Cowan
Lecture 3 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
Statistics for Particle Physics Lecture 3: Parameter Estimation
Computing and Statistical Data Analysis / Stat 7
Parametric Methods Berlin Chen, 2005 References:
Computing and Statistical Data Analysis / Stat 10
Presentation transcript:

Probability and Statistics Basic concepts II (from a physicist point of view) Benoit CLEMENT – Université J. Fourier / LPSC

Statistics 2 Parametric estimation : give one value to each parameter Interval estimation : derive a interval that probably contains the true value Non-parametric estimation : estimate the full pdf of the population SAMPLE Finite size x i POPULATION f(x; θ ) PHYSICS parameters θ EXPERIMENT INFERENCE Observable

Parametric estimation From a finite sample {x i }  estimating a parameter θ Statistic = a function S = f({x i }) Any statistic can be considered as an estimator of θ To be a good estimator it needs to satisfy : Consistency : limit of the estimator for a infinite sample. Bias : difference between the estimator and the true value Efficiency : speed of convergence Robustness : sensitivity to statistical fluctuations A good estimator should at least be consistent and asymptotically unbiased Efficient / Unbiased / Robust often contradict each other  different choices for different applications 3

Bias and consistency As the sample is a set of realization of random variables (or one vector variable), so is the estimator : it has a mean, a variance,… and a probability density function Bias : Mean value of the estimator unbiased estimator : asymptotically unbiased : Consistency: formally in practice, if asymptotically unbiased 4 biased asymptotically unbiased unbiased

Empirical estimator Sample mean is a good estimator of the population mean  weak law of large numbers : convergent, unbiased Sample variance as an estimator of the population variance : 5 biased, asymptotically unbiased unbiased variance estimator : variance of the estimator (convergence)

Errors on these estimator Uncertainty  Estimator standard deviation Use an estimator of standard deviation : (!!! Biased ) Mean : Variance : Central-Limit theorem  empirical estimators of mean and variance are normally distributed, for large enough samples define 68% confidence intervals 6

Likelihood function Generic function k(x, θ ) x : random variable(s) θ : parameter(s) 7 Probability density function f(x; θ ) = k(x, θ 0 ) ∫ f(x; θ) dx=1 for Bayesian f(x| θ )= f(x; θ ) Likelihood function L ( θ ) = k(u, θ ) ∫ L ( θ ) d θ =??? for Bayesian f( θ |x)= L ( θ )/∫ L ( θ )d θ fix θ= θ 0 (true value) fix x= u (one realization of the random variable) For a sample : n independent realizations of the same variable X

Estimator variance Start from the generic k function, differentiate twice, with respect to θ, the pdf normalization condition: Now differentiating the estimator bias : Finally, using Cauchy-Schwartz inequality Cramer-Rao bound 8

Efficiency For any unbiased estimator of θ, the variance cannot exceed : The efficiency of a convergent estimator, is given by its variance. An efficient estimator reaches the Cramer-Rao bound (at least asymptotically) : Minimal variance estimator MVE will often be biased, asymptotically unbiased 9

Maximum likelihood For a sample of measurements, {x i } The analytical form of the density is known It depends on several unknown parameters θ eg. event counting : Follow a Poisson distribution, with a parameter that depends on the physics : λ i ( θ ) An estimator of the parameters of θ, are the ones that maximize of observing the observed result.  Maximum of the likelihood function rem : system of equations for several parameters rem : often minimize -ln L : simplify expressions 10

Properties of MLE Mostly asymptotic properties : valid for large sample, often assumed in any case for lack of better information Asymptotically unbiased Asymptotically efficient (reaches the CR bound) Asymptotically normally distributed  Multinormal law, with covariance given by generalization of CR Bound : Goodness of fit = The value of is Khi-2 distributed, with ndf = sample size – number of parameters 11 Probability of getting a worse agreement

Least squares Set of measurements (x i, y i ) with uncertainties on y i Theoretical law : y = f(x, θ ) Naïve approach : use regression Reweight each term by the error Maximum likelihood : assume each y i is normally distributed with a mean equal to f(x i, θ ) and a variance equal to Δ y i Then the likelihood is : 12 Least squares or Khi-2 fit is the MLE, for Gaussian errors Generic case with correlations:

Errors on MLE Errors on parameter -> from the covariance matrix For one parameter, 68% interval More generally : Confidence contour are defined by the equation : Values of β for different number of parameters n θ and confidence levels α 13 only one realization of the estimator -> empirical mean of 1 value… n θ  α  1 (0. 5*n θ 2 )

Example : fitting a line For f(x)=ax 14

Example : fitting a line For f(x)=ax+b 15

Example : fitting a line 2 dimensional error contours on a and b 68.3% :  2 = % :  2 = % :  2 =

Example : fitting a line 1 dimensional error from contours on a and b 1d errors (68.3%) are given by the edges of the  2 =1 ellipse : depends on covariance 17

Frequentist vs Bayes Frequentist Estimator of the parameters θ maximises the probability to observe the data. Maximum of the likelihood function 18 Bayesian Bayes theorem links : The probability of θ kwnowing the data : a posteriori to the probability of the data kwnowing θ : likelihood

Confidence interval For a random variable, a confidence interval with confidence level α, is any interval [a,b] such as : Generalization of the concept of uncertainty: interval that contains the true value with a given probability  slightly different concepts For Bayesians : the posterior density is the probability density of the true value. It can be used to derive interval : No such thing for a Frequentist : The interval itself becomes the random variable [a,b] is a realization of [A,B] Independently of θ 19 Probability of finding a realization inside the interval

Confidence interval Mean centered, symetric interval [ μ -a, μ +a] 20 Mean centered, probability symmetric interval : [a, b], Highest Probability Density (HDP) : [a, b]

Confidence Belt To build a frequentist interval for an estimator of θ 1.Make pseudo-experiments for several values of θ and compute he estimator for each (MC sampling of the estimator pdf) 2.For each θ, determine  ( θ ) and  ( θ ) such as : These 2 curves are the confidence belt, for a CL α. 3.Inverse these functions. The interval satisfy: 21 Confidence Belt for Poisson parameter λ estimated with the empirical mean of 3 realizations (68%CL)

Dealing with systematics The variance of the estimator only measure the statistical uncertainty. Often, we will have to deal with some parameters whose values are known with limited precision. Systematic uncertainties The likelihood function becomes : The known parameters ν are nuisance parameters 22

Bayesian inference In Bayesian statistics, nuisance parameters are dealt with by assigning them a prior π( ν ). Usually a multinormal law is used with mean ν 0 and covariance matrix estimated from Δν 0 (+correlation, if needed) The final prior is obtained by marginalization over the nuisance parameters 23

Profile Likelihood 24 No true frequentist way to add systematic effects. Popular method of the day : profiling Deal with nuisance parameters as realization if random variables : extend the likelihood : G(v) is the likelihood of the new parameters (identical to prior) For each value of θ, maximize the likelihood with respect to nuisance : profile likelihood PL( θ ). PL( θ ) has the same statistical asymptotical properties than the regular likelihood

Non parametric estimation Directly estimating the probability density function Likelihood ratio discriminant Separating power of variables Data/MC agreement … Frequency Table : For a sample {x i }, i=1..n 1.Define successive intervals (bins) C k =[a k,a k+1 [ 2.Count the number of events n k in C k Histogram : Graphical representation of the frequency table 25

Histogram 26 N/Z for stable heavy nuclei 1.321, 1.357, 1.392, 1.410, 1.428, 1.446, 1.464, 1.421, 1.438, 1.344, 1.379, 1.413, 1.448, 1.389, 1.366, 1.383, 1.400, 1.416, 1.433, 1.466, 1.500, 1.322, 1.370, 1.387, 1.403, 1.419, 1.451, 1.483, 1.396, 1.428, 1.375, 1.406, 1.421, 1.437, 1.453, 1.468, 1.500, 1.446, 1.363, 1.393, 1.424, 1.439, 1.454, 1.469, 1.484, 1.462, 1.382, 1.411, 1.441, 1.455, 1.470, 1.500, 1.449, 1.400, 1.428, 1.442, 1.457, 1.471, 1.485, 1.514, 1.464, 1.478, 1.416, 1.444, 1.458, 1.472, 1.486, 1.500, 1.465, 1.479, 1.432, 1.459, 1.472, 1.486, 1.513, 1.466, 1.493, 1.421, 1.447, 1.460, 1.473, 1.486, 1.500, 1.526, 1.480, 1.506, 1.435, 1.461, 1.487, 1.500, 1.512, 1.538, 1.493, 1.450, 1.475, 1.500, 1.512, 1.525, 1.550, 1.506, 1.530, 1.487, 1.512, 1.524, 1.536, 1.518, 1.577, 1.554, 1.586, 1.586

Histogram Statistical description : n k are multinomial random variables. with parameters : 27 For a large sample : For small classes (width δ ): So finally : The histogram is an estimator of the probability density Each bin can be described by a Poisson density. The 1 σ error on n k is then :

Kernel density estimators Histogram is a step function -> sometime need smoother estimator On possible solution : Kernel Density Estimator Attribute to each point of the sample a “kernel” function k(u) Triangle kernel : Parabolic kernel : Gaussian kernel : … w = kernel width, similar to bin width of the histogram The pdf estimator is : Rem : for multidimensional pdf : 28

Kernel density estimators If the estimated density is normal, the optimal width is : with n that sample size and d the dimension As for the histogram binning, no generic result : try and see 29

Statistical Tests Statistical tests aim at: Checking the compatibility of a dataset {x i } with a given distribution Checking the compatibility of two datasets {x i }, {y i } : are they issued from the same distribution. Comparing different hypothesis : background vs signal+background In every case : build a statistic that quantify the agreement with the hypothesis convert it into a probability of compatibility/incompatibility : p-value 30

Pearson test Test for binned data : use the Poisson limit of the histogram Sort the sample into k bins C i : n i Compute the probability of this class : p i =∫ Ci f(x)dx The test statistics compare, for each bin the deviation of the observation from the expected mean to the theoretical standard deviation. Then χ 2 follow (asymptotically) a Khi-2 law with k-1 degrees of freedom (1 constraint ∑n i =n) p-value : probability of doing worse, For a “good” agreement χ 2 /(k-1) ~ 1, More precisely (1 σ interval ~ 68%CL) 31 Poisson mean Poisson variance Data

Kolmogorov-Smirnov test Test for unbinned data : compare the sample cumulative density function to the tested one Sample Pdf (ordered sample) The the Kolmogorov statistic is the largest deviation : The test distribution has been computed by Kolmogorov: [0; β ] define a confidence interval for D n β =0.9584/√n for 68.3% CL β =1.3754/√n for 95.4% CL 32

Example Test compatibility with an exponential law : , 0.036, 0.112, 0.115, 0.133, 0.178, 0.189, 0.238, 0.274, 0.323, 0.364, 0.386, 0.406, 0.409, 0.418, 0.421, 0.423, 0.455, 0.459, 0.496, 0.519, 0.522, 0.534, 0.582, 0.606, 0.624, 0.649, 0.687, 0.689, 0.764, 0.768, 0.774, 0.825, 0.843, 0.921, 0.987, 0.992, 1.003, 1.004, 1.015, 1.034, 1.064, 1.112, 1.159, 1.163, 1.208, 1.253, 1.287, 1.317, 1.320, 1.333, 1.412, 1.421, 1.438, 1.574, 1.719, 1.769, 1.830, 1.853, 1.930, 2.041, 2.053, 2.119, 2.146, 2.167, 2.237, 2.243, 2.249, 2.318, 2.325, 2.349, 2.372, 2.465, 2.497, 2.553, 2.562, 2.616, 2.739, 2.851, 3.029, 3.327, 3.335, 3.390, 3.447, 3.473, 3.568, 3.627, 3.718, 3.720, 3.814, 3.854, 3.929, 4.038, 4.065, 4.089, 4.177, 4.357, 4.403, 4.514, 4.771, 4.809, 4.827, 5.086, 5.191, 5.928, 5.952, 5.968, 6.222, 6.556, 6.670, 7.673, 8.071, 8.165, 8.181, 8.383, 8.557, 8.606, 9.032, , D n = p-value = σ : [0, ]

Hypothesis testing Two exclusive hypotheses H 0 and H 1 -which one is the most compatible with data - how incompatible is the other one P(data|H0) vs P(data|H1) Build a statistic, define an interval w - if the observation falls into w : accept H 1 - else accept H 0 Size of the test : how often did you get it right Power of the test : how often do you get it wrong ! Neyman-Pearson lemma : optimal statistic for testing hypothesis is the Likelihood ratio 34

CL b and CL s Two hypothesis, for counting experiment - background only : expect 10 events - signal+ background : expect 15 events You observe 16 events 35 CL b = confidence in the background hypothesis (power of the test) Discovery : 1- CL b < 5.7x10 -7 CL s+b = confidence in the signal+background hypothesis (size of the test) Rejection : CL s+b < 5x10 -2 Test for signal (non standard) CL s = CL s+b /CL b CL b = CL s+b =0.663