Probability and Statistics for Particle Physics Javier Magnin CBPF – Brazilian Center for Research in Physics Rio de Janeiro - Brazil.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

The Maximum Likelihood Method
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
The General Linear Model. The Simple Linear Model Linear Regression.
Visual Recognition Tutorial
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.
Evaluating Hypotheses
7. Least squares 7.1 Method of least squares K. Desch – Statistical methods of data analysis SS10 Another important method to estimate parameters Connection.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Rao-Cramer-Frechet (RCF) bound of minimum variance (w/o proof) Variance of an estimator of single parameter is limited as: is called “efficient” when the.
Maximum likelihood (ML)
Lecture II-2: Probability Review
Modern Navigation Thomas Herring
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Introduction l Example: Suppose we measure the current (I) and resistance (R) of a resistor. u Ohm's law relates V and I: V = IR u If we know the uncertainties.
880.P20 Winter 2006 Richard Kass Propagation of Errors Suppose we measure the branching fraction BR(Higgs  +  - ) using the number of produced Higgs.
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
880.P20 Winter 2006 Richard Kass 1 Maximum Likelihood Method (MLM) Does this procedure make sense? The MLM answers this question and provides a method.
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
R. Kass/W03P416/Lecture 7 1 Lecture 7 Some Advanced Topics using Propagation of Errors and Least Squares Fitting Error on the mean (review from Lecture.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
R Kass/SP07 P416 Lecture 4 1 Propagation of Errors ( Chapter 3, Taylor ) Introduction Example: Suppose we measure the current (I) and resistance (R) of.
Modern Navigation Thomas Herring
Lab 3b: Distribution of the mean
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION ASEN 5070 LECTURE 11 9/16,18/09.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
1 2 nd Pre-Lab Quiz 3 rd Pre-Lab Quiz 4 th Pre-Lab Quiz.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #24.
Machine Learning 5. Parametric Methods.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Computacion Inteligente Least-Square Methods for System Identification.
R. Kass/Sp07P416/Lecture 71 More on Least Squares Fit (LSQF) In Lec 5, we discussed how we can fit our data points to a linear function (straight line)
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Statistical Interpretation of Least Squares ASEN.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Data Modeling Patrice Koehl Department of Biological Sciences
The Maximum Likelihood Method
Physics 114: Lecture 13 Probability Tests & Linear Fitting
Chapter 7. Classification and Prediction
Probability and Statistics for Particle Physics
12. Principles of Parameter Estimation
The Maximum Likelihood Method
The Maximum Likelihood Method
Chapter 2 Minimum Variance Unbiased estimation
Unfolding Problem: A Machine Learning Approach
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Lecture 3 1 Probability Definition, Bayes’ theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests general.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
Lecture 4 Propagation of errors
5.2 Least-Squares Fit to a Straight Line
Simple Linear Regression
Computing and Statistical Data Analysis / Stat 7
Principles of the Global Positioning System Lecture 11
Parametric Methods Berlin Chen, 2005 References:
Unfolding with system identification
12. Principles of Parameter Estimation
Propagation of Error Berlin Chen
Propagation of Error Berlin Chen
Presentation transcript:

Probability and Statistics for Particle Physics Javier Magnin CBPF – Brazilian Center for Research in Physics Rio de Janeiro - Brazil

Outline Course: three one hour lectures 1 st lecture: General ideas / Preliminary concepts Probability and statistics Distributions 2 nd lecture: Error matrix Combining errors / results Parameter fitting and hypothesis testing 3 rd lecture: Parameter fitting and hypothesis testing (cont.) Examples of fitting procedures

2 nd lecture

Two dimensional Gaussian distribution and error matrix 1- Assume that x and y are two uncorrelated gaussian variables, then

Given that x and y are independent variables, it follows that with or, in matrix form Inverse error matrix

Error matrix The diagonal term are the variance of x and y respectively Off-diagonal terms are the covariance. Zeroes indicate no correlation among x and y The error matrix is a symmetric matrix The general definition, even for non-gaussian distributions, is

2- Correlated variables 1- Start with the uncorrelated variables case and perform a clockwise rotation of an angle , then 2- once you rename the variables to x and y, you obtain the general form of a Gaussian in two variables Correlations

 “measures” the correlation among the variables x and y = 0  no correlation (independent variables) =  1 full correlation (ellipse  straight line) Error Matrix

Combining errors / results Very often we are confronted with a situation where the result of an experiment is given in terms of two or more variables. What we want to know is what is the error of the final result in terms of the errors of the measured variables. This is the well known problem of “propagation of errors”. A second (related) problem is how to combine the results of two or more experiments who have made the same measurement.

Combining errors Linear situation Consider the following example, where the variable a is given in terms of variables b and c, which are measured:

The error of the result a can be calculated using the definition of the variance for a, as follows: where If b and c are independent variables  cov(b,c) =0 and

General case Let f k (x 1, x 2,..., x n ) a set of m linear functions in the variables x = {x 1, x 2,..., x n } And let the error matrix on x given by

Then the error matrix of f k is given by Which in the case of uncorrelated errors in the x´s, reduces to The simplest case f =  a i x i (f = a T x) reduces to  f 2 =  i  j a i M x a j = a T M x a, which is equivalent to  f 2 =  i a i 2  i 2 +  i  j  i a i a j  ij  i  j

Non-linear situation If f k is a set of non-linear functions of the variables x, it can be linearized by means of a first order Taylor expansion Since f k 0 is a constant, it does not contribute to the error on f. Therefore, the propagation of errors follows the linear case. For a two variables non-linear function f(a,b) the above result reduces to

Comments (about the non-linear case...) Error estimates for non-linear functions are biased because of the use of a truncated Taylor expansion. The extent of this bias depends on the nature of the function. If f(x 1,…,x n ) is a function of n independent variables, then For a linear function of the variables {x 1,...,x n }, the formula above (or the corresponding one for correlated variables) is obviously valid !

averaging Assume that you perform n independent measurements of a quantity q, each one of accuracy  The average q of the n measurements q i is then then the variance is and the error on q results (Remember the comment on the variance of the mean in the first lecture)

Combining results of different experiments Assume that several experiments measured the same physical quantity a and obtained the set of values {a i }, with errors {  i }. Then the best estimates of a and  are given by No proof. However, if  i =   i, i=1...n, then the results above reduce to the averaging case of the previous slide !

Example: Suppose you want to measure the spin-alignment of the vector meson  (1020) which has been produced in p + p interactions at some c.m. energy. The spin-alignment  is described by a 3 x 3 matrix, the spin-density matrix The only measurable coefficient is the  00 Parameter fitting Use the data to determine the value of free parameter(s)

**  (1020)  (1020) decays via strong interactions  00 can be measured by measuring the angular distribution of the decay products (which is known as a function of the parameter  00 ) Now the question is: which value of  00 provides the best description of data ? And how accurately  00 can be determined ?

Comments Hypothesis testing precedes parameter fitting: if hypothesis are incorrect, then there is no point in determining free parameters. In practice, one often does parameter fitting first anyway. It may be impossible to make a test of hypothesis before fixing free parameters to their optimum values. In this lecture we will consider two methods: Maximum Likelihood and Least Squares

Comments II Normalization: In many cases is desirable to normalize the theoretical distribution to the data. Normalization reduces the number of free parameters by one. In some cases, normalization is undesirable due to the introduction of distorting effects Example: fit a straight line to data. Normalization involves the calculation of  y i. The large error of the last point makes it useless. Normalization will introduce distortions because all of them are equally weighted

Interpretation of estimates Assume that a free parameter has been determined as ŷ ±  ŷ. Assume also that our estimate ŷ is Gaussian distributed and that the true value (unknown) is y 0. The probability that a measu- rement gives an answer in a specific range of y is the area under the relevant part of the gaussian For  ŷ = , the probability is ~68% Having an estimate ŷ, it is usual to write ŷ   ŷ  y 0  ŷ   ŷ where [ŷ   ŷ; ŷ   ŷ] is the confidence range for y 0

Maximum likelihood method Powerful method to find values of unknown parameters Example: Consider the following angular distribution, depending on the parameters a,b

Normalize (if not, the method does not work !) then behaves as a probability distribution

For the event i we calculate which is the probability density of observing the event i as a function of b/a. We define now the likelihood L as the product of the y i Then, for a specific value (b/a), L is the joint probability density for obtaining the particular set of cos  i we observed in the experiment.

For the event i we calculate which is the probability density of observing the event i as a function of b/a. We define now the likelihood L as the product of the y i Then, for a specific value (b/a), L is the joint probability density for obtaining the particular set of cos  i we observed in the experiment. L is the probability density for obtaining the particular set of observations in the ordering in which we observe them. Since the ordering is irrelevant, a factor of 1/n! should be included but, as we are interested in how the function L varies as a function of (b/a), that factor is irrelevant

For the event i we calculate which is the probability density of observing the event i as a function of b/a. We define now the likelihood L as the product of the y i Then, for a specific value (b/a), L is the joint probability density for obtaining the particular set of cos  i we observed in the experiment. Finally maximize L. Note the importance of the normalization: without the factor N, L can be as large as you want by simply increasing the value of (b/a), then L would not have absolute maximum !

The logarithm of the likelihood function Sometimes is most convenient to use the logarithm of the likelihood function For a large number of experimental observations n, L tends to a Gaussian distribution at least in the vicinity of the maximum of the distribution: l´´ = -1/c

The logarithm of the likelihood function Sometimes is most convenient to use the logarithm of the likelihood function For a large number of experimental observations n, L tends to a Gaussian distribution at least in the vicinity of the maximum of the distribution: l´´ = -1/c

When L is Gaussian, then the following quantities are identical and can be used as the definition of the error on p: the root means square deviation of L about its mean (-  2 l/  p 2 ) -½ l(p 0  p) = l(p 0 ) – 1/2 Clearly Gaussian variables are better than non Gaussian. Make an adequate choice of variables, e.g. in decay processes, is better to measure the decay rate 1/  than the lifetime  !

Comments Maximum likelihood method uses the events one at a time  no need to construct histograms  no problems associated to the binning. Functions of implicit variables are very easily handled. Data are used in the form of complete events rather than projections on various axes  powerful tool to determine unknown parameters. In some situations, the maximum likelihood and the least squares methods are equivalent. Easy to handle bounded parameters. One serious drawback is the large amount of computation required very often. Extension to several parameters is trivial.

Least squares method Assume you have an experimental distribution (say an histogram). The histogram represents the number of events y i obs +  i as a function of a given variable x i. Assume you want to describe the experimental data by a functional form y th (x,  j ), then we construct If the theory is in good agreement with data, then y i obs and y i th do nor differ too much and S will be small.

y th (x) =  1 +  2 x y i obs +  i xixi Bin size has to be chosen such that i) the number of events is large enough to ensure that ii) the error in the number of event in then bin is approximately gaussian (remember that Poisson  Gaussian for n   ).

Comments Start first by choosing a suitable bin size. Hopefully results will be approximately independent of the bin size. Bins may be also of different sizes. It is desirable to avoid bins with too few events  better if the number of events is large enough to ensure gaussian errors. Also, as we use the experimental error  i, we have to avoid situations arising from the fact that usually few events means large errors. Easy to generalize for several variables. If y th (x,  j ) is linear in the parameters, then the minimum of S can be found analytically. S min is a measure of how well the theoretical hypothesis describes the data.

Least squares with correlated errors We will consider now the modifications necessary in order to deal with the case in which the errors in the y i obs are correlated one another. Let us start with the two variables uncorrelated case and perform a rotation of angle 

then where the errors were transformed also to with the condition that errors in z´ and y´ are independent

Now write where is the inverse of the error matrix

Now write (in matrix form) where is the inverse of the error matrix

Comparison Maximum likelihood Least squares How easy ? Normalization and minimization can be difficult Usually easy EfficiencyUsually most efficient Sometimes equiv. to ML Input dataIndividual eventsHistograms Estimate of goodness of fit Very difficultEasy

Comparison Maximum likelihood Least squares Constraint among parameters EasyCan be imposedN-dimensional problems Normalization and minimization can be difficult Problems associated to the choice of the distribution Weighted eventsCan be usedEasy Background subtraction Can be problematicEasy Error estimate (2l/pipj)½(2l/pipj)½ ½(  2 S /  p i  p j )  ½