Summarizing Data by Statistics

Summarizing Data by Statistics
Recall that the estimations of the mean and variance of a normal density were very simple: If we believe that these numbers contain all relevant information found in the data, we could (for the purpose of learning the mean and variance) through away all the sampled data. In particular, we could use these numbers and update them online, without previous data, when new data points arrive. But is it really the case that they are sufficient? Do these estimators use and summarize all the information contained in the data regarding the respective parameters? כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv
Statistics Def: A statistic is any function of the sampled data. Examples of statistics: Sample mean - Sample variance (sampe standard deviation ) Median - halfway sampled point of the sorted sample. Mode - peak value (local max; we can have several modes). Range - difference between largest and smallest readings. Midrange - average of the largest and smallest readings. Mean deviation - average of distances from the mean, or from the median. higher moments. . . . כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Sufficient Statistics
A sufficient statistic is (informally) a function of the sampled data (i.e. a statistic), containing all the information relevant to estimating some parameter . As can be expected (and will be proven) and are sufficient for the true mean and variance of a normal distribution, resp. Once they are computed, all the other statistics (e.g. mode, other moments, etc.), as well as the data , are superfluous. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Sufficient Statistics - Cntd.
Def: A statistic is sufficient for if is independent of . Note: if is a random variable (Bayesian approach) we would require that a sufficient statistic for satisfies Indeed, if is sufficient for ,we have The converse is also true: intuitive requirement for Bayesian, , implies our definition. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Sufficient Statistics - Example
Example: Assume is i.i.d. drawn from a Bernoulli distribution Consider the statistic is a binomially distributed random variable We have Clearly, is independent of , so is sufficient for . כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Fisher-Neyman Factorization Theorem
Factorization Thm: A statistic is sufficient for iff the probability can be factored Proof (discrete case): Note that is fixed (there is only one possible non-zero probability), so כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Factorization Theorem - Cntd.
Proof: we have and want to show that is independent of . A fixed constrains possible values of the data to be in (assume that ; otherwise, we are trivially done) We have independent of כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Sufficient Statistics - Remarks
The parameter can be -dimensional vector; the statistic can be -dimensional vector, but there is no necessary relation between and If , the statistic is trivial. We are not interested in a trivial statistic, or a statistic which encodes the data set in one or a few scalars. The statistic is interesting (and useful) only if In this case we can get an impressive data reduction: we reduce extremely large data set down to a few parameters. The factorization is not unique. Letting be any function, we can factor Any 1-1 function of sufficient statistic is also sufficient. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Factorization - Example
Again, assume that is i.i.d. Bernoulli Now, Letting , we know, by the factorization theorem, that the statistic is sufficient for . Clearly, also is sufficient for . (Note that is fixed.) כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Kernel Densities If is any factorization, we define a normalized “standard” factorization using The function is called the kernel density. In the case of Baesian estimation, kernel densities have special meaning. Using the a factorization we get If the prior is uniform, the posterior equals the kernel density. (The more the prior is “flat” the more the kernel density approximates the posterior). כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Factorization - Example
Let be a normal i.i.d. sample from The sample’s density is Using where we have so that כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Factorization - Example - Cntd
We now treat the cases: Recall that 1. 2. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Exponential Families Let be an integer A parametric family of distributions is called an exponential family if we can write: Example: The normal distribution is in the exponential family (with ). and we take כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Exponential Families - Examples
The exponential distribution is (trivially) in the family The Gamma distribution The (two-dimensional) Dirichlet distribution כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Reminder: the Gamma Function
Definition: Some properties: כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Exponential Family - Examples, Cntd.
Most of the common distributions are in (some) exponential family. Of course, there are some which aren’t. Example: the Cauchy distribution is not in any exponential family; doesn’t have variance; doesn’t have sufficient statistics. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Exponential Families and sufficient Stat.
For each member of the exponential family, there is a general approach for finding a simple sufficient statistics Consider a data set drawn i.i.d. from a member of an exponential family. We have Define the statistic כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Exponential Families and sufficient Stat.
Thus we have, So, we have a Fisher-Neyman factorization and is a sufficient statistic for Example: For , the following statistic is sufficient for כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Estimators and their Properties
Let be a parameterized set of distributions. Given a sample drawn i.i.d from one of the distributions in the set we would like to estimate its parameter (thus identifying the distribution). An estimator for w.r.t is any function notice that an estimator is a random variable. How do we measure the quality of an estimator? Consistency: An estimator for is consistent if this is a (desirable) asymptotic property. But we are mainly interested in other measures for finite (and small!) sample sizes. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Estimators and their Properties
Bias: Define the bias of an estimator to be Here, the expectation is w.r.t. to the distribution The estimator is unbiased if its bias is zero ( ). Example: the estimators and , for the mean of a normal distribution, are both unbiased The estimator for its variance is biased whereas the estimator is unbiased. Variance: another important property of an estimator is its variance We would like to find estimators with minimum bias and variance. Which is more important, bias or variance? כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Risky Estimators Employ our decision-theoretic framework to measure the quality of estimators. Abbreviate and consider the square error loss function The risk associated with when is the true parameter Claim: Proof: כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Bias vs. Variance So, for a given level of conditional risk, there is a tradeoff between bias and variance. This tradeoff is among the most important problems in machine learning and pattern recognition. Classical approach: consider only unbiased estimators and try to find those with minimum possible variance. Classical approach isn’t necessarily the best. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

The Score The score of the family is the random variable measures the “sensitivity” of as a function of the parameter . Claim: Proof: Corollary: כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

The Score - Example Consider the normal distribution clearly, and כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

The Score - Vector Form In case where is a vector, the score is the vector whose th component is Example: כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Fisher Information Fisher information: designed to provide a measure of how much information a data set provides about a parameter in a parametric family. Definition (scalar form): Fisher information (about , based on ) is the variance of the score Example: consider a random variable כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Fisher Information - Cntd.
Whenever is a vector, Fisher information is the matrix where Remainder: Remark: the Fisher information is only defined whenever the distributions satisfy some regularity conditions. (For example, they should be differentiable w.r.t and all the distributions in the parametric family must have same support set). כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Let be i.i.d. random variables The score of is the sum of the individual scores Example: If are i.i.d , the score is כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Based on the sample , the Fisher information about is Thus, the Fisher information is additive w.r.t. i.i.d. random variables. Example: Suppose are i.i.d From previous example we know that the Fisher information about the paremeter based on one variable is Therefore, based on the entire sample, כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

The Cramer-Rao Inequality
Theorem: Let be an unbiased estimator for . Then Proof: Using we have כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

So, By the Cauchy-Schwarz inequality Therefore, For a biased estimator one can prove כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Example: Let be i.i.d From previous example Now let be an (unbiased) estimator for . So matches the Cramer-Rao lower bound. Def: An unbiased estimator whose variance meets the Cramer-Rao lower bound is called efficient. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Summarizing Data by Statistics

Similar presentations

Presentation on theme: "Summarizing Data by Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Summarizing Data by Statistics

Similar presentations

Presentation on theme: "Summarizing Data by Statistics"— Presentation transcript:

Similar presentations

About project

Feedback