Summarizing Data by Statistics

Slides:



Advertisements
Similar presentations
SOME GENERAL PROBLEMS.
Advertisements

Previous Lecture: Distributions. Introduction to Biostatistics and Bioinformatics Estimation I This Lecture By Judy Zhong Assistant Professor Division.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Visual Recognition Tutorial
Maximum likelihood (ML) and likelihood ratio (LR) test
The Mean Square Error (MSE):. Now, Examples: 1) 2)
Point estimation, interval estimation
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Section 6.1 Let X 1, X 2, …, X n be a random sample from a distribution described by p.m.f./p.d.f. f(x ;  ) where the value of  is unknown; then  is.
Maximum likelihood (ML)
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
Parametric Inference.
Visual Recognition Tutorial
Visual Recognition Tutorial
Maximum likelihood (ML)
Lecture II-2: Probability Review
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Chapter 5 Joint Probability Distributions and Random Samples  Jointly Distributed Random Variables.2 - Expected Values, Covariance, and Correlation.3.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Week 31 The Likelihood Function - Introduction Recall: a statistical model for some data is a set of distributions, one of which corresponds to the true.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Parameter, Statistic and Random Samples
Sampling and Sampling Distributions
Virtual University of Pakistan
STATISTICS POINT ESTIMATION
Visual Recognition Tutorial
12. Principles of Parameter Estimation
Some General Concepts of Point Estimation
Model Inference and Averaging
Parameter Estimation 主講人:虞台文.
Chapter 7: Sampling Distributions
Parameter, Statistic and Random Samples
t distribution Suppose Z ~ N(0,1) independent of X ~ χ2(n). Then,
Chapter 2 Minimum Variance Unbiased estimation
Computing and Statistical Data Analysis / Stat 7
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
STATISTICAL INFERENCE PART III
Chapter 9 Chapter 9 – Point estimation
12. Principles of Parameter Estimation
Data Exploration and Pattern Recognition © R. El-Yaniv
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Continuous Distributions
Applied Statistics and Probability for Engineers
Presentation transcript:

Summarizing Data by Statistics Recall that the estimations of the mean and variance of a normal density were very simple: If we believe that these numbers contain all relevant information found in the data, we could (for the purpose of learning the mean and variance) through away all the sampled data. In particular, we could use these numbers and update them online, without previous data, when new data points arrive. But is it really the case that they are sufficient? Do these estimators use and summarize all the information contained in the data regarding the respective parameters? כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Statistics Def: A statistic is any function of the sampled data. Examples of statistics: Sample mean - Sample variance - (sampe standard deviation - ) Median - halfway sampled point of the sorted sample. Mode - peak value (local max; we can have several modes). Range - difference between largest and smallest readings. Midrange - average of the largest and smallest readings. Mean deviation - average of distances from the mean, or from the median. higher moments. . . . כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Sufficient Statistics A sufficient statistic is (informally) a function of the sampled data (i.e. a statistic), containing all the information relevant to estimating some parameter . As can be expected (and will be proven) and are sufficient for the true mean and variance of a normal distribution, resp. Once they are computed, all the other statistics (e.g. mode, other moments, etc.), as well as the data , are superfluous. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Sufficient Statistics - Cntd. Def: A statistic is sufficient for if is independent of . Note: if is a random variable (Bayesian approach) we would require that a sufficient statistic for satisfies Indeed, if is sufficient for ,we have The converse is also true: intuitive requirement for Bayesian, , implies our definition. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Sufficient Statistics - Example Example: Assume is i.i.d. drawn from a Bernoulli distribution Consider the statistic . is a binomially distributed random variable We have Clearly, is independent of , so is sufficient for . כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Fisher-Neyman Factorization Theorem Factorization Thm: A statistic is sufficient for iff the probability can be factored Proof (discrete case): Note that is fixed (there is only one possible non-zero probability), so כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Factorization Theorem - Cntd. Proof: we have and want to show that is independent of . A fixed constrains possible values of the data to be in (assume that ; otherwise, we are trivially done) We have independent of כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Sufficient Statistics - Remarks The parameter can be -dimensional vector; the statistic can be -dimensional vector, but there is no necessary relation between and . If , the statistic is trivial. We are not interested in a trivial statistic, or a statistic which encodes the data set in one or a few scalars. The statistic is interesting (and useful) only if . In this case we can get an impressive data reduction: we reduce extremely large data set down to a few parameters. The factorization is not unique. Letting be any function, we can factor Any 1-1 function of sufficient statistic is also sufficient. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Factorization - Example Again, assume that is i.i.d. Bernoulli Now, Letting , we know, by the factorization theorem, that the statistic is sufficient for . Clearly, also is sufficient for . (Note that is fixed.) כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Kernel Densities If is any factorization, we define a normalized “standard” factorization using The function is called the kernel density. In the case of Baesian estimation, kernel densities have special meaning. Using the a factorization we get If the prior is uniform, the posterior equals the kernel density. (The more the prior is “flat” the more the kernel density approximates the posterior). כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Factorization - Example Let be a normal i.i.d. sample from The sample’s density is Using where we have so that כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Factorization - Example - Cntd We now treat the cases: Recall that 1. 2. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Exponential Families Let be an integer. A parametric family of distributions is called an exponential family if we can write: Example: The normal distribution is in the exponential family (with ). and we take כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Exponential Families - Examples The exponential distribution is (trivially) in the family The Gamma distribution The (two-dimensional) Dirichlet distribution כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Reminder: the Gamma Function Definition: Some properties: כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Exponential Family - Examples, Cntd. Most of the common distributions are in (some) exponential family. Of course, there are some which aren’t. Example: the Cauchy distribution is not in any exponential family; doesn’t have variance; doesn’t have sufficient statistics. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Exponential Families and sufficient Stat. For each member of the exponential family, there is a general approach for finding a simple sufficient statistics Consider a data set drawn i.i.d. from a member of an exponential family. We have Define the statistic כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Exponential Families and sufficient Stat. Thus we have, So, we have a Fisher-Neyman factorization and is a sufficient statistic for Example: For , the following statistic is sufficient for כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Estimators and their Properties Let be a parameterized set of distributions. Given a sample drawn i.i.d from one of the distributions in the set we would like to estimate its parameter (thus identifying the distribution). An estimator for w.r.t. is any function notice that an estimator is a random variable. How do we measure the quality of an estimator? Consistency: An estimator for is consistent if this is a (desirable) asymptotic property. But we are mainly interested in other measures for finite (and small!) sample sizes. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Estimators and their Properties Bias: Define the bias of an estimator to be Here, the expectation is w.r.t. to the distribution . The estimator is unbiased if its bias is zero ( ). Example: the estimators and , for the mean of a normal distribution, are both unbiased. The estimator for its variance is biased whereas the estimator is unbiased. Variance: another important property of an estimator is its variance . We would like to find estimators with minimum bias and variance. Which is more important, bias or variance? כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Risky Estimators Employ our decision-theoretic framework to measure the quality of estimators. Abbreviate and consider the square error loss function The risk associated with when is the true parameter Claim: Proof: כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Bias vs. Variance So, for a given level of conditional risk, there is a tradeoff between bias and variance. This tradeoff is among the most important problems in machine learning and pattern recognition. Classical approach: consider only unbiased estimators and try to find those with minimum possible variance. Classical approach isn’t necessarily the best. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv The Score The score of the family is the random variable measures the “sensitivity” of as a function of the parameter . Claim: Proof: Corollary: כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv The Score - Example Consider the normal distribution clearly, and כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv The Score - Vector Form In case where is a vector, the score is the vector whose th component is Example: כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Data Exploration and Pattern Recognition © R. El-Yaniv Fisher Information Fisher information: designed to provide a measure of how much information a data set provides about a parameter in a parametric family. Definition (scalar form): Fisher information (about , based on ) is the variance of the score Example: consider a random variable כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Fisher Information - Cntd. Whenever is a vector, Fisher information is the matrix where Remainder: Remark: the Fisher information is only defined whenever the distributions satisfy some regularity conditions. (For example, they should be differentiable w.r.t. and all the distributions in the parametric family must have same support set). כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Fisher Information - Cntd. Let be i.i.d. random variables . The score of is the sum of the individual scores Example: If are i.i.d. , the score is כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

Fisher Information - Cntd. Based on the sample , the Fisher information about is Thus, the Fisher information is additive w.r.t. i.i.d. random variables. Example: Suppose are i.i.d. . From previous example we know that the Fisher information about the paremeter based on one variable is Therefore, based on the entire sample, כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

The Cramer-Rao Inequality Theorem: Let be an unbiased estimator for . Then Proof: Using we have כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

The Cramer-Rao Inequality - Cntd. Now כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

The Cramer-Rao Inequality - Cntd. So, By the Cauchy-Schwarz inequality Therefore, For a biased estimator one can prove כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

The Cramer-Rao Inequality - Cntd. Example: Let be i.i.d. . From previous example Now let be an (unbiased) estimator for . So matches the Cramer-Rao lower bound. Def: An unbiased estimator whose variance meets the Cramer-Rao lower bound is called efficient. כ"ג/טבת/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv