Nonparametric density estimation or Smoothing the data Eric Feigelson Arcetri Observatory, April 2014.

Slides:



Advertisements
Similar presentations
Regression Eric Feigelson Lecture and R tutorial Arcetri Observatory April 2014.
Advertisements

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Model assessment and cross-validation - overview
STAT 497 APPLIED TIME SERIES ANALYSIS
Kernel methods - overview
統計計算與模擬 政治大學統計系余清祥 2003 年 6 月 9 日 ~ 6 月 10 日 第十六週:估計密度函數
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 10 Simple Regression.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Evaluating Hypotheses
Optimal Bandwidth Selection for MLS Surfaces
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Statistical analysis and modeling of neural data Lecture 4 Bijan Pesaran 17 Sept, 2007.
Chi Square Distribution (c2) and Least Squares Fitting
1 An Introduction to Nonparametric Regression Ning Li March 15 th, 2004 Biostatistics 277.
Maximum likelihood (ML)
Regression Eric Feigelson. Classical regression model ``The expectation (mean) of the dependent (response) variable Y for a given value of the independent.
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Statistical Methods II&III: Confidence Intervals ChE 477 (UO Lab) Lecture 5 Larry Baxter, William Hecker, & Ron Terry Brigham Young University.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.
CY1B2 Statistics1 (ii) Poisson distribution The Poisson distribution resembles the binomial distribution if the probability of an accident is very small.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Machine Learning 5. Parametric Methods.
Likelihood-tuned Density Estimator Yeojin Chung and Bruce G. Lindsay The Pennsylvania State University Nonparametric Maximum Likelihood Estimator Nonparametric.
Histograms h=0.1 h=0.5 h=3. Theoretically The simplest form of histogram B j = [(j-1),j)h.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.
Spatial Point Processes Eric Feigelson Institut d’Astrophysique April 2014.
ESTIMATION METHODS We know how to calculate confidence intervals for estimates of  and  2 Now, we need procedures to calculate  and  2, themselves.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Richard Kass/F02P416 Lecture 6 1 Lecture 6 Chi Square Distribution (  2 ) and Least Squares Fitting Chi Square Distribution (  2 ) (See Taylor Ch 8,
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chapter 15: Exploratory data analysis: graphical summaries CIS 3033.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Estimating standard error using bootstrap
Probability Distributions
Chapter 4 Basic Estimation Techniques
12. Principles of Parameter Estimation
Ch8: Nonparametric Methods
Nonparametric Density Estimation
Learning Theory Reza Shadmehr
Mathematical Foundations of BME
政治大學統計系余清祥 2004年5月26日~ 6月7日 第十六、十七週:估計密度函數
12. Principles of Parameter Estimation
Applied Statistics and Probability for Engineers
Presentation transcript:

Nonparametric density estimation or Smoothing the data Eric Feigelson Arcetri Observatory, April 2014

Why density estimation? The goal of density estimation is to estimate the unknown probability density function of a random variable from a set of observations. In more familiar language, density estimation smoothes collections of individual measurements into a continuous distribution, replacing dots on a scatterplot by a smooth estimator curve or surface. When the parametric form of the distribution is known (e.g., from astrophysical theory) or assumed (e.g., a heuristic power law model), then the estimation of model parameters is the subject of regression (MSMA Chpt. 7). Here we make no assumption of the parametric form and are thus involved in nonparametric density estimation.

Astronomical applications Galaxies in a rich cluster  underlying distribution of baryons Lensing of background galaxies  underlying distribution of Dark Matter Photons in a Chandra X-ray image  underlying X-ray sky Cluster stars in a Hertzsprung-Russell diagram  stellar evolution isochrone X-ray light curve of a gamma ray burst afterglow  temporal behavior of a relativistic afterglow Galaxy halo star streams  cannibalism of satellite dwarf galaxy

Problems with the histogram Discontinuities between bins are not present in the underlying population No mathematical guidance for choosing origin, x 0 No mathematical guidance for binning (grouping) method: equal spacing, equal # points, etc. No mathematical guidance for choosing the `center’ of a bin Difficult to visualize multivariate histograms `In terms of various mathematical descriptions of accuracy, the histogram can be quite substantially improved upon'. (Silverman 1992) `The examination of both undersmoothed and oversmoothed histograms should be routine.’ (Scott 1992) Histograms are useful for exploratory visualization of univariate data, but are not recommended for statistical analysis. Fit models to the original data points and (cumulative) e.d.f.’s, not the (differential) histogram, unless the data are intrinsically grouped

Kernel density estimation The most common nonparametric density estimation technique convolves discrete data with a normalized kernel function to obtain a continuous estimator: A kernel must integrate to unity over – ∞ < x < ∞, and must be symmetric, K(u) = K(-u) for all u. If K(u) is a kernel, then a scaled K*(u) = K( u) is also a kernel. A normal (Gaussian) kernel a good choice, although theorems show that the minimum variance is given by the Epanechnikov kernel (inverted parabola). The uniform kernel (`boxcar’, `Heaviside function’) give substantially higher variance.

A narrow bandwidth follows the data closely (small bias) but has high noise (large variance). A wide bandwidth misses detailed structure (high bias) but has low noise (small variance). Statisticians often choose to minimize the L 2 risk function, the mean integrated square error (MISE), MISE = Bias 2 + Variance ~ c 1 h 4 + c 2 h -1 (The constant c 1 depends on the integral of the second derivative of the true p.d.f. ) The `asymptotic mean integrated square error’ (AMISE) based on a 2 nd -order expansion of the MISE is often used. The choice of bandwidth is tricky!

KDE: Choice of bandwidth The choice of bandwidth h is more important than the choice of kernel function. Silverman’s `rule of thumb’ that minimizes the MISE for simple distributions is where A is the minimum of the standard deviation  and the interquartile range IQR/1.34. More generally, statisticians choose kernel bandwidths using cross-validation. Important theorems written in the 1980s show that maximum likelihood bandwidths can be estimated from resamples of the dataset. One method is leave-on-out samples with (n-1) points, and the likelihood is Alternatives include the `least squares cross-validation’ estimator, and the `generalized cross-validation’ (GCV) related to the Akaike/Bayesian Information Criteria.

h=0.05 N(0,1) true density h=0.337 min MISE h=2.0 data points (rug plot) Example of normal kernel smoothing

Rarely recognized by astronomers … Confidence bands around the kernel density estimator can be obtained: For large samples and simple p.d.f. behaviors, confidence intervals for normal KDEs can be obtained from asymptotic normality (i.e. the Central Limit Theorem) For small or large samples and nearly-any p.d.f. behviors, confidence intervals for any KDE can be estimated from bootstrap resamples.

Adaptive smoothing Standard kernel density estimation (KDE) assumes that the statistical properties of the underlying population is stationary (the same at all values of x) so that a constant bandwidth h can be used. However, in many astronomical datatsets, the behavior is nonstationary: sources appear in images; events appear in time series; detailed structures are evident where data points are densely sampled; and so forth. Astronomers thus often seek adaptive smoothing techniques that give locally (rather than globally) optimal estimation of the population from the data. Unfortunately, modern statistics does not provide a unique optimal procedure for adaptive smoothing, and techniques developed within astronomy have not been evaluated. This is an area ripe for improved astrostatistical research.

Some adaptive smoothing techniques Scale KDE bandwidth by the inverse square-root of the local p.d.f. estimate Silverman’s adaptive kernel estimator (MSMA, eqns 6.13=6.15) Kth-nearest neighbor (k-NN) estimators: the p.d.f. estimator at each data points x i is the inverse of the distance to the kth-NN around that point. (Related to KDE with uniform kernel) Abramson Ann. Stat has minimal MSE at x=0. Zero bias?

Kernel regression A regression approach to smoothing bivariate or multivariate data … E(Y | x) = f(x) Read “the expected population value of the response variable Y given a chosen value of x is a specified function of x”. A reasonable estimation approach with a limited data set is to find the mean value of Y in a window around x, [x-h]/, x+h/2) with h chosen to balance bias and variance. A more effective way might include more distant values of x downweighted by some kernel p.d.f. function such as N(0,h 2 ). This called kernel regression, a type of local regression. The `best fit’ might be obtained by locally weighted least squares or maximum likelihood.

Two common nonparametric regressions Nadaraya-Watson estimator

Local polynomial smoother (LOWESS, LOESS, Friedman’s supersmoother)

Spline regression The function f(x) can be any (non)linear function but is often chosen to be a polynomial. If the polynomials are connected together at a series of knots to give a smooth curve, the result is called a spline. Variants that avoid high correlation among the fitted parameters are B-splines and natural splines. Kass 2008

The challenge of spline knot selection 5 knots chosen by R 15 knots chosen by R 7 knots chosen by user Kass 2008

Comment for astronomers Due to unfamiliarity with kernel density estimation and nonparametric regressions, astronomers too often fit data with heuristic simple functions (e.g. linear, linear with threshold, power law, ….). Unless scientific reasons are present for such functions, it is often wiser to let the data speak for themselves, estimating a smooth distribution from data points nonparametrically. A variety of often-effective techniques are available for this. Well-established methods like KDE and NW estimator have asymptotic confidence bands. For all methods, confidence bands can be estimated by bootstrap methods within the `window’ determined by the (local/global) bandwidth. Astronomers thus do not have to sacrifice `error analysis’ using nonparametric regression techniques.