Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien.

Slides:

Advertisements

Similar presentations

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Advertisements

Component Analysis (Review)

Pattern Recognition and Machine Learning

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Model assessment and cross-validation - overview

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Kernel methods - overview

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Maximum likelihood (ML) and likelihood ratio (LR) test

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Resampling techniques

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Machine Learning CMPT 726 Simon Fraser University

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.

Linear and generalised linear models

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {

Linear and generalised linear models

Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case

Maximum likelihood (ML)

Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.

Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,

Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance 

0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Introduction and Motivation Approaches for DE: Known model → parametric approach: p(x;θ) (Gaussian, Laplace,…) Unknown model → nonparametric approach Assumes.

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)

Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:

ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.

Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Overview G. Jogesh Babu. Overview of Astrostatistics A brief description of modern astronomy & astrophysics. Many statistical concepts have their roots.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Machine Learning 5. Parametric Methods.

Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

Estimating standard error using bootstrap

Chapter 3: Maximum-Likelihood Parameter Estimation

Probability Theory and Parameter Estimation I

LECTURE 11: Advanced Discriminant Analysis

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

LECTURE 03: DECISION SURFACES

CH 5: Multivariate Methods

Pattern Classification, Chapter 3

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

10701 / Machine Learning Today: - Cross validation,

Pattern Recognition and Machine Learning

Generally Discriminant Analysis

Mathematical Foundations of BME

Parametric Methods Berlin Chen, 2005 References:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Probabilistic Surrogate Models

Presentation transcript:

Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

Introduction Theory Aspects of Application Simulation Study Summary Contents:

Introduction

Theory Application Aspects Simulation Study Summary Introduction observations: Which distribution?

01234 ? ? ? ? ?

01234 Kernel density estimator model: K(.) and h to choose Theory Application Aspects Simulation Study Summary Introduction

01234 triangular gaussian „small“ h „large“ h kernel/ bandwidth:

Theory Application Aspects Simulation Study Summary Introduction Question 1: Which choice of K(.) and h is the best for a descriptive purpose?

Introduction Theory Application Aspects Simulation Study Summary Introduction Classification:

Introduction Theory Application Aspects Simulation Study Summary Introduction Levelplot – LDA (based on assumption of a multivariate normal distribution): Classification:

Introduction Theory Application Aspects Simulation Study Summary Introduction Classification:

Introduction Theory Application Aspects Simulation Study Summary Introduction Levelplot – KDE classificator: Classification:

Introduction Theory Application Aspects Simulation Study Summary Introduction Question 2: Performance of classification based on KDE in more than 2 dimensions?

Theory

Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Theory Application Aspects Simulation Study Summary Introduction

Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Theory Application Aspects Simulation Study Summary Introduction

Theory Application Aspects Simulation Study Summary Introduction L p -distances: Optimization criteria

Theory Application Aspects Simulation Study Summary Introduction f(.) g(.)

Theory Application Aspects Simulation Study Summary Introduction

Theory Application Aspects Simulation Study Summary Introduction „Integrated absolute error“ =IAE =ISE „Integrated squared error“

Theory Application Aspects Simulation Study Summary Introduction =IAE „Integrated absolute error“ =ISE „Integrated squared error“

Theory Application Aspects Simulation Study Summary Introduction Consideration of horizontal distances for a more intuitive fit (Marron and Tsybakov, 1995) Compare the number and position of modes Minimization of the maximum vertical distance Other ideas:

Overview about some minimization criteria L 1 -distance=IAE L  -distance=Maximum difference „Modern“ criteria, which include a kind of measure of the horizontal distances L 2 -distance=ISE, MISE,AMISE,... Difficult mathematical tractability Does not consider overall fit Difficult mathematical tractability Theory Application Aspects Simulation Study Summary Introduction Most commonly used

ISE, MISE, AMISE,... Theory Application Aspects Simulation Study Summary Introduction MISE=E(ISE), the expectation of ISE AMISE=Taylor approximation of MISE, easier to calculate ISE is a random variable

Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Theory Application Aspects Simulation Study Summary Introduction

The AMISE-optimal bandwidth Theory Application Aspects Simulation Study Summary Introduction

The AMISE-optimal bandwidth Theory Application Aspects Simulation Study Summary Introduction minimized by „Epanechnikov kernel“ dependent on the kernel function K(.)

The AMISE-optimal bandwidth Theory Application Aspects Simulation Study Summary Introduction dependent on the unknown density f(.) How to proceed?

Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984) Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap

Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984)

Least squares cross-validation (LSCV) Undisputed selector in the 1980s Gives an unbiased estimator for the ISE Suffers from more than one local minimizer – no agreement about which one to use Bad convergence rate for the resulting bandwidth h opt Theory Application Aspects Simulation Study Summary Introduction

Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984) Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap

Normal rule („Rule of thumb“) Assumes f(x) to be N( ,  2 ) Easiest selector Often oversmooths the function Theory Application Aspects Simulation Study Summary Introduction The resulting bandwidth is given by:

Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984) Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap

Plug in-methods (Sheather and Jones, 1991; Park and Marron,1990) Does not substitute R(f“) in the AMISE- formula, but estimates it via R(f (IV) ) and R(f (IV) ) via R(f (VI) ),etc. Another parameter i to chose (the number of stages to go back) – one stage is mostly sufficient Better rates of convergence Does not finally circumvent the problem of the unknown density, either Theory Application Aspects Simulation Study Summary Introduction

The multivariate case Theory Application Aspects Simulation Study Summary Introduction h  H...the bandwidth matrix

Issues of generalization in d dimensions Theory Application Aspects Simulation Study Summary Introduction d 2 instead of one bandwidth parameter Unstable estimates Bandwidth selectors are essentially straightforward to generalize For Plug-in methods it is „too difficult“ to give succint expressions for d>2 dimensions

Aspects of Application

Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

Application Aspects Theory Simulation Study Summary Introduction The „curse of dimensionality“  The data „disappears“ into the distribution tails in high dimensions : a good fit in the tails is desired! dd

Application Aspects Theory Simulation Study Summary Introduction The „curse of dimensionality“  Much data is necessary to obey a constant estimation error in high dimensions

Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

Essential issues Estimation of tails important Worse fit in the tails Calculation intensive for large n Many observations required for a reasonable fit L 2 -optimal L 1 -optimal (Misclassification rate) AMISE-optimal parameter choice Optimal classification (in high dimensions)

Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

Application Aspects Theory Simulation Study Summary Introduction Method 1: Reduction of the data onto a subspace which allows a somewhat accurate estimation, however does not destoy too much information  „trade-off“ Use the multivariate kernel density concept to estimate the class densities

Application Aspects Theory Simulation Study Summary Introduction Method 2: Use the univariate concept to „normalize“ the data nonparametrically Use the classical methods like LDA and QDA for classification Drawback: calculation intensive

Application Aspects Theory Simulation Study Summary Introduction Method 2:

Simulation Study

Theory Application Aspects Summary Introduction Criticism on former simulation studies Carried out years ago Out-dated parameter selectors Restriction to uncorrelated normals Fruitless estimation because of high dimensions No dimension reduction

Simulation Study Theory Application Aspects Summary Introduction 21 datasets 14 estimators 2 error criteria  21x14x2=588 classification scores Many results The present simulation study

Simulation Study Theory Application Aspects Summary Introduction The present simulation study 21 datasets 14 estimators 2 error criteria  21x14x2=588 classification scores Many results

Simulation Study Theory Application Aspects Summary Introduction Each dataset has classes for distinction observations/class test observations, 100 produced by each class... therfore dimension 1400x10

Univariate prototype distributions:

+10 datasets having unequal covariance matrices 21 datasets total + 1 insurance dataset 10 datasets having equal covariance matrices

Simulation Study Theory Application Aspects Summary Introduction Simulation Study 21 datasets 14 estimators 2 error criteria  21x14x2=588 classification scores Many results

Principal component reduction onto 2,3,4 and 5 dimensions (4) x multivariate „normal rule“ and multivariate LSCV-criterion,resp. (2) 8 estimators Method 2(„marginal normalizations“): Method 1(multivariate density estimator): Classical methods: 14 estimators 2 estimators LDA and QDA (2) Univariate normal rule and Sheather-Jones plug-in (2) x subsequent LDA and QDA (2) 4 estimators

Simulation Study Theory Application Aspects Summary Introduction Simulation Study 21 datasets 14 estimators 2 misclassification criteria  21x14x2=588 classification scores Many results

Simulation Study Theory Application Aspects Summary Introduction Misclassification Criteria The classical Misclassification rate („Error rate“) The Brier score

Simulation Study Theory Application Aspects Summary Introduction Simulation Study 21 datasets 14 estimators 2 error criteria  21x14x2= 588 classification scores Many results

Simulation Study Theory Application Aspects Summary Introduction Results The choice of the misclassification criterion is not essential

Simulation Study Theory Application Aspects Summary Introduction Results The choice of the multivariate bandwidth parameter (method 1) is not essential in most cases Superiority of LSCV in case of bimodals having unequal covariance matrices

Simulation Study Theory Application Aspects Summary Introduction Results The choice of the univariate bandwidth parameter (method 2) is not essential

Simulation Study Theory Application Aspects Summary Introduction Results The best trade-off is a projection onto 2-3 dimensions

Results

Is the additional calculation time justified? Results Simulation Study Theory Application Aspects Summary Introduction

Summary

Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE  MISE  AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE  MISE  AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE  MISE  AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE  MISE  AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

The End