Download presentation
Presentation is loading. Please wait.
1
Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien
2
Introduction Theory Aspects of Application Simulation Study Summary Contents:
3
Introduction
4
Theory Application Aspects Simulation Study Summary Introduction 01234 25 observations: Which distribution?
5
01234 ? ? ? ? ?
6
01234 Kernel density estimator model: K(.) and h to choose Theory Application Aspects Simulation Study Summary Introduction
7
01234 triangular gaussian „small“ h „large“ h kernel/ bandwidth:
8
Theory Application Aspects Simulation Study Summary Introduction Question 1: Which choice of K(.) and h is the best for a descriptive purpose?
9
Introduction Theory Application Aspects Simulation Study Summary Introduction Classification:
10
Introduction Theory Application Aspects Simulation Study Summary Introduction Levelplot – LDA (based on assumption of a multivariate normal distribution): Classification:
11
Introduction Theory Application Aspects Simulation Study Summary Introduction Classification:
12
Introduction Theory Application Aspects Simulation Study Summary Introduction Levelplot – KDE classificator: Classification:
13
Introduction Theory Application Aspects Simulation Study Summary Introduction Question 2: Performance of classification based on KDE in more than 2 dimensions?
14
Theory
15
Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Theory Application Aspects Simulation Study Summary Introduction
16
Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Theory Application Aspects Simulation Study Summary Introduction
17
Theory Application Aspects Simulation Study Summary Introduction L p -distances: Optimization criteria
18
Theory Application Aspects Simulation Study Summary Introduction f(.) g(.)
19
Theory Application Aspects Simulation Study Summary Introduction
20
Theory Application Aspects Simulation Study Summary Introduction „Integrated absolute error“ =IAE =ISE „Integrated squared error“
21
Theory Application Aspects Simulation Study Summary Introduction =IAE „Integrated absolute error“ =ISE „Integrated squared error“
22
Theory Application Aspects Simulation Study Summary Introduction Consideration of horizontal distances for a more intuitive fit (Marron and Tsybakov, 1995) Compare the number and position of modes Minimization of the maximum vertical distance Other ideas:
23
Overview about some minimization criteria L 1 -distance=IAE L -distance=Maximum difference „Modern“ criteria, which include a kind of measure of the horizontal distances L 2 -distance=ISE, MISE,AMISE,... Difficult mathematical tractability Does not consider overall fit Difficult mathematical tractability Theory Application Aspects Simulation Study Summary Introduction Most commonly used
24
ISE, MISE, AMISE,... Theory Application Aspects Simulation Study Summary Introduction MISE=E(ISE), the expectation of ISE AMISE=Taylor approximation of MISE, easier to calculate ISE is a random variable
25
Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Theory Application Aspects Simulation Study Summary Introduction
26
The AMISE-optimal bandwidth Theory Application Aspects Simulation Study Summary Introduction
27
The AMISE-optimal bandwidth Theory Application Aspects Simulation Study Summary Introduction minimized by „Epanechnikov kernel“ dependent on the kernel function K(.)
28
The AMISE-optimal bandwidth Theory Application Aspects Simulation Study Summary Introduction dependent on the unknown density f(.) How to proceed?
29
Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984) Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap
30
Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984)
31
Least squares cross-validation (LSCV) Undisputed selector in the 1980s Gives an unbiased estimator for the ISE Suffers from more than one local minimizer – no agreement about which one to use Bad convergence rate for the resulting bandwidth h opt Theory Application Aspects Simulation Study Summary Introduction
32
Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984) Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap
33
Normal rule („Rule of thumb“) Assumes f(x) to be N( , 2 ) Easiest selector Often oversmooths the function Theory Application Aspects Simulation Study Summary Introduction The resulting bandwidth is given by:
34
Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984) Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap
35
Plug in-methods (Sheather and Jones, 1991; Park and Marron,1990) Does not substitute R(f“) in the AMISE- formula, but estimates it via R(f (IV) ) and R(f (IV) ) via R(f (VI) ),etc. Another parameter i to chose (the number of stages to go back) – one stage is mostly sufficient Better rates of convergence Does not finally circumvent the problem of the unknown density, either Theory Application Aspects Simulation Study Summary Introduction
36
The multivariate case Theory Application Aspects Simulation Study Summary Introduction h H...the bandwidth matrix
37
Issues of generalization in d dimensions Theory Application Aspects Simulation Study Summary Introduction d 2 instead of one bandwidth parameter Unstable estimates Bandwidth selectors are essentially straightforward to generalize For Plug-in methods it is „too difficult“ to give succint expressions for d>2 dimensions
38
Aspects of Application
39
Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes
40
Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes
41
Application Aspects Theory Simulation Study Summary Introduction The „curse of dimensionality“ The data „disappears“ into the distribution tails in high dimensions : a good fit in the tails is desired! dd
42
Application Aspects Theory Simulation Study Summary Introduction The „curse of dimensionality“ Much data is necessary to obey a constant estimation error in high dimensions
43
Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes
44
Essential issues Estimation of tails important Worse fit in the tails Calculation intensive for large n Many observations required for a reasonable fit L 2 -optimal L 1 -optimal (Misclassification rate) AMISE-optimal parameter choice Optimal classification (in high dimensions)
45
Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes
46
Application Aspects Theory Simulation Study Summary Introduction Method 1: Reduction of the data onto a subspace which allows a somewhat accurate estimation, however does not destoy too much information „trade-off“ Use the multivariate kernel density concept to estimate the class densities
47
Application Aspects Theory Simulation Study Summary Introduction Method 2: Use the univariate concept to „normalize“ the data nonparametrically Use the classical methods like LDA and QDA for classification Drawback: calculation intensive
48
Application Aspects Theory Simulation Study Summary Introduction Method 2:
49
Simulation Study
50
Theory Application Aspects Summary Introduction Criticism on former simulation studies Carried out 20-30 years ago Out-dated parameter selectors Restriction to uncorrelated normals Fruitless estimation because of high dimensions No dimension reduction
51
Simulation Study Theory Application Aspects Summary Introduction 21 datasets 14 estimators 2 error criteria 21x14x2=588 classification scores Many results The present simulation study
52
Simulation Study Theory Application Aspects Summary Introduction The present simulation study 21 datasets 14 estimators 2 error criteria 21x14x2=588 classification scores Many results
53
Simulation Study Theory Application Aspects Summary Introduction Each dataset has......2 classes for distinction...600 observations/class...200 test observations, 100 produced by each class... therfore dimension 1400x10
54
Univariate prototype distributions:
55
+10 datasets having unequal covariance matrices 21 datasets total + 1 insurance dataset 10 datasets having equal covariance matrices
56
Simulation Study Theory Application Aspects Summary Introduction Simulation Study 21 datasets 14 estimators 2 error criteria 21x14x2=588 classification scores Many results
57
Principal component reduction onto 2,3,4 and 5 dimensions (4) x multivariate „normal rule“ and multivariate LSCV-criterion,resp. (2) 8 estimators Method 2(„marginal normalizations“): Method 1(multivariate density estimator): Classical methods: 14 estimators 2 estimators LDA and QDA (2) Univariate normal rule and Sheather-Jones plug-in (2) x subsequent LDA and QDA (2) 4 estimators
58
Simulation Study Theory Application Aspects Summary Introduction Simulation Study 21 datasets 14 estimators 2 misclassification criteria 21x14x2=588 classification scores Many results
59
Simulation Study Theory Application Aspects Summary Introduction Misclassification Criteria The classical Misclassification rate („Error rate“) The Brier score
60
Simulation Study Theory Application Aspects Summary Introduction Simulation Study 21 datasets 14 estimators 2 error criteria 21x14x2= 588 classification scores Many results
61
Simulation Study Theory Application Aspects Summary Introduction Results The choice of the misclassification criterion is not essential
62
Simulation Study Theory Application Aspects Summary Introduction Results The choice of the multivariate bandwidth parameter (method 1) is not essential in most cases Superiority of LSCV in case of bimodals having unequal covariance matrices
63
Simulation Study Theory Application Aspects Summary Introduction Results The choice of the univariate bandwidth parameter (method 2) is not essential
64
Simulation Study Theory Application Aspects Summary Introduction Results The best trade-off is a projection onto 2-3 dimensions
65
Results
67
Is the additional calculation time justified? Results Simulation Study Theory Application Aspects Summary Introduction
68
Summary
69
Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified
70
Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified
71
Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified
72
Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified
73
Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified
74
Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE MISE AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities
75
Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE MISE AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities
76
Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE MISE AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities
77
Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE MISE AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities
78
Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
79
Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
80
Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
81
Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
82
Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time
83
The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.