Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien.

Similar presentations


Presentation on theme: "Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien."— Presentation transcript:

1 Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien

2 Introduction Theory Aspects of Application Simulation Study Summary Contents:

3 Introduction

4 Theory Application Aspects Simulation Study Summary Introduction 01234 25 observations: Which distribution?

5 01234 ? ? ? ? ?

6 01234 Kernel density estimator model: K(.) and h to choose Theory Application Aspects Simulation Study Summary Introduction

7 01234 triangular gaussian „small“ h „large“ h kernel/ bandwidth:

8 Theory Application Aspects Simulation Study Summary Introduction Question 1: Which choice of K(.) and h is the best for a descriptive purpose?

9 Introduction Theory Application Aspects Simulation Study Summary Introduction Classification:

10 Introduction Theory Application Aspects Simulation Study Summary Introduction Levelplot – LDA (based on assumption of a multivariate normal distribution): Classification:

11 Introduction Theory Application Aspects Simulation Study Summary Introduction Classification:

12 Introduction Theory Application Aspects Simulation Study Summary Introduction Levelplot – KDE classificator: Classification:

13 Introduction Theory Application Aspects Simulation Study Summary Introduction Question 2: Performance of classification based on KDE in more than 2 dimensions?

14 Theory

15 Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Theory Application Aspects Simulation Study Summary Introduction

16 Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Theory Application Aspects Simulation Study Summary Introduction

17 Theory Application Aspects Simulation Study Summary Introduction L p -distances: Optimization criteria

18 Theory Application Aspects Simulation Study Summary Introduction f(.) g(.)

19 Theory Application Aspects Simulation Study Summary Introduction

20 Theory Application Aspects Simulation Study Summary Introduction „Integrated absolute error“ =IAE =ISE „Integrated squared error“

21 Theory Application Aspects Simulation Study Summary Introduction =IAE „Integrated absolute error“ =ISE „Integrated squared error“

22 Theory Application Aspects Simulation Study Summary Introduction Consideration of horizontal distances for a more intuitive fit (Marron and Tsybakov, 1995) Compare the number and position of modes Minimization of the maximum vertical distance Other ideas:

23 Overview about some minimization criteria L 1 -distance=IAE L  -distance=Maximum difference „Modern“ criteria, which include a kind of measure of the horizontal distances L 2 -distance=ISE, MISE,AMISE,... Difficult mathematical tractability Does not consider overall fit Difficult mathematical tractability Theory Application Aspects Simulation Study Summary Introduction Most commonly used

24 ISE, MISE, AMISE,... Theory Application Aspects Simulation Study Summary Introduction MISE=E(ISE), the expectation of ISE AMISE=Taylor approximation of MISE, easier to calculate ISE is a random variable

25 Essential issues Optimization criteria Improvements of the standard model Resulting optimal choices of the model parameters K(.) and h Theory Application Aspects Simulation Study Summary Introduction

26 The AMISE-optimal bandwidth Theory Application Aspects Simulation Study Summary Introduction

27 The AMISE-optimal bandwidth Theory Application Aspects Simulation Study Summary Introduction minimized by „Epanechnikov kernel“ dependent on the kernel function K(.)

28 The AMISE-optimal bandwidth Theory Application Aspects Simulation Study Summary Introduction dependent on the unknown density f(.) How to proceed?

29 Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984) Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap

30 Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984)

31 Least squares cross-validation (LSCV) Undisputed selector in the 1980s Gives an unbiased estimator for the ISE Suffers from more than one local minimizer – no agreement about which one to use Bad convergence rate for the resulting bandwidth h opt Theory Application Aspects Simulation Study Summary Introduction

32 Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984) Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap

33 Normal rule („Rule of thumb“) Assumes f(x) to be N( ,  2 ) Easiest selector Often oversmooths the function Theory Application Aspects Simulation Study Summary Introduction The resulting bandwidth is given by:

34 Data-driven bandwidth selection methods Theory Application Aspects Simulation Study Summary Introduction Maximum Likelihood Cross- Validation Least-squares cross-validation (Bowman, 1984) Leave-one-out selectors Criteria based on substituting R(f“) in the AMISE-formula „Normal rule“ („Rule of thumb“; Silverman, 1986) Plug-in methods (Sheather and Jones, 1991; Park and Marron,1990) Smoothed bootstrap

35 Plug in-methods (Sheather and Jones, 1991; Park and Marron,1990) Does not substitute R(f“) in the AMISE- formula, but estimates it via R(f (IV) ) and R(f (IV) ) via R(f (VI) ),etc. Another parameter i to chose (the number of stages to go back) – one stage is mostly sufficient Better rates of convergence Does not finally circumvent the problem of the unknown density, either Theory Application Aspects Simulation Study Summary Introduction

36 The multivariate case Theory Application Aspects Simulation Study Summary Introduction h  H...the bandwidth matrix

37 Issues of generalization in d dimensions Theory Application Aspects Simulation Study Summary Introduction d 2 instead of one bandwidth parameter Unstable estimates Bandwidth selectors are essentially straightforward to generalize For Plug-in methods it is „too difficult“ to give succint expressions for d>2 dimensions

38 Aspects of Application

39 Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

40 Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

41 Application Aspects Theory Simulation Study Summary Introduction The „curse of dimensionality“  The data „disappears“ into the distribution tails in high dimensions : a good fit in the tails is desired! dd

42 Application Aspects Theory Simulation Study Summary Introduction The „curse of dimensionality“  Much data is necessary to obey a constant estimation error in high dimensions

43 Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

44 Essential issues Estimation of tails important Worse fit in the tails Calculation intensive for large n Many observations required for a reasonable fit L 2 -optimal L 1 -optimal (Misclassification rate) AMISE-optimal parameter choice Optimal classification (in high dimensions)

45 Application Aspects Theory Simulation Study Summary Introduction Essential issues Curse of dimensionality Connection between goodness-of-fit and optimal classification Two methods for discrimatory purposes

46 Application Aspects Theory Simulation Study Summary Introduction Method 1: Reduction of the data onto a subspace which allows a somewhat accurate estimation, however does not destoy too much information  „trade-off“ Use the multivariate kernel density concept to estimate the class densities

47 Application Aspects Theory Simulation Study Summary Introduction Method 2: Use the univariate concept to „normalize“ the data nonparametrically Use the classical methods like LDA and QDA for classification Drawback: calculation intensive

48 Application Aspects Theory Simulation Study Summary Introduction Method 2:

49 Simulation Study

50 Theory Application Aspects Summary Introduction Criticism on former simulation studies Carried out 20-30 years ago Out-dated parameter selectors Restriction to uncorrelated normals Fruitless estimation because of high dimensions No dimension reduction

51 Simulation Study Theory Application Aspects Summary Introduction 21 datasets 14 estimators 2 error criteria  21x14x2=588 classification scores Many results The present simulation study

52 Simulation Study Theory Application Aspects Summary Introduction The present simulation study 21 datasets 14 estimators 2 error criteria  21x14x2=588 classification scores Many results

53 Simulation Study Theory Application Aspects Summary Introduction Each dataset has......2 classes for distinction...600 observations/class...200 test observations, 100 produced by each class... therfore dimension 1400x10

54 Univariate prototype distributions:

55 +10 datasets having unequal covariance matrices 21 datasets total + 1 insurance dataset 10 datasets having equal covariance matrices

56 Simulation Study Theory Application Aspects Summary Introduction Simulation Study 21 datasets 14 estimators 2 error criteria  21x14x2=588 classification scores Many results

57 Principal component reduction onto 2,3,4 and 5 dimensions (4) x multivariate „normal rule“ and multivariate LSCV-criterion,resp. (2) 8 estimators Method 2(„marginal normalizations“): Method 1(multivariate density estimator): Classical methods: 14 estimators 2 estimators LDA and QDA (2) Univariate normal rule and Sheather-Jones plug-in (2) x subsequent LDA and QDA (2) 4 estimators

58 Simulation Study Theory Application Aspects Summary Introduction Simulation Study 21 datasets 14 estimators 2 misclassification criteria  21x14x2=588 classification scores Many results

59 Simulation Study Theory Application Aspects Summary Introduction Misclassification Criteria The classical Misclassification rate („Error rate“) The Brier score

60 Simulation Study Theory Application Aspects Summary Introduction Simulation Study 21 datasets 14 estimators 2 error criteria  21x14x2= 588 classification scores Many results

61 Simulation Study Theory Application Aspects Summary Introduction Results The choice of the misclassification criterion is not essential

62 Simulation Study Theory Application Aspects Summary Introduction Results The choice of the multivariate bandwidth parameter (method 1) is not essential in most cases Superiority of LSCV in case of bimodals having unequal covariance matrices

63 Simulation Study Theory Application Aspects Summary Introduction Results The choice of the univariate bandwidth parameter (method 2) is not essential

64 Simulation Study Theory Application Aspects Summary Introduction Results The best trade-off is a projection onto 2-3 dimensions

65 Results

66

67 Is the additional calculation time justified? Results Simulation Study Theory Application Aspects Summary Introduction

68 Summary

69 Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

70 Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

71 Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

72 Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

73 Summary (1/3) – Classification Performance Restriction to only a few dimensions Improvements with respect to the classical discrimination methods by marginal normalizations (especially for unequal covariance matrices) Poor performance of the multivariate kernel density classificator LDA is undisputed in the case of equal covariance matrices and equal prior probabilities Additional computation time seems not to be justified

74 Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE  MISE  AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

75 Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE  MISE  AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

76 Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE  MISE  AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

77 Summary (2/3) – KDE for Data Description Great variety in error criteria, parameter selection procedures and additional model improvements (3 dimensions) No correspondence about a feasible error criterion Nobody knows, what is finally optimized („upper bounds“ in L 1 -theory, L 2 -theory: ISE  MISE  AMISE,several minima in LSCV,...) Different parameter selectors are of varying quality with respect to different underlying densities

78 Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

79 Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

80 Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

81 Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

82 Summary (3/3) – Theory vs. Application Comprehensive theoretical results about optimal kernels or optimal bandwidths are not relevant for classification For discrimatory purposes the issue of estimating log- densities is much more important Some univariate model improvements are not generalizable The – widely ignored – „curse of dimensionality“ forces the user to achieve a trade-off between necessary dimension reduction and information loss Dilemma: Much data is required for accurate estimates – Much data lead to a explosion of the computation time

83 The End


Download ppt "Kernel Density Estimation Theory and Application in Discriminant Analysis Thomas Ledl Universität Wien."

Similar presentations


Ads by Google