DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel 101tec GmbH, Halle, Germany
Overview Density-based clustering and DENCLUE 1.0 Hill climbing as EM-algorithm Identification of local maxima Applications of general EM-acceleration Experiments
Density-Based Clustering Assumption –clusters are regions of high density in the data space, How to estimate density? –parametric models mixture models –non-parametric models histogram kernel density estimation
Kernel Density Estimation Idea –influence of a data point is modeled by a kernel –density is the normalized sum of all kernels –smoothing parameter h Gaussian Kernel Density Estimate
DENCLUE 1.0 Framework Clusters are defined by local maxima of the density estimate –find all maxima by hill climbing Problem –const. step size Gradient Hill Climbing const. step size
Problem of const. Step Size Not efficient –many unnecessary small steps Not effective –does not converge to a local maximum just comes close Example
New Hill Climbing Approach General approach –differentiate density estimate and set to zero –no solution, but can be used for iteration
New DENCLUE 2.0 Hill Climbing Efficient –automatically adjusted step size at no extra costs Effective –converges to local maximum (proof follows) Example
Proof of Convergence Cast the problem of maximizing kernel denstiy as maximizing the likelihood of a mixture model Introduce hidden variable
Proof of Convergence Complete likelihood is maximized by EM-Algorithm this also maximizes the original likelihood, which is the kernel density estimate When starting the EM with we do the hill climbing for E-Step M-Step
Identification of local Maxima EM-Algorithm iterates until –reached end point –sum of k last step sizes Assumption –true local maximum is in a ball of around Points with end points closer belong to the same maximum M In case of non-unique assignment do a few extra EM iterations
Acceleration Sparse EM –update only the p% points with largest posterior –saves 1-p% of kernel computations after first iteration Data Reduction –use only %p of the data as representative points –random sampling –kMeans
Experiments Comparison of DENCLUE 1.0 (FS) vs. 2.0 (SSA) 16-dim. artificial data both methods are tuned to find the correct clustering
Experiments Comparison of acceleration methods
Experiments Clustering quality (normalized mutual information, NMI) vs. sample size (RS)
Experiments Cluster Quality (NMI) of DENCLUE 2.0 (SSA) and acceleration methods and k-Means on real data sample sizes 0.8, 0.4, 0.2
Conclusion New hill climbing for DENCLUE Automatic step size adjustment Convergence proof by reduction to EM Allows the application of general EM accelerations Future work –automatic setting of smoothing parameter h (so far tuned manually)
Thank you for your attention!