Dimensionally distributed Pasi Fränti and Sami Sieranoja density estimation Pasi Fränti and Sami Sieranoja 24.1.2019 P. Fränti and S. Sieranoja, "Dimensionally distributed density estimation", Int. Conf. Artificial Intelligence and Soft Computing (ICAISC), Zakopane, Poland, 343-353, June 2018.
Density in clustering
Density in outlier detection
Definitions
Definition of density Density = mass / volume
Definition of density Density = mass / volume r
Definition of density Density = mass / volume r N
Definition of density Density = mass / volume r N
Two-ways to estimate density Input: Point Output: Density around the point Distance-based Neighbor-based Fix neigborhood (R) Count points (N) Fix number of points (N) Measure size of neigborhood (R)
Two-ways to estimate density Distance-based Neighbor-based 1.9 2 1.1 Input: R-radius Output: Point count Input: k-neighbors Output: Mean distance
Two-ways to estimate density Distance-based Neighbor-based 1.9 3 1.4 1 1.4 2 1.6 0.9 2 2 0.8 1.1 2 1 1.2 1.5 2.0
Summary Distance-based Neighbor-based Measure: N Fixed constant Measure: R
Choice of the parameters Distance-based: R = 10-100% * average distance to data center [2] R = Average pairwise distance of all data points [28] R = 90% * first peak in the pairwise distance histogram [17] R = 0.07 [26] Neighbor-based: k = 10 [18] k = 30 [12] k = 10-100 [27] k = 30-200 [5] k = N [19] k = min{50, N/(2K)} where K is the number of clusters
Bottleneck: finding neighbors O(N2) Distance-based Neighbor-based 3 1 2 4 d(x,y) > R k-nearest
Dimensionally distributed density estimation (DDDE)
Density in categorical data Estimate popularity of individual attributes Cao, Liang, Bai, Expert Systems with Applications, 2009. [Zhang, Farmer, Mandarin] [Malinen, Scientist, Finnish] A B
Sorting in each dimension Sorting by x-values Sorting by y-values Sliding window Sliding window
Independent density estimates x-projection y-projection 1.7 2.0 1.6 1.2 0.3 1.2 0.5 0.6 0.6 0.4 Sliding window 0.5 0.6 0.4 1.8 2.0 0.5 0.7 0.5 1.5 0.9 Sliding window Density value = 0.6 + 0.6 = 1.2
Density estimates DDDE 2-NN
Sliding window technique m— m+ 33 y[i] 73 17 21 26 29 44 47 67 75 77 88 95 15 25 m— m+ 40 y[i] 80 17 21 26 29 44 47 67 75 77 88 95 -26 +47 -67 +88
DDDE algorithm O(DNlogN+DN) O(NlogN) O(N)
Potential false detections
Experiments
Methods compared Clustering algorithms: Density-based sorting + k-means Density peaks [26] Repeated k-means (as reference point) Density estimations: Full search O(N2) Using subsample (s=2%) O(sN2) Using DDDE O(NlogN)
Datasets S1 S2 S3 S4 Unbalance Birch1 Birch2 DIM32 A1 A2 A3
Centroid index (CI) CI = 4 [Fränti, Rezaei, Zhao, Pattern Recognition 2014] CI = 4 empty 15 prototypes (pigeons) 15 real clusters (pigeon holes) empty empty empty
Quality comparison Centroid index
Effect for density peaks Full search DDDE
Speed comparison Seconds
Speed vs. quality
Time profiling
Time profiling
Conclusions Rapid O(DN logN) time algorithm. Remarkable 160:1 speed-up Density estimation no longer bottleneck