Dimensionally distributed Pasi Fränti and Sami Sieranoja

Dimensionally distributed Pasi Fränti and Sami Sieranoja
density estimation Pasi Fränti and Sami Sieranoja P. Fränti and S. Sieranoja, "Dimensionally distributed density estimation", Int. Conf. Artificial Intelligence and Soft Computing (ICAISC), Zakopane, Poland, , June 2018.

Density in clustering

Density in outlier detection

Definitions

Definition of density Density = mass / volume

Definition of density Density = mass / volume r

Definition of density Density = mass / volume r N

Two-ways to estimate density
Input: Point Output: Density around the point Distance-based Neighbor-based Fix neigborhood (R) Count points (N) Fix number of points (N) Measure size of neigborhood (R)

Distance-based Neighbor-based 1.9 2 1.1 Input: R-radius Output: Point count Input: k-neighbors Output: Mean distance

Distance-based Neighbor-based 1.9 3 1.4 1 1.4 2 1.6 0.9 2 2 0.8 1.1 2 1 1.2 1.5 2.0

Summary Distance-based Neighbor-based Measure: N Fixed constant
Measure: R

Choice of the parameters
Distance-based: R = % * average distance to data center [2] R = Average pairwise distance of all data points [28] R = 90% * first peak in the pairwise distance histogram [17] R = 0.07 [26] Neighbor-based: k = 10 [18] k = 30 [12] k = [27] k = [5] k = N [19] k = min{50, N/(2K)} where K is the number of clusters

Bottleneck: finding neighbors
O(N2) Distance-based Neighbor-based 3 1 2 4 d(x,y) > R k-nearest

Dimensionally distributed density estimation (DDDE)

Density in categorical data
Estimate popularity of individual attributes Cao, Liang, Bai, Expert Systems with Applications, 2009. [Zhang, Farmer, Mandarin] [Malinen, Scientist, Finnish] A B

Sorting in each dimension
Sorting by x-values Sorting by y-values Sliding window Sliding window

Independent density estimates
x-projection y-projection 1.7 2.0 1.6 1.2 0.3 1.2 0.5 0.6 0.6 0.4 Sliding window 0.5 0.6 0.4 1.8 2.0 0.5 0.7 0.5 1.5 0.9 Sliding window Density value = = 1.2

Density estimates DDDE 2-NN

Sliding window technique
m— m+ 33 y[i] 73 17 21 26 29 44 47 67 75 77 88 95 15 25 m— m+ 40 y[i] 80 17 21 26 29 44 47 67 75 77 88 95 -26 +47 -67 +88

DDDE algorithm O(DNlogN+DN) O(NlogN) O(N)

Potential false detections

Experiments

Methods compared Clustering algorithms:
Density-based sorting + k-means Density peaks [26] Repeated k-means (as reference point) Density estimations: Full search O(N2) Using subsample (s=2%) O(sN2) Using DDDE O(NlogN)

Datasets S1 S2 S3 S4 Unbalance Birch1 Birch2 DIM32 A1 A2 A3

Centroid index (CI) CI = 4
[Fränti, Rezaei, Zhao, Pattern Recognition 2014] CI = 4 empty 15 prototypes (pigeons) 15 real clusters (pigeon holes) empty empty empty

Quality comparison Centroid index

Effect for density peaks
Full search DDDE

Speed comparison Seconds

Speed vs. quality

Time profiling

Conclusions Rapid O(DN logN) time algorithm.
Remarkable 160:1 speed-up Density estimation no longer bottleneck

Dimensionally distributed Pasi Fränti and Sami Sieranoja

Similar presentations

Presentation on theme: "Dimensionally distributed Pasi Fränti and Sami Sieranoja"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dimensionally distributed Pasi Fränti and Sami Sieranoja

Similar presentations

Presentation on theme: "Dimensionally distributed Pasi Fränti and Sami Sieranoja"— Presentation transcript:

Similar presentations

About project

Feedback