Presentation is loading. Please wait.

Presentation is loading. Please wait.

MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi.

Similar presentations


Presentation on theme: "MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi."— Presentation transcript:

1 MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi

2 Clustering Algorithms BIRCH ROCK CLIQUE –Inputs: grid size and density threshold –Prunes subspaces MAFIA –Adaptive grid size –Inputs: density threshold –No pruning of subspaces

3 Grids: CLIQUE way Along each dimension: Divide the whole range into intervals (windows) of size given by user. Threshold the number of points in each interval by the user input density to get clusters.

4 Grids: MAFIA way Along each dimension: Divide the whole range into many small windows. Compute a histogram (Assuming discrete data here). –E.g., we can divide the range of natural numbers (1-15) into 5 windows (1-3, 4-6,…,13-15). Value of a window = max(histogram value within the window) –E.g., if there are three 1s, zero 2s, and five 3s, then the value of the first window (1-3) = three.

5 Grids: MAFIA way Along each dimension: (contd.) From L-to-R, merge adjacent windows which differ by less than threshold ß. –Can be made a user input, but they hard-coded it (25-75%) What if cannot detect any partition? –Divide the range equally.

6 Compare… CLIQUE MAFIA

7 Which windows are cluster candidates? CLIQUE: use user input threshold MAFIA: use user input threshold normalized to window size –Cluster dominance factor: α –Reports clusters as DNF expressions –Cluster candidates henceforth referred to as Candidate Dense Units (CDU)

8 Algorithm Initialization B = number of records that fit into memory 1.Read data in chunks of B and build histogram for each dimension. 2.Determine the adaptive windows for each dimension, and the normalized thresholds for each window. 3.Get the candidate windows in each dimension. 4.Variable of working dimension, k = 1.

9 Main Loop Repeat 1. k++; 2.Find candidate dense units (by combining dimensions); 3.Read through the data to find how many points lie in each of these CDUs; 4.Find the true dense units. Until (no more dense units found) Report the true dense units as clusters.

10 Building CDUs CDUs in k dimensions –merge two dense units of (k-1) dimensions. –such that they share any (k-2) dimensions. –each dense unit of (k-1) dims has to be compared with every other dense unit. –can lead to duplicate CDUs, compare every CDU with every other CDU. Dense units which cannot be combined are a potential cluster (in a subspace).

11 Building CDU example (2D  3D) We can get repeated CDUs Two passes required. 1.To combine two 2D units to one 3D unit. 2.To eliminate repeated CDUs.

12 Variables (Recap) Cluster dominance factor, α: –High α, strong clusters and vice-versa. –Usual value: 1.5 Window merging threshold, β: –High β, fine windows and vice-versa.

13 MAFIA vs. CLIQUE (speedup) CLIQUE used –without pruning. –with 10 bins for each dimension. –with different thresholds ?

14 MAFIA vs. CLIQUE (number of CDUs computed) Single 7D cluster in a 10D data space CLIQUE: 75 6D clusters, 546 7D clusters

15 MAFIA vs. CLIQUE (quality) 2 4D clusters in 10D data space CLIQUE: cluster boundary very unreliable –On using a variable number of (fixed size) bins in each dimension (how?), it misses one cluster.

16 MAFIA (scalability)


Download ppt "MAFIA: Adaptive Grids for Clustering Massive Data Sets Harsha Nagesh, Sanjay Goil, Alok Choudhury -Udeepta Bordoloi."

Similar presentations


Ads by Google