Presentation is loading. Please wait.

Presentation is loading. Please wait.

DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)

Similar presentations


Presentation on theme: "DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)"— Presentation transcript:

1 DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)

2 Presentation Outline Problem definition Different approaches Focus: the projective clustering approach

3 Problem Definition – Traditional Clustering Traditional clustering problem: To divide data points into disjoint groups such that the value of an objective function is optimized. Objective function: to minimize intra-cluster distance and maximize inter-cluster distance. Distance function: define over all dimensions, numeric or categorical.

4 Problem Definition – Traditional Clustering Example Problem: clustering points in 2-D space. Distance function: Euclidean distance (d: no. of dimensions, 2 in this case).

5 Problem Definition – Traditional Clustering Example (source: CURE, SIGMOD 1998)

6 Problem Definition – Distance Function Problem Observation: distance measures defined over all dimensions are sometimes inappropriate. Example (source: DOC, SIGMOD 2002) C 1 : (x 1, x 2 ) C 2 : (x 2, x 3 ) C 3 : (x 1, x 3 )

7 Problem Definition – Distance Function Problem As the number of noise dimensions increases, the distance functions become less and less accurate. => For each cluster, except the set of data points, we also need to find out the set of “related dimensions” (“bounded attributes”)

8 Problem Definition – The Subspace Clustering Problem Formal Definition: Given a dataset of N data points and d dimensions, we want to divide the points into k disjoint clusters, each relating to a subset of dimensions, such that an objective function is optimized. Objective function: usually intra-cluster distance, each cluster uses its own set of dimensions in distance calculation.

9 Problem Definition – The Subspace Clustering Problem Observation: normal distance functions (Manhattan, Euclidean, etc.) give a smaller value if less dimensions are involved. =>1. Use a normalized distance function. =>2. Should also try to maximize the number of dimensions. Example (DOC): score(C, D) = |C|(1/β) |D|, C = points in a cluster, D = relating attributes, β is a constant.

10 Different Approaches – Overview Grid-based dimension selection Association rule hypergraph partitioning Context-specific Bayesian clustering Projective clustering (Focus)

11 Different Approaches – Grid-Based Dimension Selection CLIQUE (98), ENCLUS (99), MAFIA (99), etc. Basic idea: A cluster is a region with high density. – Divide the domain of each dimension into units. – For each dimension, find all dense units – units with many points. – Merge neighboring dense units into “clusters”. – After finding all 1-d clusters, find 2-d dense units. – Repeat with higher dimensions.

12 Different Approaches – Grid-Based Dimension Selection A 2-D dataset for illustration:

13 Different Approaches – Grid-Based Dimension Selection 1. Divide the domain of each dimension into sub- units.

14 Different Approaches – Grid-Based Dimension Selection 2. Find all dense units – units with many points. (assume density threshold = 3 points)

15 Different Approaches – Grid-Based Dimension Selection 3. Merge neighboring dense units into “clusters”.

16 Different Approaches – Grid-Based Dimension Selection 4. Find 2-d dense units. Merge neighboring dense units, if any.

17 Different Approaches – Grid-Based Dimension Selection 5. Repeat with higher dimensions.

18 Different Approaches – Grid-Based Dimension Selection Results:1-d:,,,. 2-d:,.

19 Different Approaches – Grid-Based Dimension Selection Problems with the grid-based dimension selection approach: – Non-disjoint clusters. – Exponential dependency on the number of dimensions.

20 Different Approaches - Association Rule Hypergraph Partitioning 1997 Cluster related items (attribute values) using association rules and cluster related transactions (data points) using clusters of items.

21 Different Approaches – Association Rule Hypergraph Partitioning Procedures: 1. Find all frequent itemsets in the dataset. 2. Construct a hypergraph with each item as a vertex, and each hyperedge corresponding to a frequent itemset. (If {A, B, C} is a frequent itemset, there is a hyperedge connecting the vertices of A, B, and C.)

22 Different Approaches – Association Rule Hypergraph Partitioning Procedures: 3. Each hyperedge is assigned a weight equal to a function of the confidences of all the association rules between the connecting items. (If there are association rules {A}=>{B,C} (c. 0.8), {A,B}=>{C} (c. 0.4), {A,C}=>{B} (c. 0.6), {B}=>{A,C} (c. 0.4), {B,C}=>{A} (c. 0.8) and {C}=>{A,B} (c. 0.6), then the weight of the hyperedge ABC can be the average of the confidences, i.e. 0.6)

23 Different Approaches – Association Rule Hypergraph Partitioning Procedures: 4. Use a hypergraph partitioning algorithm (e.g. HMETIS, 97) to divide the hypergraph into k partitions, so that the sum of the weights that straddle partitions is minimized. Each partition forms a cluster with different subset of items. 5. Assign each transaction to a cluster, based on a scoring function (e.g. percentage of matched items).

24 Different Approaches – Association Rule Hypergraph Partitioning Problems with the association rule hypergraph partitioning approach: – In real clusters, an item can be related to multiple clusters. – May not be applicable to numeric attributes.

25 Different Approaches – Context- Specific Bayesian Clustering Naïve-Bayesian classification: given a training set with classes C i (i=1..k), a data point with attribute values x 1, x 2, …, x d is classified by P(C=C i | x 1, x 2, …, x d ) =P(x 1, x 2, …, x d | C=C i ) P(C=C i ) / P(x 1, x 2, …, x d ) αP(x 1, x 2, …, x d | C=C i ) P(C=C i ) =P(x 1 |C=C i )P(x 2 |C=C i )…P(x d |C=C i )P(C=C i )

26 Different Approaches – Context- Specific Bayesian Clustering A RECOMB 2001 paper Context-specific independence (CSI) model: each attribute X i depends only on classes in a set L i. E.g. if k=5 and L 1 ={1, 4}, then P(X 1 |C=C 2 ) = P(X 1 |C=C 3 ) = P(X 1 |C=C 5 ) = P(X 1 |C=C def )

27 Different Approaches – Context- Specific Bayesian Clustering A CSI model M contains: k – the number of classes. G – the set of attributes that depend on some classes. L i – the “local structures” of the attributes. Parameters for a CSI model, θ M : P(C=C i ), P(X i |L i =C j )

28 Different Approaches – Context- Specific Bayesian Clustering Recall P(C=C i | x 1, x 2, …, x d ) α P(x 1 |C=C i )P(x 2 |C=C i )…P(x d |C=C i )P(C=C i ), in the CSI model, it equals P(X 1 |L i =C j )P(X 2 |L i =C j )…P(X d |L i =C j )P(C=C i ) So, for a dataset (without class labels), if we can guess a CSI model and its parameters, then we can assign each data point to a class => clustering.

29 Different Approaches – Context- Specific Bayesian Clustering Searching best model and parameters: – Define a score to rank the current model and parameters (BIC( M, θ M ) or CS( M, θ M )). – Randomly pick a model and a set of parameters and calculate the score. – Try modifying the model (e.g. add an attribute to a local structure), recalculate the score. – If the score is better, keep it and try modifying a parameter.

30 Different Approaches – Context- Specific Bayesian Clustering Repeat until a stopping criterion is reached (e.g. using simulated annealing). M 1, θ M1 -> M 2, θ M1 -> M 2, θ M2 -> M 3, θ M2 ->…

31 Different Approaches – Context- Specific Bayesian Clustering The scoring functions (just have a taste):

32 Different Approaches – Context- Specific Bayesian Clustering Problems with the context-specific Bayesian clustering approach: – Cluster quality and execution time not guaranteed. – Easily get into local minimum.

33 Focus: The Projective Clustering Approach PROCLUS (99), ORCLUS (00), etc. K-medoid partitional clustering. Basic idea: use a set of sample points to determine the relating dimensions for each cluster, assign points to the clusters according to the dimension sets, throw away some bad medoids and repeat.

34 Focus: The Projective Clustering Approach Algorithm (3 phases): Initialization phase – Input k: target number of clusters. – Input l: average number of dimensions in a cluster. – Draw Ak samples randomly from the dataset, where A is a constant. – Use max-min algorithm to draw Bk points from the sample, where B is a constant < A. Call this set of points M.

35 Focus: The Projective Clustering Approach Iterative Phase – Draw k medoids from M. – For each medoid m i, calculate the Manhattan distance δ i (involving all dimensions) to the nearest medoid. – Find all points in the whole dataset that are within a distance δ i from m i.

36 Focus: The Projective Clustering Approach Finding the set of surrounding points for a medoid: A B C δ

37 Focus: The Projective Clustering Approach – The average distance between the points and the medoid along each dimension will be calculated. – Among all kd dimensions, select kl of them with exceptionally small average distances. An extra restriction is that each medoid must pick at least 2 dimensions. – Whether the distance from medoid of a particular dimension is “exceptionally small” in a cluster is determined by its standard score:

38 Focus: The Projective Clustering Approach Scoring dimensions: A B C

39 Focus: The Projective Clustering Approach Example: – In cluster C 1, the average distances from medoid along dimension D 1 =10, along D 2 =15 and along D 3 =13. In Cluster C 2, the average distances are 7, 6 and 12. – Mean(C 1 ) = (10 + 15 + 13) / 3 = 12.67. – S.D.(C 1 ) = 2.52. – Z(C 1 D 1 ) = (10-12.67)/2.52 = -1.06. – Similarly, Z(C 1 D 2 ) = 0.93, Z(C 1 D 3 ) = 0.13, Z(C 2 D 1 ) = -0.41, Z(C 2 D 2 ) = -0.73, Z(C2D3) = 1.14. – So the order to pick the dimensions will be C 1 D 1 -> C 2 D 2 -> C 2 D 1 -> C 1 D 3 -> C 1 D 2 -> C 2 D 3.

40 Focus: The Projective Clustering Approach Iterative Phase (cont’d) – Now, each medoid has a related set of dimensions. Assign all points in the whole dataset to the medoid closest to it (using a normalized distance function involving only the selected dimension). – Calculate the overall score of the clustering. Record the cluster definitions (relating attributes and assigned points) if the score is the new best one. – Throw away medoids with too few points. Replace them with some points remained in M.

41 Focus: The Projective Clustering Approach Refinement Phase – After determining the best set of medoids, use the assigned points to re-determine the sets of dimensions, and reassign all points. – If the distance between a point and its medoid is longer than the distance between the medoid and its closest medoid, the point is marked as an outlier.

42 Focus: The Projective Clustering Approach Experiment: – Dataset: synthetic, 100, 000 points, 20 dimensions. – Set 1: 5 clusters, each with 7 dimensions. – Set 2: 5 clusters, with 2-7 dimensions. – Machine: 233-MHz IBM RS/6000, 128M RAM, running AIX. Dataset stored in a 2GB SCSI drive. – Comparison: CLIQUE (grid-based)

43 Focus: The Projective Clustering Approach Result accuracy (set 1): InputDimensionsPoints A3, 4, 7, 9, 14, 16, 1721391 B3, 4, 7, 12, 13, 14, 1723278 C4, 6, 11, 13, 14, 17, 1918245 D4, 7, 9, 13, 14, 16, 1715728 E3, 4, 9, 12, 14, 16, 1716357 Outliers-5000 Actual clusters FoundDimensionsPoints 14, 6, 11, 13, 14, 17, 1918701 23, 4, 7, 9, 14, 16, 1721915 33, 4, 7, 12, 13, 14, 1723975 44, 7, 9, 13, 14, 16, 1716018 53, 4, 9, 12, 14, 16, 1716995 Outliers-2396 PROCLUS results

44 Focus: The Projective Clustering Approach Result accuracy (set 1): InputABCDEOut. Output 1001824500456 2213910000523 312327801010697 4000157280290 5000016357638 Out.000002396 Confusion Matrix of PROCLUS

45 Focus: The Projective Clustering Approach Result accuracy (set 1): InputABCDEOut. Output 21112800000 151951000000 3100010100 3200011100 470001284900 Confusion Matrix of CLIQUE

46 Focus: The Projective Clustering Approach Result accuracy (set 2): InputDimensionsPoints A2, 3, 4, 9, 11, 14, 1821391 B2, 3, 723278 C2, 1218245 D2, 3, 4, 12, 13, 1715728 E2, 416357 Outliers-5000 Actual clusters FoundDimensionsPoints 12, 3, 722051 22, 416800 32, 3, 4, 12, 13, 1715387 42, 1218970 52, 3, 4, 9, 11, 14, 1821498 Outliers-5294 PROCLUS results

47 Focus: The Projective Clustering Approach Result accuracy (set 2): InputABCDEOut. Output 102099226741618358 23400016097669 3091153091058 4022561653600178 52135700210129 Out.021144112223609 Confusion Matrix of PROCLUS

48 Focus: The Projective Clustering Approach Scalability (with dataset size):

49 Focus: The Projective Clustering Approach Scalability (with average dimension- ality):

50 Focus: The Projective Clustering Approach Scalability (with space dimension- ality):

51 Focus: The Projective Clustering Approach Problems with the projective clustering approach: – Need to know l, the average number of dimensions. – A cluster with very small number of selected dimensions will absorb the points of other clusters. – Using a distance measure over the whole dimension space to select the sets of dimensions may not be accurate, especially when the number of noise attributes is large.

52 Summary The subspace clustering problem: given a dataset of N data points and d dimensions, we want to divide the points into k disjoint clusters, each relating to a subset of dimensions, such that an objective function is optimized. Grid-based dimension selection Association rule hypergraph partitioning Context-specific Bayesian clustering Projective clustering

53 References Grid-based dimension selection: – “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications” (SIGMOD 1998) – “Entropy-based Subspace Clustering for Mining Numerical Data” (SIGKDD 1999) – “MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets” (Technical Report 9906-010, Northwestern University 1999) Association rule hypergraph partitioning: – “Clustering Based On Association Rule Hypergraphs” (Clustering Workshop 1997)

54 References – “Multilevel Hypergraph Partitioning: Application in VLSI Domain” (DAC 1997) Context-specific Bayesian clustering: – “Context-Specific Bayesian Clustering for Gene Expression Data” (RECOMB 2001) Projective clustering – “Fast Algorithms for Projected Clustering” (SIGMOD 1999) – “Finding Generalized Projected Clusters in High Dimensional Spaces” (SIGMOD 2000) – “A Monte Carlo Algorithm for Fast Projective Clustering” (SIGMOD 2002)


Download ppt "DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)"

Similar presentations


Ads by Google