DB Seminar Series: Semi- supervised Projected Clustering By: Kevin Yip (4 th May 2004)
Outline Introduction –Projected clustering –Semi-supervised clustering Our problem Our new algorithm Experimental results Future works and extensions
Projected Clustering Where are the clusters?
Projected Clustering Where are the clusters?
Projected Clustering Pattern-based projected cluster:
Projected Clustering Goal: to discover clusters and their relevant dimensions that optimize a certain objective function. Previous approaches: –Partitional: PROCLUS, ORCLUS –One cluster at a time: DOC, FastDOC, MineClus –Hierarchical: HARP
Projected Clustering Limitations of the approaches: –Cannot detect clusters of extremely low dimensionalities (clusters with low percentage of relevant dimensions, e.g. only 5% of input dimensions are relevant) –Require the input of parameter values that are hard for users to supply –Performance sensitive to the parameter values –High time complexity
Semi-supervised Clustering In some applications, there is usually a small amount of domain knowledge available (e.g. the functions of 5% of the genes probed on a microarray). The knowledge may not be suitable/sufficient for carrying out classification. Clustering algorithms make little use of external knowledge.
Semi-supervised Clustering The idea of semi-supervised clustering: –Use the models implicitly assumed behind a clustering algorithm (e.g. compact hypersphere of k- means, density-connected irregular regions of DBScan) –Use external knowledge to guide the tuning of model parameters (e.g. location of cluster centers)
Semi-supervised Clustering Why not clustering? –The clusters produced may not be the ones required. –There could be multiple possible groupings. –There is no way to utilize the domain knowledge that is accessible (active learning v.s. passive validation). (Guha et al., 1998)
Semi-supervised Clustering Why not classification? –There is insufficient labeled data: Objects are not labeled. The amount of labeled objects is statistically insignificant. The labeled objects do not cover all classes. The labeled objects of a class do not cover all cases (e.g. they are all found at one side of a class). –It is not always possible to find a classification method with an underlying model that fits the data (e.g. pattern-based similarity).
Our Problem Data Model: –The input dataset has n objects and d dimensions –The dataset contains k disjoint clusters, and possibly some outlier objects –Each cluster is associated with a set of relevant dimensions –If a dimension is relevant to a cluster, the projections of the cluster members on the dimension are random samples of a local Gaussian distribution –Other projections are random samples of a global distribution (e.g. uniform distribution or Gaussian distribution with a standard deviation much larger than those of the local distributions)
Our Problem Resulting data: if a dimension is relevant to a cluster, the projections of its members on the dimension will be close to each other (the within-cluster variance much smaller than irrelevant dimensions). Example: XYZ C1N(5, 1) C2N(8, 1)N(6, 1)U(0, 10) C3U(0, 10)
Our Problem XYZ C1 N(5, 1) C2 N(8, 1)N(6, 1)U(0, 10) C3 U(0, 10)
Our Problem XYZ C1 N(5, 1) C2 N(8, 1)N(6, 1)U(0, 10) C3 U(0, 10)
Our Problem XYZ C1 N(5, 1) C2 N(8, 1)N(6, 1)U(0, 10) C3 U(0, 10)
Our Problem XYZ C1 N(5, 1) C2 N(8, 1)N(6, 1)U(0, 10) C3 U(0, 10)
Our Problem Problem definition: –Inputs: The dataset D The target number of clusters k A (possibly empty) set I o of labeled objects (obj. ID, class label), which may or may not cover all classes A (possibly empty) set I v of labeled relevant dimensions (dim. ID, class label), which may or may not cover all classes. A single dimension can be specified as relevant to multiple clusters
Our Problem Problem definition (cont’d): –Outputs: A set of k disjoint projected clusters with a (locally) optimal objective score A (possibly empty) set of outlier objects
Our Problem Assumptions made in this study: –There is a primary clustering target (c.f. biclustering) –Disjoint, axis-parallel clusters (c.f. subspace clustering and ORCLUS) –Distance-based similarity –One cluster per class (c.f. decision tree) –All inputs are correct (but can be biased, i.e., with projections on the relevant dimensions deviated from the cluster center)
Our New Algorithm Basic idea: k-medoid/median 1.Determine the potential medoids (seeds) and relevant dimensions of each cluster 2.Assign every object to the cluster (or to the outlier list) that gives the greatest improvement to the objective score 3.Decide which medoids are good/bad –A good medoid: replace by cluster median, refine selected dimensions –A bad medoid: replace by another seed 4.Repeat 2 and 3 until no improvements can be obtained in a certain number of iterations
Our New Algorithm Issues to consider: –Design of the objective function –Selection of relevant dimensions for a cluster –Determination of seeds and the relevant dimensions of the corresponding potential clusters –Replacement of medoids
Our New Algorithm Design goals of the objective function: –Should not have a trivial best score (e.g. when each cluster selects only one dimension) –Should not be ruined by the selection of a small amount of irrelevant dimensions –Should be robust (clustering accuracy should not degrade seriously when the input parameter values are not very accurate)
Our New Algorithm The objective function: –Overall score: –Score component of cluster C i : –Contribution of selected dimension v j on the score component of C i : – : normalization factor
Our New Algorithm Characteristics of the objective function: –Higher score => better clustering –No trivial best score when each cluster selects only one or selects all dimensions –Relevant dimensions (dimensions with smaller ) constitute more to the objective score –Robust? (To be discussed soon…)
Our New Algorithm Dimension selection: –In order to maximize, all dimensions with should be selected. –Appropriate values of : Should be at least j 2, the global variance of dimension v j Scheme 1: Scheme 1b:, but only dimensions with are selected => easier to compare the results with different m
Our New Algorithm Scheme 2: estimate the probability for an irrelevant dimension to be selected (global distribution needs to be known). If the global distribution is Gaussian… –If n i values are randomly sampled from the global distribution of an irrelevant dimension v j, the random variable (n i -1) ij 2 / j 2 has a chi-square distribution with n i -1 degrees of freedom. –Suppose we want the probability of selecting an irrelevant dimension to be p, then From the cumulative chi-square distribution, the corresponding can be computed.
Our New Algorithm Probability density function and cumulative distribution (n i =30): (30-1) ij 2 / i 2 19 => ij 2 0.66 i 2 (= m i 2 )
Our New Algorithm Robustness of the algorithm: –A good value of m should be… Large enough to tolerate local variances Small enough to distinguish local variances from global variances –The best value to use is data- dependent, but provided the difference between local and global variances is large, there is usually a wide range of values that lead to results with acceptable performance (e.g. 0.3 < m < 0.7)
Our New Algorithm Determination of seeds and the relevant dimensions of the corresponding potential clusters: –Traditional approach: One seed pool Seeds determined randomly/by max-min distance method/by preclustering (e.g. hierarchical) Relevant dimensions of each cluster determined by a set of objects near the medoid (in the input space)
Our New Algorithm Our proposal – seed group: –Seeds are stored in separate seed groups, each seed group contains a small number (e.g. 5) seeds –One private seed group for each cluster with some inputs –A number of public seed groups are shared by all clusters without external inputs –The seeds of the cluster with the largest amount of inputs are initialized first (as we are most confident in their correctness), and then those with less inputs, and so on. Finally, the public seed groups are initialized.
Our New Algorithm Our proposal – seeds selection: –Based on low-dimensional histograms (grids) –Relevant dimension => small variance => high density –Procedures: Determine starting point Hill-climbing –=> Need to determine both the dimensions used in constructing the grid and the starting point
Our New Algorithm Determining the grid-constructing dimensions and the starting point: –Case 1: a cluster with both labeled objects and labeled relevant dimensions –Case 2: a cluster with only labeled objects –Case 3: a cluster with only labeled relevant dimensions –Case 4: a cluster with no inputs
Our New Algorithm Case 1: both kinds of inputs are available 1.Form a seed cluster by the input objects 2.Rank all dimensions by 3.All dimensions with positive ij or in the input set I v are candidate dimensions for constructing the histograms 4.Relative chance of being selected: ij if dimension v j is not in I v, 1 otherwise 5.The starting point is the median of the seed cluster
Our New Algorithm Example: cluster 2 – 2x : 0.68 – 2y : 0.83 – 2z : The hill-climbing mechanism fixes errors due to biased inputs
Our New Algorithm Case 2: labeled objects only –Similar to case 1, but the chance for each dimension to be selected is based on ij only Case 3: labeled dimensions only –Similar to case 1, but with no staring point, i.e., all cells are examined, and the one with the highest density will be returned
Our New Algorithm Case 4: no inputs –The tentative seed is the one with the maximum projected distance to the closest selected seeds (modified max-min distance method) –For each dimension, an one-dimensional histogram is constructed to determine the density of objects around the projection of the tentative seed –The chance for each dimension being selected to construct the grid is based on the density –The tentative seed is used as the starting point
Our New Algorithm Medoid drawing/replacement: –The medoid of each cluster is initially drawn from… The corresponding private seed group, if available A unique public seed group, otherwise –After assignment, the medoid for cluster C i is likely to be a bad one if… ( i / max i I ) is small – the cluster has a low quality as compared to other clusters ( i / max I ) is small – the cluster has a low quality as compared to a perfect cluster The cluster is very similar to another cluster
Our New Algorithm Medoid drawing/replacement (cont’d): –Each time, only one potential bad medoid is replaced since the probability of simultaneously correcting multiple medoids is low –The target bad medoid is replaced by a seed from the corresponding private seed group or a new public seed group –The medoids of other clusters are replaced by cluster medoids, and the relevant dimensions are reselected –The algorithm keeps track on the best set of medoids and relevant dimensions
Experimental Results Dataset 1: n=1000, d=100, k=5, l real =5-40 (5- 40% of d) No external inputs Algorithms: –HARP –PROCLUS –SSPC –CLARANS (non-projected control)
Experimental Results Best performance (results with the best ARI values):
Experimental Results Best performance v.s. average performance:
Experimental Results Robustness (l real =10):
Experimental Results Dataset 2: n=150, d=3000, k=5, l real =30 (1% of d) Inputs: –I o size, I v size=1-9 –4 combinations: both, labeled objects only, labeled relevant dimensions only, none –Coverage: 1-5 clusters (20-100%)
Experimental Results Increasing input size (100% coverage):
Experimental Results Increasing coverage (input size=3):
Experimental Results Increasing coverage (input size=6):
Future Works and Extensions Other required experiments: –Biased inputs –Multiple labeling methods for a single dataset –Scalability –Real data –Imperfect data with artificial outliers and errors –Searching for the best k
Future Works and Extensions To be considered in the future: –Other input types (e.g. must-links and cannot-links) –Wrong/Inconsistent inputs –Pattern-based and range-based similarity –Non-disjoint clusters
References Projected clustering: –HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering (DB Seminar on 20 Sep 2002)HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering Semi-supervised clustering: –The Semi-supervised Clustering Problem (DB Seminar on 2 Jan 2004)The Semi-supervised Clustering Problem