Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,

Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich, 2: Helmholtz Zentrum München, Technische Universität München {he, feng, konte, mtson}@dbs.ifi.lmu.de, claudia.plant@helmholtz-muenchen.de Presented by George Hodulik

Motivation of ROCAT algorithm Subspace clusters are more common in data than full dimensional clusters Most current subspace clustering algorithms have at least one of the following problems: Heavily depend on input parameters. Produce many redundancies Partition based (subspace clusters cannot overlap) Require fault-tolerant data Only relevant for numerical data Greatly affected by outliers

Use data compression as a measurement of similarity – Minimum Description Length (MDL) MDL Principle: The subspace clusters that compresses the data optimally will be the most relevant subspace clusters. 6076.5 bits 6147.1 bits 6670.9 bits Subspace cluster C i Non clustered area Subspace clustering Full-D clustering No clustering

Shannon Entropy as a measurement of MDL Shannon Entropy is the lower bound of lossless compression We do not need to actually compress the data, so we will use Shannon Entropy as a measurement of MDL Entropy of an attribute A j Entropy of subspace cluster C i We want to minimize the sum of the coding cost of each cluster, the non-clustered area, and the model description of the subspace clusters. This minimization will give us the most relevant subspace clusters.

ROCAT Algorithm Input: Data set D Output: List of subspace clusters in D 3 phases: Searching Combining Reassigning

Searching : Find subspace clusters Keep finding the best pure subspace cluster until the Shannon Entropy of the data set no longer decreases

Searching : Find best pure cluster A pure subspace cluster is one in that has all the same values for each attribute in each object. Algorithm FindBestPure

How FindBestPure works

For each pair of clusters C i and C j that overlap, split/combine them as shown, choosing the option which minimizes the Shannon entropy of the data set. Combining Phase

Reassigning phase For each subspace cluster C i, Find each object o which match the (attribute, value) description of C i, Add or Remove o to/from C i if It reduces the Shannon Entropy Then, for each C i which was changed, try adding attributes to C i if it decreases the Shannon Entropy. We can try attributes in order of their Shannon Entropy to be more efficient. Repeat both steps until nothing changes.

Runtime Complexity N objects, M attributes Searching Phase = O(M 2 * N) Combining Phase = O(   *M*N)  is the number of subspace clusters found in Searching phase Close to O(M*N) Reassigning Phase = O(i * (M * N)) i is the number of times iterations in the reassigning phase until convergence Normally converges very fast, so close to O(M * N)

Comparable performance on synthetic data  Cluster quality (F-Measure) Subspace cluster quality (F-Measure) 

Comparable scalability on synthetic data 52 attributes used on left, 960 objects on right

Robustness against outliers

Real world Data – Congressional Votes Survey with 16 attributes, 435 instances, 2 classes (Democratic and Republican) ROCAT produces very pure classes and notes outliers, while DHCC takes no notice of outliers, and MTV is overwhelmed by outliers. SUBCAD also performs well, but it should be noted that its subspace clusters are over only 3 dimensions, while ROCAT’s are 12 dimensions.

Real world data - Mushrooms 8124 records, 22 categorical attributes, 2 classes (edible and poisonous) Nearly all ROCAT clusters have a very high purity (15 being the only one not pure), while all others have significant impurity. Notice that MTV has decent precision, but fails to classify hundreds of mushrooms left in the Noise category.

Real world data - Splice 3190 instances, 60 attributes, 3 classes (EI Exon/Intron, IE Intron/Exon, Neither). ROCAT and DHCC produce quite pure results, while all others perform relatively poorly. Again, MTV performs well but is very sensitive to outliers.

Real world data – overall precision ROCAT significantly outperforms almost all other methods with respect to precision. Recall that SUBCAD subspace clusters in Vote have much lower dimensionality than ROCAT’s. Recall that MTV in Mushroom fails to classify hundreds of samples. DHCC and ROCAT both perform well on Splice.

Conclusions ROCAT is a notable algorithm for finding non-redundant overlapping subspace clusters in categorical data, with no parameters, and without being negatively affected by outliers. Data compression is an intuitive way to represent similarity The combining phase seems redundant since the reassigning phase also works to remove redundancy, only it is more complete. No single algorithm is a fix-all (yet). Some algorithms had results as good or better than ROCAT for certain data sets.

Thank you! Questions?

Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,

Similar presentations

Presentation on theme: "Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,

Similar presentations

Presentation on theme: "Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,"— Presentation transcript:

Similar presentations

About project

Feedback