Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,

Slides:



Advertisements
Similar presentations
Conceptual Clustering
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
PARTITIONAL CLUSTERING
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Fast Algorithms For Hierarchical Range Histogram Constructions
Graduate : Sheng-Hsuan Wang
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Data Mining Techniques: Clustering
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Applying the ROCAT algorithm to find subspace clusters in categorical data Presented by George Hodulik.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Robust Information-theoretic Clustering By C. Bohm, C. Faloutsos, J-Y. Pan, and C. Plant Presenter: Niyati Parikh.
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Unsupervised Learning
Cluster Analysis (1).
Carmine Cerrone, Raffaele Cerulli, Bruce Golden GO IX Sirmione, Italy July
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Clustering & Dimensionality Reduction 273A Intro Machine Learning.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Introduction to Directed Data Mining: Decision Trees
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Chapter 9 – Classification and Regression Trees
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Presented by Ho Wai Shing
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
A new clustering tool of Data Mining RAPID MINER.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Tutorial 2, Part 2: Calibration of a damped oscillator.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
What Is Cluster Analysis?
Semi-Supervised Clustering
Hierarchical Clustering: Time and Space requirements
Data Mining K-means Algorithm
RE-Tree: An Efficient Index Structure for Regular Expressions
Equalization in a wideband TDMA system
Clustering Wei Wang.
Data Transformations targeted at minimizing experimental variance
Text Categorization Berlin Chen 2003 Reference:
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich, 2: Helmholtz Zentrum München, Technische Universität München {he, feng, konte, Presented by George Hodulik

Motivation of ROCAT algorithm Subspace clusters are more common in data than full dimensional clusters Most current subspace clustering algorithms have at least one of the following problems: Heavily depend on input parameters. Produce many redundancies Partition based (subspace clusters cannot overlap) Require fault-tolerant data Only relevant for numerical data Greatly affected by outliers

Use data compression as a measurement of similarity – Minimum Description Length (MDL) MDL Principle: The subspace clusters that compresses the data optimally will be the most relevant subspace clusters bits bits bits Subspace cluster C i Non clustered area Subspace clustering Full-D clustering No clustering

Shannon Entropy as a measurement of MDL Shannon Entropy is the lower bound of lossless compression We do not need to actually compress the data, so we will use Shannon Entropy as a measurement of MDL Entropy of an attribute A j Entropy of subspace cluster C i We want to minimize the sum of the coding cost of each cluster, the non-clustered area, and the model description of the subspace clusters. This minimization will give us the most relevant subspace clusters.

ROCAT Algorithm Input: Data set D Output: List of subspace clusters in D 3 phases: Searching Combining Reassigning

Searching : Find subspace clusters Keep finding the best pure subspace cluster until the Shannon Entropy of the data set no longer decreases

Searching : Find best pure cluster A pure subspace cluster is one in that has all the same values for each attribute in each object. Algorithm FindBestPure

How FindBestPure works

For each pair of clusters C i and C j that overlap, split/combine them as shown, choosing the option which minimizes the Shannon entropy of the data set. Combining Phase

Reassigning phase For each subspace cluster C i, Find each object o which match the (attribute, value) description of C i, Add or Remove o to/from C i if It reduces the Shannon Entropy Then, for each C i which was changed, try adding attributes to C i if it decreases the Shannon Entropy. We can try attributes in order of their Shannon Entropy to be more efficient. Repeat both steps until nothing changes.

Runtime Complexity N objects, M attributes Searching Phase = O(M 2 * N) Combining Phase = O(   *M*N)  is the number of subspace clusters found in Searching phase Close to O(M*N) Reassigning Phase = O(i * (M * N)) i is the number of times iterations in the reassigning phase until convergence Normally converges very fast, so close to O(M * N)

Comparable performance on synthetic data  Cluster quality (F-Measure) Subspace cluster quality (F-Measure) 

Comparable scalability on synthetic data 52 attributes used on left, 960 objects on right

Robustness against outliers

Real world Data – Congressional Votes Survey with 16 attributes, 435 instances, 2 classes (Democratic and Republican) ROCAT produces very pure classes and notes outliers, while DHCC takes no notice of outliers, and MTV is overwhelmed by outliers. SUBCAD also performs well, but it should be noted that its subspace clusters are over only 3 dimensions, while ROCAT’s are 12 dimensions.

Real world data - Mushrooms 8124 records, 22 categorical attributes, 2 classes (edible and poisonous) Nearly all ROCAT clusters have a very high purity (15 being the only one not pure), while all others have significant impurity. Notice that MTV has decent precision, but fails to classify hundreds of mushrooms left in the Noise category.

Real world data - Splice 3190 instances, 60 attributes, 3 classes (EI Exon/Intron, IE Intron/Exon, Neither). ROCAT and DHCC produce quite pure results, while all others perform relatively poorly. Again, MTV performs well but is very sensitive to outliers.

Real world data – overall precision ROCAT significantly outperforms almost all other methods with respect to precision. Recall that SUBCAD subspace clusters in Vote have much lower dimensionality than ROCAT’s. Recall that MTV in Mushroom fails to classify hundreds of samples. DHCC and ROCAT both perform well on Splice.

Conclusions ROCAT is a notable algorithm for finding non-redundant overlapping subspace clusters in categorical data, with no parameters, and without being negatively affected by outliers. Data compression is an intuitive way to represent similarity The combining phase seems redundant since the reassigning phase also works to remove redundancy, only it is more complete. No single algorithm is a fix-all (yet). Some algorithms had results as good or better than ROCAT for certain data sets.

Thank you! Questions?