CACTUS-Clustering Categorical Data Using Summaries

Slides:



Advertisements
Similar presentations
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Advertisements

gSpan: Graph-based substructure pattern mining
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Graduate : Sheng-Hsuan Wang
Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
AOI-ags Algorithms and inside Stories the School of Computing and Engineering of the University of Huddersfield Lizhen Wang July 2008.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.
/department of mathematics and computer science Visualization of Transition Systems Hannes Pretorius Visualization Group
Summary. Chapter 9 – Triggers Integrity constraints Enforcing IC with different techniques –Keys –Foreign keys –Attribute-based constraints –Schema-based.
Fast Algorithms for Association Rule Mining
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Anthony K.H. Tung Hongjun Lu Jiawei Han Ling Feng 國立雲林科技大學 National.
CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
TAR: Temporal Association Rules on Evolving Numerical Attributes Wei Wang, Jiong Yang, and Richard Muntz Speaker: Sarah Chan CSIS DB Seminar May 7, 2003.
Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003.
HKU CS 11/8/ Scalable Clustering of Categorical Data HKU CS Database Research Seminar August 11th, 2004 Panagiotis Karras.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Loop Analysis and Repair Nafi Diallo Computer Science NJIT Advisor: Dr. Ali Mili.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author:George et al. Advisor:Dr. Hsu Graduate:ZenJohn Huang IDSL seminar 2001/10/23.
Presented by Ho Wai Shing
Community Detection Algorithms: A Comparative Analysis Authors: A. Lancichinetti and S. Fortunato Presented by: Ravi Tiwari.
Efficient Computation of Combinatorial Skyline Queries Author: Yu-Chi Chung, I-Fang Su, and Chiang Lee Source: Information Systems, 38(2013), pp
Presented By: Miss N. Nembhard. Relation Algebra Relational Algebra is : the formal description of how a relational database operates the mathematics.
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Bootstrapped Optimistic Algorithm for Tree Construction
Discovering Interesting Patterns for Investment Decision Making with GLOWER-A Genetic Learner Overlaid With Entropy Reduction Advisor : Dr. Hsu Graduate.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author : Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Dual clustering : integrating data clustering over optimization.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Michael.
Elaboration popo.
Efficient SOM Learning by Data Order Adjustment
Chapter3:Software Processes
Semi-Supervised Clustering
Adaptive Resonance Theory (ART)
Mining Time-Changing Data Streams
Supporting Fault-Tolerance in Streaming Grid Applications
Machine Learning for Online Query Relaxation
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Clustering Categorical Data Using Summaries
Community Distribution Outliers in Heterogeneous Information Networks
Clustering Algorithms for Noun Phrase Coreference Resolution
Data Integration with Dependent Sources
Probabilistic Data Management
Liang Zheng and Yuzhong Qu
DATA MINING Introductory and Advanced Topics Part II - Clustering
Advanced Algorithms Analysis and Design
Overview of Query Evaluation
Geometrically Inspired Itemset Mining*
Mingzhen Mo and Irwin King
15th Scandinavian Workshop on Algorithm Theory
Unit Relational Algebra 1
Learning and Memorization
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

CACTUS-Clustering Categorical Data Using Summaries Advisor: Dr. Hsu Graduate:Min-Hung Lin IDSL seminar 2001/10/30

Outline Motivation Objective Related Work Definitions CACTUS Performance Evaluation Conclusions Comments

Motivation Clustering with categorical attributes has received attention Previous algorithms do not give a formal description of the clusters Some of them need post-process the output of the algorithm to identify the final clusters.

Objective Introduce a novel formalization of a cluster for categorical attributes. Describe a fast summarization-based algorithm CACTUS that discovers clusters. Evaluate the performance of CACTUS on synthetic and real datasets.

Related Work EM algorithm [Dempster et al., 1977] Iterative clustering technique STIRR algorithm[Gibson et al., 1998] Iterative algorithm based on non-linear dynamical systems ROCK algorithm[Guha et al., 1999] Hierarchical clustering algorithm

DEF:Support

DEF:Strongly Connected

DEF:Strongly Connected(cont’d)

Formal Definition of a Cluster

Formal Definition of a Cluster (cont’d) is the cluster-projection of C on C is called a sub-cluster if it satisfies conditions (1) and (3) A cluster C over a subset of all attributes is called a subspace cluster on S; if |S| = k then C is called a k-cluster

DEF:Similarity

Inter-attribute Summaries

Intra-attribute Summaries

Experiments

Result STIRR fails to discover CACTUS correctly discovers all clusters clusters consisting of overlapping cluster-projections on any attribute clusters where two or more clusters share the same cluster projection CACTUS correctly discovers all clusters

CACTUS Three-phase clustering algorithm Summarization Phase Compute the summary information Clustering Phase Discover a set of candidate clusters Validation Phase Determine the actual set of clusters

Summarization Phase Inter-attribute Summaries Intra-attribute Summaries

Clustering Phase Computing cluster-projections on attributes Level-wise synthesis of clusters

Computing Cluster-Projections on Attributes Step 1 :pairwise cluster-projection Step 2 :intersection

Computing Cluster-Projections on Attributes (cont’d)

Level-wise synthesis of clusters

Level-wise synthesis of clusters (cont’d) Generation procedure

Level-wise synthesis of clusters (cont’d) Candidate cluster

Validation Some of the candidate clusters may not have enough support because some of the 2-cluster may be due to different sets of tuples. Check if the support of each candidate cluster is greater than the threshold: times the expected support of the cluster. Only clusters whose support on D passes the threshold are retained.

Validation Procedure Setting the supports of all candidate clusters to zero. For each tuple increment the support of the candidate cluster to which t belongs. At the end of the scan, delete all candidate clusters whose support is less than the threshold.

Extensions Large Attribute Value Domains Clusters in Subspaces

Performance Evaluation Evaluation of CACTUS on Synthetic and Real Datasets Compared the performance of CACTUS with the performance of STIRR

Synthetic Datasets The test datasets were generated using the data generator developed by Gibson et al.(1 million tuples, 10 attributes, 100 attributes values for each attribute)

Real Datasets Two sets of bibliographic entries 7766 entries are database-related 30919 entries are theory-related Four attributes: the first author, the second author, the conference, and the year. Attribute domains are {3418,3529,1631,44},{8043,8190,690,42},{10212,10527,2315,52}

Real Datasets (cont’d) Database-related Theory-related Mixture

Results CACTUS is very fast and scalable(only two scans of the dataset) CACTUS outperforms STIRR by a factor between 3 and 10

Conclusions Formalized the definition of a cluster for categorical attributes. Introduced a fast summarization-based algorithm CACTUS for discovering such clusters in categorical data. Evaluated algorithm against both synthetic and real datasets.

Future Work Relax the cluster definition by allowing sets of attribute values are “almost” strongly connected to each other. Inter-attribute summaries can be incremental maintained=>Derive an incremental clustering algorithm Rank the clusters based on a measure of interestingness

Comments Pairwise cluster-projection is the NP-complete problem A large number of candidate clusters is still a problem