Clustering Categorical Data Using Summaries

Clustering Categorical Data Using Summaries
CACTUS Clustering Categorical Data Using Summaries Venkatesh Ganti joint work with Johannes Gehrke and Raghu Ramakrishnan (University of Wisconsin-Madison)

Introduction Most research on clustering focused on n-dimensional numeric data e.g., BIRCH [ZRL96], CURE [GRS98], Clustering framework [BFR98], WaveCluster [SCZ98] etc. Data also consists of categorical attributes e.g., the UC-Irvine collection of datasets Problem: similarity functions are not defined for categorical data

CACTUS Goal: Fast scalable algorithm for discovering well-defined clusters Similarity: use attribute value co-occurrence (STIRR[GKR98]) Speed and scalability: exploit the small domain sizes of categorical attributes

Preliminaries and Notation
Set of n categorical attributes with domains D1,…,Dn A tuple consists of a value from each domain, e.g., (a1,b2,c1) Dataset: a set of tuples a2 a1 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 A B C Note: Sizes of D1,…,Dn are typically very small

Similarity, between attributes
c1 c2 c3 c4 A B C “similarity’’ between a1 and b1 support(a1,b1)=#tuples containing (a1,b1) a1 and b1 are strongly connected if support(a1,b1) is higher than expected; {a1,a2,a3,a4} and {b1,b2} are strongly connected if all pairs are; Not strongly connected

Similarity, within an attribute
simA(b1,b2): number of values of A which are strongly connected with both b1 and b2 sim*(B) thru A thru C (b1,b2) 4 2 (b1,b3) (b1,b4) (b2,b3) (b2,b4) A B C a2 a1 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4

Definitions Support(ai,bk) is the number of tuples that contain both ai and bk ai and bk are strongly connected if support(ai,bk) >> expected value Si and Sk are strongly connected if every pair of values in Si x Sk is strongly connected.

An Example Intuitively, a cluster is a high-density region a2 a1 a3 a4
b1 b2 b3 b4 c1 c2 c3 c4 A B C Intuitively, a cluster is a high-density region Region: {a1,a2} x {b1,b2} x {c1,c2} Note: Dense regions lead to strongly connected sets

Cluster Definition Region: a cross-product of sets of attribute values: C1 x … x Cn C=C1 x … x Cn is a cluster iff Ci and Cj are strongly connected, for all i,j Ci is maximal, for all i Support(C) >> expected Ci: cluster projection of C on Ai

CACTUS: Outline Idea: compute and use data summaries for clustering
3 phases Summarization Compute summaries of data Clustering Using the summaries to compute candidate clusters Validation Validate the set of candidate clusters from the clustering phase

Summaries Two types of summaries Inter-attribute summaries
Intra-attribute summaries

Inter-Attribute Summaries
Supports of all strongly connected attribute value pairs from different attributes Similar in nature to “frequent’’ 2-itemsets So is the computation A B C IJ(A,B) IJ(A,C) IJ(B,C) (a1,b1) (a1,c1) (b1,c1) (a1,b2) (a1,c2) (b1,c2) (a2,b1) (a2,c1) (b2,c1) (a2,b2) (a2,c2) (b2,c2) (a3,b1) (b3,c1) … a2 a1 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4

Intra-attribute summaries
simA(B): similarities thru A of attribute value pairs of B sim*(B) thru A thru C (b1,b2) 4 2 (b1,b3) (b1,b4) (b2,b3) (b2,b4) A B C a2 a1 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4

Computing Intra-attribute Summaries
SQL query to compute simA(B) Select T1.B, T2.B, count(*) From IJ(A,B) as T1(A,B), IJ(A,B) as T2(A,B) Where T1.B T2.B and T1.A=T2.A Group By T1.A, T2.A Having count > 0; Note: Inter-attribute summaries are sufficient Dataset is not accessed!

Memory Requirements for Summaries
Attribute domains are small Typically less than 100 E.g., the largest attribute value domain in the UC-Irvine collection is 100 (Pendigits dataset) 50 attributes, domain sizes 100, and 100 MB of main memory Only one scan of the dataset for computing inter-attribute summaries

CACTUS Summarization Clustering Phase Validation

Clustering Phase Compute cluster projections on each attribute
Join cluster projections across attributes—candidate cluster generation a2 a1 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 Identify the cluster projections: {a1,a2}, {b1,b2}, {c1,c2}; Then identify the cluster: {a1,a2} x {b1,b2} x {c1,c2}

Computing Cluster Projections
a2 a1 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 Lemma: Computing all projections of clusters on attribute pairs is NP-complete

Distinguishing Set Assumption
Each cluster projection Ci on Ai is distinguished by a small set of attribute values Distinguishing set is bounded by k (distinguishing number) Values for k are typically small

Distinguishing Set Assumption
b1 b2 b3 b4 c1 c2 c3 c4 Cluster: {a1,a2} x {b1,b2} x {c1,c2}; {a1} (or {a2}) distinguishes {a1,a2}; Approach: Compute distinguishing sets and extend them to cluster projections

Candidate Cluster Generation
Cluster projections S1,…,Sn on A1,…,An Cross product S1 x … x Sn Level-wise synthesis: S1 x S2, prune, then add S3 and so on. May contain some dubious clusters! C’ S1={C’,C1}; S2={C2}; S3={C3}; C’ x C2 x C3: not a cluster C3 C1 C2

The CACTUS Algorithm Summarize Clustering phase Validation
inter-attribute summaries: scans dataset intra-attribute summaries Clustering phase Compute cluster projections Level-wise synthesis of cluster projections to form candidate clusters Validation Requires a scan of the dataset

STIRR [GKR98] An iterative dynamical system
Weighted nodes in the graph In each iteration, weights are propagated between connected nodes (determined by tuples in the dataset) Each iteration requires a dataset scan Iteration stops when the fixed point is reached Similar nodes have similar weights

Experimental Evaluation
Compare CACTUS with STIRR Synthetic datasets Quasi-random data [GKR98:STIRR] Fix domain of each attribute Randomly generate tuples from these domains Identify clusters and plant additional (5%) data within the clusters

Synthetic Datasets: Cactus and STIRR
{0,…9} x {0,…9} {10,…,19} x {10,…,19} 9 19 10 20 … 99 Both CACTUS and STIRR identified the two clusters exactly

Synthetic Dataset (contd.)
{0,…,9} x {0,…,9} x {0,…,9} {10,…,19} x {10,…,19} x {10,…,19} {0,…,9} x {10,…,19} x {10,…,19} Cactus identifies the 3 clusters 9 19 10 20 … 99 STIRR returns: {0,…,9} x {0,…,19} x {0,…,9} {10,…,19} x {0,…,19} x {10,…,19}

Scalability with #Tuples
#Attributes: 10 Domain Size: 100 CACTUS is 10 times faster

Scalability with #Attributes
1 million tuples Domain size: 100

Scalability with Domain Size
1 million tuples #attributes: 4

Bibliographic Data Database and theory bibliographic entries [Wie]—38500 entries Attributes: first author, second author, conference/journal, and year Example cluster projections on the conference attribute (1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod Record (2). ACMTG, CompGeom, FOCS, Geometry, ICALP, IPL, JCSS, … (3). PODS, Algorithmica, FOCS, ICALP, INFCTRL, IPL, JCSS, …

Conclusions Formal definition of a cluster
A scalable fast summarization-based clustering algorithm for categorical data Outperforms an earlier algorithm (STIRR) by almost an order of magnitude Subspace clustering

Extensions Dealing with large attribute value domains
In some rare cases, the inter-attribute or intra-attribute summaries may not fit in main memory Clusters in subspaces when the number of attributes is large

Related Work Conceptual Clustering (e.g., [Fisher87]), EM [DLR77]
Assume that datasets fit in main memory Recent scalable clustering algorithms for clustering categorical data STIRR [GKR98] ROCK [GRS99] the definition of clusters is not clear

Limitations The cluster definition may be too strong for certain applications That we require every pair of attribute values across attributes to be strongly connected Consequence: a large number of clusters

Outline of the talk Notion of similarity Cluster Definition
The CACTUS Algorithm Experimental Evaluation Extensions to CACTUS Conclusions

Validation Scan the dataset once more
Compute supports of candidate clusters Retain only those with significantly high support

Computing Cluster Projections: Algorithm
For the attribute A1, compute cluster projections from clusters on (A1,A2), (A1,A3),…,(A1,An): Intersection join on

Computing Cluster Projections
a2 a1 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 Lemma: C=C1 x … x Cn be a cluster. Then Ci is the intersection of {Ci’: (Ci’,Ck) is a cluster on Ai, Ak}

Clustering Categorical Data Using Summaries

Similar presentations

Presentation on theme: "Clustering Categorical Data Using Summaries"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Categorical Data Using Summaries

Similar presentations

Presentation on theme: "Clustering Categorical Data Using Summaries"— Presentation transcript:

Similar presentations

About project

Feedback