CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh.

CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh

Overview Introduction and motivation Existing tools for clustering categorical data: STIRR and ROCK Definition of a cluster over categorical data The algorithm – CACTUS Experiments and results Summary

Introduction and motivation Numeric data, {1,2,3,4,5, …} Categorical data, {LFD, PMR, DME} –Usually small number of attribute values in their domains. Large domains are typically hard to infer useful information –Use relations! Relations contain different attributes, but the cross product of domain attributes can be large. CACTUS – a fast summarisation-based algorithm which uses summary information to find well-fined clusters.

Existing tools for clustering categorical data STIRR –Each attribute value is represented as a weighted vertex in a graph. –Multiple copies b1,…,bm (basins) of weighted vertices are maintained. They can have different weights. –Starting Step: a set of weights on all vertices in all basins. –Iterative Step: Increment the weight in basin bi on vertex tj, for each vertices tuple t= in bi, using a function combining the weights of vertices other than tj in bi. –At fixed point: the large positive weights and small negative weights across the basins isolate two groups of attribute values on each attribute. ROCK –Starts with each tuple in its own cluster. –Merges close clusters until a required number (user specified) of clusters remains. Closeness defined by a similarity function. Use STIRR to compare with CACTUS.

Definitions: Interval region, support and belonging A1,…,An is a set of categorical attributes with domains D1,…,Dn respectively. D is a set tuples where each tuple t є {1,…,n}. –Interval region: S=S1 X … X Sn if Si subset of Di for all i є {1,…,n}. Equivalent to intervals in numeric data –The support of a value pair σD(ai,aj) =|{ tєD:t.Ai=ai & t.Aj=aj }|/| D |. The support of a region σD(S) is the number of tuples in D contained in S –Belonging: A tuple t= є D belongs to a region S if for all t є {1,…,n}, t.Ai є Si.

Definitions: expected support, strongly connected The expected support under attribute-independence assumption, –Of a region : E[σD(S)] = |D|*|S1|X…X|Sn|/|D1|X…X|Dn| –Of a pair ai and aj: E[σD(ai,aj)] = α*|D|/|Di|X|Dj| –α is normally set to 2 or 3 Strongly Connected –ai and aj: if σD(ai,aj)>E[σD(ai,aj)], σ*D(ai,aj)=σD(ai,aj); Otherwise, 0. –ai є Si w.r.t Sj: for all x є Sj, ai and x are strongly connected. –Si and Sj: if each ai є Si is strongly connected with each aj є Sj and if each aj є Sj is strongly connected with each ai є Si.

Definitions: Cluster, Cluster-projection, sub-cluster and subspace cluster C= is a cluster over {A1,…,An} if –1. Ci and Cj are strongly connected –2. There exists on C’i such that C’i is a proper superset of Ci and C’i and Ci are strongly connected –3. σD(C) of C is >= α * the expected support of C under attribute-independence assumption Ci is a cluster-projection of C on Ai. C is a sub-cluster if it only satisfies 1 and 3. A cluster C over a subset of all attributes S proper subset of {A1,…,An} is a subspace cluster on S.

Definitions: similarity, inter-attribute summaries, intra-attribute summaries Similarity: γj(ai,a2) = |{x є Dj: σ*D(a1,x)>0 and σ*D(a2,x)>0}| Inter-attribute summary: –∑ij={(ai,aj, σ*D(ai,aj): ai є Di, aj є Dj, and σ*D(ai,aj)>0} –Strongly connected attribute values pairs where each pair has attribute values from different attributes Intra-attribute summary: –∑ij={(ai,aj, γjD(ai,aj): ai є Di, aj є Dj, and γjD(ai,aj)>0} –Similarities between attribute values of the same attribute

CACTUS Vs STIRR: clusters found by CACTUS

CACTUS Vs STIRR: clusters found by STIRR

CACTUS: CAtegorical ClusTering Using Summaries Central idea: data summary (inter- & intra- attribute summary) is sufficient enough to find candidate clusters which can then be validated. A three-phase clustering algorithm: –Summarisation –Clustering –Validation

Summarisation Phase Assumption: the inter- & intra- attribute summary of any pair of attributes fits easily into main memory. Inter-attribute Summaries: –Use a counter set to 0 initially for each pair (ai,aj) є Di x Dj. –Scan the dataset, increment the counter for each pair. –After the scan, compute σ*D(ai,aj) and reset the counters of those whose σ* < E[σD(ai,aj)]. Store those values pairs. Intra-attribute Summaries: –Scan the dataset and find those tuples (T1,T2) of one domain such that T1.a is strongly connected with T1.b and T2.a is strongly connected with T2.b. –Very fast operation, hence only compute them when needed

Clustering Phase A two-step operation: –Step 1. analyse each attribute to compute all cluster-projections on it –Step 2. Synthesise candidate clusters on sets of attributes from the cluster-projections on individual attributes

Clustering Phase continued Step1: Compute cluster-projections on attributes –Step A. Find all cluster-projections on Ai of cluster over (Ai,Aj). –Step B. Compute all the cluster-projections on Ai of cluster over {A1,…,An} by intersecting sets of cluster-projects from Step A. –Step A is NP-Hard! Solution: use distinguishing sets. Distinguishing sets identify different cluster-projections. Construct distinguishing sets on Ai and extend w.r.t Aj some of the candidate distinguishing sets on Ai. Detailed steps are too long for this presentation, sorry! –StepB: intersection of Cluster-projection Intersection joint S1ΠS2 = {s: there exist s1єS1 and s2єS2 such that s=s1Πs2 and |s|>1} Apply intersection joint to all sets of attribute values on Ai. Step2: Try to augment ck with a cluster projection c k+1 on attribute A k+1. If new cluster is a sub-cluster on (Ai,A k+1 ), i є {1,…,k}, then add c k+1 = to the final cluster.

Validation Phase Use a required threshold to recognise false candidates which do not have enough support because some of the 2- clusters combined to form a candidate cluster may be due to different sets of tuples.

Experiments and Results To compare with STIRR Use 1 million tuples, 10 attributes and 100 attribute values for each attribute. CACTUS discovers a broader class of clusters than STIRR.

Experiments and Results

Conclusion The authors formalised the definition of a cluster in categorical data CACTUS is a fast and efficient algorithm for clustering in categorical data. I am sorry that I did not show some part of the algorithm due to time constraint.

Question Time

CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh.

Similar presentations

Presentation on theme: "CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh.

Similar presentations

Presentation on theme: "CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh."— Presentation transcript:

Similar presentations

About project

Feedback