Download presentation
Presentation is loading. Please wait.
Published byLinda Powers Modified over 9 years ago
1
CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh
2
Overview Introduction and motivation Existing tools for clustering categorical data: STIRR and ROCK Definition of a cluster over categorical data The algorithm – CACTUS Experiments and results Summary
3
Introduction and motivation Numeric data, {1,2,3,4,5, …} Categorical data, {LFD, PMR, DME} –Usually small number of attribute values in their domains. Large domains are typically hard to infer useful information –Use relations! Relations contain different attributes, but the cross product of domain attributes can be large. CACTUS – a fast summarisation-based algorithm which uses summary information to find well-fined clusters.
4
Existing tools for clustering categorical data STIRR –Each attribute value is represented as a weighted vertex in a graph. –Multiple copies b1,…,bm (basins) of weighted vertices are maintained. They can have different weights. –Starting Step: a set of weights on all vertices in all basins. –Iterative Step: Increment the weight in basin bi on vertex tj, for each vertices tuple t= in bi, using a function combining the weights of vertices other than tj in bi. –At fixed point: the large positive weights and small negative weights across the basins isolate two groups of attribute values on each attribute. ROCK –Starts with each tuple in its own cluster. –Merges close clusters until a required number (user specified) of clusters remains. Closeness defined by a similarity function. Use STIRR to compare with CACTUS.
5
Definitions: Interval region, support and belonging A1,…,An is a set of categorical attributes with domains D1,…,Dn respectively. D is a set tuples where each tuple t є {1,…,n}. –Interval region: S=S1 X … X Sn if Si subset of Di for all i є {1,…,n}. Equivalent to intervals in numeric data –The support of a value pair σD(ai,aj) =|{ tєD:t.Ai=ai & t.Aj=aj }|/| D |. The support of a region σD(S) is the number of tuples in D contained in S –Belonging: A tuple t= є D belongs to a region S if for all t є {1,…,n}, t.Ai є Si.
6
Definitions: expected support, strongly connected The expected support under attribute-independence assumption, –Of a region : E[σD(S)] = |D|*|S1|X…X|Sn|/|D1|X…X|Dn| –Of a pair ai and aj: E[σD(ai,aj)] = α*|D|/|Di|X|Dj| –α is normally set to 2 or 3 Strongly Connected –ai and aj: if σD(ai,aj)>E[σD(ai,aj)], σ*D(ai,aj)=σD(ai,aj); Otherwise, 0. –ai є Si w.r.t Sj: for all x є Sj, ai and x are strongly connected. –Si and Sj: if each ai є Si is strongly connected with each aj є Sj and if each aj є Sj is strongly connected with each ai є Si.
7
Definitions: Cluster, Cluster-projection, sub-cluster and subspace cluster C= is a cluster over {A1,…,An} if –1. Ci and Cj are strongly connected –2. There exists on C’i such that C’i is a proper superset of Ci and C’i and Ci are strongly connected –3. σD(C) of C is >= α * the expected support of C under attribute-independence assumption Ci is a cluster-projection of C on Ai. C is a sub-cluster if it only satisfies 1 and 3. A cluster C over a subset of all attributes S proper subset of {A1,…,An} is a subspace cluster on S.
8
Definitions: similarity, inter-attribute summaries, intra-attribute summaries Similarity: γj(ai,a2) = |{x є Dj: σ*D(a1,x)>0 and σ*D(a2,x)>0}| Inter-attribute summary: –∑ij={(ai,aj, σ*D(ai,aj): ai є Di, aj є Dj, and σ*D(ai,aj)>0} –Strongly connected attribute values pairs where each pair has attribute values from different attributes Intra-attribute summary: –∑ij={(ai,aj, γjD(ai,aj): ai є Di, aj є Dj, and γjD(ai,aj)>0} –Similarities between attribute values of the same attribute
9
CACTUS Vs STIRR: clusters found by CACTUS
10
CACTUS Vs STIRR: clusters found by STIRR
11
CACTUS: CAtegorical ClusTering Using Summaries Central idea: data summary (inter- & intra- attribute summary) is sufficient enough to find candidate clusters which can then be validated. A three-phase clustering algorithm: –Summarisation –Clustering –Validation
12
Summarisation Phase Assumption: the inter- & intra- attribute summary of any pair of attributes fits easily into main memory. Inter-attribute Summaries: –Use a counter set to 0 initially for each pair (ai,aj) є Di x Dj. –Scan the dataset, increment the counter for each pair. –After the scan, compute σ*D(ai,aj) and reset the counters of those whose σ* < E[σD(ai,aj)]. Store those values pairs. Intra-attribute Summaries: –Scan the dataset and find those tuples (T1,T2) of one domain such that T1.a is strongly connected with T1.b and T2.a is strongly connected with T2.b. –Very fast operation, hence only compute them when needed
13
Clustering Phase A two-step operation: –Step 1. analyse each attribute to compute all cluster-projections on it –Step 2. Synthesise candidate clusters on sets of attributes from the cluster-projections on individual attributes
14
Clustering Phase continued Step1: Compute cluster-projections on attributes –Step A. Find all cluster-projections on Ai of cluster over (Ai,Aj). –Step B. Compute all the cluster-projections on Ai of cluster over {A1,…,An} by intersecting sets of cluster-projects from Step A. –Step A is NP-Hard! Solution: use distinguishing sets. Distinguishing sets identify different cluster-projections. Construct distinguishing sets on Ai and extend w.r.t Aj some of the candidate distinguishing sets on Ai. Detailed steps are too long for this presentation, sorry! –StepB: intersection of Cluster-projection Intersection joint S1ΠS2 = {s: there exist s1єS1 and s2єS2 such that s=s1Πs2 and |s|>1} Apply intersection joint to all sets of attribute values on Ai. Step2: Try to augment ck with a cluster projection c k+1 on attribute A k+1. If new cluster is a sub-cluster on (Ai,A k+1 ), i є {1,…,k}, then add c k+1 = to the final cluster.
15
Validation Phase Use a required threshold to recognise false candidates which do not have enough support because some of the 2- clusters combined to form a candidate cluster may be due to different sets of tuples.
16
Experiments and Results To compare with STIRR Use 1 million tuples, 10 attributes and 100 attribute values for each attribute. CACTUS discovers a broader class of clusters than STIRR.
17
Experiments and Results
18
Conclusion The authors formalised the definition of a cluster in categorical data CACTUS is a fast and efficient algorithm for clustering in categorical data. I am sorry that I did not show some part of the algorithm due to time constraint.
19
Question Time
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.