An Efficient Method for Projected Clustering

An Efficient Method for Projected Clustering
Hongyin Cui Jiang Ye School of Computing Science Simon Fraser University 2019/5/2

Introduction Clustering is a widely used technique for data mining, indexing and classification. Most clustering algorithms do not work efficiently or effectively in high dimensional spaces because of the inherent sparsity of the data. 2019/5/2

Clusters may exist in different subspaces comprised of different combinations of attributes
2019/5/2

Related Work CLIQUE: density-based and grid-based
It partitions each dimension into the same number of equal length intervals It partitions a m-dimensional data space into non-overlapping rectangular units A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter A cluster is a maximal set of connected dense units within a subspace A bottom-up greedy algorithm exponential dependency on the number of dimensions 2019/5/2

Related Work (Cont.) PROCLUS:
Find the best set of medoids by a “hill climbing” process. Search not just in the space of the possible medoids but also in the space of possible dimensions associated with each medoid. So it uses a locality analysis and its result may be a local optimum. 2019/5/2

Related Work (Cont.) DOC:
An dense projective cluster is a pair (C, D), and C is a subset of the data set S, D is a subset of full-dimension [d] |C| must be sufficiently large, i.e. |C| ≥ |S| iD, maxpC pi - minqC qi ≤ w i[d] - D, maxpC pi - minqC qi > w It repeatedly choose pS and XS via radom sampling compute the corresponding cluster (C, D) Report the best found cluster An approximation of the optimal projective cluster. 2019/5/2

Problem Definition key observations:
Often many records in a database share similar values for several attributes. Identifying and grouping together records that share similar values for some attributes can both gain useful insight into the data (projected clusters), and obtain a more parsimonious representation of the data. 2019/5/2

Problem Definition (cont.)
The user can define discretization criteria by specifying the interval wi for each attribute i, or using a global interval w for all attributes. We say a group of records share a similar value on attribute i, if they have a same discretized value on i. (in a same interval) For example: Name Position Points Played Mins Penalty Mins Blake Defense 43 395 34 Borque 80 430 22 Gullimore 3 30 18 Gretzky Centre 89 458 26 Konstantinov 10 560 120 May Winger 35 290 180 Odjick 9 115 245 Tkachuk Center 82 475 160 Wotton 5 38 6 Figure 1: A fragment of the NHL Players’ Statistic Table (1996) 2019/5/2

In Figure 1, suppose the discretization intervals imposed on attributes are: Position => already discrete wPoints = 10, wPlayedMins= 60, wPenaltyMins=20 Find out: {Borque, Gretzky, Tkachuk} {Points, Played Mins} {Gullimore, Wotton} {Position, Points, Played Mins, Penalty Mins} played and scored a lot Same position Played, scored & penalized sparingly 2019/5/2

Let p = (p1, …, pd) be a point in Rd, [d] denotes the set of the d dimensions, and wi ≥ 0 for 0≤ i ≤d. i[d], dimension i is partitioned and pi is discretized by wi. Let S be a set of points in Rd. For any 0≤ ≤ 1, a projected cluster in S is a pair (C, D), C  S, D  [d], such that: |C| ≥ |S| jD, all points in C share an equivalent discretized value on attribute j. (in the same interval) No D’  D also satisfies the above two conditions 2019/5/2

FIPCLUS: Mining projected clusters via frequent closed itemsets
Basic Steps Step 1: discretize p on each attribute. Step 2: create a transaction database. Step 3: Mining frequent closed itemsets by CLOSET algorithm, each identify one subspace. Step 4: find corresponding groups of points for each subspace via scanning DB once. 2019/5/2

FIPCLUS: Mining projected clusters via frequent closed itemsets (cont
Step 1 i[d], partition dimension i and discretize pi by wi. Or Discretize pi using users specified criteria. Or ignore step 1, if users provide discretized data. 2019/5/2

Step 2 i[d], enumerate and number each discretized value with a different integer j, and all numbers are continuous. E.g. Position={defense, center, winger}, then defense=1, center=2 and winger=3 Substitute each discretized value in [d] with an unique integer, i*d+j. The original database is transformed into a transaction database. 2019/5/2

Step 3: based on CLOSET [Jie Pei, Jiawei Han] Definition (Frequent closed itemset): An itemset X is a closed itemset if there exists no itemset X’ such that (1) X’ is a proper superset of X, and (2) every transaction containing X also contains X’. A closed itemset X is frequent if its support passes the given support threshold. CLOSET is based on FP-tree without candidate generation. 2019/5/2

CLOSET Input: Transaction database TDB and support threshold min_sup; Output: The complete set of frequent closed itemsets; Method: Initialization. Let FCI be the set of frequent closed itemset. Initialize FCI =  Find frequent items. Scan transaction database TDB, compute frequent item list f_list. Mine frequent closed itemsets recursively. Call CLOSET(, TDB, f_list, FCI). 2019/5/2

CLOSET(X, DB, f_list, FCI) Parameters: X: is the frequent itemset. DB: X-conditional database, which is a subset of transactions in TDB containing X. f_list: frequent item list of DB FCI: The set of frequent closed itemsets already found. 2019/5/2

CLOSET(X, DB, f_list, FCI) { Extract a set (Y) of items appearing in every transaction of DB, insert XY to FCI, if it is not a subset of some itemset in FCI with the same support; Build FP-tree for DB, items in Y are excluded. Directly extract frequent closed itemsets from FP-tree. i rest of f_list, form conditional database DB|i and compute local frequent item list f_listi i rest of f_list, call CLOSET(iX, DB|I, f_listi, FCI), if iX is not a subset of any frequent closed itemset in FCI with the same support. } 2019/5/2

Evaluation & Comparison
Definition - more flexible and meaningful. No assumption on the distribution of C in D. different interval wi on each dimension or flexible discretization criteria. CLIQUE partition each dimension into  intervals, not flexible. Hard to determine dense threshold for each unit. PROCLUS Distance-based: has all distance-based flaws. DOC Very similar definition But global interval width w for each dimension, not flexible. 2019/5/2

Evolution & Comparison
Algorithm Solve clustering problem via mining frequent itemsets more efficient, scalable and faster in large database. Runtime complexity is O(N), where N=|DB|. Typically 4, 5 scan of DB CLIQUE Bottom-up construction generate huge candidates, each of which need scan DB once. --- not efficient PROCLUS Find the best set of medoids by a “hill climbing” process. A locality analysis and its result may be a local optimum. Runtime complexity O(Nk l + Nk d), where k= the number of clusters, l = the average dimensionality of subspaces, d = the full dimensionality. -- less efficient 2019/5/2

Evaluation & Comparison
DOC Find the approximation of clusters via random sampling. Not complete and quality can not be guaranteed. Runtime complexity is O(N  dc+1), where c = a constant, d= the full dimensionality, and N= |DB|  less efficient 2019/5/2

Conclusion We proposed FIPCLUS, which
Efficiently mining projected clusters via frequent closed itemsets. Applies a compressed FP-tree structure for mining frequent closed itemset without candidate generation. Generates a much smaller set of frequent itemsets and leads to less and more interesting projected clusters. 2019/5/2

Weakness & Future Work Weakness: Future work:
FIPCLUS may generate some overlapping clusters. E.g. For (C1, D1) and (C2, D2), C1={a, b, c}, D1={d1, d2, d3, d4}; C2={a, b, c, e, f}, D2={d1, d2}; Future work: Modify FIPCLUS to mine the maximal frequent itemsets to address above weakness. E.g. In above example, it only outputs (C1, D1). It is actually a tradeoff, since maximal frequent itemsets may lose some interesting clusters and information. Evaluate its effectiveness. 2019/5/2

References [1] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Application [2] C. Procopiuc, M. Jones, P. Agarwal, T.M.Murali, A Monto Carlo Algorithm for Fast Projective Clustering [3] C. Aggarwal, C. Procopius, J. Wolf, P. Yu, J. Park, Fast Algorithm for Projected Clustering [4] J. Pei, J. Han, R. Mao, CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets [5] K. Yip, D. Cheung, M. Ng, A Highly-usable Projected Clustering Algorithm for Gene Expression Profiles [6] H.V. Jagadish, J. Madar, R. Ng, Semantic Compression and Pattern Extraction with Fascicles 2019/5/2

An Efficient Method for Projected Clustering

Similar presentations

Presentation on theme: "An Efficient Method for Projected Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Efficient Method for Projected Clustering

Similar presentations

Presentation on theme: "An Efficient Method for Projected Clustering"— Presentation transcript:

Similar presentations

About project

Feedback