Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Manoranjan Dash, Kiseok Choi, Peter Scheuermann, Huan Liu Feature selection for clustering IEEE 2002
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Properties of Feature Selection Distance-based Entropy Measure and Its Efficient Calculation Feature Selection Algorithm Experimental Evaluation Conclusions Personal Opinion
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation There are only a few methods proposed for feature selection for clustering. Most methods are ‘wrapper’ techniques. Heavy reliance on clustering algorithms Lock of clustering criteria to evaluate clustering in different subspaces.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective Propose a method that is independent of any clustering algorithm.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Why Feature Selection is important ? Removing unimportant features Data sizes reduce Learning accuracy improve The drawbacks of Wrapper method: No generally acceptable criterion to estimate the accuracy of clustering. Rely on clustering algorithms.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction We propose and evaluate a ‘filter’ method for feature selection. Independent of clustering algorithms It is based on the observation that data with clusters has very different point-to-point distance histogram than data without cluster.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Properties of Feature Selection Here, we show an example using synthetic data in(3,2,1)- dimensional spaces. We noted down following two distinct scenarios. 1. A single feature defines clusters independently. 2. Individual feature don’t define clusters but correlated features do.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Properties of Feature Selection Distance Histogram Has 100 buckets Distance are normalized into [0…1] X value is the bucket number and Y value is the frequency of bucket X
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Properties of Feature Selection An important distinction between two histogram is: Histogram for data without clusters has a bell shape. Histogram for data with clusters has a different distribution. Typically, if the dataset consists of some clusters, then Majority of intra-cluster distances is smaller than the majority of inter-cluster distances.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Entropy is defined as following: Where Xi is the i th data point, X ik as the k th feature value of the i th point, i=1…N, and K=1…M
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Because we don’t know the probability of points, we use following proxy method to estimate the entropy. Where D ij is the normalized distance in the range[ ] between Xi and Xj. Entropy is 0.0 when distance = 0.0 or 1.0 Entropy is 1.0 when mean distance=0.5 Its goal is to assign low entropy to intra and inter-cluster distances, and to assign a higher entropy to noisy distances.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Following figure shows entropy-distance relationship But it still has two drawbacks. The meeting point(u) of two sides, can be an inter-cluster distance, but still it is assigned the highest entropy Entropy increases rapidly for very small distances
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation For first drawback, we set the meeting point(u) so as to separate the intra-cluster and inter-cluster distances more accurately. The second drawback can be overcome by incorporating a coefficient ( β ) in the equation.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Figure 3(a) shows that increasing the β value will decrease the entropy. Figure 3(b) shows varying μ has the effect of shifting the meeting point of the two sides of the entropy.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Among different β value we experimented, it works well when it is set to 10. The way to estimate μ is based on the observation Estimating the range of intra-cluster is easy and more accurate than that of inter-cluster. Because intra-cluster distances occupy the lowest portion of the complete range of distances.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Using the intra-cluster distance range to estimate μ:
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Figure 4 explain the above procedure. A notable different the two figures is: Data with clusters entropy is low for many bins having considerably high frequency count.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation Distance calculation required to calculate the entropy is O(N 2 ) Until now, we want to assign low entropy to intra- cluster and inter-cluster distances. Try to minimize the entropy for data with clusters. A large portion of distances have very low entropy whose total make a less than significant contribution to the over-all entropy.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Distance-based Entropy Measure and Its Efficient Calculation By considering distance having entropy higher than a threshold entropy. We can find a range of distance Rc with higher entropy than threshold Entropy for any distance outside this range can be set to a very low constant value.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Feature Selection Algorithm The process has two main step: Search the optimal subset whose entropy is minimum. Evaluation of subsets of features. Here We use forward selection method First finds the best feature, and then using the already selected features finds the next best feature, and so on.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation Evaluating a feature selection method for clustering is to check the correctness of the selected features. We evaluate the proposed method over synthetic datasets for which we know the important feature. Next we evaluate over benchmark and real datasets for which important features are known or can be found out by visualization.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Evaluation Synthetic Datasets and Real Datasets
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions The result of experiment show our method correctly finds the most important subsets.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion ……