Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.

Similar presentations


Presentation on theme: "Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison."— Presentation transcript:

1 Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison

2 Principal Objective  Find a reduced number of input space features such that clustering in the reduced space closely replicates the clustering in the full dimensional space

3 Basic Idea  Based on rigorous optimization theory, make a simple but fundamental modification in one of the two steps of the k-median algorithm  In each cluster, find a point closest in the 1-norm to all points in that cluster and to the median of ALL data points  Proposed approach can lead to a feature reduction as high as 64%, with clustering comparable to within 4% to that with the original set of features  Based on increasing weight given to the data median, more features are deleted from problem

4 FSKM Example  Start with median at origin  Apply k-median algorithm  As weight of data median increases, features are removed from the problem

5 Outline of Talk  Ordinary k-median algorithm  Two steps of the algorithm  Feature Selecting k-Median (FSKM) Algorithm  Overall optimization objective  Basic idea  Mathematical optimization formulation  Algorithm statement  Numerical examples  Conclusion & outlook

6 Ordinary k-Median Algorithm  Given m data points in n-dimensional input feature space  Find k cluster centers with the following property  The sum of the 1-norm distances between each data point and the closest cluster center is minimized  Finding the minimum of a bunch of linear functions is a concave minimization problem and is NP-hard  However, the two-step k-median algorithm terminates in a finite number of steps at a point satisfying the minimum principle necessary optimality condition

7 Two-Step k-Median Algorithm  (0) Start with k initial cluster centers  (1) Assign each data point to a 1-norm closest cluster center  (2) For each cluster compute a new cluster center that is 1- norm closest to all points in the cluster (median of cluster)  (3) Stop if all cluster centers are unchanged else go to (1) Algorithm terminates in a finite number of steps at a point satisfying the minimum principle necessary optimality conditions

8 Key Change in Step (2) of k-Median Algorithm  (0)  (1)  (2) For each cluster compute a new cluster center that minimizes the sum of 1-norm distances to all points in the cluster and a weighted 1-norm distance to the median of all data points  (3) Weight of 1-norm distance to dataset median determines number of features deleted:  For a zero weight no features are suppressed  For a sufficiently large weight all features are suppressed and a weighted 1-norm distance to the median of all data points

9 FSKM Theory

10 Subgradients  f(y)-f(x) ¸  f(x) 0 (y-x) 8 x,y 2 R n  Consider ||x|| 1, x 2 R 1  If x < 0   ||x|| 1 = -1  If x > 0   ||x|| 1 = 1  If x = 0   ||x|| 1 2 [-1, 1]

11 FSKM Theory (Continued)

12 Zeroing Cluster Features (Based on Necessary and Sufficient Optimality Conditions for Nondifferentiable Convex Optimization)

13 FSKM Algorithm

14 FSKM Example (Revisited)  Start with median at origin  Apply k-median algorithm  Compute ’s  x 1 = 1  y 1 = 5  x 2 = 0  y 2 = 4  max x = 1  max y = 5  For =1, feature x is removed from the problem 1 2 x y

15 Numerical Testing  FSKM tested on five publicly available labeled datasets  Labels were used only to test effectiveness of FSKM  Data is first clustered using k-median then FSKM is applied to delete one feature at a time  Without using data labels, “error” in FSKM clustering with reduced features is obtained by comparison with the “gold standard” clustering with the full set of features  FSKM clustering error curve obtained without labels is compared with classification error curve obtained using data labels

16 3-Class Wine Dataset 178 Points in 13-dimensional Space

17 Remarks  Curves close together  Largest increase in error as last few features are removed  Reduced 13 features to 4:  Clustering error < 4%  Classification error decreased by 0.56 percentage points

18 2-Class Votes Dataset 435 Points in 16-dimensional Space

19 Remarks  Curves have similar shape  Largest increase in error as last few features are removed  Reduced 16 features to 3:  Clustering error < 10%  Classification error increased by 1.84 percentage points

20 2-Class WDBC Dataset (Wisconsin Diagnostic Breast Cancer) 569 Points in 30-dimensional Space

21 Remarks  Curves have similar shape for 14 and fewer features  First 3 features removed cause no change to either error curve  Reduced 30 features to 7:  Clustering error < 10%  Classification error increased by 3.69 percentage points

22 2-Class Star/Galaxy-Bright Dataset 2462 Points in 14-dimensional Space

23 Remarks  Clustering error increases gradually as number of features is reduced  Some features obstructing classification  Reduced 14 features to 4:  Clustering error < 10%  Classification error decreased by 1.42 percentage points

24 2-Class Cleveland Heart Dataset 297 Points in 13-dimensional Space

25 Remarks  Largest increase in both curves going from 13 to 9 features  Most features useful?  Reduced 13 features to 8:  Clustering error < 17%  Classification error increased by 7.74 percentage points

26 Conclusion  FSKM is a fast method for selecting relevant features while maintaining clusters similar to those in the original full dimensional space  Features selected by FSKM without labels may be useful for labeled data classification as well  FSKM eliminates costly search for appropriately reduced number of features required for clustering in smaller dimensional spaces (e.g. 14-choose-6 = 3003 k- median runs to get best 6 features out of 14 for the Star/Galaxy-Bright dataset compared to 9 k-median runs required by FSKM)

27 Outlook  Feature & data selection for support vector machines  Sparse kernel approximation methods  Gene expression selection  Incorporation of prior knowledge into learning  Optimization-based clustering may be useful in other machine learning applications  Minimalist supervised & unsupervised learning  Select minimal knowledge for best model

28 Web Pages (Containing Paper & Talk)


Download ppt "Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison."

Similar presentations


Ads by Google