Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly.

Similar presentations


Presentation on theme: "DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly."— Presentation transcript:

1 DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly clusterable data sets o ACADEMIC IMPORTANCE  Expand the application of the Kernel K-Means clustering algorithm to non-traditional uses Matt Strautmann, Dept. of Electrical and Computer Engineering BACKGROUND oWHAT IS K-MEANS CLUSTERING?  K-Means clustering aims to divide the dataset into clusters (“groups”) in which each data point belongs to the cluster with the nearest mean vector. oWHAT IS KERNAL K-MEANS?  Sum-of-squares algorithm  Two step process: data point assignment and update o WHAT IS THE PLUS PLUS INITIALIZATION SCHEME?  The first mean vector is a randomly selected data point  Each subsequent mean vector is created by evaluating randomly selected data points against a vector weighting probability APPROACH Evaluate standard K-Means (Soft++) against 4 datasets to form benchmark Hybridize Soft K-Means++ with Kernel K-Means to form Kernel K-Means++ Test Kernel K-Means++ on small size, small dimension Gaussian, large dimension Gaussian, and large size datasets Dr. Donald C. Wunsch II, Dept. of Electrical and Computer Engineering PROJECT DATASETS DISCUSSION Kernel K-Means++ was found to cluster the test datasets in a superior manner over Soft K-Means++ Kernel data-mapping was seen to solve the overlapping data sets by: Mapping the data before clustering to a higher- dimensional feature space using a nonlinear function Partitioning the points with linear separators in the new space Soft K-Means++ could not successfully cluster the Lung Cancer Dataset; results were for one cluster out of three successfully clustered Soft K-Means++ clustered the two dimension, two cluster Gaussian dataset with only one error out of the one thousand data points SOFT K-MEANS++ VS. KERNEL K-MEANS++ CONCLUDING REMARKS The initialization was seen to be the most important factor in the algorithm converging The “PLUS PLUS” cluster mean initialization was seen to improve the results Kernel assignment works better than the maximum responsibility calculation of Soft K-Means Kernel K-Means++ can handle small or large dimension datasets well; the increase of dimensionally seemed to be advantageous for the Lung Cancer Dataset (56 dimensions) over the lower clustering accuracy of the Iris Plant Dataset (4 dimensions) Kernel K-Means++ produced superior results to Soft K- Means++ when clustering the Lung Cancer Dataset and demonstrated recognition of all three clusters RESULTS COMPARISON Kernel K-Means++ clustering accuracy superior in all cases except the two dimensional, two cluster dataset. The clustering accuracy of the datasets increased by the following amounts: Iris Plant: 104% Lung Cancer: 38% 2D2k: -2.5% 8D5K: 30% FUTURE WORK Further improvement of the mean vector initialization is believed possible over the “PLUS PLUS” initialization Other options for the mean-squared error calculation for data point evaluation are possible The time analysis of the algorithm must be calculate The author would like to acknowledge the expertise of Dr. Rui Xu in advising this project. Acknowledgements 1.) Initial Mean Orientations 2.) Voronoi Diagram Generated by the Means (data points associated with nearest cluster mean) 3.) Cluster Centroid Becomes New Cluster Mean 4.) Step 2 and 3 Repeated until Convergence http://en.wikipedia.org/wiki/K-means_clustering Iris Plant Dataset 2 Dimension, 2 Cluster Dataset (Gaussian 2D2K) 2 Dimension, 2 Cluster Dataset (Gaussian 2D2K) lans.ece.utexas.edu eleves.ens.fr Soft K-Means++ Clustering Accuracy Average (over ten runs) Standard Deviation of Accuracy Calculation (over ten runs) Variance of Accuracy Calculation (over ten runs) Iris Plant Dataset28.00%8.218%2.867% Lung Cancer Dataset43.75%-- 2D2K Gaussian Dataset99.00%-- 8D5K Gaussian Dataset58.50%2.082%0.043% Kernel K-Means++ Clustering Accuracy Average (over ten runs) Standard Deviation of Accuracy Calculation (over ten runs) Variance of Accuracy Calculation (over ten runs) Iris Plant Dataset57.00%5.009%2.238% Lung Cancer Dataset62.00%6.878%0.473% 2D2K Gaussian Dataset96.50%1.677%0.028% 8D5K Gaussian Dataset76.31%10.366%1.075%


Download ppt "DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly."

Similar presentations


Ads by Google