Fast and Exact K-Means Clustering 2/23/2019 Fast and Exact K-Means Clustering Ruoming Jin Anjan Goswami Gagan Agrawal The Ohio State University Hello, everyone, my name is Ruoming Jin. Today, I will present the paper, “Communication and Memory Efficient Parallel Decision Tree Construction”. 2/23/2019
Mining Out-of-Core Datasets The need to efficiently process disk-resident datasets In many cases, the huge amount of data can not fit into the main memory The processor-memory performance gap, and consequently, the processor-secondary memory performance gap become larger and larger! Moore’s law (50% per year) Latency gap (disks, 5 ms /DRAM 50ns > 100) The problem Most Mining Algorithms are I/O (data) intensive Many Mining Algorithms, such as Decision Tree Construction and K-means clustering, have to rewrite or scan the dataset many times Some remedies Approximate Mining Algorithms Working on Samples How can we develop efficient out-of-core mining algorithms without losing accuracy? 2/23/2019
Processor/Disk race How can we let Carl Lewis do more running to reduce turtle’s running distance? 2/23/2019
Sampling Based Approach Use samples to get approximate results or information Find criteria to test or estimate the accuracy of the sample results, and prune the statistically incorrect results Scan the complete dataset, collect the necessary information based on the approximate results in order to derive the accurate final results If the estimation from the sample is wrong and results in a false pruning, a re-scan is needed. 2/23/2019
Applications of this Approach Scaling and Parallelizing Decision Tree Construction Use RainForest (RF-read) as the basis Statistically Pruning Intervals for Enhanced Scalability (SPIES) – SDM 2003 Reduces both memory and communication requirements Fast and Exact k-means (this paper) Distributed, Fast, and Exact K-means (submitted) 2/23/2019
K-means Algorithm An iterative assigning and shifting procedure In-Core Datasets Using kd-tree (Pelleg and Moore) Out-of-Core Datasets Single-pass approximate algorithms Bradley and Fayyad Farnstorm and his colleagues Domingos and Hulten 2/23/2019
The Problem Can we have an algorithm which requires fewer passes on the entire dataset and can produce the same results as the original k-means? Fast and Exact K-Means Algorithm (FEKM) Typically requires only one or a small number of passes on the entire dataset Provably produces the same cluster centers as reported by the original k-means algorithm Experimental results from a number of real and synthetic datasets show speedups between a factor of 2 and 4.5, as compared to k-means 2/23/2019
Basic ideas of FEKM Run the original k-means algorithm on samples; store the centers computed after each iteration on the sampled data Build Confidence Radius for every cluster center at every iteration an estimation of the difference between the centers from samples and the corresponding exact k-means centers Apply confidence radius to find the points likely to have different center assignment in k-means running on the complete dataset 2/23/2019
Sampling the datasets 2/23/2019
K-means clustering on samples Confidence Radius An estimation of the upper-bound of the distance between the sample center and the corresponding k-means center! 2/23/2019
Boundary Points C1,δ1 d1 C2,δ2 d2 |d1-d2|<δ1+δ2 Another center is close enough compared to the closest center! 2/23/2019
Processing the Complete Dataset Identify if a point is a boundary point For a boundary point, we simply keep it in the main memory For other points (stable points), we will assign them to the closest centers derived from the samples for each possible iteration Caching sufficient statistics for stable points (CA-Table, cluster number * number of iterations * dimension) 2/23/2019
Confidence Radius Computation of Confidence Radius Large radius/small radius Heuristics 2/23/2019
Discussion Correctness Performance Analysis FEKM guarantees to find the same clusters as the original k-means Performance Analysis Determined by the number of passes of the dataset 2/23/2019
Experimental Setup and Datasets Machines 700 MHz Pentium Processors 1 GB memory Datasets Synthetic Datasets Similar to the ones used by Bradley et al 18 1.1GB datasets and 2 4.4GB datasets 5, 10, 20, 50, 100, and 200 dimensions 5, 10 and 20 clusters Real Datasets (UCI ML archive) KDDCup99 (38 dim, 1.8 GB, k=5) Corel image (32 dim, 1.9GB, k=16) Reuters text database (258 dim, 2 GB, k=25) Super-sampling, normalized [0,1] 2/23/2019
Performance of k-means and FEKM Algorithms on synthetic Datasets No. iterations Time of k-means Time of FEKM Samples (%) Passes Size Dimensions 1.1GB 200 100 54862.33 27388.85 10 2 3 1898.65 584.88 5 1 41029.15 18106.51 1233.12 585.63 50 1796.30 882.36 20 5335.15 2112.42 6 3919.08 1643.75 4619.95 2353.41 4.4GB 4393.02 2931.53 21985.62 8194.07 7467.53 Running Time in Seconds, 20 clusters 2/23/2019
Performance of k-means and FEKM Algorithms on Real Datasets No. iterations Time of k-means Time of FEKM Samples (%) Passes Squared Error Kdd99 19 7151 2317 10 2 4.0 kdd99 2529 15 3.5 2136 5 4.2 Corel 43 28442 10503 3 2.2 12603 2.15 9342 3.24 Reuter 20 41290 10311 10.1 11204 8.6 9214 14.9 Running Time in Seconds, Squared Error between final centers and the centers after sampling 2/23/2019
Summary Both algorithms (SPIES and FEKM) use information derived from samples to decide what needs to be cached, summarized, or dropped from the complete dataset Construct detailed class-histogram for unpruned intervals (SPIES) Cache the sufficient statistics for stable points and store the boundary points (FEKM) Both algorithms achieve significant performance gains without losing any accuracy 2/23/2019