Download presentation
Presentation is loading. Please wait.
1
Fast and Exact K-Means Clustering
2/23/2019 Fast and Exact K-Means Clustering Ruoming Jin Anjan Goswami Gagan Agrawal The Ohio State University Hello, everyone, my name is Ruoming Jin. Today, I will present the paper, “Communication and Memory Efficient Parallel Decision Tree Construction”. 2/23/2019
2
Mining Out-of-Core Datasets
The need to efficiently process disk-resident datasets In many cases, the huge amount of data can not fit into the main memory The processor-memory performance gap, and consequently, the processor-secondary memory performance gap become larger and larger! Moore’s law (50% per year) Latency gap (disks, 5 ms /DRAM 50ns > 100) The problem Most Mining Algorithms are I/O (data) intensive Many Mining Algorithms, such as Decision Tree Construction and K-means clustering, have to rewrite or scan the dataset many times Some remedies Approximate Mining Algorithms Working on Samples How can we develop efficient out-of-core mining algorithms without losing accuracy? 2/23/2019
3
Processor/Disk race How can we let Carl Lewis do more running to reduce turtle’s running distance? 2/23/2019
4
Sampling Based Approach
Use samples to get approximate results or information Find criteria to test or estimate the accuracy of the sample results, and prune the statistically incorrect results Scan the complete dataset, collect the necessary information based on the approximate results in order to derive the accurate final results If the estimation from the sample is wrong and results in a false pruning, a re-scan is needed. 2/23/2019
5
Applications of this Approach
Scaling and Parallelizing Decision Tree Construction Use RainForest (RF-read) as the basis Statistically Pruning Intervals for Enhanced Scalability (SPIES) – SDM 2003 Reduces both memory and communication requirements Fast and Exact k-means (this paper) Distributed, Fast, and Exact K-means (submitted) 2/23/2019
6
K-means Algorithm An iterative assigning and shifting procedure
In-Core Datasets Using kd-tree (Pelleg and Moore) Out-of-Core Datasets Single-pass approximate algorithms Bradley and Fayyad Farnstorm and his colleagues Domingos and Hulten 2/23/2019
7
The Problem Can we have an algorithm which requires fewer passes on the entire dataset and can produce the same results as the original k-means? Fast and Exact K-Means Algorithm (FEKM) Typically requires only one or a small number of passes on the entire dataset Provably produces the same cluster centers as reported by the original k-means algorithm Experimental results from a number of real and synthetic datasets show speedups between a factor of 2 and 4.5, as compared to k-means 2/23/2019
8
Basic ideas of FEKM Run the original k-means algorithm on samples; store the centers computed after each iteration on the sampled data Build Confidence Radius for every cluster center at every iteration an estimation of the difference between the centers from samples and the corresponding exact k-means centers Apply confidence radius to find the points likely to have different center assignment in k-means running on the complete dataset 2/23/2019
9
Sampling the datasets 2/23/2019
10
K-means clustering on samples
Confidence Radius An estimation of the upper-bound of the distance between the sample center and the corresponding k-means center! 2/23/2019
11
Boundary Points C1,δ1 d1 C2,δ2 d2 |d1-d2|<δ1+δ2 Another center is close enough compared to the closest center! 2/23/2019
12
Processing the Complete Dataset
Identify if a point is a boundary point For a boundary point, we simply keep it in the main memory For other points (stable points), we will assign them to the closest centers derived from the samples for each possible iteration Caching sufficient statistics for stable points (CA-Table, cluster number * number of iterations * dimension) 2/23/2019
13
Confidence Radius Computation of Confidence Radius
Large radius/small radius Heuristics 2/23/2019
14
Discussion Correctness Performance Analysis
FEKM guarantees to find the same clusters as the original k-means Performance Analysis Determined by the number of passes of the dataset 2/23/2019
15
Experimental Setup and Datasets
Machines 700 MHz Pentium Processors 1 GB memory Datasets Synthetic Datasets Similar to the ones used by Bradley et al 18 1.1GB datasets and 2 4.4GB datasets 5, 10, 20, 50, 100, and 200 dimensions 5, 10 and 20 clusters Real Datasets (UCI ML archive) KDDCup99 (38 dim, 1.8 GB, k=5) Corel image (32 dim, 1.9GB, k=16) Reuters text database (258 dim, 2 GB, k=25) Super-sampling, normalized [0,1] 2/23/2019
16
Performance of k-means and FEKM Algorithms on synthetic Datasets
No. iterations Time of k-means Time of FEKM Samples (%) Passes Size Dimensions 1.1GB 200 100 10 2 3 584.88 5 1 585.63 50 882.36 20 6 4.4GB Running Time in Seconds, 20 clusters 2/23/2019
17
Performance of k-means and FEKM Algorithms on Real Datasets
No. iterations Time of k-means Time of FEKM Samples (%) Passes Squared Error Kdd99 19 7151 2317 10 2 4.0 kdd99 2529 15 3.5 2136 5 4.2 Corel 43 28442 10503 3 2.2 12603 2.15 9342 3.24 Reuter 20 41290 10311 10.1 11204 8.6 9214 14.9 Running Time in Seconds, Squared Error between final centers and the centers after sampling 2/23/2019
18
Summary Both algorithms (SPIES and FEKM) use information derived from samples to decide what needs to be cached, summarized, or dropped from the complete dataset Construct detailed class-histogram for unpruned intervals (SPIES) Cache the sufficient statistics for stable points and store the boundary points (FEKM) Both algorithms achieve significant performance gains without losing any accuracy 2/23/2019
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.