Parallel k-means++ for Multiple Shared-Memory Architectures Patrick Mackey Pacific Northwest National Laboratory Robert R. Lewis Washington State University ICPP 2016
This Paper Describes the approaches for parallelizing k-means++ on three distinct hardware architectures. OpenMP: shared-memory multiple multi-core processors. Cray XMT: massively multi-threaded architecture. high performance GPU.
k-means++ A method that improves the quality of k- means clustering. Selecting a set of initial seeds that would on average provide better clustering than random selection. Uses a probabilistic approach for selecting seeds. The probability is based on the distance of a data point from all previously selected seeds.
Pseudocode of Serial k-means++
Pseudocode of Weighted_Rand_Index
Parallel k-means++ Parallelizing the probabilistic selection is challenging. A dependence exists between each iteration in the while loop. Simple loop parallelism will not work. Each thread is given a partition of data points, and make its own seed selection from its subset of weighted probabilities using the same basic algorithm. Produces a list of potential seed choices and their probabilities.
Parallel k-means++(Cont.) Performs another weighted probability selection on the list and decides the final chosen seed.
Proof of Correctness Let x ∈ X be an arbitrary vector. ppar(x): probability of selecting x in the parallel algorithm. p(x): the true probability of selecting x ∈ X by weighted probability. Theorem: Ppar(x) = P(x)
Proof Let X’ be the set of vectors assigned to a thread containing the vector x. Since p(X’|x) = 1.0, ppar(x) = p(x).
k-means++ for OpenMP
K-means++ for Massively Multithreaded Architecture
Weighted_Rand() on Massively Multithreaded Architecture
K-means++ on GPU Implemented with Nvidia’s Thrust library for C++.
Prob_Reduce()
Scaling Performance Results
Platform Performance Comparison Conduct a series of experiments with varying size of n, m, and k on different platforms. n: the number of data points. m: the dimensional size of the data. k: the number of clusters. Platforms: GPU (Nvidia Tesla C1060) OpenMP (8 cores) OpenMP (4 cores) Cray XMT (128 processors) Cray XMT (64 processors) Cray XMT (32 processors)
Linear Regression Linear regression model: Accuracy Root-mean-square-error(RMSE) “The average deviation among all our platforms was just 4.4% of the average predicted time, with no platform having an RMSE greater than 11% of the mean.”
Comparison Visualization “Every single platform had a range of values for n, m, and k in which it predicted to be the fastest of all our tested platforms.”
Summaries GPU dominated when the dimensionality of the data was small. Cray XMT excelled when the dimensionality of the data was high or the number of data points became exceedingly large. Shared-memory multiple multi-core processors outperform the others when the data was small, or the number of clusters desired was small.
Summaries(Cont.) “Using a number of threads equal to the number of processors will not always be the most efficient.” A program could be implemented that selects a more optimal number of threads to run the algorithm with, with the added benefit of making more resources available for other processes.