Parallel k-means++ for Multiple Shared-Memory Architectures

Slides:

Advertisements

Similar presentations

Introduction to the CUDA Platform

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.

April 2, 2015Applied Discrete Mathematics Week 8: Advanced Counting 1 Random Variables In some experiments, we would like to assign a numerical value to.

OpenFOAM on a GPU-based Heterogeneous Cluster

Discovering Affine Equalities Using Random Interpretation Sumit Gulwani George Necula EECS Department University of California, Berkeley.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.

Data Parallel Algorithms Presented By: M.Mohsin Butt

Sampling Distributions

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Genetic Algorithm What is a genetic algorithm? “Genetic Algorithms are defined as global optimization procedures that use an analogy of genetic evolution.

A Survey of Parallel Tree- based Methods on Option Pricing PRESENTER: LI,XINYING.

Radial Basis Function Networks

Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

Evaluating Performance for Data Mining Techniques

Department of Electrical Engineering National Cheng Kung University

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

1 RECENT DEVELOPMENTS IN MULTILAYER PERCEPTRON NEURAL NETWORKS Walter H. Delashmit Lockheed Martin Missiles and Fire Control Dallas, TX 75265

Algorithms and their Applications CS2004 ( ) Dr Stephen Swift 3.1 Mathematical Foundation.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

K means ++ and K means Parallel Jun Wang. Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Data Mining – Algorithms: K Means Clustering

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Progressive Clustering of Big Data with GPU Acceleration and Visualization Jun Wang1, Eric Papenhausen1, Bing Wang1, Sungsoo Ha1, Alla Zelenyuk2, and Klaus.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Case Study 2- Parallel Breadth-First Search Using OpenMP

Enabling machine learning in embedded systems

PCB 3043L - General Ecology Data Analysis.

Super Computing By RIsaj t r S3 ece, roll 50.

Parallel Density-based Hybrid Clustering

Real-Time Ray Tracing Stefan Popov.

Classification with Perceptrons Reading:

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Spectral Clustering.

Ray-Cast Rendering in VTK-m

CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph

Alternative Processor Panel Results 2008

Parallelization of Sparse Coding & Dictionary Learning

Chapter 5: Probabilistic Analysis and Randomized Algorithms

Programming with Shared Memory Specifying parallelism

Chapter 5: Probabilistic Analysis and Randomized Algorithms

Design and Analysis of Algorithms

Presentation transcript:

Parallel k-means++ for Multiple Shared-Memory Architectures Patrick Mackey Pacific Northwest National Laboratory Robert R. Lewis Washington State University ICPP 2016

This Paper Describes the approaches for parallelizing k-means++ on three distinct hardware architectures. OpenMP: shared-memory multiple multi-core processors. Cray XMT: massively multi-threaded architecture. high performance GPU.

k-means++ A method that improves the quality of k- means clustering. Selecting a set of initial seeds that would on average provide better clustering than random selection. Uses a probabilistic approach for selecting seeds. The probability is based on the distance of a data point from all previously selected seeds.

Pseudocode of Serial k-means++

Pseudocode of Weighted_Rand_Index

Parallel k-means++ Parallelizing the probabilistic selection is challenging. A dependence exists between each iteration in the while loop. Simple loop parallelism will not work. Each thread is given a partition of data points, and make its own seed selection from its subset of weighted probabilities using the same basic algorithm. Produces a list of potential seed choices and their probabilities.

Parallel k-means++(Cont.) Performs another weighted probability selection on the list and decides the final chosen seed.

Proof of Correctness Let x ∈ X be an arbitrary vector. ppar(x): probability of selecting x in the parallel algorithm. p(x): the true probability of selecting x ∈ X by weighted probability. Theorem: Ppar(x) = P(x)

Proof Let X’ be the set of vectors assigned to a thread containing the vector x. Since p(X’|x) = 1.0, ppar(x) = p(x).

k-means++ for OpenMP

K-means++ for Massively Multithreaded Architecture

Weighted_Rand() on Massively Multithreaded Architecture

K-means++ on GPU Implemented with Nvidia’s Thrust library for C++.

Prob_Reduce()

Scaling Performance Results

Platform Performance Comparison Conduct a series of experiments with varying size of n, m, and k on different platforms. n: the number of data points. m: the dimensional size of the data. k: the number of clusters. Platforms: GPU (Nvidia Tesla C1060) OpenMP (8 cores) OpenMP (4 cores) Cray XMT (128 processors) Cray XMT (64 processors) Cray XMT (32 processors)

Linear Regression Linear regression model: Accuracy Root-mean-square-error(RMSE) “The average deviation among all our platforms was just 4.4% of the average predicted time, with no platform having an RMSE greater than 11% of the mean.”

Comparison Visualization “Every single platform had a range of values for n, m, and k in which it predicted to be the fastest of all our tested platforms.”

Summaries GPU dominated when the dimensionality of the data was small. Cray XMT excelled when the dimensionality of the data was high or the number of data points became exceedingly large. Shared-memory multiple multi-core processors outperform the others when the data was small, or the number of clusters desired was small.

Summaries(Cont.) “Using a number of threads equal to the number of processors will not always be the most efficient.” A program could be implemented that selects a more optimal number of threads to run the algorithm with, with the added benefit of making more resources available for other processes.