Progressive Clustering of Big Data with GPU Acceleration and Visualization Jun Wang1, Eric Papenhausen1, Bing Wang1, Sungsoo Ha1, Alla Zelenyuk2, and Klaus.

Slides:

Advertisements

Similar presentations

GPU-Accelerated Incremental Correlation Clustering of Large Data with Visual Feedback Eric Papenhausen and Bing Wang (Stony Brook University) Sungsoo Ha.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

PARTITIONAL CLUSTERING

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.

External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.

External Sorting R & G Chapter 13 One of the advantages of being

External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.

External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

FLANN Fast Library for Approximate Nearest Neighbors

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

IIIT, Hyderabad Performance Primitives for Massive Multithreading P J Narayanan Centre for Visual Information Technology IIIT, Hyderabad.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.

A Parallel Implementation of MSER detection GPGPU Final Project Lin Cao.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

5/29/2008AI UEC in Japan Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Introduction to Database Systems1 External Sorting Query Processing: Topic 0.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.

External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

PDES Introduction The Time Warp Mechanism

External Sorting Chapter 13

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Physical Join Operators

Linchuan Chen, Peng Jiang and Gagan Agrawal

Localizing the Delaunay Triangulation and its Parallel Implementation

Database Management Systems (CS 564)

Lecture#12: External Sorting (R&G, Ch13)

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

External Sorting Chapter 13

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Selected Topics: External Sorting, Join Algorithms, …

GPGPU: Parallel Reduction and Scan

DATA MINING Introductory and Advanced Topics Part II - Clustering

Parallel Computation Patterns (Reduction)

CS 179: Lecture 3.

Parallelization of Sparse Coding & Dictionary Learning

External Sorting.

Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

Parallel k-means++ for Multiple Shared-Memory Architectures

Visual Causality Analysis Made Practical

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Parallel Programming in C with MPI and OpenMP

External Sorting Chapter 13

N-Body Gravitational Simulations

Dynamic Load Balancing of Unstructured Meshes

Index Structures Chapter 13 of GUW September 16, 2019

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Progressive Clustering of Big Data with GPU Acceleration and Visualization Jun Wang1, Eric Papenhausen1, Bing Wang1, Sungsoo Ha1, Alla Zelenyuk2, and Klaus Muller1 1Computer Science Department, Stony Brook University 2Pacific Northwest National Laboratory 2017 New York Scientific Data Summit (NYSDS) Jun Wang

The Internet of Things and People

Big Data for Scientific Research Nature - Big Data Sept. 3, 2008, Vol. 455, Issue 7209 Science - Dealing with Data Feb. 11, 2011, Vol. 331, Issue 6018

Our Data – Aerosol Science Understand the processes that control formation, physicochemical properties and transformations of particles Acquired by a state-of-the-art single particle mass spectrometer (SPLAT II) often deployed in an aircraft

Our Data – Aerosol Science SPLAT II can acquire up to 100 particles per second at sizes between 50-3,000 nm at a precision of 1 nm Creates a 450-D mass spectrum for each particle Overall Goal: Build the hierarchical structure of particles that can be used in automated classification of new particle acquisitions SpectraMiner

Our Data – Aerosol Science Data Scale: 450 dimensions Typically, several millions of points Goal: Overall: hierarchical (tree) structure In this talk: parallel clustering algorithms for Redundancy elimination Learning the leaf level of the tree

Incremental k-Means – Sequential The old CPU-based solution Input: data points P, distance threshold t Output: clusters C C = empty set for each unclustered point p in P if C is empty then Make p a new cluster center and add it into C else p = next unclustered point Find the cluster center c in C closest to p let d = distance(c, p) if d < t then Cluster p into c else Make p a new cluster center added to C end if end for return C

Incremental k-Means – Parallel NEW GPU-based solution Input: data points P, distance threshold t, batch size b, max iteration M Output: clusters C C = empty set while number of un-clustered points in P > 0 Run Alg. 1 until a number of b clusters B emerge Iteration 𝑖=0 while 𝑖<𝑀 and B is not stable in parallel: for each unclustered point 𝑝 𝑖 Find the center 𝑏 𝑖 in B closest to 𝑝 𝑖 if 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏 𝑖 , 𝑝 𝑖 <𝑡 then 𝑐 𝑖 = 𝑏 𝑖 else 𝑐 𝑖 = null end for on CPU: Assign 𝑝 𝑖 to 𝑏 𝑖 if 𝑐 𝑖 is not null in parallel: update centers of B end while Add B to C return C

Incremental k-Means – Parallel GPU Thread Block: 32 × 32 threads If we set the batch size b = 96: A group of 96/32 = 3 cluster centers is mapped to a column of the block Number of thread block launched: N/32

Dimension Reduction and the Threshold t Dimension standard deviations by parallel reduction

Comments and Observations Algorithm merges the incremental k-means algorithm with a parallel implementation (k=C) Design choices: C=96 good balance between CPU and GPU utilization With C>96 algorithm becomes CPU-bound With C<96 the GPU would be underutilized A multiple of 32 avoids divergent warps on the GPU Max iterations = 5 worked best Advantages of the new scheme: Second pass of previous scheme no longer needed

GPU Implementation Platform Parallelism 1-4 Tesla K20 GPUs Installed in a remote ‘cloud’ server Parallelism Launch N/32 thread blocks of size 32 x 32 each Each thread compares a point with 3 cluster centers Make use of shared memory to avoid non-coalesced memory accesses

Quality Measures Cluster quality measure: Davies-Bouldin (DB) index 𝐷𝐵= 1 𝑛 𝑖=1 𝑛 max⁡( 𝜎 𝑖 + 𝜎 𝑗 𝑀 𝑖𝑗 ) 𝜎 𝑖 - intra-cluster distance 𝑀 𝑖𝑗 - inter-cluster distance The lower the DB, the better the quality

Acceleration by Sub-Thresholding Points with small distance to the center are settled and will not be re-visited in inner iterations This also somehow improves the DB index Data Size Sequential Parallel ST: 0.3 ST: 0.2 ST: 0.15 10k 0.527 0.539 0.540 0.537 0.529 50k 0.546 0.590 0.548 0.554 100k 0.550 0.584 0.600 0.570 0.544 200k 0.564 0.587 0.640 0.593

Results – Different Settings About 33x speedup

Results – Multi-GPU 4-GPU has about 100x speedup over sequential

In-Situ Visual Feedback (1) Visualize cluster centers as summary snapshots Glimmer MDS algorithm Intuitive 2D layout Color map: Small clusters map to mostly white Large clusters map to saturated blue We find that early visualizations are already quite revealing This is shown by cluster size histogram Cluster size of M>10 is considered significant

In-Situ Visual Feedback (2) 79/96 998/3360 2004/13920

In-Situ Visual Feedback (3) 3001/52800 4002/165984 4207/336994

Conclusions and Future Work Current approach quite promising Good speedup and results In-situ visualization of data reduction process with early valuable feedback Future work More efficient data storage facilities Load-balancing point for multi-GPU Accelerate the hierarchy building A comprehensive VA system

Final Slide Thanks for attending! And thanks to NSF and DOE for funding Any questions?