Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben Welton
Density-based clustering o Discovers the number of clusters o Finds oddly-shaped clusters 2 Mr. Scan: Efficient Clustering with MRNet and GPUs
Goal: Find regions that meet minimum density and spatial distance characteristics The two parameters that determine if a point is in a cluster is Epsilon (Eps), and MinPts If the number of points in Eps is > MinPts, the point is a core point. For every discovered point, this same calculation is performed until the cluster is fully expanded Clustering Example (DBSCAN [1] ) 3 Mr. Scan: Efficient Clustering with MRNet and GPUs EpsMinPts MinPts: 3 [1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996)
Scaling DBSCAN o PDBSCAN (1999) [2] o Quality equivalent to single DBSCAN o Linear speedup up to 8 nodes o DBDC (2004) [3] o Sacrifices quality o ~30x speedup on 15 nodes o PDSDBSCAN (2012) [4] o Quality equivalent to single node DBSCAN o 5675x Speedup on 8192 nodes (72 Million Points) o 2 Map/Reduce attempts (2011, 2012) o Quality equivalent to single node DBSCAN o 6x speedup on 12 nodes 4 Mr. Scan: Efficient Clustering with MRNet and GPUs [2] X. Xu et. al., A fast Parallel Clustering Algorithm for Large Spatial Databases (1999) [3] E. Januzaj et. al., DBDC: Density Based Distributed Clustering (2004) [4] M Patwary et. al., A new scalable parallel DBSCAN algorithm using the disjoint-set data structure (2012)
Challenges of scaling DBSCAN o Data distribution o How do we effectively take an input file and create partitions that can be clustered by DBSCAN? o Distributed 2-D partitioner reading from a distributed file system o Load balancing o How to keep variance in clustering times across nodes to a minimum? o Dense Box o Merge o How do we reduce the amount of data needed for the merge while keeping accuracy high? o Representative points 5 Mr. Scan: Efficient Clustering with MRNet and GPUs
6 MRNet – Multicast / Reduction Network o General-purpose TBON API o Network: user-defined topology o Stream: logical data channel o to a set of back-ends o multicast, gather, and custom reduction o Packet: collection of data o Filter: stream data operator o synchronization o transformation o Widely adopted by HPC tools o CEPBA toolkit o Cray ATP & CCDB o Open|SpeedShop & CBTF o STAT o TAU FE ……… BE app BE app BE app BE app CP F(x 1,…,x n )
TBON Computation 7 Mr. Scan: Efficient Clustering with MRNet and GPUs FE BE app BE app BE app CP BE app Ideal Characteristics: o Filter output size constant or decreasing o Computation rate similar across levels o Adjustable for load balance Data Size: 10MB per BE Packet Size: ≤ 10 MB Packet Size: ≤10 MB ~10 sec ~40 sec … 4x ~10 sec Total Time: ~30 sec Total Time: ~60 sec
Intro to Mr. Scan 8 Mr. Scan: Efficient Clustering with MRNet and GPUs BE CP BE DBSCAN Merge FE Mr. Scan Phases Partition: Distributed DBSCAN: BE) Merge: CPU (x #levels) Sweep: CPU (x #levels) FE BE Merge FS Sweep
Mr. Scan Architecture 9 Mr. Scan: Efficient Clustering with MRNet and GPUs Time: 0Time: 18.2 Min Partitioner DBSCAN Merge & Sweep Clustering 6.5 Billion Points FS Read 224 Secs FS Write 489 Secs MRNet Startup 130 Secs FS Read: 24 Secs DBSCAN 168 Secs Merge Time: 6 Secs Sweep Time: 4 Secs Write Output: 19 Secs
Partition Phase o Goal: Partitions computationally equivalent to DBSCAN o Algorithm: o Form initial partitions o Add shadow regions o Rebalance 10 Mr. Scan: Efficient Clustering with MRNet and GPUs
Distributed Partitioner 11 Mr. Scan: Efficient Clustering with MRNet and GPUs
GPU DBSCAN Filter 12 Mr. Scan: Efficient Clustering with MRNet and GPUs DBSCAN is performed in two distinct steps Step 1: Detect Core Points Block 1 Block 2 Block 900 T1T1 T2T2 T 512 T1T1 T2T2 T 512 T1T1 T2T2 T 512 Block 1 T1T1 T2T2 T 512 Block 2 T1T1 T2T2 T 512 Block 900 T1T1 T2T2 T 512 Step 2: Expand core points and color
Dense Box 13 Mr. Scan: Efficient Clustering with MRNet and GPUs One significant scalability issue is dealing with dense regions of data Density increases the computation cost of DBSCAN R2 Requires more comparison operations R1 R2 We reduce the computation cost of high density regions by pre- clustering these regions KD-Tree Look at each leaf bounding box looking for boxes with point count > minpts and size < 0.35 * eps DBSCAN no longer needs to expand these regions `
Merge Algorithm o Merge overlapping clusters found on different nodes. o Two steps in the merge operation 1.Select Representative points (BE) 2.Merge operation 14 Mr. Scan: Efficient Clustering with MRNet and GPUs
Representative Points o These are points that represent the core points in the dataset. o Create a boundary which at least one core point shared between overlapping clusters must be contained. 15 Mr. Scan: Efficient Clustering with MRNet and GPUs Representative points are the points closest to the corners and middle of the side of the eps box These points create a boundary (shaded region) which a point must fall in to merge overlapping clusters
Merge Algorithm 16 Mr. Scan: Efficient Clustering with MRNet and GPUs Merge algorithm is responsible for merging overlapping clusters detected on different DBSCAN nodes. Need to handle the merge with low overhead and without the full dataset Node 1Node 2 Core Point Non-Core Point 1. Core/Core overlap Core Point in common. 64 operations to detect. Node 1Node 2 Core Point Non-Core Point 2. Non-core/Core overlap Core point seen as non-core by one node. MinPts * 2 operations required to detect
Sweep Step o Get cluster identifiers and file offsets down to BE’s to write final clusters. o FE gives each cluster a unique ID and a file offset. o This data is passed back down to the BE that holds the data in the cluster. o Data is written out to disk by the BE. 17 Mr. Scan: Efficient Clustering with MRNet and GPUs
Experiment Setup o Dataset: Generated data with distribution from real Twitter data o Measuring: o Weak Scaling up to 8192 GPUs o Strong Scaling o Quality compared to single-threaded DBSCAN 18 Mr. Scan: Efficient Clustering with MRNet and GPUs
Results 19 Mr. Scan: Efficient Clustering with MRNet and GPUs Weak Scaling: 4096x data/compute increase 18.48x-31.68x time increase
Results Breakdown – Partition 6.5 Billion Points: 65.9% of Mr. Scan’s time 94.6% I/O time 20 Mr. Scan: Efficient Clustering with MRNet and GPUs
Results Breakdown – GPU Cluster Time 21 Mr. Scan: Efficient Clustering with MRNet and GPUs
Strong Scaling 22 Mr. Scan: Efficient Clustering with MRNet and GPUs
Quality 23 Mr. Scan: Efficient Clustering with MRNet and GPUs
Future Work o Remove partitioner’s I/O bottleneck o Multiple dimensions 24 Mr. Scan: Efficient Clustering with MRNet and GPUs
Conclusion o Clustered 6.5 billion points with DBSCAN in 18.2 minutes o Controlled computational variance of DBSCAN o Partitioner I/O = scaling enemy 25 Mr. Scan: Efficient Clustering with MRNet and GPUs
Questions? 26 A Brief Discussion of Ways and Means
Summary of previous Mr. Scan implementation 27 Mr. Scan: Efficient Clustering with MRNet and GPUs FE BE CP BE DBSCAN Algorithm Steps SpatialDecomp: FE) DBSCAN: CPU or BE) DrawBoundBox: CPU or GPU MergeCluster: CPU (x #levels) MergeCluster