Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1, 20161
2 Growing need for analysis of large scale data Scientific Commercial Data-intensive Supercomputing (DISC) Map-Reduce has received a lot of attention Database and Datamining communities High performance computing community Motivation
Motivation (2) Processor Architecture Trends Clock speeds are not increasing Trend towards multi-core architectures Accelerators like GPUs are very popular Cluster of multi-cores / GPUs are becoming common Trend Towards the Cloud Use storage/computing/services from a provider How many of you prefer gmail over cse/osu account for ? Utility model of computation Need high-level APIs and adaptation June 1, 20163
My Research Group Data-intensive theme at multiple levels Parallel programming Models Multi-cores and accelerators Adaptive Middleware Scientific Data Management / Workflows Deep Web Integration and Analysis June 1, 20164
Personnel Currently 10 PhD students 4 MS thesis students Graduated PhDs 7 Graduated PhDs between 2005 and 2008 June 1, 20165
This Talk Parallel Programming API for Data-Intensive Computing An alternate API and System for Google’s Map- Reduce Show actual comparison Data-intensive Computing on Accelerators Compilation for GPUs Overview of other topics Scientific Data Management Adaptive and Streaming Middleware Deep web June 1, 20166
Map-Reduce Simple API for (data-intensive) parallel programming Computation is: Apply map on each input data element Produce ( key,value ) pair(s) Sort them using the key Apply reduce on the set with a distinct key values June 1, 20167
8 Map-Reduce Execution
June 1, Positives: Simple API Functional language based Very easy to learn Support for fault-tolerance Important for very large-scale clusters Questions Performance? Comparison with other approaches Suitability for different class of applications? Map-Reduce: Positives and Questions
Class of Data-Intensive Applications Many different types of applications Data-center kind of applications Data scans, sorting, indexing More ``compute-intensive’’ data-intensive applications Machine learning, data mining, NLP Map-reduce / Hadoop being widely used for this class Standard Database Operations Sigmod 2009 paper compares Hadoop with Databases and OLAP systems What is Map-reduce suitable for? What are the alternatives? MPI/OpenMP/Pthreads – too low level? June 1,
Hadoop Implementation June 1, HDFS Almost GFS, but no file update Cannot be directly mounted by an existing operating system Fault tolerance Name node Job Tracker Task Tracker
June 1, FREERIDE: GOALS Framework for Rapid Implementation of Data Mining Engines The ability to rapidly prototype a high- performance mining implementation Distributed memory parallelization Shared memory parallelization Ability to process disk-resident datasets Only modest modifications to a sequential implementation for the above three Developed at Ohio State June 1,
FREERIDE – Technical Basis June 1, Popular data mining algorithms have a common canonical loop Generalized Reduction Can be used as the basis for supporting a common API Demonstrated for Popular Data Mining and Scientific Data Processing Applications While( ) { forall (data instances d) { I = process(d) R(I) = R(I) op f(d) } ……. }
June 1, Similar, but with subtle differences Comparing Processing Structure
Observations on Processing Structure Map-Reduce is based on functional idea Do not maintain state This can lead to sorting overheads FREERIDE API is based on a programmer managed reduction object Not as ‘clean’ But, avoids sorting Can also help shared memory parallelization Helps better fault-recovery June 1,
June 1, Tuning parameters in Hadoop Input Split size Max number of concurrent map tasks per node Number of reduce tasks For comparison, we used four applications Data Mining: KMeans, KNN, Apriori Simple data scan application: Wordcount Experiments on a multi-core cluster 8 cores per node (8 map tasks) Experiment Design
June 1, KMeans: varying # of nodes Avg. Time Per Iteration (sec) # of nodes Dataset: 6.4G K : 1000 Dim: 3 Results – Data Mining
June 1, Results – Data Mining (II) June 1, Apriori: varying # of nodes Avg. Time Per Iteration (sec) # of nodes Dataset: 900M Support level: 3% Confidence level: 9%
June 1, June 1, KNN: varying # of nodes Avg. Time Per Iteration (sec) # of nodes Dataset: 6.4G K : 1000 Dim: 3 Results – Data Mining (III)
June 1, Wordcount: varying # of nodes Total Time (sec) # of nodes Dataset: 6.4G Results – Datacenter-like Application
June 1, KMeans: varying dataset size Avg. Time Per Iteration (sec) Dataset Size K : 100 Dim: 3 On 8 nodes Scalability Comparison
June 1, Wordcount: varying dataset size Total Time (sec) Dataset Size On 8 nodes Scalability – Word Count
June 1, Four components affecting the hadoop performance Initialization cost I/O time Sorting/grouping/shuffling Computation time What is the relative impact of each ? An Experiment with k-means Overhead Breakdown
June 1, Varying the number of clusters (k) Avg. Time Per Iteration (sec) # of KMeans Clusters Dataset: 6.4G Dim: 3 On 16 nodes Analysis with K-means
June 1, Varying the number of dimensions Avg. Time Per Iteration (sec) # of Dimensions Dataset: 6.4G K : 1000 On 16 nodes Analysis with K-means (II)
Observations June 1, Initialization costs and limited I/O bandwidth of HDFS are significant in Hadoop Sorting is also an important limiting factor for Hadoop’s performance
This Talk Parallel Programming API for Data-Intensive Computing An alternate API and System for Google’s Map- Reduce Show actual comparison Data-intensive Computing on Accelerators Compilation for GPUs Overview of other topics Scientific Data Management Adaptive and Streaming Middleware Deep web June 1,
Background - GPU Computing Many-core architectures/Accelerators are becoming more popular GPUs are inexpensive and fast CUDA is a high-level language for GPU programming
CUDA Programming Significant improvement over use of Graphics Libraries But.. Need detailed knowledge of the architecture of GPU and a new language Must specify the grid configuration Deal with memory allocation and movement Explicit management of memory hierarchy
Parallel Data mining Common structure of data mining applications (FREERIDE) /* outer sequential loop *//* outer sequential loop */ while() { while() { /* Reduction loop */ /* Reduction loop */ Foreach (element e){ Foreach (element e){ (i, val) = process(e); (i, val) = process(e); Reduc(i) = Reduc(i) op val; Reduc(i) = Reduc(i) op val; } }
Porting on GPUs High-level Parallelization is straight-forward Details of Data Movement Impact of Thread Count on Reduction time Use of shared memory
Architecture of the System Variable information Reduction functions Optional functions Code Analyzer( In LLVM) Variable Analyzer Code Generator Variable Access Pattern and Combination Operations Host Program Grid configuration and kernel invocation Kernel functions Executable User Input
A sequential reduction function Optional functions (initialization function, combination function…) Values of each variable or size of array Variables to be used in the reduction function
Analysis of Sequential Code Get the information of access features of each variable Determine the data to be replicated Get the operator for global combination Variables for shared memory
Memory Allocation and Copy Copy the updates back to host memory after the kernel reduction function returns C.C.C.C. Need copy for each thread T0T1 T2 T3 T4 T61T62 T63T0T1 …… T0T1 T2T3T4 T61T62 T63T0T1 …… A.A.A.A. B.B.B.B.
Generating CUDA Code and C++/C code Invoking the Kernel Function Memory allocation and copy Thread grid configuration (block number and thread number) Global function Kernel reduction function Global combination
Optimizations Using shared memory Providing user-specified initialization functions and combination functions Specifying variables that are allocated once
Applications K-means clustering EM clustering PCA
Experimental Results Speedup of k-means
Speedup of k-means on GeForce 9800X2
Speedup of EM
Speedup of PCA
Deep Web Data Integration The emerge of deep web Deep web is huge Different from surface web Challenges for integration Not accessible through search engines Inter-dependences among deep web sources
Our Contributions Structured Query Processing on Deep Web Schema Mining/Matching Analysis of Deep web data sources P44
45 Adaptive Middleware To Enable the Time-critical Event Handling to Achieve the Maximum Benefit, While Satisfying the Time/Budget Constraint To be Compatible with Grid and Web Services To Enable Easy Deployment and Management with Minimum Human Intervention To be Used in a Heterogeneous Distributed Environment ICAC 2008
46 HASTE Middleware Design ICAC 2008
47 Workflow Composition System
Summary Growing data is creating new challenges in HPC, grid and cloud environments Number of topics being addressed Many opportunities for involvement 888 meets Thursdays 5:00 – 6:00, DL 280 June 1,