Data-Intensive Computing: From Clouds to GPU Clusters

Data-Intensive Computing: From Clouds to GPU Clusters
Gagan Agrawal

Motivation Parallel data mining is a special case of Data-Intensive Computing Cloud Environments have emerged: Elastic Pay-as-you Reliable long term storage High performance systems are changing Accelerator based CPU-GPU clusters are some of the fastest systems today December 4, 2018

Background Previously developed MATE (a Map-Reduce system with an AlternaTE API) for multi-core environments Phoenix implemented Map-Reduce in shared-memory systems MATE adopted Generalized Reduction, first proposed in FREERIDE that was developed at Ohio State Comparison between MATE and Phoenix for Data Mining Applications Comparing performance and API Understanding performance overheads MATE provided an alternative API better than ``Map- Reduce`` for some data-intensive applications

Map-Reduce Execution December 4, 2018

Comparing Processing Structures
Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping.. .overheads are eliminated with red. func/obj. December 4, 2018

Observations on Processing Structures
Map-Reduce is based on functional idea Does not maintain state This can lead to overheads of managing intermediate results between map and reduce Map could generate intermediate results of very large size Reduction-based approach is based on a programmer- managed reduction object Not as ‘clean’ But, avoids sorting of intermediate results Can also help shared memory parallelization Helps better fault-recovery December 4, 2018

Outline MATE-EC2 MATE-CG MATE supported for Amazon EC2 environment
Process data resident in S3 Can use heterogeneous environments MATE-CG Target GPU clusters Use both a multi-core CPU and a GPU for the same computatiion December 4, 2018

MATE-EC2: Motivation MATE – MapReduce with an Alternate API
MATE-EC2: Implementation for AWS Environments Cloud resources are blackboxes Need for services and tools that can… get the most out of cloud resources help their users with easy APIs Virtualization hides all the scheduling related operations and underlying architectures of the cloud environment from the users, therefore it can be seen as a functional blackbox. This property is desirable for the end-user, however developers still need to know the details of the cloud environment in order to provide efficient services and tools. Thus, there is a need for understanding the characteristics of the cloud environments. Understanding the characteristics of the cloud environment is only the first step. The investigated characteristics should be properly exploited by the tools and services that run on cloud environment. Therefore, cloud service users can get the most out of available resources.

MATE-EC2 Design Data organization Chunk Retrieval
Three levels: Buckets, Chunks and Units Metadata information Chunk Retrieval Threaded Data Retrieval Selective Job Assignment Load Balancing and handling heterogeneity Pooling mechanism Data Organization: -Buckets: Physical presentation of the data on S3 (data is physically stored in data objects and presented with buckets) Chunks: Logical data blocks inside buckets (exploits memory utilization) Data units: Minimum required data units that are processes by the application (exploits cache utilization) -Metadata information: Index file that consists of fullpath of the data object, offset address of the chunk, size of the chunk, total number of units inside the chunk. Chunk Retrieval: -Threaded Data Retrieval: Each chunk is requested with a number of threads. Therefore bandwidth usage of the processing node is maximized. -Selective Job Assignment: Chunk selection is based on the number of the connections on each bucket. The chunk is selected from the data object that has the minimum number of active -connection. Thus, each bucket’s upload bandwidth is exploited. Load Balancing and handling heterogeneity: -Pooling mechanism: Whenever a processing node finishes a job, it requests another from the master node. Master node, then, assigns a job from the job pool. Load balancing mechanism uses the selective job assignment. Thus the job selection is not a sequential or random process.

MATE-EC2 Processing Flow
T T 1 T 2 T 3 C 5 C n S3 Data Object Computing Layer Job Scheduler Metadata File EC2 Master Node EC2 Slave Node Request / Retrieve another job

Experiments Goals: Finding the most suitable setting for AWS
Performance of MATE-EC2 on heterogeneous and homogeneous environments Performance comparison of MATE-EC2 and Map-Reduce Applications: KMeans and PCA Used Resources: 4 Large EC2 instances for processing, 1 Large instance for Master 16 Data objects on S3 (8.2GB total data set for both app.)

Diff. Data Chunk Sizes KMeans 16 Retrieval threads
Performance increase 8M vs. others 1.13 to 1.30 1 Thread vs. 16 Threads versions 1.24 to 1.81

Diff. Number of Threads Performance increase for PCA 128MB chunk size
Performance increase in Fig. (KMeans) 1.37 to 1.90 Performance increase for PCA 1.38 to 1.71

Heterogeneous Env. L: Large instances S: Small instances
128MB chunk size Overheads in Fig. (KMeans) Under 1% Overheads for PCA 1.1 to 11.7 The comparison includes available bandwidth of the instances as well as their throughput. The reason why PCA has more overhead is because of the number of data retrievals (data is retrieved twice) and more synchronization points that PCA has.

MATE-EC2 vs. Map-Reduce Scalability (MATE) Scalability (MR) Speedups:
Efficiency: 90% Scalability (MR) Efficiency: 74% Speedups: MATE vs. MR 3.54 to 4.58

Outline MATE-EC2 MATE-CG MATE supported for Amazon EC2 environment
Process data resident in S3 Can use heterogeneous environments MATE-CG Target GPU clusters Use both a multi-core CPU and a GPU for the same computatiion December 4, 2018

MATE-CG: System Design and Implementation
Execution Overview of MATE-CG System API Support of heterogeneous computing Data types: Input_Space and Reduction_Object Functions: CPU_Reduction and GPU_Reduction Runtime Partitioning disk-resident dataset among nodes Managing large-sized reduction object on disk Managing large-sized intermediate data Using GPUs to accelerate computation December 4, 2018

MATE-CG Overview Execution work-flow December 4, 2018

System API Data types and functions December 4, 2018

Implementation Considerations (I)
A multi-level data partitioning scheme First, partitioning function: partition inputs into blocks and distributed them to different nodes Data locality should be considered Second, heterogeneous data mapping: cut each block into two parts, one for CPU, the other for GPU How to identify the best data mapping? Third, splitting function: split part of data blocks into smaller chunks Observation: smaller chunk size for CPU and larger chunk size for GPU December 4, 2018

Implementation Considerations (II)
Management of large-sized reduction- object/intermediate data: Reduce disk I/O of large reduction objects: Data access patterns are used to reuse splits of reduction objects as much as possible Transparent to user code Reduce network costs of large intermediate data: A generic solution to invoke a all-to-all broadcast among all nodes would cause severe performance losses Application-driven optimizations can be used to improve performance. December 4, 2018

Auto-Tuning Framework
Auto-tuning problem: given an application, find the optimal parameter setting to distribute data to the CPU and the GPU respectively due to different processing capabilities. For example: 20/80? 50/50? 70/30? Our approach: exploit the iterative nature of many data- intensive applications with similar computations over a number of iterations Construct an analytical model to predict performance The optimal value is learnt over the first few iterations No compile-time search or tuning is needed Low runtime overheads with a large number of iterations December 4, 2018

The Analytical Model (I)
We focus on the two main components in the overall running time on each node: data processing time on the CPU and/or the GPU and the overheads on the CPU First, consider the CPU only and we have: Second, on the GPU, we have: Third, let Tcg represent the heterogeneous execution time using both CPU and GPU, we have: December 4, 2018

The Analytical Model (II)
Let p represent the fraction of data to the CPU and we have: and Overall, to relate Tcg with p, we have the following illustration December 4, 2018

The Analytical Model (III)
Illustration of the relationship between Tcg and p: December 4, 2018

The Analytical Model (IV)
To minimize Tcg by computing the optimal p, we have: To identify the best p, a simple heuristic way is used: First, set p to 1: use CPUs only Second, set p to 0: use GPUs only Obtain necessary values for other parameters in the above expression and predict an initial p Adjust p accordingly in future iterations for variances in measured values: make the CPU and the GPU finish simultaneously December 4, 2018

Applications: three representatives
Gridding kernel from scientific computing Single pass: convert visibilities into a grid-model of the sky The Expectation-Maximization algorithm from data mining Iterative: estimate a vector of parameters Two consecutive steps: the Expectation step (E-step) and the Maximization step (M-step) PageRank from graph mining Iterative: calculate the relative importance of web pages Is essentially a matrix-vector multiplication algorithm December 4, 2018

Applications: Optimizations (I)
The Expectation-Maximization algorithm Large intermediate matrix between the E-step and the M-step Could cause a lot of network communication costs for broadcasting such a large matrix among all nodes Optimization: On the same node, M-step reads the same subset of intermediate matrix as produced in E-step (use of a common partitioner) PageRank Data-copying overheads are significant on GPUs Smaller input vector splits are shared by larger matrix blocks that need further splitting Optimization: copy shared input vector splits only once to save copying time (fine-grained copying) December 4, 2018

Applications: Optimizations (II)
Outline of data copying and computation on GPUs December 4, 2018

Experiments Design (I)
Experiments Platform A heterogeneous CPU-GPU cluster Each node has one Intel 8-core CPU and a NVIDA Tesla (Fermi) GPU (448 cores) Used up to 128 CPU cores and 7168 GPU cores on 16 nodes 31 December 4, 2018

Experiments Design (II)
Three representative applications Gridding kernel, EM, and PageRank. For each application, we run it in four modes in the cluster: CPU-1: 1 CPU core per node as baseline CPU-8: 8 CPU cores per node GPU-only: only the GPU per node CPU-8-n-GPU: both 8 CPU cores and GPU per node 32 December 4, 2018

Experiments Design (III)
We focused on three aspects: scalability Performance improvement of Heterogeneous computing Effectiveness of auto-tuning Framework Performance impact of application-driven optimizations 33 December 4, 2018

Results: Scalability with # of GPUs (I)
PageRank: 64GB dataset; a graph of 1 billion nodes and 4 billion edges 7.0 6.8 6.3 5.0 16% December 4, 2018

Results: Scalability with # of GPUs (II)
Gridding Kernel: 32GB dataset; a collection of 800 million visibilities and a 6.4GB sky grid 7.5 7.2 6.9 6.5 25% December 4, 2018

Results: Scalability with # of GPUs (III)
EM: 32GB dataset; a cluster of 1 billion points 7.6 6.8 5.0 15.0 3.0 December 4, 2018

Results: Auto-tuning (I)
PageRank: 64GB dataset on 16 nodes 7% P=0.30 December 4, 2018

Results: Auto-tuning (II)
EM: 32GB dataset on 16 nodes E: 29% M: 24% E: p=0.31 M: p=0.27 December 4, 2018

Results: Heterogeneous Execution
Gridding Kernel: 32GB dataset on 16 nodes >=56% >=42% December 4, 2018

Results: App-Driven Optimizations (I)
EM: 4GB dataset with 20GB intermediate matrix 1.7 7.7 December 4, 2018

Results: App-Driven Optimizations (II)
PageRank: 32GB dataset with a block size of 512MB and GPU chunk size of 128MB 24% December 4, 2018

Results: Examples for System Tuning
Gridding Kernel: 32GB dataset; varying cpu_chunk_size and gpu_chunk_size 16 MB 512MB December 4, 2018

Insights GPUs can significantly accelerate certain classes of computations Programming difficulties and data-copying overheads Data mapping between the CPU and the GPU is crucial Application-specific opportunities should be exploited Automatic optimization would be desirable 43 December 4, 2018

Summary Emerging environments are posing new challenges
Clouds GPU Clusters Middleware Support Can Data-Intensive Computing December 4, 2018

Data-Intensive Computing: From Clouds to GPU Clusters

Similar presentations

Presentation on theme: "Data-Intensive Computing: From Clouds to GPU Clusters"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data-Intensive Computing: From Clouds to GPU Clusters

Similar presentations

Presentation on theme: "Data-Intensive Computing: From Clouds to GPU Clusters"— Presentation transcript:

Similar presentations

About project

Feedback