Presentation is loading. Please wait.

Presentation is loading. Please wait.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

Similar presentations


Presentation on theme: "MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal."— Presentation transcript:

1 MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal

2 Outline January 23, 20162  Background  System Design and Implementation  Auto-Tuning Framework  Applications  Experiments  Related Work  Conclusion

3 January 23, 20163  Map-Reduce  Simple API  Easy to write parallel programs: map and reduce  Fault-tolerant for large-scale data centers  Performance?  Always a concern for HPC community  Generalized Reduction as a variant  Shared a similar processing structure  The key difference lies in a programmer-managed reduction-object  Outperformed Map-reduce for a sub-class of data intensive applications Background (I)

4 January 23, 20164  Parallel Computing Environments  CPU clusters(multi-cores)  Most widely used as traditional HPC platforms  Support MapReduce and many of its variants  GPU clusters(many-cores)  Higher performance with better cost and energy efficiency  Low programming productivity  Limited MapReduce-like support: Mars for a single GPU, a recent IDAV for GPU clusters, …  CPU-GPU clusters(heterogeneous systems)  No MapReduce-like support up to date! Background (II)

5 Background (III)  Previously developed MATE (a Map-Reduce system with an AlternaTE API) for multi-core environments  Phoenix implemented Map-Reduce in shared-memory systems  MATE adopted Generalized Reduction, first proposed in FREERIDE that was developed at Ohio State 2001-2003  Comparison between MATE and Phoenix for  Data Mining Applications  Comparing performance and API  Understanding performance overheads  MATE provided an alternative API better than ``Map- Reduce`` for some data-intensive applications  Assumption is that reduction object must fit in memory January 23, 20165

6 Background (IV)  To address the limitation in MATE, we developed Ex- MATE to support the management of large reduction- objects, which are required by graph mining applications  Developed support for managing disk-resident reduction object and efficient updating in distributed environments  Evaluated Ex-MATE with PEGASUS, a Hadoop-based graph mining system  Ex-MATE outperformed PEGASUS for three graph mining applications with factors ranging from 9 to 35 January 23, 20166

7 7 Map-Reduce Execution

8 Comparing Processing Structures 8 Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping...overheads are eliminated with red. func/obj. January 23, 2016

9 Observations on Processing Structures  Map-Reduce is based on functional idea  Does not maintain state  This can lead to overheads of managing intermediate results between map and reduce  Map could generate intermediate results of very large size  Reduction-based approach is based on a programmer- managed reduction object  Not as ‘clean’  But, avoids sorting of intermediate results  Can also help shared memory parallelization  Helps better fault-recovery January 23, 20169

10 10 System Design and Implementation  Execution Overview of MATE-CG  System API  Support of heterogeneous computing  Data types: Input_Space and Reduction_Object  Functions: CPU_Reduction and GPU_Reduction  Runtime  Partitioning disk-resident dataset among nodes  Managing large-sized reduction object on disk  Managing large-sized intermediate data  Using GPUs to accelerate computation

11 January 23, 201611 MATE-CG Overview  Execution work-flow

12 January 23, 201612 System API  Data types and functions

13 January 23, 201613 Implementation Considerations (I)  A multi-level data partitioning scheme  First, partitioning function: partition inputs into blocks and distributed them to different nodes  Data locality should be considered  Second, heterogeneous data mapping: cut each block into two parts, one for CPU, the other for GPU  How to identify the best data mapping?  Third, splitting function: split part of data blocks into smaller chunks  Observation: smaller chunk size for CPU and larger chunk size for GPU

14 January 23, 201614 Implementation Considerations (II)  Management of large-sized reduction- object/intermediate data:  Reduce disk I/O of large reduction objects:  Data access patterns are used to reuse splits of reduction objects as much as possible  Transparent to user code  Reduce network costs of large intermediate data:  A generic solution to invoke a all-to-all broadcast among all nodes would cause severe performance losses  Application-driven optimizations can be used to improve performance.

15 Auto-Tuning Framework January 23, 201615  Auto-tuning problem: given an application, find the optimal parameter setting to distribute data to the CPU and the GPU respectively due to different processing capabilities.  For example: 20/80? 50/50? 70/30?  Our approach: exploit the iterative nature of many data- intensive applications with similar computations over a number of iterations  Construct an analytical model to predict performance  The optimal value is learnt over the first few iterations  No compile-time search or tuning is needed  Low runtime overheads with a large number of iterations

16 The Analytical Model (I)  We focus on the two main components in the overall running time on each node: data processing time on the CPU and/or the GPU and the overheads on the CPU  First, consider the CPU only and we have:  Second, on the GPU, we have:  Third, let T cg represent the heterogeneous execution time using both CPU and GPU, we have: January 23, 201616

17 The Analytical Model (II)  Let p represent the fraction of data to the CPU and we have:  and  Overall, to relate T cg with p, we have the following illustration January 23, 201617

18 The Analytical Model (III)  Illustration of the relationship between T cg and p: January 23, 201618

19 The Analytical Model (IV)  To minimize T cg by computing the optimal p, we have:  To identify the best p, a simple heuristic way is used:  First, set p to 1: use CPUs only  Second, set p to 0: use GPUs only  Obtain necessary values for other parameters in the above expression and predict an initial p  Adjust p accordingly in future iterations for variances in measured values: make the CPU and the GPU finish simultaneously January 23, 201619

20 Applications: three representatives  Gridding kernel from scientific computing  Single pass: convert visibilities into a grid-model of the sky  The Expectation-Maximization algorithm from data mining  Iterative: estimate a vector of parameters  Two consecutive steps: the Expectation step (E-step) and the Maximization step (M-step)  PageRank from graph mining  Iterative: calculate the relative importance of web pages  Is essentially a matrix-vector multiplication algorithm January 23, 201620

21 Applications: Optimizations (I)  The Expectation-Maximization algorithm  Large intermediate matrix between the E-step and the M-step  Could cause a lot of network communication costs for broadcasting such a large matrix among all nodes  Optimization: On the same node, M-step reads the same subset of intermediate matrix as produced in E-step (use of a common partitioner)  PageRank  Data-copying overheads are significant on GPUs  Smaller input vector splits are shared by larger matrix blocks that need further splitting  Optimization: copy shared input vector splits only once to save copying time (fine-grained copying) January 23, 201621

22 Applications: Optimizations (II)  Outline of data copying and computation on GPUs January 23, 201622

23 January 23, 201623 Example User Code  Gridding Kernel

24 January 23, 201624  Experiments Platform  A heterogeneous CPU-GPU cluster  Each node has one Intel 8-core CPU and a NVIDA Tesla (Fermi) GPU (448 cores)  Used up to 128 CPU cores and 7168 GPU cores on 16 nodes Experiments Design (I)

25 January 23, 201625  Three representative applications  Gridding kernel, EM, and PageRank.  For each application, we run it in four modes in the cluster:  CPU-1: 1 CPU core per node as baseline  CPU-8: 8 CPU cores per node  GPU-only: only the GPU per node  CPU-8-n-GPU: both 8 CPU cores and GPU per node Experiments Design (II)

26 January 23, 201626  We focused on three aspects:  For each application: scalability with the increasing number of GPUs and performance improvement over CPUs  Performance improvement of Heterogeneous computing and effectiveness of auto-tuning Framework  Performance impact of application-driven optimizations and examples of system tuning Experiments Design (III)

27 January 23, 201627 Results: Scalability with # of GPUs (I)  PageRank: 64GB dataset; a graph of 1 billion nodes and 4 billion edges 7.0 6.8 6.3 5.0 16%

28 January 23, 201628 Results: Scalability with # of GPUs (II)  Gridding Kernel: 32GB dataset; a collection of 800 million visibilities and a 6.4GB sky grid 7.5 7.2 6.9 6.5 25%

29 January 23, 201629 Results: Scalability with # of GPUs (III)  EM: 32GB dataset; a cluster of 1 billion points 7.6 6.8 5.0 15.0 3.0

30 January 23, 201630 Results: Auto-tuning (I)  PageRank: 64GB dataset on 16 nodes 7% P=0.30

31 January 23, 201631 Results: Auto-tuning (II)  EM: 32GB dataset on 16 nodes E: 29% E: p=0.31 M: 24% M: p=0.27

32 January 23, 201632 Results: Heterogeneous Execution  Gridding Kernel: 32GB dataset on 16 nodes >=56% >=42%

33 January 23, 201633 Results: App-Driven Optimizations (I)  EM: 4GB dataset with 20GB intermediate matrix 1.7 7.7

34 January 23, 201634 Results: App-Driven Optimizations (II)  PageRank: 32GB dataset with a block size of 512MB and GPU chunk size of 128MB 24%

35 January 23, 201635 Results: Examples for System Tuning  Gridding Kernel: 32GB dataset; varying cpu_chunk_size and gpu_chunk_size 16 MB 512MB

36 Insights January 23, 201636  GPUs can significantly accelerate certain classes of computations but exhibits programming difficulties and introduces data-copying overheads  Suitable data mapping between the CPU and the GPU in heterogeneous computing is crucial for the overall performance  Application-specific opportunities should be exploited to make the best use of a parallel system with optional APIs  Automatic optimization would be desirable to choose the correct set of system parameter settings

37 Related Work January 23, 201637  Data-intensive computing with map-reduce-like models  Multi-core CPUs: Phoenix, Phoenix-rebirth, …  A single GPU: Mars, MapCG, …  GPU clusters: MITHRA, IDAV, …  CPU-GPU clusters: our MATE-CG  Programming heterogeneous systems  Software end: Merge, EXOCHI, Harmony, Qilin, …  Hardware end: CUBA, CUDA, OpenCL, …  Auto-tuning: a lot of efforts  Basic idea: Search for the best solution among all possibilities  Map solutions to performance metrics  Map hardware/software characteristics to parameters  Very useful for library generators, compilers, runtime systems,...

38 January 23, 201638 Conclusions  The MATE-CG supports a map-reduce-like API to ease the programming difficulty of a heterogeneous CPU-GPU cluster  The system achieves good scalability with increasing number of nodes and the heterogeneous execution further improves the performance over CPU only or GPU only  It introduces a novel and effective auto-tuning approach for choosing the best data mapping between the CPU and the GPU  Application-specific optimizations should be considered in the user code and a high-level API should be coupled with significant auto-tuning for identify the right system parameter settings automatically

39 Questions? January 23, 201639


Download ppt "MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal."

Similar presentations


Ads by Google