Download presentation
Presentation is loading. Please wait.
Published byAndrea Harvey Modified over 9 years ago
1
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal
2
Outline January 23, 20162 Background System Design and Implementation Auto-Tuning Framework Applications Experiments Related Work Conclusion
3
January 23, 20163 Map-Reduce Simple API Easy to write parallel programs: map and reduce Fault-tolerant for large-scale data centers Performance? Always a concern for HPC community Generalized Reduction as a variant Shared a similar processing structure The key difference lies in a programmer-managed reduction-object Outperformed Map-reduce for a sub-class of data intensive applications Background (I)
4
January 23, 20164 Parallel Computing Environments CPU clusters(multi-cores) Most widely used as traditional HPC platforms Support MapReduce and many of its variants GPU clusters(many-cores) Higher performance with better cost and energy efficiency Low programming productivity Limited MapReduce-like support: Mars for a single GPU, a recent IDAV for GPU clusters, … CPU-GPU clusters(heterogeneous systems) No MapReduce-like support up to date! Background (II)
5
Background (III) Previously developed MATE (a Map-Reduce system with an AlternaTE API) for multi-core environments Phoenix implemented Map-Reduce in shared-memory systems MATE adopted Generalized Reduction, first proposed in FREERIDE that was developed at Ohio State 2001-2003 Comparison between MATE and Phoenix for Data Mining Applications Comparing performance and API Understanding performance overheads MATE provided an alternative API better than ``Map- Reduce`` for some data-intensive applications Assumption is that reduction object must fit in memory January 23, 20165
6
Background (IV) To address the limitation in MATE, we developed Ex- MATE to support the management of large reduction- objects, which are required by graph mining applications Developed support for managing disk-resident reduction object and efficient updating in distributed environments Evaluated Ex-MATE with PEGASUS, a Hadoop-based graph mining system Ex-MATE outperformed PEGASUS for three graph mining applications with factors ranging from 9 to 35 January 23, 20166
7
7 Map-Reduce Execution
8
Comparing Processing Structures 8 Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping...overheads are eliminated with red. func/obj. January 23, 2016
9
Observations on Processing Structures Map-Reduce is based on functional idea Does not maintain state This can lead to overheads of managing intermediate results between map and reduce Map could generate intermediate results of very large size Reduction-based approach is based on a programmer- managed reduction object Not as ‘clean’ But, avoids sorting of intermediate results Can also help shared memory parallelization Helps better fault-recovery January 23, 20169
10
10 System Design and Implementation Execution Overview of MATE-CG System API Support of heterogeneous computing Data types: Input_Space and Reduction_Object Functions: CPU_Reduction and GPU_Reduction Runtime Partitioning disk-resident dataset among nodes Managing large-sized reduction object on disk Managing large-sized intermediate data Using GPUs to accelerate computation
11
January 23, 201611 MATE-CG Overview Execution work-flow
12
January 23, 201612 System API Data types and functions
13
January 23, 201613 Implementation Considerations (I) A multi-level data partitioning scheme First, partitioning function: partition inputs into blocks and distributed them to different nodes Data locality should be considered Second, heterogeneous data mapping: cut each block into two parts, one for CPU, the other for GPU How to identify the best data mapping? Third, splitting function: split part of data blocks into smaller chunks Observation: smaller chunk size for CPU and larger chunk size for GPU
14
January 23, 201614 Implementation Considerations (II) Management of large-sized reduction- object/intermediate data: Reduce disk I/O of large reduction objects: Data access patterns are used to reuse splits of reduction objects as much as possible Transparent to user code Reduce network costs of large intermediate data: A generic solution to invoke a all-to-all broadcast among all nodes would cause severe performance losses Application-driven optimizations can be used to improve performance.
15
Auto-Tuning Framework January 23, 201615 Auto-tuning problem: given an application, find the optimal parameter setting to distribute data to the CPU and the GPU respectively due to different processing capabilities. For example: 20/80? 50/50? 70/30? Our approach: exploit the iterative nature of many data- intensive applications with similar computations over a number of iterations Construct an analytical model to predict performance The optimal value is learnt over the first few iterations No compile-time search or tuning is needed Low runtime overheads with a large number of iterations
16
The Analytical Model (I) We focus on the two main components in the overall running time on each node: data processing time on the CPU and/or the GPU and the overheads on the CPU First, consider the CPU only and we have: Second, on the GPU, we have: Third, let T cg represent the heterogeneous execution time using both CPU and GPU, we have: January 23, 201616
17
The Analytical Model (II) Let p represent the fraction of data to the CPU and we have: and Overall, to relate T cg with p, we have the following illustration January 23, 201617
18
The Analytical Model (III) Illustration of the relationship between T cg and p: January 23, 201618
19
The Analytical Model (IV) To minimize T cg by computing the optimal p, we have: To identify the best p, a simple heuristic way is used: First, set p to 1: use CPUs only Second, set p to 0: use GPUs only Obtain necessary values for other parameters in the above expression and predict an initial p Adjust p accordingly in future iterations for variances in measured values: make the CPU and the GPU finish simultaneously January 23, 201619
20
Applications: three representatives Gridding kernel from scientific computing Single pass: convert visibilities into a grid-model of the sky The Expectation-Maximization algorithm from data mining Iterative: estimate a vector of parameters Two consecutive steps: the Expectation step (E-step) and the Maximization step (M-step) PageRank from graph mining Iterative: calculate the relative importance of web pages Is essentially a matrix-vector multiplication algorithm January 23, 201620
21
Applications: Optimizations (I) The Expectation-Maximization algorithm Large intermediate matrix between the E-step and the M-step Could cause a lot of network communication costs for broadcasting such a large matrix among all nodes Optimization: On the same node, M-step reads the same subset of intermediate matrix as produced in E-step (use of a common partitioner) PageRank Data-copying overheads are significant on GPUs Smaller input vector splits are shared by larger matrix blocks that need further splitting Optimization: copy shared input vector splits only once to save copying time (fine-grained copying) January 23, 201621
22
Applications: Optimizations (II) Outline of data copying and computation on GPUs January 23, 201622
23
January 23, 201623 Example User Code Gridding Kernel
24
January 23, 201624 Experiments Platform A heterogeneous CPU-GPU cluster Each node has one Intel 8-core CPU and a NVIDA Tesla (Fermi) GPU (448 cores) Used up to 128 CPU cores and 7168 GPU cores on 16 nodes Experiments Design (I)
25
January 23, 201625 Three representative applications Gridding kernel, EM, and PageRank. For each application, we run it in four modes in the cluster: CPU-1: 1 CPU core per node as baseline CPU-8: 8 CPU cores per node GPU-only: only the GPU per node CPU-8-n-GPU: both 8 CPU cores and GPU per node Experiments Design (II)
26
January 23, 201626 We focused on three aspects: For each application: scalability with the increasing number of GPUs and performance improvement over CPUs Performance improvement of Heterogeneous computing and effectiveness of auto-tuning Framework Performance impact of application-driven optimizations and examples of system tuning Experiments Design (III)
27
January 23, 201627 Results: Scalability with # of GPUs (I) PageRank: 64GB dataset; a graph of 1 billion nodes and 4 billion edges 7.0 6.8 6.3 5.0 16%
28
January 23, 201628 Results: Scalability with # of GPUs (II) Gridding Kernel: 32GB dataset; a collection of 800 million visibilities and a 6.4GB sky grid 7.5 7.2 6.9 6.5 25%
29
January 23, 201629 Results: Scalability with # of GPUs (III) EM: 32GB dataset; a cluster of 1 billion points 7.6 6.8 5.0 15.0 3.0
30
January 23, 201630 Results: Auto-tuning (I) PageRank: 64GB dataset on 16 nodes 7% P=0.30
31
January 23, 201631 Results: Auto-tuning (II) EM: 32GB dataset on 16 nodes E: 29% E: p=0.31 M: 24% M: p=0.27
32
January 23, 201632 Results: Heterogeneous Execution Gridding Kernel: 32GB dataset on 16 nodes >=56% >=42%
33
January 23, 201633 Results: App-Driven Optimizations (I) EM: 4GB dataset with 20GB intermediate matrix 1.7 7.7
34
January 23, 201634 Results: App-Driven Optimizations (II) PageRank: 32GB dataset with a block size of 512MB and GPU chunk size of 128MB 24%
35
January 23, 201635 Results: Examples for System Tuning Gridding Kernel: 32GB dataset; varying cpu_chunk_size and gpu_chunk_size 16 MB 512MB
36
Insights January 23, 201636 GPUs can significantly accelerate certain classes of computations but exhibits programming difficulties and introduces data-copying overheads Suitable data mapping between the CPU and the GPU in heterogeneous computing is crucial for the overall performance Application-specific opportunities should be exploited to make the best use of a parallel system with optional APIs Automatic optimization would be desirable to choose the correct set of system parameter settings
37
Related Work January 23, 201637 Data-intensive computing with map-reduce-like models Multi-core CPUs: Phoenix, Phoenix-rebirth, … A single GPU: Mars, MapCG, … GPU clusters: MITHRA, IDAV, … CPU-GPU clusters: our MATE-CG Programming heterogeneous systems Software end: Merge, EXOCHI, Harmony, Qilin, … Hardware end: CUBA, CUDA, OpenCL, … Auto-tuning: a lot of efforts Basic idea: Search for the best solution among all possibilities Map solutions to performance metrics Map hardware/software characteristics to parameters Very useful for library generators, compilers, runtime systems,...
38
January 23, 201638 Conclusions The MATE-CG supports a map-reduce-like API to ease the programming difficulty of a heterogeneous CPU-GPU cluster The system achieves good scalability with increasing number of nodes and the heterogeneous execution further improves the performance over CPU only or GPU only It introduces a novel and effective auto-tuning approach for choosing the best data mapping between the CPU and the GPU Application-specific optimizations should be considered in the user code and a high-level API should be coupled with significant auto-tuning for identify the right system parameter settings automatically
39
Questions? January 23, 201639
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.