Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wei Jiang Advisor: Dr. Gagan Agrawal

Similar presentations


Presentation on theme: "Wei Jiang Advisor: Dr. Gagan Agrawal"— Presentation transcript:

1 Wei Jiang Advisor: Dr. Gagan Agrawal
A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Advisor: Dr. Gagan Agrawal

2 Motivation Growing need of Data-Intensive SuperComputing
High-performance data processing Data management & efficiency requirements Data-intensive computations Emergence of various parallel architectures Traditional CPU clusters (multi-cores) GPU clusters (many-cores) CPU-GPU clusters (heterogeneous systems) Exploiting parallel architectures for High-end apps Provide better programmer productivity & runtime efficiency Programming Models and Middleware Support! December 3, 2018

3 Introduction (I) Map-Reduce Data-intensive applications
Simple API : map and reduce Easy to write parallel programs Fault-tolerant for large-scale data centers High programming productivity Performance? Always a concern for HPC community Data-intensive applications Various Subclasses: Data center-oriented : search technologies Data Mining Large intermediate structures: pre-processing December 3, 2018

4 Introduction (II) Parallel Computing Environments
CPU clusters (multi-cores) Most widely used as traditional HPC platforms Motivated MapReduce and many of its variants GPU clusters (many-cores) Higher performance with better cost & energy efficiency Low programming productivity Limited MapReduce-like support CPU-GPU clusters Emerging heterogeneous systems No MapReduce-like support up to date except our recent work! New Hybrid Architecture CPU+GPU on the same chip December 3, 2018

5 Thesis Goals Examine APIs for data-intensive computations
Map-reduce-like models Needs of different data-intensive applications Data mining, graph mining, and scientific computing Implementation on emerging architectures Multi-core, many-core, hybrid, etc… Automatic tuning and scheduling Automatic optimization Efficient resource utilization and sharing December 3, 2018

6 Outline Background Current Work Proposed Work Conclusion
December 3, 2018

7 Map-Reduce Execution December 3, 2018

8 Hadoop Implementation
HDFS Almost GFS, but no file update Cannot be directly mounted by an existing operating system Fault tolerance One name node: Job Tracker & Task Tracker Optimization Data locality Schedule a map task near a replica of the corresponding input data Optional combiner Use local reduction to save the network bandwidth Backup tasks Alleviate the problem of stragglers December 3, 2018

9 Outline Background Current Work Proposed Work Conclusion
December 3, 2018

10 Current Work Three systems on different parallel architectures:
MATE (Map-reduce with an AlternaTE API) For multi-core environments Ex-MATE (Extended MATE) For clusters of multi-cores Provided large-sized reduction object support MATE-CG (MATE for Cpu-Gpu) For heterogeneous CPU-GPU clusters Provided an auto-tuning framework for data distribution December 3, 2018

11 Phoenix-based Implementation
Phoenix is based on the same principles of MapReduce Targets shared-memory systems Consists of a simple API that is visible to application programmers Users define functions like splitter, map, reduce, and etc.. An efficient runtime that handles low-level details Parallelization using P-threads Fault detection and recovery MATE (Map-Reduce system with an AlternaTE API) Built on top of Phoenix with the use of a reduction object Adopted generalized reduction model December 3, 2018

12 Comparing Processing Structures
Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping.. .overheads are eliminated with red. func/obj. December 3, 2018

13 Phoenix Runtime December 3, 2018

14 MATE Runtime Dataflow Basic one-stage dataflow (Full Replication scheme) December 3, 2018

15 System Design and Implementation
Shared-memory technique in MATE Full Replication scheme Three sets of functions in MATE Internally used functions—invisible to users One set of API provided by the runtime Another set of API to be defined or customized by the users December 3, 2018

16 Functions APIs defined/customized by the user Function Description R/O
int (*splitter_t)(void *, int, reduction_args_t *) O void (*reduction_t)(reduction_args_t *) R void (*combination_t)(void*) void (*finalize_t)(void *) December 3, 2018

17 Implementation Considerations
Focus on the API differences in programming models Data partitioning among multiple threads: Dynamically assigns splits to worker threads, same as Phoenix Buffer management: two temporary buffers One for reduction objects The other for combination results Fault tolerance: re-executes failed tasks Checkpointing may be a better solution since reduction object maintains the computation state December 3, 2018

18 Experiments Design For comparison, we used three applications
Data Mining: KMeans, PCA, Apriori Also evaluated the single-node performance of Hadoop on KMeans and Apriori Experiments on two multi-core platforms 8 cores on one WCI node (Intel cpu) 16 cores on one AMD node (AMD cpu) 18 December 3, 2018

19 Results: Data Mining (I)
K-Means on 8-core and 16-core machine: 400MB dataset, 3-dim points, k = 100 Avg. Time Per Iteration (sec) 2.0 3.0 # of threads December 3, 2018

20 Results: Data Mining (II)
PCA on 8-core and 16-core machine : 8000 * matrix Total Time (sec) 2.0 2.0 # of threads December 3, 2018

21 Results: Data Mining (III)
Apriori on 8-core and 16-core machine: 1,000,000 transactions, 3% support Avg. Time Per Iteration (sec) # of threads December 3, 2018

22 Current Work Three systems on different parallel architectures:
MATE (Map-reduce with an AlternaTE API) For multi-core environments Ex-MATE (Extended MATE) For clusters of multi-cores Provided large-sized reduction object support MATE-CG (MATE for Cpu-Gpu) For heterogeneous CPU-GPU clusters Provided an auto-tuning framework for data distribution December 3, 2018

23 Extending MATE Main issue of the original MATE:
Assumes that the reduction object MUST fit in memory We extended MATE to address this limitation Focus on graph mining: an emerging class of apps Require large-sized reduction objects as well as large-scale datasets E.g., PageRank could have a 8GB reduction object! Support of managing arbitrary-sized reduction objects Large-sized reduction objects are disk-resident Evaluated Ex-MATE using PEGASUS PEGASUS: A Hadoop-based graph mining system December 3, 2018

24 Ex-MATE Runtime Overview
Basic one-stage execution December 3, 2018

25 Implementation Considerations
Support for processing very large datasets Partitioning function: Partition and distribute to a number of nodes Splitting function: Use the multi-core CPU on each node Management of a large reduction-object (R.O.): Reduce disk I/O! Outputs (R.O.) are updated in a demand-driven way Partition the reduction object into splits Inputs are re-organized based on data access patterns Reuse a R.O. split as much as possible in memory Example: Matrix-Vector Multiplication December 3, 2018

26 A MV-Multiplication Example
Input Vector (1, 1) (1, 2) Output Vector Input Matrix (2, 1) December 3, 2018

27 Experiments Design Applications: Evaluation: Experimental platform:
Three graph mining algorithms: PageRank, Diameter Estimation (HADI), and Finding Connected Components (HCC) – Parallelized using GIM-V method Evaluation: Performance comparison with PEGASUS PEGASUS provides a naïve version and an optimized version Speedups with an increasing number of nodes Scalability speedups with an increasing size of datasets Experimental platform: A cluster of multi-core CPU machines Used up to 128 cores (16 nodes) 27 December 3, 2018

28 Results: Graph Mining (I)
16GB datasets: Ex-MATE:10~ times speedup HADI PageRank Avg. Time Per Iteration (min) HCC # of nodes December 3, 2018

29 Scalability: Graph Mining (II)
HCC: better scalability with larger datasets 32GB 8GB Avg. Time Per Iteration (min) 64GB # of nodes December 3, 2018

30 Current Work Three systems on different parallel architectures:
MATE (Map-reduce with an AlternaTE API) For multi-core environments Ex-MATE (Extended MATE) For clusters of multi-cores Provided large-sized reduction object support MATE-CG (MATE for Cpu-Gpu) For heterogeneous CPU-GPU clusters Provided an auto-tuning framework for data distribution December 3, 2018

31 MATE-CG Adopt Generalized Reduction
On top of MATE and Ex-MATE Accelerate data-intensive computations on heterogeneous systems Focus on CPU-GPU clusters A multi-level data partitioning Propose a novel auto-tuning framework Exploits iterative nature of many data-intensive apps Automatically decides the workload distribution between CPUs and GPUs December 3, 2018

32 MATE-CG Overview Execution work-flow December 3, 2018

33 Auto-Tuning Framework
Auto-tuning problem: Given an application, find the optimal data distribution between the CPU and the GPU to minimize the overall running time For example: 20/80? 50/50? 70/30? Our approach: Exploit the iterative nature of many data-intensive applications with similar computations over a number of iterations Construct an analytical model to predict performance The optimal value is computed and learnt over the first few iterations No compile-time search or tuning is needed Low runtime overheads with a large number of iterations December 3, 2018

34 The Analytical Model Illustration of the relationship between Tcg and p: T_cg varies with p T_c varies with p T_g varies with p T_c: processing times on cpu with p T_p: processing times on gpu with (1-p) T_cg: overall processing times on cpu+gpu with p December 3, 2018

35 Experiments Design Experiments Platform
A heterogeneous CPU-GPU cluster Each node has one multi-core CPU and one GPU Intel 8-core CPU NVIDA Tesla (Fermi) GPU (448 cores) Used up to 128 CPU cores and 7168 GPU cores on 16 nodes Three representative applications Gridding kernel, EM, and PageRank For each application, we run it in four modes in the cluster CPU-1: CPU-8:GPU-only: CPU-8-n-GPU 35 December 3, 2018

36 Results: Scalability with increasing # of GPUs
GPU-only is better than CPU-8 PageRank EM Avg. Time Per Iteration (sec) 16% 3.0 Gridding Kernel 25% # of nodes December 3, 2018

37 Results: Auto-tuning On16 nodes EM PageRank Iteration Number
Execution Time In One Iteration (sec) Iteration Number Gridding Kernel December 3, 2018

38 Outline Background Current Work Proposed Work Conclusion
December 3, 2018

39 Proposed Work Performance-Aware Application Consolidation in Cluster Environments Towards Automatic Optimization of Reduction-Based Applications A High-Level API for Programming Heterogeneous Systems Using OpenCL December 3, 2018

40 Performance-Aware Application Consolidation in Cluster Environments
Motivation We only focused on running a single application at one time in a cluster A single one may not utilize all the resources on each node Desirable to consolidate a set of applications concurrently Towards resource sharing Also performance-aware Problem formulation: Given a set of data-intensive applications, schedule them concurrently to maximize resource usage in CPU clusters while maintaining the performance of each application December 3, 2018

41 Two Sub-Problems Performance and resource modeling
Predict the performance based on a resource allocation plan Identify the best resource allocation plan within the performance constraints E.g., use the least average resource costs Application-consolidation scheduling Map each application to a set of nodes Towards efficient overall resource utilization per node in the cluster December 3, 2018

42 Scheduling Problem The application-consolidation scheduling algorithm
Map a set of applications with different resource allocation plans to nodes towards efficient resource utilization Reduced to bin-packing problem, which is NP-Hard Need heuristics in practice One heuristic could be as follows: Step 1: calculate the weighted avg. of resource demands for all applications and sort them in descending order Step 2: For each app in decreasing order, select qualified nodes to meet the demands Such nodes exist: assign the app to them and update the residual resources Otherwise, assign it to least loaded nodes or delay its execution December 3, 2018

43 Proposed Work Performance-Aware Application Consolidation in Cluster Environments Towards Automatic Optimization of Reduction-Based Applications A High-Level API for Programming Heterogeneous Systems Using OpenCL December 3, 2018

44 Towards Automatic Optimization of Reduction-Based Applications
An improvement of the MATE-CG system Ease the programming difficulties on GPUs Code generation Provide an auto-tuning framework in terms of a set of system parameters Auto-tuning problem: For a set of system parameters, automatically identify the best execution plan for a given application to achieve the best- possible performance metric (e.g., running time) Potential Approaches: Performance-model approach Dynamic-profiling approach Sampling-based approach December 3, 2018

45 Auto-Tuning: Potential Approaches (I)
Performance-model approach Establish a performance model to predict the performance Search the best among solution space learning-over-time could be used E.g., Iterative nature Dynamic-profiling approach Invest some parameters upfront before application runs Application similarities E.g., some apps can fit into same algorithm Code analysis could be used: E.g., GPU settings December 3, 2018

46 Auto-Tuning: Potential Approaches (II)
Sampling-based approach Start multiple small instances before main instance Rely on feedbacks from sampled instances Adaptively identify the best configuration E.g., useful for single-pass applications A hybrid approach Combine the above three Target a complete set of system parameters December 3, 2018

47 Proposed Work Performance-Aware Application Consolidation in Cluster Environments Towards Automatic Optimization of Reduction-Based Applications A High-Level API for Programming Heterogeneous Systems Using OpenCL December 3, 2018

48 A High-Level API for Programming Heterogeneous Systems Using OpenCL
OpenCL is emerging as the open standard for programming heterogeneous systems One kernel code for one app. A typical OpenCL program execution overview December 3, 2018

49 Data Distribution for OpenCL
Focus on CPU-GPU clusters A loosely categorization of OpenCL programs CPUs only GPUs only Both GPUs and GPUs with proper data distribution A hierarchical model for workload distribution First-level predictors Fit into CPUs only or GPUs only ? Binary classifiers based on training data Second-level predictor First-level does not lead to a conclusion Auto-tuning: search for the best hybrid data mapping December 3, 2018

50 Summary Develop A high-level API on top of OpenCL for heterogeneous systems Research agenda Conduct an empirical study to understand OpenCL programs Develop the hierarchical model for data distribution among CPUs and GPUs Provide useful insights for optimizing OpenCL programs December 3, 2018

51 Outline Background Current Work Proposed Work Conclusion
December 3, 2018

52 Conclusions (I) Map-Reduce-Like models provide high programming productivity in traditional HPC platforms Performance is not satisfactory Our previous work provided an alternate API with better performance on emerging parallel architectures A comparative study showed a variant of map-reduce is promising in performance efficiency The MATE system focused on the two APIs with better understanding and comparing performance differences The Ex-MATE added support for graph mining applications with large-sized reduction objects The MATE-CG focused on utilizing the massive power of CPU- GPU clusters with a novel auto-tuning framework December 3, 2018

53 Conclusions (II) In the future, we propose our work in three directions: Consolidate a set of data-intensive applications in CPU clusters towards better resource utilization and also maintaining performance Design a map-reduce-like API to easy the programming difficulties on GPUs and provide automatic optimization in the runtime for auto-tuning a set of system parameters Develop a high-level API using OpenCL for programming heterogeneous systems with a hierarchical data distribution model December 3, 2018

54 Thank You, and Acknowledgments
Questions and comments Wei Jiang - Gagan Agrawal - Our research is supported by: 54 December 3, 2018

55 Related Work (I) Data-intensive computing with map-reduce-like models
Multi-core CPUs: Phoenix, Phoenix-rebirth, … A single GPU: Mars, MapCG, … GPU clusters: MITHRA, IDAV, … CPU-GPU clusters: our MATE-CG Improvement of Map-Reduce API: Integrating pre-fetch and pre-shuffling into Hadoop Supporting online queries Enforcing a less restrictive synchronization semantics between Map and Reduce 55 December 3, 2018

56 Related Work (II) Google’s Pregel System: Variants of Map-Reduce:
Map-reduce may not so suitable for graph operations Proposed to target graph processing Open source version: HAMA project in Apache Variants of Map-Reduce: Dryad/DryadLINQ from Microsoft Sawzall from Google Pig/Map-Reduce-Merge from Yahoo! Hive from Facebook 56 December 3, 2018

57 Related Work (III) Programming heterogeneous systems
Software end: Merge, EXOCHI, Harmony, Qilin, … Hardware end: CUBA, CUDA, OpenCL, Auto-tuning & automatic optimization: Automatically search for best solution among all possibilities Map solutions to performance metrics Map hardware/software characteristics to parameters Very useful for library generators, compilers, and runtime systems Scheduling & consolidation Consolidate a set of applications concurrently in cloud/virtualized environments Consider performance, resource, energy, etc… 57 December 3, 2018


Download ppt "Wei Jiang Advisor: Dr. Gagan Agrawal"

Similar presentations


Ads by Google