Wei Jiang Advisor: Dr. Gagan Agrawal

Slides:

Advertisements

Similar presentations

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Data-Intensive and High.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Next Generation of Apache Hadoop MapReduce Owen

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

TensorFlow– A system for large-scale machine learning

A Peta-Scale Graph Mining System

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Resource Elasticity for Large-Scale Machine Learning

Introduction to MapReduce and Hadoop

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Linchuan Chen, Xin Huo and Gagan Agrawal

Tools and Techniques for Processing (and Management) of Data

Tools and Techniques for Processing and Management of Data

Applying Twister to Scientific Applications

MapReduce Simplied Data Processing on Large Clusters

湖南大学-信息科学与工程学院-计算机与科学系

On Spatial Joins in MapReduce

Communication and Memory Efficient Parallel Decision Tree Construction

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Data-Intensive Computing: From Clouds to GPU Clusters

Smita Vijayakumar Qian Zhu Gagan Agrawal

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Yi Wang, Wei Jiang, Gagan Agrawal

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

A Map-Reduce System with an Alternate API for Multi-Core Environments

MapReduce: Simplified Data Processing on Large Clusters

Anand Bhat*, Soheil Samii†, Raj Rajkumar* *Carnegie Mellon University

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Wei Jiang Advisor: Dr. Gagan Agrawal A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Wei Jiang Advisor: Dr. Gagan Agrawal

Motivation Growing need of Data-Intensive SuperComputing High-performance data processing Data management & efficiency requirements Data-intensive computations Emergence of various parallel architectures Traditional CPU clusters (multi-cores) GPU clusters (many-cores) CPU-GPU clusters (heterogeneous systems) Exploiting parallel architectures for High-end apps Provide better programmer productivity & runtime efficiency Programming Models and Middleware Support! December 3, 2018

Introduction (I) Map-Reduce Data-intensive applications Simple API : map and reduce Easy to write parallel programs Fault-tolerant for large-scale data centers High programming productivity Performance? Always a concern for HPC community Data-intensive applications Various Subclasses: Data center-oriented : search technologies Data Mining Large intermediate structures: pre-processing December 3, 2018

Introduction (II) Parallel Computing Environments CPU clusters (multi-cores) Most widely used as traditional HPC platforms Motivated MapReduce and many of its variants GPU clusters (many-cores) Higher performance with better cost & energy efficiency Low programming productivity Limited MapReduce-like support CPU-GPU clusters Emerging heterogeneous systems No MapReduce-like support up to date except our recent work! New Hybrid Architecture CPU+GPU on the same chip December 3, 2018

Thesis Goals Examine APIs for data-intensive computations Map-reduce-like models Needs of different data-intensive applications Data mining, graph mining, and scientific computing Implementation on emerging architectures Multi-core, many-core, hybrid, etc… Automatic tuning and scheduling Automatic optimization Efficient resource utilization and sharing December 3, 2018

Outline Background Current Work Proposed Work Conclusion December 3, 2018

Map-Reduce Execution December 3, 2018

Hadoop Implementation HDFS Almost GFS, but no file update Cannot be directly mounted by an existing operating system Fault tolerance One name node: Job Tracker & Task Tracker Optimization Data locality Schedule a map task near a replica of the corresponding input data Optional combiner Use local reduction to save the network bandwidth Backup tasks Alleviate the problem of stragglers December 3, 2018

Outline Background Current Work Proposed Work Conclusion December 3, 2018

Current Work Three systems on different parallel architectures: MATE (Map-reduce with an AlternaTE API) For multi-core environments Ex-MATE (Extended MATE) For clusters of multi-cores Provided large-sized reduction object support MATE-CG (MATE for Cpu-Gpu) For heterogeneous CPU-GPU clusters Provided an auto-tuning framework for data distribution December 3, 2018

Phoenix-based Implementation Phoenix is based on the same principles of MapReduce Targets shared-memory systems Consists of a simple API that is visible to application programmers Users define functions like splitter, map, reduce, and etc.. An efficient runtime that handles low-level details Parallelization using P-threads Fault detection and recovery MATE (Map-Reduce system with an AlternaTE API) Built on top of Phoenix with the use of a reduction object Adopted generalized reduction model December 3, 2018

Comparing Processing Structures Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping.. .overheads are eliminated with red. func/obj. December 3, 2018

Phoenix Runtime December 3, 2018

MATE Runtime Dataflow Basic one-stage dataflow (Full Replication scheme) December 3, 2018

System Design and Implementation Shared-memory technique in MATE Full Replication scheme Three sets of functions in MATE Internally used functions—invisible to users One set of API provided by the runtime Another set of API to be defined or customized by the users December 3, 2018

Functions APIs defined/customized by the user Function Description R/O int (*splitter_t)(void *, int, reduction_args_t *) O void (*reduction_t)(reduction_args_t *) R void (*combination_t)(void*) void (*finalize_t)(void *) December 3, 2018

Implementation Considerations Focus on the API differences in programming models Data partitioning among multiple threads: Dynamically assigns splits to worker threads, same as Phoenix Buffer management: two temporary buffers One for reduction objects The other for combination results Fault tolerance: re-executes failed tasks Checkpointing may be a better solution since reduction object maintains the computation state December 3, 2018

Experiments Design For comparison, we used three applications Data Mining: KMeans, PCA, Apriori Also evaluated the single-node performance of Hadoop on KMeans and Apriori Experiments on two multi-core platforms 8 cores on one WCI node (Intel cpu) 16 cores on one AMD node (AMD cpu) 18 December 3, 2018

Results: Data Mining (I) K-Means on 8-core and 16-core machine: 400MB dataset, 3-dim points, k = 100 Avg. Time Per Iteration (sec) 2.0 3.0 # of threads December 3, 2018

Results: Data Mining (II) PCA on 8-core and 16-core machine : 8000 * 1024 matrix Total Time (sec) 2.0 2.0 # of threads December 3, 2018

Results: Data Mining (III) Apriori on 8-core and 16-core machine: 1,000,000 transactions, 3% support Avg. Time Per Iteration (sec) # of threads December 3, 2018

Current Work Three systems on different parallel architectures: MATE (Map-reduce with an AlternaTE API) For multi-core environments Ex-MATE (Extended MATE) For clusters of multi-cores Provided large-sized reduction object support MATE-CG (MATE for Cpu-Gpu) For heterogeneous CPU-GPU clusters Provided an auto-tuning framework for data distribution December 3, 2018

Extending MATE Main issue of the original MATE: Assumes that the reduction object MUST fit in memory We extended MATE to address this limitation Focus on graph mining: an emerging class of apps Require large-sized reduction objects as well as large-scale datasets E.g., PageRank could have a 8GB reduction object! Support of managing arbitrary-sized reduction objects Large-sized reduction objects are disk-resident Evaluated Ex-MATE using PEGASUS PEGASUS: A Hadoop-based graph mining system December 3, 2018

Ex-MATE Runtime Overview Basic one-stage execution December 3, 2018

Implementation Considerations Support for processing very large datasets Partitioning function: Partition and distribute to a number of nodes Splitting function: Use the multi-core CPU on each node Management of a large reduction-object (R.O.): Reduce disk I/O! Outputs (R.O.) are updated in a demand-driven way Partition the reduction object into splits Inputs are re-organized based on data access patterns Reuse a R.O. split as much as possible in memory Example: Matrix-Vector Multiplication December 3, 2018

A MV-Multiplication Example Input Vector (1, 1) (1, 2) Output Vector Input Matrix (2, 1) December 3, 2018

Experiments Design Applications: Evaluation: Experimental platform: Three graph mining algorithms: PageRank, Diameter Estimation (HADI), and Finding Connected Components (HCC) – Parallelized using GIM-V method Evaluation: Performance comparison with PEGASUS PEGASUS provides a naïve version and an optimized version Speedups with an increasing number of nodes Scalability speedups with an increasing size of datasets Experimental platform: A cluster of multi-core CPU machines Used up to 128 cores (16 nodes) 27 December 3, 2018

Results: Graph Mining (I) 16GB datasets: Ex-MATE:10~ times speedup HADI PageRank Avg. Time Per Iteration (min) HCC # of nodes December 3, 2018

Scalability: Graph Mining (II) HCC: better scalability with larger datasets 32GB 8GB Avg. Time Per Iteration (min) 64GB # of nodes December 3, 2018

Current Work Three systems on different parallel architectures: MATE (Map-reduce with an AlternaTE API) For multi-core environments Ex-MATE (Extended MATE) For clusters of multi-cores Provided large-sized reduction object support MATE-CG (MATE for Cpu-Gpu) For heterogeneous CPU-GPU clusters Provided an auto-tuning framework for data distribution December 3, 2018

MATE-CG Adopt Generalized Reduction On top of MATE and Ex-MATE Accelerate data-intensive computations on heterogeneous systems Focus on CPU-GPU clusters A multi-level data partitioning Propose a novel auto-tuning framework Exploits iterative nature of many data-intensive apps Automatically decides the workload distribution between CPUs and GPUs December 3, 2018

MATE-CG Overview Execution work-flow December 3, 2018

Auto-Tuning Framework Auto-tuning problem: Given an application, find the optimal data distribution between the CPU and the GPU to minimize the overall running time For example: 20/80? 50/50? 70/30? Our approach: Exploit the iterative nature of many data-intensive applications with similar computations over a number of iterations Construct an analytical model to predict performance The optimal value is computed and learnt over the first few iterations No compile-time search or tuning is needed Low runtime overheads with a large number of iterations December 3, 2018

The Analytical Model Illustration of the relationship between Tcg and p: T_cg varies with p T_c varies with p T_g varies with p T_c: processing times on cpu with p T_p: processing times on gpu with (1-p) T_cg: overall processing times on cpu+gpu with p December 3, 2018

Experiments Design Experiments Platform A heterogeneous CPU-GPU cluster Each node has one multi-core CPU and one GPU Intel 8-core CPU NVIDA Tesla (Fermi) GPU (448 cores) Used up to 128 CPU cores and 7168 GPU cores on 16 nodes Three representative applications Gridding kernel, EM, and PageRank For each application, we run it in four modes in the cluster CPU-1: CPU-8:GPU-only: CPU-8-n-GPU 35 December 3, 2018

Results: Scalability with increasing # of GPUs GPU-only is better than CPU-8 PageRank EM Avg. Time Per Iteration (sec) 16% 3.0 Gridding Kernel 25% # of nodes December 3, 2018

Results: Auto-tuning On16 nodes EM PageRank Iteration Number Execution Time In One Iteration (sec) Iteration Number Gridding Kernel December 3, 2018

Outline Background Current Work Proposed Work Conclusion December 3, 2018

Proposed Work Performance-Aware Application Consolidation in Cluster Environments Towards Automatic Optimization of Reduction-Based Applications A High-Level API for Programming Heterogeneous Systems Using OpenCL December 3, 2018

Performance-Aware Application Consolidation in Cluster Environments Motivation We only focused on running a single application at one time in a cluster A single one may not utilize all the resources on each node Desirable to consolidate a set of applications concurrently Towards resource sharing Also performance-aware Problem formulation: Given a set of data-intensive applications, schedule them concurrently to maximize resource usage in CPU clusters while maintaining the performance of each application December 3, 2018

Two Sub-Problems Performance and resource modeling Predict the performance based on a resource allocation plan Identify the best resource allocation plan within the performance constraints E.g., use the least average resource costs Application-consolidation scheduling Map each application to a set of nodes Towards efficient overall resource utilization per node in the cluster December 3, 2018

Scheduling Problem The application-consolidation scheduling algorithm Map a set of applications with different resource allocation plans to nodes towards efficient resource utilization Reduced to bin-packing problem, which is NP-Hard Need heuristics in practice One heuristic could be as follows: Step 1: calculate the weighted avg. of resource demands for all applications and sort them in descending order Step 2: For each app in decreasing order, select qualified nodes to meet the demands Such nodes exist: assign the app to them and update the residual resources Otherwise, assign it to least loaded nodes or delay its execution December 3, 2018

Proposed Work Performance-Aware Application Consolidation in Cluster Environments Towards Automatic Optimization of Reduction-Based Applications A High-Level API for Programming Heterogeneous Systems Using OpenCL December 3, 2018

Towards Automatic Optimization of Reduction-Based Applications An improvement of the MATE-CG system Ease the programming difficulties on GPUs Code generation Provide an auto-tuning framework in terms of a set of system parameters Auto-tuning problem: For a set of system parameters, automatically identify the best execution plan for a given application to achieve the best- possible performance metric (e.g., running time) Potential Approaches: Performance-model approach Dynamic-profiling approach Sampling-based approach December 3, 2018

Auto-Tuning: Potential Approaches (I) Performance-model approach Establish a performance model to predict the performance Search the best among solution space learning-over-time could be used E.g., Iterative nature Dynamic-profiling approach Invest some parameters upfront before application runs Application similarities E.g., some apps can fit into same algorithm Code analysis could be used: E.g., GPU settings December 3, 2018

Auto-Tuning: Potential Approaches (II) Sampling-based approach Start multiple small instances before main instance Rely on feedbacks from sampled instances Adaptively identify the best configuration E.g., useful for single-pass applications A hybrid approach Combine the above three Target a complete set of system parameters December 3, 2018

Proposed Work Performance-Aware Application Consolidation in Cluster Environments Towards Automatic Optimization of Reduction-Based Applications A High-Level API for Programming Heterogeneous Systems Using OpenCL December 3, 2018

A High-Level API for Programming Heterogeneous Systems Using OpenCL OpenCL is emerging as the open standard for programming heterogeneous systems One kernel code for one app. A typical OpenCL program execution overview December 3, 2018

Data Distribution for OpenCL Focus on CPU-GPU clusters A loosely categorization of OpenCL programs CPUs only GPUs only Both GPUs and GPUs with proper data distribution A hierarchical model for workload distribution First-level predictors Fit into CPUs only or GPUs only ? Binary classifiers based on training data Second-level predictor First-level does not lead to a conclusion Auto-tuning: search for the best hybrid data mapping December 3, 2018

Summary Develop A high-level API on top of OpenCL for heterogeneous systems Research agenda Conduct an empirical study to understand OpenCL programs Develop the hierarchical model for data distribution among CPUs and GPUs Provide useful insights for optimizing OpenCL programs December 3, 2018

Outline Background Current Work Proposed Work Conclusion December 3, 2018

Conclusions (I) Map-Reduce-Like models provide high programming productivity in traditional HPC platforms Performance is not satisfactory Our previous work provided an alternate API with better performance on emerging parallel architectures A comparative study showed a variant of map-reduce is promising in performance efficiency The MATE system focused on the two APIs with better understanding and comparing performance differences The Ex-MATE added support for graph mining applications with large-sized reduction objects The MATE-CG focused on utilizing the massive power of CPU- GPU clusters with a novel auto-tuning framework December 3, 2018

Conclusions (II) In the future, we propose our work in three directions: Consolidate a set of data-intensive applications in CPU clusters towards better resource utilization and also maintaining performance Design a map-reduce-like API to easy the programming difficulties on GPUs and provide automatic optimization in the runtime for auto-tuning a set of system parameters Develop a high-level API using OpenCL for programming heterogeneous systems with a hierarchical data distribution model December 3, 2018

Thank You, and Acknowledgments Questions and comments Wei Jiang - jiangwei@cse.ohio-state.edu Gagan Agrawal - agrawal@cse.ohio-state.edu Our research is supported by: 54 December 3, 2018

Related Work (I) Data-intensive computing with map-reduce-like models Multi-core CPUs: Phoenix, Phoenix-rebirth, … A single GPU: Mars, MapCG, … GPU clusters: MITHRA, IDAV, … CPU-GPU clusters: our MATE-CG Improvement of Map-Reduce API: Integrating pre-fetch and pre-shuffling into Hadoop Supporting online queries Enforcing a less restrictive synchronization semantics between Map and Reduce 55 December 3, 2018

Related Work (II) Google’s Pregel System: Variants of Map-Reduce: Map-reduce may not so suitable for graph operations Proposed to target graph processing Open source version: HAMA project in Apache Variants of Map-Reduce: Dryad/DryadLINQ from Microsoft Sawzall from Google Pig/Map-Reduce-Merge from Yahoo! Hive from Facebook 56 December 3, 2018

Related Work (III) Programming heterogeneous systems Software end: Merge, EXOCHI, Harmony, Qilin, … Hardware end: CUBA, CUDA, OpenCL, Auto-tuning & automatic optimization: Automatically search for best solution among all possibilities Map solutions to performance metrics Map hardware/software characteristics to parameters Very useful for library generators, compilers, and runtime systems Scheduling & consolidation Consolidate a set of applications concurrently in cloud/virtualized environments Consider performance, resource, energy, etc… 57 December 3, 2018