Data-Intensive Computing: From Clouds to GPU Clusters

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SALSA HPC Group School of Informatics and Computing Indiana University.
Spark: Cluster Computing with Working Sets
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
SALSA HPC Group School of Informatics and Computing Indiana University.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Dynamic Mobile Cloud Computing: Ad Hoc and Opportunistic Job Sharing.
TensorFlow– A system for large-scale machine learning
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
CS427 Multicore Architecture and Parallel Computing
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Distributed Systems CS
Tools and Techniques for Processing (and Management) of Data
Tools and Techniques for Processing and Management of Data
Applying Twister to Scientific Applications
MapReduce Simplied Data Processing on Large Clusters
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
湖南大学-信息科学与工程学院-计算机与科学系
Communication and Memory Efficient Parallel Decision Tree Construction
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Ch 4. The Evolution of Analytic Scalability
Wei Jiang Advisor: Dr. Gagan Agrawal
Smita Vijayakumar Qian Zhu Gagan Agrawal
AWS Cloud Computing Masaki.
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Multithreaded Programming
Distributed Systems CS
A Map-Reduce System with an Alternate API for Multi-Core Environments
Operating System Overview
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Data-Intensive Computing: From Clouds to GPU Clusters Gagan Agrawal

Motivation Parallel data mining is a special case of Data-Intensive Computing Cloud Environments have emerged: Elastic Pay-as-you Reliable long term storage High performance systems are changing Accelerator based CPU-GPU clusters are some of the fastest systems today December 4, 2018

Background Previously developed MATE (a Map-Reduce system with an AlternaTE API) for multi-core environments Phoenix implemented Map-Reduce in shared-memory systems MATE adopted Generalized Reduction, first proposed in FREERIDE that was developed at Ohio State 2001-2003 Comparison between MATE and Phoenix for Data Mining Applications Comparing performance and API Understanding performance overheads MATE provided an alternative API better than ``Map- Reduce`` for some data-intensive applications

Map-Reduce Execution December 4, 2018

Comparing Processing Structures Reduction Object represents the intermediate state of the execution Reduce func. is commutative and associative Sorting, grouping.. .overheads are eliminated with red. func/obj. December 4, 2018

Observations on Processing Structures Map-Reduce is based on functional idea Does not maintain state This can lead to overheads of managing intermediate results between map and reduce Map could generate intermediate results of very large size Reduction-based approach is based on a programmer- managed reduction object Not as ‘clean’ But, avoids sorting of intermediate results Can also help shared memory parallelization Helps better fault-recovery December 4, 2018

Outline MATE-EC2 MATE-CG MATE supported for Amazon EC2 environment Process data resident in S3 Can use heterogeneous environments MATE-CG Target GPU clusters Use both a multi-core CPU and a GPU for the same computatiion December 4, 2018

MATE-EC2: Motivation MATE – MapReduce with an Alternate API MATE-EC2: Implementation for AWS Environments Cloud resources are blackboxes Need for services and tools that can… get the most out of cloud resources help their users with easy APIs Virtualization hides all the scheduling related operations and underlying architectures of the cloud environment from the users, therefore it can be seen as a functional blackbox. This property is desirable for the end-user, however developers still need to know the details of the cloud environment in order to provide efficient services and tools. Thus, there is a need for understanding the characteristics of the cloud environments. Understanding the characteristics of the cloud environment is only the first step. The investigated characteristics should be properly exploited by the tools and services that run on cloud environment. Therefore, cloud service users can get the most out of available resources.

MATE-EC2 Design Data organization Chunk Retrieval Three levels: Buckets, Chunks and Units Metadata information Chunk Retrieval Threaded Data Retrieval Selective Job Assignment Load Balancing and handling heterogeneity Pooling mechanism Data Organization: -Buckets: Physical presentation of the data on S3 (data is physically stored in data objects and presented with buckets) Chunks: Logical data blocks inside buckets (exploits memory utilization) Data units: Minimum required data units that are processes by the application (exploits cache utilization) -Metadata information: Index file that consists of fullpath of the data object, offset address of the chunk, size of the chunk, total number of units inside the chunk. Chunk Retrieval: -Threaded Data Retrieval: Each chunk is requested with a number of threads. Therefore bandwidth usage of the processing node is maximized. -Selective Job Assignment: Chunk selection is based on the number of the connections on each bucket. The chunk is selected from the data object that has the minimum number of active -connection. Thus, each bucket’s upload bandwidth is exploited. Load Balancing and handling heterogeneity: -Pooling mechanism: Whenever a processing node finishes a job, it requests another from the master node. Master node, then, assigns a job from the job pool. Load balancing mechanism uses the selective job assignment. Thus the job selection is not a sequential or random process.

MATE-EC2 Processing Flow T T 1 T 2 T 3 C 5 C n S3 Data Object Computing Layer Job Scheduler Metadata File EC2 Master Node EC2 Slave Node Request / Retrieve another job

Experiments Goals: Finding the most suitable setting for AWS Performance of MATE-EC2 on heterogeneous and homogeneous environments Performance comparison of MATE-EC2 and Map-Reduce Applications: KMeans and PCA Used Resources: 4 Large EC2 instances for processing, 1 Large instance for Master 16 Data objects on S3 (8.2GB total data set for both app.)

Diff. Data Chunk Sizes KMeans 16 Retrieval threads Performance increase 8M vs. others 1.13 to 1.30 1 Thread vs. 16 Threads versions 1.24 to 1.81

Diff. Number of Threads Performance increase for PCA 128MB chunk size Performance increase in Fig. (KMeans) 1.37 to 1.90 Performance increase for PCA 1.38 to 1.71

Heterogeneous Env. L: Large instances S: Small instances 128MB chunk size Overheads in Fig. (KMeans) Under 1% Overheads for PCA 1.1 to 11.7 The comparison includes available bandwidth of the instances as well as their throughput. The reason why PCA has more overhead is because of the number of data retrievals (data is retrieved twice) and more synchronization points that PCA has.

MATE-EC2 vs. Map-Reduce Scalability (MATE) Scalability (MR) Speedups: Efficiency: 90% Scalability (MR) Efficiency: 74% Speedups: MATE vs. MR 3.54 to 4.58

Outline MATE-EC2 MATE-CG MATE supported for Amazon EC2 environment Process data resident in S3 Can use heterogeneous environments MATE-CG Target GPU clusters Use both a multi-core CPU and a GPU for the same computatiion December 4, 2018

MATE-CG: System Design and Implementation Execution Overview of MATE-CG System API Support of heterogeneous computing Data types: Input_Space and Reduction_Object Functions: CPU_Reduction and GPU_Reduction Runtime Partitioning disk-resident dataset among nodes Managing large-sized reduction object on disk Managing large-sized intermediate data Using GPUs to accelerate computation December 4, 2018

MATE-CG Overview Execution work-flow December 4, 2018

System API Data types and functions December 4, 2018

Implementation Considerations (I) A multi-level data partitioning scheme First, partitioning function: partition inputs into blocks and distributed them to different nodes Data locality should be considered Second, heterogeneous data mapping: cut each block into two parts, one for CPU, the other for GPU How to identify the best data mapping? Third, splitting function: split part of data blocks into smaller chunks Observation: smaller chunk size for CPU and larger chunk size for GPU December 4, 2018

Implementation Considerations (II) Management of large-sized reduction- object/intermediate data: Reduce disk I/O of large reduction objects: Data access patterns are used to reuse splits of reduction objects as much as possible Transparent to user code Reduce network costs of large intermediate data: A generic solution to invoke a all-to-all broadcast among all nodes would cause severe performance losses Application-driven optimizations can be used to improve performance. December 4, 2018

Auto-Tuning Framework Auto-tuning problem: given an application, find the optimal parameter setting to distribute data to the CPU and the GPU respectively due to different processing capabilities. For example: 20/80? 50/50? 70/30? Our approach: exploit the iterative nature of many data- intensive applications with similar computations over a number of iterations Construct an analytical model to predict performance The optimal value is learnt over the first few iterations No compile-time search or tuning is needed Low runtime overheads with a large number of iterations December 4, 2018

The Analytical Model (I) We focus on the two main components in the overall running time on each node: data processing time on the CPU and/or the GPU and the overheads on the CPU First, consider the CPU only and we have: Second, on the GPU, we have: Third, let Tcg represent the heterogeneous execution time using both CPU and GPU, we have: December 4, 2018

The Analytical Model (II) Let p represent the fraction of data to the CPU and we have: and Overall, to relate Tcg with p, we have the following illustration December 4, 2018

The Analytical Model (III) Illustration of the relationship between Tcg and p: December 4, 2018

The Analytical Model (IV) To minimize Tcg by computing the optimal p, we have: To identify the best p, a simple heuristic way is used: First, set p to 1: use CPUs only Second, set p to 0: use GPUs only Obtain necessary values for other parameters in the above expression and predict an initial p Adjust p accordingly in future iterations for variances in measured values: make the CPU and the GPU finish simultaneously December 4, 2018

Applications: three representatives Gridding kernel from scientific computing Single pass: convert visibilities into a grid-model of the sky The Expectation-Maximization algorithm from data mining Iterative: estimate a vector of parameters Two consecutive steps: the Expectation step (E-step) and the Maximization step (M-step) PageRank from graph mining Iterative: calculate the relative importance of web pages Is essentially a matrix-vector multiplication algorithm December 4, 2018

Applications: Optimizations (I) The Expectation-Maximization algorithm Large intermediate matrix between the E-step and the M-step Could cause a lot of network communication costs for broadcasting such a large matrix among all nodes Optimization: On the same node, M-step reads the same subset of intermediate matrix as produced in E-step (use of a common partitioner) PageRank Data-copying overheads are significant on GPUs Smaller input vector splits are shared by larger matrix blocks that need further splitting Optimization: copy shared input vector splits only once to save copying time (fine-grained copying) December 4, 2018

Applications: Optimizations (II) Outline of data copying and computation on GPUs December 4, 2018

Experiments Design (I) Experiments Platform A heterogeneous CPU-GPU cluster Each node has one Intel 8-core CPU and a NVIDA Tesla (Fermi) GPU (448 cores) Used up to 128 CPU cores and 7168 GPU cores on 16 nodes 31 December 4, 2018

Experiments Design (II) Three representative applications Gridding kernel, EM, and PageRank. For each application, we run it in four modes in the cluster: CPU-1: 1 CPU core per node as baseline CPU-8: 8 CPU cores per node GPU-only: only the GPU per node CPU-8-n-GPU: both 8 CPU cores and GPU per node 32 December 4, 2018

Experiments Design (III) We focused on three aspects: scalability Performance improvement of Heterogeneous computing Effectiveness of auto-tuning Framework Performance impact of application-driven optimizations 33 December 4, 2018

Results: Scalability with # of GPUs (I) PageRank: 64GB dataset; a graph of 1 billion nodes and 4 billion edges 7.0 6.8 6.3 5.0 16% December 4, 2018

Results: Scalability with # of GPUs (II) Gridding Kernel: 32GB dataset; a collection of 800 million visibilities and a 6.4GB sky grid 7.5 7.2 6.9 6.5 25% December 4, 2018

Results: Scalability with # of GPUs (III) EM: 32GB dataset; a cluster of 1 billion points 7.6 6.8 5.0 15.0 3.0 December 4, 2018

Results: Auto-tuning (I) PageRank: 64GB dataset on 16 nodes 7% P=0.30 December 4, 2018

Results: Auto-tuning (II) EM: 32GB dataset on 16 nodes E: 29% M: 24% E: p=0.31 M: p=0.27 December 4, 2018

Results: Heterogeneous Execution Gridding Kernel: 32GB dataset on 16 nodes >=56% >=42% December 4, 2018

Results: App-Driven Optimizations (I) EM: 4GB dataset with 20GB intermediate matrix 1.7 7.7 December 4, 2018

Results: App-Driven Optimizations (II) PageRank: 32GB dataset with a block size of 512MB and GPU chunk size of 128MB 24% December 4, 2018

Results: Examples for System Tuning Gridding Kernel: 32GB dataset; varying cpu_chunk_size and gpu_chunk_size 16 MB 512MB December 4, 2018

Insights GPUs can significantly accelerate certain classes of computations Programming difficulties and data-copying overheads Data mapping between the CPU and the GPU is crucial Application-specific opportunities should be exploited Automatic optimization would be desirable 43 December 4, 2018

Summary Emerging environments are posing new challenges Clouds GPU Clusters Middleware Support Can Data-Intensive Computing December 4, 2018