Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

OpenFOAM on a GPU-based Heterogeneous Cluster

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

Mapping Techniques for Load Balancing

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

Elastic Applications in the Cloud Dinesh Rajan University of Notre Dame CCL Workshop, June 2012.

Fault-Tolerant Programming Models and Computing Frameworks Candidacy Examination 12/11/2013 Mehmet Can Kurt.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

MRPGA ： An Extension of MapReduce for Parallelizing Genetic Algorithm Reporter ：古乃卉.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Performance Evaluation of Adaptive MPI

Fault-Tolerant Programming Models and Computing Frameworks

A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.

Linchuan Chen, Xin Huo and Gagan Agrawal

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for Heterogeneous Execution

Heterogeneity in HPC - Present and Future Present Use of accelerators e.g. a CPU+ MIC Cluster Future decreasing feature sizes will increase process variation power-efficient technologies such as NTV will compound process variation local power and thermal optimizations Relative speeds are application-specific Variations can even be dynamic

Application Development for Heterogeneous HPC Existing Programming Models Designed (largely) for homogeneous settings (MPI, PGAS) explicit partitioning and communication Explicit Partitioning Know the relative speed of CPU and MIC cores Code is not portable Only static variations Task Models Not suitable/popular for communication-oriented applications

Our Work DISC: a high-level programming model notion of domain and interactions between domain elements Suitable for most classes of popular scientific applications Abstractions to hide data distribution and communication captured through a domain-interaction API Key Features: automatic partitioning and communication heterogeneous execution support with work redistribution Automated Resilient Execution (Ongoing work)

Scientific Applications Structured and unstructured grids, N-body simulations Similarities iterative structure domain and interactions among domain elements interactions drive computations Programming involves bookkeeping partitions and task assignment identify data to send/receive prepare input/output buffers

DISC Abstractions Domain input space as a multidimensional domain data points as domain elements domain initialization by API leverages automatic partitioning

DISC Abstractions Interaction between Domain Elements grid-based interactions (inferred from domain type) radius-based interaction (by cutoff distance) explicit-list based interaction (by point connectivity)

compute-function and computation-space compute-function calculate new values for point attributes invoked by runtime at each iteration computation-space (for each subdomain) updates performed on computation-space leverages automatic repartitioning

Runtime Communication Generation from Domain-Interaction API Each subdomain needs updated attributes of interacted elements in other subdomains DISC runtime has the knowledge of partitioning (boundaries of each subdomain) nature of interaction among points Automatic communication identifies which elements should be sent where places received values in runtime structures

Runtime Communication Generation from Domain-Interaction API Grid-based Interactions: seen in stencil patterns acquires ghost rows and columns single exchange with immediate neighbors (east,west,north,south) Radius-based Interactions: seen in molecular dynamics (cutoff distance r c ) acquires all elements inside a sphere one or more exchanges (depending on r c ) with immediate neighbors Explicit-list Based Interactions: specified explicitly by disc_add_interaction() routine exchanges with any subdomains (not just imm. neighbors)

Work Redistribution for Heterogeneity Main idea: shrinking/expanding a subdomain changes processors’ workload t i : unit-processing time of processor i t i = T i / n i T i = total time spent on compute-functions n i = number of local points in assigned subdomain

Work Redistribution for Heterogeneity 1D Case size of each subdomain inversely proportional to its unit-processing time 2D/3D Case express as a non-linear optimization problem min T max s.t. x r1 * y r1 * t 1 <= T max x r2 * y r1 * t 2 <= T max … x r1 + x r2 + x r3 = x r y r1 + y r2 = y r

Example Scenario Before repartitioning After repartitioning

Implementation: Putting it

Other Benefits of DISC Can support restart with a different number of nodes Partition to a different number of processes Why? Failure and no replacement node Performance within a power budget Exploit cloud elasticity More flexible scheduling on HPC platforms Switching off nodes/cores for power/thermal reasons

DISC and Automated Resilient Execution Support automated application-level checkpointing Use notion of domains and computation spaces Can also help with soft errors Separates data and control Communication and synchronization can be protected Exposes iterative structure Applicable technique can depend on nature of interactions Ongoing work

Experiments Implemented with C language on MPICH2 Each node with two-quad core 2.53 GHz Intel(R) Xeon(R) processor with 12GB RAM Up to 128 nodes (by using a single core at each node) Applications Stencil (Jacobi, Sobel) Unstructured grid (Euler) Molecular dynamics (MiniMD)

Homogeneous Configurations Comparison against MPI implementations Average overheads: 2.7% (MiniMD), < 1% (Euler) MiniMDEuler

Homogeneous Configurations Average overheads: 0.5% (Jacobi), 3.8% (Sobel) JacobiSobel

Heterogeneous Configurations (varying number of cores slowed by 40%) MiniMDEuler Slowdown reduction: 54%  10-15%67-73%  41-47%

Heterogeneous Configurations (varying number of cores slowed by 40%) JacobiSobel Slowdown reduction: 47-51%  8-25% 56%  14%

Heterogeneous Configurations (64 cores slowed by varying percentages) disc-perfect: T disc x (P homogeneous /P heterogeneous ) 25%: 25-50%: 25%  9% 83%  18% 36%  25% 111%  55% MiniMD Euler

Charm++ Comparison Euler (4 nodes are slowed down out of 16) Diff. Load-Balancing Strategies for Charm++ (RefineLB) Load-balance once at the beginning (a) Homog.: Charm % slower than DISC (c) Heter. LB: Charm++, at 64-chares (best-case), 14.5% slower than DISC

Decomposition across CPU and Accelerator Process I (CPU), Process II (GPU) *,*,* show DISC’s decision

Conclusion A parallel programming model for scientific applications Automatic work partitioning and communication Automatic repartitioning for heterogeneity support

Thank you. Questions?