Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
GPU Computing with Hartford Condor Week 2012 Bob Nordlund.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,
OpenFOAM on a GPU-based Heterogeneous Cluster
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Parallel Programming Models and Paradigms
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
Grid Load Balancing Scheduling Algorithm Based on Statistics Thinking The 9th International Conference for Young Computer Scientists Bin Lu, Hongbin Zhang.
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.
12006/9/26 Load Balancing in Dynamic Structured P2P Systems Brighten Godfrey, Karthik Lakshminarayanan, Sonesh Surana, Richard Karp, Ion Stoica INFOCOM.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Naixue GSU Slide 1 ICVCI’09 Oct. 22, 2009 A Multi-Cloud Computing Scheme for Sharing Computing Resources to Satisfy Local Cloud User Requirements.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Performance and Energy Efficiency of GPUs and FPGAs
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Cloud Computing Energy efficient cloud computing Keke Chen.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Heterogeneity-Aware Peak Power Management for Accelerator-based Systems Heterogeneity-Aware Peak Power Management for Accelerator-Based Systems Gui-Bin.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Incremental Run-time Application Mapping for Heterogeneous Network on Chip 2012 IEEE 14th International Conference on High Performance Computing and Communications.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
Philipp Gysel ECE Department University of California, Davis
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Parallel Programming Models
Employing compression solutions under openacc
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Introduction to Load Balancing:
Ph.D. in Computer Science
Heterogeneous Computation Team HybriLIT
Conception of parallel algorithms
Ching-Chi Lin Institute of Information Science, Academia Sinica
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Presentation transcript:

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source: Field-Programmable Technology (FPT), pp , Dec Presenter: Ming-Chih Li ESL, Dept. of CSIE, CCU 2011/09/16

Outline  Introduction  Heterogeneous Framework  Scheduling Polices  Applications  Performance Evaluation  Conclusions 2

Outline  Introduction  Heterogeneous Framework  Scheduling Polices  Applications  Performance Evaluation  Conclusions 3

Introduction  To increase the raw computation capacity of a system  Computational power  Number of processing units  High Performance Computing (HPC) systems  Co-processing accelerators  FPGA, GPU  Distributed computing  Several nodes in a cluster 4

Introduction (cont’)  Challenges of design  Hardware accelerators are customized for specific computation and communication patterns  High non-recurring engineering cost  Communication overhead 5

Introduction (cont’)  Focusing the research on the Monte-Carlo (MC) simulation problems  Contributions  A scalable distributed Monte-Carlo framework for multi-accelerator heterogeneous clusters  Load balancing schemes  Dynamic runtime scheduling  Mapped to two applications 6

Introduction (cont’)  What’s Monte-Carlo simulation problem?  A class of computational algorithms that rely on repeated random sampling to compute their results  Financial applications in banks  Example: calculation of PI value Random (x, y) | x, y = [0,1] Area of square: 1*1 = 1 # of in-circle-points / total points * Area of square = Area of circle Area of circle = PI*r 2

Outline  Introduction  Heterogeneous Framework  Scheduling Polices  Applications  Performance Evaluation  Conclusions 8

Heterogeneous Framework  Three major concerns  Application programmer productivity  No new languages and tool chains  Scalability of approach  Hierarchical model  Resource utilization efficiency  Extensible dynamic scheduling policies  Based on computational performance or energy consumption 9

Heterogeneous Framework (cont’) 10

Heterogeneous Framework (cont’) 11

Heterogeneous Framework (cont’) 12

Outline  Introduction  Heterogeneous Framework  Scheduling Polices  Applications  Performance Evaluation  Conclusions 13

Scheduling Polices  The computational performance differs between different nodes and between different accelerators of the same node  Improper task distribution -> drastic performance reduction  For example:  Computing rate  FPGA = 1000/1s  CPU = 1/1s 14 One node FPGACPU MC distributor Total time: 1000s

Scheduling Polices (cont’)  The computational performance differs between different nodes and between different accelerators of the same node  Improper task distribution -> drastic performance reduction  For example:  Computing rate  FPGA = 1000/1s  CPU = 1/1s 15 One node FPGACPU MC distributor Total time: 2s

Scheduling Polices (cont’)  Proposed one static and two dynamic scheduling polices A. Constant-Size policy B. Linear-Incremental policy C. Exponential-Incremental policy  Definitions:  : initial task size for all child processes  : task size for child i at the jth time of simulation  : remaining uncompleted task size  16

Scheduling Polices (cont’) A. Constant-Size policy  For example:  If total simulation tasks size = 120 and = 50  Then TS i 1 = 50, R d = 70 TS i 2 = 50, R d = 20 TS i 3 = 20, R d = 0 17

Scheduling Polices (cont’) A. Linear-Incremental policy  For example:  If total simulation tasks size = 120, = 50, and c = 5  Then TS i 1 = 50, R d = 70 TS i 2 = 55, R d = 15 TS i 3 = 15, R d = 0 18

Scheduling Polices (cont’) A. Exponential-Incremental policy  For example:  If total simulation tasks size = 500, = 50, and m = 2  Then TS i 1 = 50, R d = 450 TS i 2 = 100, R d = 350 TS i 3 = 200, R d = 150 TS i 4 = 150, R d = 0 19

Scheduling Polices (cont’)  Other possible policies  Mixed scheduling policy  using Linear-Incremental policy at the beginning and then change the policy to Constant-Size after certain iteration  Energy-Equal scheduling policy  each MC worker consumes the same amount of computational energy 20

Outline  Introduction  Heterogeneous Framework  Scheduling Polices  Applications  Performance Evaluation  Conclusions 21

Applications  The authors have implemented two applications in the proposed framework, namely,  Asian option pricing using control variate method  GARCH asset simulation 22

Applications (cont’)  FPGA kernel  Constant-Size scheduling policy is the best choice as all MC cores finish the computation in the exact same cycle 23

Applications (cont’)  The number of pipelined stages must be identical for all the pipelined loops in order to guarantee a consistent computation schedule 24

Applications (cont’)  Xilinx Vertex-5 xc5vlx330t FPGA 25

Applications (cont’)  GPU kernel  Single Instruction Multiple Data (SIMD) computing devices  Design CUDA kernels  CPU kernel  C language  Intel Math Kernel Library (MKL)  compiled with Intel compiler (icc) 11.1 with -O3  OpenMP  parallel FOR #pragma 26

Outline  Introduction  Heterogeneous Framework  Scheduling Polices  Applications  Performance Evaluation  Conclusions 27

Performance Evaluation  An accelerator cluster  consists of 8 server nodes  two AMD Phenom 9650 Quad-Core 2.3GHz CPUs  one nVidia Tesla C1060 GPU  one Xilinx Virtex-5 xc5vlx330t FPGA 28

Performance Evaluation (cont’)  Dynamic scheduling analysis of a single node  The number of Monte-Carlo simulations is 10,000,000  Using Linear-Incremental policy with TS init =

Performance Evaluation (cont’)  Dynamic scheduling analysis of a single node 30

Performance Evaluation (cont’)  Performance, energy and efficiency analysis of accelerator allocation of a cluster  Acceleration performance versus energy consumption  Power monitor  Additional Power Consumption for Computation (APCC)  APCC = Run-time Power – Static Power  Additional Energy Consumption for Computation (AECC) 31

 simulations is  Using Linear-Incremental policy with TS init = 1000  Constant-Size scheduling policy is employed at the higher level MC distributor with TS init = 100M, 50M, 25M, 12.5M for a cluster with 1, 2, 4, 8 nodes. 32

Performance Evaluation (cont’) 33

Performance Evaluation (cont’) 34

Outline  Introduction  Heterogeneous Framework  Scheduling Polices  Applications  Performance Evaluation  Conclusions 35

Conclusions  Propose a dynamic scheduling Monte-Carlo framework for collaborative computation in a multi-accelerator heterogeneous cluster  Load balancing process is automated by employing dynamic scheduling policies using the proposed framework  The framework is scalable and extensible for a variety of dynamic scheduling policies  We have shown that the proposed framework is viable by mapping two applications involving financial computation 36

Conclusions  Future works  The automation for design development in this framework  Applications involving data-dependency will be tested  They also intend to collaborate with other institutes to form a “cluster of heterogeneous clusters” in solving practical scientific problems 37

Thanks for your Attention! 38