Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

OpenFOAM on a GPU-based Heterogeneous Cluster

Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

STRATEGIES INVOLVED IN REMOTE COMPUTATION

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.

Business Analysis and Essential Competencies

1 Chapter 9 Database Design. 2 2 In this chapter, you will learn: That successful database design must reflect the information system of which the database.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

CUDA Performance Study on Hadoop MapReduce Clusters Chen He Peng Du University of Nebraska-Lincoln.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

An Energy-efficient Task Scheduler for Multi-core Platforms with per-core DVFS Based on Task Characteristics Ching-Chi Lin Institute of Information Science,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Static Process Scheduling

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: Distributed System Co- Synthesis Part of HW/SW Codesign of Embedded Systems Course.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Enabling Dynamic Memory Management Support for MTC on NVIDIA GPUs Proposed Work This work aims to enable efficient dynamic memory management on NVIDIA.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Auburn University

Two-Dimensional Phase Unwrapping On FPGAs And GPUs

Resource Elasticity for Large-Scale Machine Learning

“Temperature-Aware Task Scheduling for Multicore Processors”

Standards and Patterns for Dynamic Resource Management

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

MASS CUDA Performance Analysis and Improvement

Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Presentation transcript:

Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance on these heterogeneous platforms, load- balancing plays an important role to distribute workload. And even more important is to consider online parameters that cannot be known a priori for system scheduling. A Computational Fluid Dynamics (CFD) application, developed at the Fraunhofer IGD, is used to exemplify the implementation of iterative SLE (Systems of Linear Equations) solvers and the need of a hybrid CPU-GPU approach to optimize performance. Towards Dynamic Reconfigurable Load-Balancing for Hybrid Desktop Platforms Alécio P. D. Binotto – Carlos E. Pereira – Dieter W. Fellner Innovation The approach abstracts the Pus using the OpenCL API as the platform independent programming model. It has the proposal to extend OpenCL with a module that schedule and balance the workload over the CPU and GPU for the specific case study in a high level. Starting with an initial scheduling configuration just when the application starts, an online profiler monitors and stores tasks’ execution times and platform conditions. Since the tasks are non- deterministic, during application execution, a reconfigurable dynamic scheduling is performed considering changes on runtime conditions. Figure above depicts the approach; and bellow, the tow-phase scheduling approach. First Assignment: the first guess faces a multidimensional scheduling problem with NP-hard complexity and become more complex when dealing with more than two PUs and several tasks. To optimize, this assignment is based on heuristics taking into account the performance benchmark described on the right. The first scheduling is performed in a static-code way, but dynamically just after the application starts considering the domain size and the premise that the PUs are idle, defining a rule based on the break-even point values presented on the next section. However, this strategy can lead to a PU overflow, i.e., tasks being scheduled to the same PU, showing the need for a context-aware adaptation. Dynamic Reconfiguration: After the first assignment, information provided by profiling is considered. Based on estimated costs, dynamic parameters, and awareness of runtime conditions, a new task can be reconfigured to other PU just if the estimated time to be executed in the new PU is less than the time in the current PU.

Preliminary Results Specially, 3 iterative methods for solving SLEs are analyzed: Jacobi, red-black Gauss-Seidel (GS), Conjugate Gradient (CG). The break-even points over the PUs (quad-core CPU, 8800GT GPU, and GTX285 GPU) used in the first assignment of the scheduling are presented as first results. The figures bellow show the performance of the solvers on the most powerful used PU, the GTX285 GPU. For a small problem size, GS and Jacobi are faster; and for large problems, it is the CG. On the right, a comparison of the CG over the PUs. For problems till 2K unknowns, the CPU has better performance than GPUs. Conclusions and Remaining Challenges Based on the performance evaluation, it was verified the need of strategies for dynamic scheduling, improving OpenCL. We propose a method for dynamic load-balancing over CPU and GPU, applied to solvers for SLEs. As remaining challenges, to develop the load-balance reconfiguration phase on the presented case study; and to analyze the performance of the application without the proposed method to evaluate the strategy overhead. 24th IEEE International Parallel and Distributed Processing Symposium PhD Forum The GPU implementation was made using CUDA and optimized for memory coalescing. Particularly, the figure on the right side, above, compares the CG without and with (new) our strategy for enabling memory coalescing. The CG was also used for comparing performances using multiple GPUs. There is a need of 2M unknowns for using two GPUs. With less elements, it results on an increasing of communication. The multi- GPU approach demonstrates that the speedup depends on the problem size and the achieved speedup was 1.7 for 8M unknowns. Importance This PhD research follows the state-of- the-art as several scientific applications can now be executed on multi-core desktop platforms. To our knowledge, there is a need for research oriented to support load-balancing over a CPU-GPU (or CPU-accelerators) platform. The work shows its relevance analyzing not just platform characteristics, but also the platform context execution scenario. Acknowledgments Thanks for Daniel Weber and Christian Daniel for their support. A. Binotto thanks the support given by DAAD and Alßan, scholarship no. E07D402961BR.