Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Slides:

Advertisements

Similar presentations

Garuda-PM July 2011 Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc

Advertisements

The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.

Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc

Software Pipelined Execution of Stream Programs on GPUs Abhishek Udupa et. al. Indian Institute of Science, Bangalore 2009.

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Jan 30, 2003 GCAFE: 1 Compilation Targets Ian Buck, Francois Labonte February 04, 2003.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Contemporary Languages in Parallel Computing Raymond Hummel.

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Jan Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

GPU Architecture and Programming

OpenCL Programming James Perry EPCC The University of Edinburgh.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Contemporary Languages in Parallel Computing Raymond Hummel.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

Computer Engg, IIT(BHU)

These slides are based on the book:

CS203 – Advanced Computer Architecture

CS427 Multicore Architecture and Parallel Computing

Microarchitecture.

Multicore and GPU Programming

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Presentation transcript:

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters Microsoft External Research Initiative StreamIt A High level programming model where nodes and arcs represent computation and communication respectively Exposes Task, Data and Pipeline Parallelism Example: A Two-Band Equalizer Duplicate Splitter Bandpass Filter Bandpass Filter Combiner SignalOutput StreamIt on GPUs Provides a natural way of implementing DSP applications on GPUs Challenges: – Determining the optimal execution configuration for each filter – Harnessing the data parallelism within each SM – Orchestrating the execution across SMs to exploit task and pipeline parallelism – Register constraints Our Approach Good execution configuration determined by using profiling – Identify near-optimal number of concurrent thread instances per filter. – Takes into consideration register contraints Formulate work scheduling and processor (SM) assignment as a unified Integer Linear Program problem. – Takes into account communication bandwidth restrictions Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced. Ongoing Work: Stateful filters are assigned to CPUs – Synergistic execution across CPUs and GPUs GPUs Massively Data Parallel Computation Engines, capable of delivering TFLOPs level throughputs. Organized as a set of Streaming Multiprocessors (SMs), each of which can be viewed as a very wide SIMD unit Typically programmed using CUDA, OpenCL, which are extensions of C/C++ Steep learning curve and Portability Issues The Process The Results Speedups of upto 33.82X over a single threaded CPU execution Software Pipelined scheme outperforms a naive serial SAS schedule in most cases Coalesced Accesses results in huge performance gains, as evident from the poor performance of SWPNC Further details in: ”Software Pipelined Execution of Stream Programs on GPUs” (CGO 2009) The Problem Variety of general-purpose hardware accelerators – GPUs : nVidia, ATI, Intel – Accelerators: Clearspeed, Cell BE, etc. – All Said and Done: A Plethora of Instruction Sets HPC Design using Accelerators – Exploit instruction-level parallelism – Exploit data-level parallelism on SIMD units – Exploit thread-level parallelism on multiple units/multi-cores Challenges – Portability across different generation and platforms – Ability to exploit different types of parallelism What a Solution Needs to Provide Rich abstractions for Functionality – Not a lowest common denominator Independence from any single architecture Portability without compromises on efficiency – Don't forget high-performance goals of the ISA Scalability – Both ways, from micro engines to massively parallel devices – Single core embedded processor to multi-core workstation Take advantage of Accelerators (GPU, Cell, etc.) Transparent Access to Distributed Memory Management and load balance across heterogeneous devices Operator – Add, Mult, … Vector – 1-D bulk data type of base types – E.g. Distributor – Distributes operator over vector – Example: par add returns Vector composition – Concat, slice, gather, scatter, … Enter PLASMA… The PLASMA Framework “CPLASM”, a prototype high-level assembly language Prototype PLASMA IR Compiler Currently Supported Targets:  C (Scalar), SSE3, CUDA (NVIDIA GPUs)‏ Future Targets:  Cell, ATI, ARM Neon,... Compiler Optimizations for “Vector” IR Initial Performance Results Supercomputer Education & Research Centre, Indian Institute of Science R. Govindarajan, Sreepathi Pai, Matthew J. Thazhuthaveetil, Abhishek Udupa