Computing with Accelerators: Overview ITS Research Computing Mark Reed.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Contemporary Languages in Parallel Computing Raymond Hummel.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Panda: MapReduce Framework on GPU’s and CPU’s
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
CS6235 L16: Libraries, OpenCL and OpenAcc. L16: Libraries, OpenACC, OpenCL CS6235 Administrative Remaining Lectures -Monday, April 15: CUDA 5 Features.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
How to use HybriLIT Matveev M. A., Zuev M.I. Heterogeneous Computations team HybriLIT Laboratory of Information Technologies (LIT), Joint Institute for.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
Scaling up R computation with high performance computing resources.
Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
Computer Graphics Graphics Hardware
Productive Performance Tools for Heterogeneous Parallel Computing
Employing compression solutions under openacc
GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht
Peng Wang, Ph.D. HPC Developer Technology, NVIDIA
Computer Graphics Graphics Hardware
Multicore and GPU Programming
Presentation transcript:

Computing with Accelerators: Overview ITS Research Computing Mark Reed

Objectives Learn why computing with accelerators is important Understand accelerator hardware Learn what types of problems are suitable for accelerators Survey the programming models available Know how to access accelerators for your own use

Logistics Course Format – lecture and discussion Breaks Facilities UNC Research Computing 

The answers to all your questions: What? Why? Where? How? When? Who? Which? What are accelerators? Why accelerators? Which programming models are available? When is it appropriate? Who should be using them? Where can I ran the jobs? How do I run jobs? Agenda

What is a computational accelerator?

Related Terms:  Computational accelerator, hardware accelerator, offload engine, co-processor, heterogeneous computing Examples of (of what we mean) by accelerators  GPU  MIC  FPGA  But not vector instruction units, SSD, AVX … by any other name still as sweet

What’s wrong with plain old CPU’s?  The heat problem  Processor speed has plateaued  Green computing: Flops/Watt Future looks like some form of heterogeneous computing  Your choices, multi-core or many-core :) Why Accelerators?

The Heat Problem Additionally From: Jack Dongarra, UT

More Parallelism Additionally From: Jack Dongarra, UT

Free Lunch is Over From “The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software” By Herb Sutter Intel CPU Introductions

Generally speaking you trade off clock speed for lower power Processing cores will be low power, slower cpu (~ 1 GHz) Lots of cores, high parallelism (hundreds of threads) Memory on the accelerator is less (e.g. 6 GB) Data transfer is over PCIe and is slow and therefore expensive computationally Accelerator Hardware

CUDA OpenACC  PGI Directives, HMPP Directives OpenCL Xeon Phi Programming Models

Credit: “A comparison of Programming Models” by Jeff Larkin, Nvidia (formerly with Cray)

OpenACC Directives based HPC parallel programming model  Fortran comment statements and C/C++ pragmas Performance and portability OpenACC compilers can manage data movement between CPU host memory and a separate memory on the accelerator Compiler availability:  CAPS entreprise, Cray, and The Portland Group (PGI)  (coming go GNU) Language support: Fortran, C, C++ (some) OpenMP specification will include this

Fortran !$acc parallel loop reduction(+:pi) do i=0, n-1 t = (i+0.5_8)/n pi = pi + 4.0/(1.0 + t*t) end do !$acc end parallel loop C #pragma acc parallel loop reduction(+:pi) for (i=0; i<N; i++) { double t= (double)((i+0.5)/N); pi +=4.0/(1.0+t*t); } OpenACC Trivial Example

Open Computing Language OpenCL lets Programmers write a single portable program that uses ALL resources in the heterogeneous platform (includes GPU, FPGA, DSP, CPU, Xeon Phi, and others) To use OpenCL, you must  Define the platform  Execute code on the platform  Move data around in memory  Write (and build) programs OpenCL

Credit: Bill Barth, TACC Intel Xeon Phi

GPU strength is flops and memory bandwidth Lots of parallelism Little branching Conversely, these problems do not work well  Most graph algorithms (too unpredictable, especially in memory-space)  Sparse linear algebra (but bad on CPU too)  Small signal processing problems (FFTs smaller than 1000 points, for example)  Search  Sort What types of problems work well?

See accelerated-applications-for-hpc.pdf accelerated-applications-for-hpc.pdf 16 Page guide of ported applications including computational chemistry (MD and QC), materials science, bioinformatics, physics, weather and climate forecasting Or see applications.html for a searchable guidehttp:// applications.html GPU Applications

Best possible performance Most control over memory hierarchy, data movement, and synchronization Limited portability Steep learning curve Must maintain multiple code paths CUDA Pros and Cons

Possible to achieve CUDA level performance  Directives to control data movement but actual performance may depend on maturity of the compiler Incremental development is possible Directives based so can use a single code base Compiler availability is limited Not as low level as CUDA or OpenCL See 2_1ip.pdf for a detailed reporthttp:// 2_1ip.pdf OpenACC Pros and Cons

Low level so can get good performance  Generally not as good as CUDA Portable in both hardware and OS OpenCL is an API for C  Fortran programs can’t access it directly The OpenCL API is verbose and there are a lot of steps to run even a basic program There is a large body of available code OpenCL Pros and Cons

If you have a work station/laptop with an Nvidia card you can run it on that  Supports Nvidia CUDA developer toolkit Killdevil cluster on campus Xsede resources  Keeneland, GPGPU cluster at Ga. Tech  Stampede, Xeon PHI cluster at TACC (also some GPUs) Where can I run jobs?

Nvidia M2070 – Tesla GPU, Fermi microarchitecture 2 GPUs/CPU 1 rack of GPU, all c-186-* nodes  32 nodes, 64 GPU 448 threads, 1.5 GHz clock 6 GB memory PCIe gen 2 bus Does DP and SP Killdevil GPU Hardware

gpu-nodes-on-killdevil/ gpu-nodes-on-killdevil/ Add the module  module add cuda/  module initadd cuda/ Submit to the gpu nodes  -q gpu –a gpuexcl_t Tools  nvcc – CUDA compiler  computeprof – CUDA visual profiler  cuda-gdb – debugger Running on Killdevil

Questions and Comments? For assistance please contact the Research Computing Group:   Phone: HELP  Submit help ticket at