CS6235 L16: Libraries, OpenCL and OpenAcc. L16: Libraries, OpenACC, OpenCL CS6235 Administrative Remaining Lectures -Monday, April 15: CUDA 5 Features.

Slides:



Advertisements
Similar presentations
GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Introduction to the CUDA Platform
GPU Programming using BU Shared Computing Cluster
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
Last Lecture The Future of Parallel Programming and Getting to Exascale 1.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
CS6963 L19: Dynamic Task Queues and More Synchronization.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Contemporary Languages in Parallel Computing Raymond Hummel.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
GPGPU platforms GP - General Purpose computation using GPU
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Trip report: GPU UERJ Felice Pantaleo SFT Group Meeting 03/11/2014 Felice Pantaleo SFT Group Meeting 03/11/2014.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
CS6963 L15: Design Review and CUBLAS Paper Discussion.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Advanced / Other Programming Models Sathish Vadhiyar.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Computational Biology 2008 Advisor: Dr. Alon Korngreen Eitan Hasid Assaf Ben-Zaken.
GPU Architecture and Programming
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
GPU-Accelerated Computing and Case-Based Reasoning Yanzhi Ren, Jiadi Yu, Yingying Chen Department of Electrical and Computer Engineering, Stevens Institute.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
Contemporary Languages in Parallel Computing Raymond Hummel.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
CS 732: Advance Machine Learning
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Martin Kruliš by Martin Kruliš (v1.1)1.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Enabling machine learning in embedded systems
Lecture 5: GPU Compute Architecture
MASS CUDA Performance Analysis and Improvement
Lecture 5: GPU Compute Architecture for the last time
CS 179 Project Intro.
General Programming on Graphical Processing Units
General Programming on Graphical Processing Units
L16: Feedback from Design Review
ECE 8823: GPU Architectures
1.3.7 High- and low-level languages and their translators
Presentation transcript:

CS6235 L16: Libraries, OpenCL and OpenAcc

L16: Libraries, OpenACC, OpenCL CS6235 Administrative Remaining Lectures -Monday, April 15: CUDA 5 Features (small exercise) -Wednesday, April 17: Parallel Architectures and Getting to Exascale Final projects -Poster session, April 24 (dry run April 22) -Final report, May 1

L16: Libraries, OpenACC, OpenCL CS6235 Final Project Presentation Dry run on April 22, Presentation on April 24 -Easels, tape and poster board provided -Tape a set of Powerpoint slides to a standard 2’x3’ poster, or bring your own poster. Final Report on Projects due May 1 -Submit code -And written document, roughly 10 pages, based on earlier submission. -In addition to original proposal, include -Project Plan and How Decomposed (from DR) -Description of CUDA implementation -Performance Measurement -Related Work (from DR)

L16: Libraries, OpenACC, OpenCL CS6235 Sources for Today’s Lecture References: OpenAcc Home page: OpenCL Home page: Overview of OpenCL: overview.pdf tutorial.pdf 2012 Tutorial Additional Reference (new book): Heterogeneous Computing with OpenCL, B. Gaster, L. Howes, D. Kaeli, P. Mistry, D. Schaa, Morgan Kaufmann, 2012.

L16: Libraries, OpenACC, OpenCL CS6235 Many Different Paths to Parallel Programming In this course, we have focused on CUDA But there are other programming technologies, some of which we will discuss today Range of solutions: -Libraries -Higher Levels of Abstraction -OpenACC (today) -PyCUDA, Copperhead, etc. -Non-proprietary, but similar solutions -Specifically, OpenCL

L16: Libraries, OpenACC, OpenCL CS6235 A Few Words About Libraries Take a look at what you find in /usr/local/cuda-5.0/lib -CUBLAS: linear algebra -CUSPARSE: sparse linear algebra -CUFFT: signal processing (FFTs) More in librarieshttps://developer.nvidia.com/gpu-accelerated- libraries Also, -Thrust: -STL-like templated interfaces to several algorithms and data structures designed for high performance heterogeneous parallel computing

L16: Libraries, OpenACC, OpenCL CS6235 Advantages of Libraries Some common algorithms can be encapsulated in a library -Library is developed by performance expert -Can be used by average developer -Accelerates development process Often programmer does not even need to think about parallelism -Parallelism is embedded in the library Disadvantages of libraries -Difficult to customize to specific contexts -Complex and difficult to understand

L16: Libraries, OpenACC, OpenCL CS6235 What is OpenAcc? High-level directives can be added to C/C++ or Fortran programs -Mark loops or blocks of statements to be offloaded to an attached accelerator -Portable across host, OS and accelerator -Similar in spirit to OpenMP Example constructs -#pragma acc parallel [clause[,]…] newline block of code or loop -Example clauses: async[], copy_in(), private(), copy_out(), … Announced at SC11, November 2011 A partnership between Nvidia, Cray, PGI, and CAPS

L16: Libraries, OpenACC, OpenCL CS6235 Why Openacc? Advantages: -Ability to add GPU code to existing program with very low effort, similar to OpenMP vs. Pthreads -Compiler deals with complexity of index expressions, data movement, synchronization -Has the potential to be portable and non-proprietary if adopted by other vendors -Real gains are being achieved on real applications Disadvantages: -Performance may suffer, possibly a lot -Cannot express everything that can be expressed in CUDA -Still not widely adopted by the community, but it is new so this may change

L16: Libraries, OpenACC, OpenCL CS6235 What is OpenCL? Open source standard specification for developing heterogeneous parallel applications -i.e., parallel codes that use a mix of different functional units -Goal is to unify how parallelism is expressed, how to offload computation to accelerators like GPUs, and port code from one platform to another Advantages over CUDA -Not proprietary -Not specific to Nvidia GPUs Disadvantages (my list) -Portable, but not necessarily performance portable -CUDA is customized to Nvidia hardware execution model, so can be made faster -A bit low level in terms of programmer abstractions

L16: Libraries, OpenACC, OpenCL CS6235 OpenCL short presentation Taken from

L16: Libraries, OpenACC, OpenCL CS6235 A recent comparison, from Vetter, “Programming Scalable Heterogeneous Systems Productively”, SOS15, March See also: Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU '10). ACM, New York, NY, USA,

L16: Libraries, OpenACC, OpenCL CS6235 What are the differences This is not scientific, but I have read the following: -Startup is much higher in OpenCL because of its generality -Data transfer may also be slower due to a more general protocol -Computation is slightly slower due to generality of scheduling Bottom line -Generality slows down performance -One can ask whether programmability is improved, or portability is achieved (unclear)

L16: Libraries, OpenACC, OpenCL CS6235 What will it take for OpenCL to Dethrone CUDA For problems not on Nvidia GPUs, it may already have -CUDA programs often do not run on other hardware -But sometimes the CUDA support exists But what happens for people who want the ultimate in performance -Will continue to use CUDA until OpenCL codes run as fast or faster -This may not be achievable due to reasons in previous slide -CUDA is getting a larger and larger applications community -CUDA codes are tuned at a low level for Nvidia architecture features Alternatively, if portability across high-end platforms becomes important, then performance may be sacrified Eventually CUDA will be replaced with something, if history is any guide