CUDACL: A Tool for CUDA and OpenCL Programmers Ferosh Jacob 1, David Whittaker 2, Sagar Thapaliya 2, Purushotham Bangalore 2, Marjan Memik 32, and Jeff.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Prasanna Pandit R. Govindarajan

Refining High Performance FORTRAN Code from Programming Model Dependencies Ferosh Jacob University of Alabama Department of Computer Science

1 Automating Auto Tuning Jeffrey K. Hollingsworth University of Maryland

Intermediate GPGPU Programming in CUDA

Cijo Thomas Janith Kaiprath Valiyalappil CS566 Parallel Programming, Spring '13 1.

Speed, Accurate and Efficient way to identify the DNA.

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.

Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Stimulating reuse with an automated active code search tool Júlio Lins – André Santos (Advisor) –

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

 Introduction Introduction  Definition of Operating System Definition of Operating System  Abstract View of OperatingSystem Abstract View of OperatingSystem.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Upgrade to Real Time Linux Target: A MATLAB-Based Graphical Control Environment Thesis Defense by Hai Xu CLEMSON U N I V E R S I T Y Department of Electrical.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Integrated Development Environment for Policies Anjali B Shah Department of Computer Science and Electrical Engineering University of Maryland Baltimore.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

GPU Architecture and Programming

FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

QCAdesigner – CUDA HPPS project

Weaving a Debugging Aspect into Domain-Specific Language Grammars SAC ’05 PSC Track Santa Fe, New Mexico USA March 17, 2005 Hui Wu, Jeff Gray, Marjan Mernik,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

CS 732: Advance Machine Learning

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Martin Kruliš by Martin Kruliš (v1.0)1.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Computer Engg, IIT(BHU)

CS427 Multicore Architecture and Parallel Computing

A Graphical Modeling Environment for the

NVIDIA Profiler’s Guide

MASS CUDA Performance Analysis and Improvement

© 2012 Elsevier, Inc. All rights reserved.

Introduction to CUDA.

GPU Scheduling on the NVIDIA TX2:

6- General Purpose GPU Programming

Presentation transcript:

CUDACL: A Tool for CUDA and OpenCL Programmers Ferosh Jacob 1, David Whittaker 2, Sagar Thapaliya 2, Purushotham Bangalore 2, Marjan Memik 32, and Jeff Gray 1 1 Department of Computer Science, University of Alabama 2 Department of Computer and Information Sciences, University of Alabama at Birmingham 3 Faculty of Electrical Engineering and Computer Science, University of Maribor Contact:

Challenges to GPU programmers 1.Accidental complexity - Improper level of abstraction 1.Is it possible to allow programmers to use GPUs as just another tool in the IDE? 2.A programmer often selects a block of code and specifies the device on which it is to execute in order to understand the performance concerns. Should performance tuning be at a high-level of abstraction, such as that similar to what a Graphical User Interface (GUI) developer sees in a What You See Is What You Get (WYSIWYG) editor?

Challenges to GPU programmers 2.Accidental complexity - Configuration details 1. A simple analysis for CUDA or OpenCL reveals that there are few configuration parameters, but many technical details involved in the execution. How far can default values be provided without much performance loss? 2. How much information should be shown to the user for performance tuning and how much information should be hidden to make programming tasks more simplified?

GPU program analysis Goal: To explore the abstraction possibilities observed from their common capabilities. Goal: To study the flow of data from GPU to CPU or vice versa, the flow of data between multiple threads, and the flow of data within the GPU (e.g., shared to global or constant). Goal: To extract the possible templates from an OpenCL program.

CUDA analysis For the analysis, 42 kernels were selected from 25 randomly selected programs that are provided as code samples from the installation package of NVIDIA CUDA. Level A Level B Level C

OpenCL analysis Motivation: In a program oclMatrVecMul from the OpenCL installation package of NVIDIA, three steps – 1) creating the OpenCL context, 2) creating a command queue for device 0, and 3) setting up the program – are achieved with 34 lines of code. Data: 15 programs were randomly selected from the code samples that are shipped with the NVIDIA OpenCL installation package. If the steps for a general OpenCL program can be found, templates can be provided that can free the programmer from writing much of the common code manually. Furthermore, one or more steps can be abstracted to standard functions to simplify the development process.

Analysis conclusions CUDA Automatic Code Conversion: There are a fair amount of programs (48%), whose code can be automatically generated if the sequential code is available. Copy mismatch: There exists programs (12%) where all the variables in the kernel are not copied. Even though the variables are not copied, memory should be allocated in the host code before the first access. OpenCL Default template: Every OpenCL program consists of creating a context, setting up the program, and cleaning up the OpenCL resources. Device specification: Each kernel could be specified with a device or multiple devices in which the kernel is meant to be executed.

Design of CUDACL CDT/JDT Parsing C or Java code C or Java code with Abstract APIs Pre defined functions Configuration parameters C with OpenCL CDT/JDT Parsing C with CUDA Java with OpenCL Java with CUDA

CUDACL in action Host code of Arrayadd

CUDACL in action Kernel code of Arrayadd

CUDACL in action Eclipse plug-in editor for configuration

Case study 1 1. Creation of Octree with CUDA Octree is a popular tree data structure that is often used to represent volumetric data. Volumetric 3D data is widely used in computer graphics Program written in C language Level A program

Case study 2 Edge detection with OpenCL Farmland is found by using a pattern matching algorithm to search for large, flat, contiguous squares and circles on the edge map of the land, which is created by running a Sobel edge detector. Program written in Java language Level C program

Related works Sequential code to Parallel PFC 1 PAT 2 CUDA hiCUDA 3 CUDA-lite 4 OpenCL CUDACL 1.R. Allen and K. Kennedy,PFC: a program to convert fortran to parallel form, In Supercomputers: Design and Applications, pp. 186–203, B. Appelbe, K. Smith, and C. McDowell, Start/Pat: a parallelprogramming toolkit, IEEE Softw., pp. 29–38, T. D. Han and T. S. Abdelrahman, hiCUDA: A high-level directivebased language for GPU programming, in Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, Washington, D.C., March 2009, pp. 52– S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu, CUDAlite: Reducing GPU programming complexity, in Proceedings of the International Workshop on Languages and Compilers for Parallel Computing,Edmonton, Canada, July 2008, pp. 1–15.

Conclusion and Future work Conclusion A graphical interface is available with default values to configure a GPU parallel block in C and Java programs to generate CUDA and OpenCL programs. Two case studies were presented to show that CUDACL, and its associated tool support within Eclipse, can be useful for a wide range of applications. Future work As an extension of the work, a Domain-Specific Language (DSL) is being designed to represent the interface for the configuration. More examples should be tried to evaluate the tool.

Thank You Questions? Ferosh Jacob 1, David Whittaker 2, Sagar Thapaliya 2, Purushotham Bangalore 2, Marjan Memik 32, and Jeff Gray 1 1 Department of Computer Science, University of Alabama 2 Department of Computer and Information Sciences, University of Alabama at Birmingham 3 Faculty of Electrical Engineering and Computer Science, University of Maribor Contact: