OpenCL Framework for Heterogeneous CPU/GPU Programming a very brief introduction to build excitement NCCS User Forum, March 20, 2012 György (George) Fekete.

Slides:

Advertisements

Similar presentations

GPU Programming using BU Shared Computing Cluster

Advertisements

Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Contemporary Languages in Parallel Computing Raymond Hummel.

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

GPU Architecture and Programming

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

1 Latest Generations of Multi Core Processors

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms?

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Parallel Computing Presented by Justin Reschke

My Coordinates Office EM G.27 contact time:

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

SixTrack for GPU R. De Maria. SixTrack Status SixTrack: Single Particle Tracking Code [cern.ch/sixtrack]. 70K lines written in Fortran 77/90 (with few.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Computer Engg, IIT(BHU)

CS427 Multicore Architecture and Parallel Computing

GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht

OpenCL 소개 류관희 충북대학교 소프트웨어학과.

Processing Framework Sytse van Geldermalsen

Multicore and GPU Programming

Presentation transcript:

OpenCL Framework for Heterogeneous CPU/GPU Programming a very brief introduction to build excitement NCCS User Forum, March 20, 2012 György (George) Fekete

What happened just two years ago? Top 3 in 2010 SYSTEMGFlop/s PROCESSORS GPUPOWER Tianhe-1A4,70114,336 Xeon7,168 Tesla M2050 4,040 kW Jaguar1,759224,256 Opteron6,950 kW Nebulae1,2719,280 Xeon4,640 Tesla2,580 kW Before 2009: novelty, experimental, gamers and hackers Recently: demand serious attention in supercomputing GPUs forw

How are GPUs changing computation? field strength at each grid point depends on distance from each atom charge of each atom sum all contributions for each grid point p for each atom a d = dist(p, a) val[p] += field(a, d) for each grid point p for each atom a d = dist(p, a) val[p] += field(a, d) Example: compute field strength in the neighborhood of a molecule

Run on CPU only image credit: Single core: about a minute

Run on 16 cores image credit: 16 threads in 16 cores: about 5 seconds

Run with OpenCL clip credit: With OpenCL and a GPU device: a blink of an eye (< 0.2s)

Test run timings TimeSpeedup CPU GPU not optimized GPU optimized

Why Is GPU so Fast? GPUCPU

GPU vs CPU (2008) GTX 280Q9450 Bus512 bits128 bits memory1GB GDDR3 dual port 8GB single port memory bandwidth141 GB/s12.1 GB/s cache16kB + 16kB per block 12 MB cores2404

Why should I care about heterogeneous computing? Increased computational power no longer comes from increased clock speeds does come from parallelism with multiple CPUs and programmable GPUs rev CPU multicore computing GPU data parallel computing Heterogeneous computing

What is OpenCL? Open Computing Language standard for parallel programming of heterogeneous systems consisting of parallel processors like CPUs and GPUs specification developed by many companies maintained by the Khronos Group OpenGL and other open spec. technologies Implemented by hardware vendors implementation is compliant if it conforms to the specifications

What is an OpenCL device? Any piece of hardware that is OpenCL compliant device compute units – processing elements multicore CPUmany graphics adapters Nvidia AMD

A Dali-gpu node is an OpenCL device

OpenCL features Clean API ANSI-C99 language support additional data types, built-ins Thread management framework application and thread-level synchronization easy to use, lightweight Uses all resources in your computer IEEE-754 compliant rounding behavior Provide guidelines for future hardware designs

OpenCL's place in data parallel computing Coarse grain Fine grain GridOpenMP/pthreadsSIMD/Vector enginesMPI

OpenCL  the one big idea remove one level of loops each processing element has a global id for i in 0...(n-1) { c[i] = f(a[i], b[i]); } id = get_global_id(0) c[id] = f(a[id], b[id]) then now

How are GPUs changing computation? for each grid point p for each atom a d = dist(p, a) val[p] += field(a, d) for each grid point p for each atom a d = dist(p, a) val[p] += field(a, d) Example: compute field strength in the neighborhood of a molecule for each atom a d = dist(p, a) val[p] += field(a, d) for each atom a d = dist(p, a) val[p] += field(a, d)

F operates on one element of a data[ ] array Each processor works on one element of the array at a time. There are 4 processors in this example, and four colors... (A real GPU has many more processors) define F(x){...} i = get_global_id(0); end = len(data) while (i < end){ F(data[i]); i = i + ncpus } What kind of problems can OpenCL help? Data Parallel Programming 101: apply the same operation to each element of an array independently

Is GPU a cure for everything? Problems that map well separation of problem into independent parts linear algebra random number generation sorting (radix sort, bitonic sort) regular language parsing Not so well inherently sequential problems non-local calculations anything with communication dependence device dependence ! !!

How do I program them? C++ Supported by Nvidia, AMD,... Fortran FortranCL: an OpenCL Interfce to Fortran 90 V0.1 alpha is coming up to speed Python PyOpenCL Libraries

OpenCL environments Drivers Nvidia AMD Intel IBM Libraries OpenCL toolbox for MATLAB OpenCLLink for Mathematica OpenCL Data Parallel Primitives Library (clpp) ViennaCL – linear algebra library

OpenCL environments Other language bindings WebCL JavaScript Firefox and WebKit Python PyOpenCL The Open Toolkit library – C#, OpenGL, OpenAL, Mono/.NET Fortran Tools gDEBugger clcc SHOC (Scalable Heterogeneous Computing Benchmark Suite) ImageMagick

Myths about GPUs Hard to program just a different programming model. resembles MasPar more so than x86 C, assembler and Fortran interface Not accurate IEEE 754 FP operations Address generation

Possible Future Discussions High-level GPU programming Easy learning curve Moderate accelaration GPU libraries, traditional problems Linear algebra problems FFT list is growing! Close to the silicon Steep learning curve More impressive accelaration Send me your problem

The time is now... Andreas Klöckner et al, "PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation," Parallel Computing, V 38, 3, March 2012, pp