APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.

Slides:

Advertisements

Similar presentations

Designing a Program & the Java Programming Language

Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

1 G54PRG Programming Lecture 1 Amadeo Ascó Adam Moore G54PRG Programming Lecture 1 Amadeo Ascó 3 Java Programming Language.

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Introduction to Programming G51PRG University of Nottingham Revision 1

GPU Programming using BU Shared Computing Cluster

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Programming Merit Badge

Gary Frost AMD Principal Member Of Technical Staff Applications and Developer Infrastructure Aparapi: An Open Source tool for extending the Java™ promise.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Introduction to Java Programming

Introduction to Java.

CSE 1301 J Lecture 2 Intro to Java Programming Richard Gesick.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Java Virtual Machine Java Virtual Machine A Java Virtual Machine (JVM) is a set of computer software programs and data structures that use.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Introduction to the Java Virtual Machine 井民全. JVM (Java Virtual Machine) the environment in which the java programs execute The specification define an.

Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.

© 2012 Pearson Education, Inc. All rights reserved. 1-1 Why Java? Needed program portability – Program written in a language that would run on various.

GPU Architecture and Programming

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Getting started with Programming using IDE. JAVA JAVA IS A PROGRAMMING LANGUAGE AND A PLATFORM. IT CAN BE USED TO DELIVER AND RUN HIGHLY INTERACTIVE DYNAMIC.

J ava P rogramming: From Problem Analysis to Program Design, From Problem Analysis to Program Design, Second Edition Second Edition D.S. Malik D.S. Malik.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

© 2012 Pearson Education, Inc. All rights reserved types of Java programs Application – Stand-alone program (run without a web browser) – Relaxed.

CS 178: Programming with Multimedia Objects Aditya P. Mathur Professor of Computer Sciences Purdue University, West Lafayette August 27, 2004 Last update:

JAVA Ekapap Julnonyang When it was implemented? Developed by Sun Microsystems. The first public implementation was Java 1.0 in 1995 The language.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

©2016 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. CSC 110 – INTRO TO COMPUTING - PROGRAMMING Overview of Programming.

Getting Started With Java September 22, Java Bytecode  Bytecode : is a highly optimized set of instructions designed to be executed by the Java.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

ITP 109 Week 2 Trina Gregory Introduction to Java.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Review A program is… a set of instructions that tell a computer what to do. Programs can also be called… software. Hardware refers to… the physical components.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Intro to Programming STARS College of Communication and Information Florida State University Written by: Hannah Brock Alissa Ovalle Nicolaus Lopez Martin.

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Computer Engg, IIT(BHU)

CS 179: GPU Programming Lecture 1: Introduction 1

Our Graphics Environment

GPUs: Not Just for Graphics Anymore

The Basics: HTML5, Drawing, and Source Code Organization

Topic: Difference b/w JDK, JRE, JIT, JVM

OpenCL 소개 류관희 충북대학교 소프트웨어학과.

Patrick Cozzi University of Pennsylvania CIS Spring 2011

CS 179: GPU Programming Lecture 1: Introduction 1

7 Best Programming Languages Based as per Earnings & Opportunities

Introduction to OpenCL 2.0

Multicore and GPU Programming

Presentation transcript:

APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team

3| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 AGENDA  The age of heterogeneous computing is here  The supercomputer in your desktop/laptop  Why Java ™ ?  Current GPU programming options for Java developers  Are developers likely to adopt emerging Java OpenCL ™ /CUDA ™ bindings?  Aparapi –What is it –How it works  Performance  Examples/Demos  Proposed Enhancements  Future work

4| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 THE AGE OF HETEROGENEOUS COMPUTE IS HERE  GPUs originally developed to accelerate graphics operations  Early adopters repurposed their GPUs for ‘general compute’ by performing ‘unnatural acts’ with shader APIs  OpenGL allowed shaders/textures to be compiled and executed via extensions  OpenCL TM /GLSL/CUDA TM standardized/formalized how to express GPU compute and simplified host programming  New programming models are emerging and lowering barriers to adoption

5| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 THE SUPERCOMPUTER IN YOUR DESKTOP  Some interesting tidbits from –November 2000  “ASCI White is new #1 with 4.9 TFlops on the Linpack"  –November 2002  “3.2 TFlops are needed to enter the top 10”   May 2011 –AMD Radeon TM TFlops single precision performance 

6| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011  One of the most widely used programming languages –  Established in domains likely to benefit from heterogeneous compute –BigData, Search, Hadoop+Pig, Finance, GIS, Oil & Gas  Even if applications are not implemented in Java, they may still run on the Java Virtual Machine (JVM) –JRuby, JPython, Scala, Clojure, Quercus(PHP)  Acts as a good proxy/indicator for enablement of other runtimes/interpreters –JavaScript, Flash,.NET, PHP, Python, Ruby, Dalvik? WHY JAVA?

7| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 GPU PROGRAMMING OPTIONS FOR JAVA PROGRAMMERS  Emerging Java GPU APIs require coding a ‘Kernel’ in a domain-specific language // JOCL/OpenCL kernel code __kernel void squares(__global const float *in, __global float *out){ int gid = get_global_id(0); out[gid] = in[gid] * in[gid]; }  As well as writing the Java ‘host’ CPU-based code to: –Initialize the data –Select/Initialize execution device –Allocate or define memory buffers for args/parameters –Compile 'Kernel' for a selected device –Enqueue/Send arg buffers to device –Execute the kernel –Read results buffers back from the device –Cleanup (remove buffers/queues/device handles) –Use the results import static org.jocl.CL.*; import org.jocl.*; public class Sample { public static void main(String args[]) { // Create input- and output data int size = 10; float inArr[] = new float[size]; float outArray[] = new float[size]; for (int i=0; i<size; i++) { inArr[i] = i; } Pointer in = Pointer.to(inArr); Pointer out = Pointer.to(outArray); // Obtain the platform IDs and initialize the context properties cl_platform_id platforms[] = new cl_platform_id[1]; clGetPlatformIDs(1, platforms, null); cl_context_properties contextProperties = new cl_context_properties(); contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]); // Create an OpenCL context on a GPU device cl_context context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU, null, null, null); // Obtain the cl_device_id for the first device cl_device_id devices[] = new cl_device_id[1]; clGetContextInfo(context, CL_CONTEXT_DEVICES, Sizeof.cl_device_id, Pointer.to(devices), null); // Create a command-queue cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], 0, null); // Allocate the memory objects for the input- and output data cl_mem inMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, Sizeof.cl_float * size, in, null); cl_mem outMem = clCreateBuffer(context, CL_MEM_READ_WRITE, Sizeof.cl_float * size, null, null); // Create the program from the source code cl_program program = clCreateProgramWithSource(context, 1, new String[]{ "__kernel void sampleKernel("+ " __global const float *in,"+ " __global float *out){"+ " int gid = get_global_id(0);"+ " out[gid] = in[gid] * in[gid];"+ "}" }, null, null); // Build the program clBuildProgram(program, 0, null, null, null, null); // Create and extract a reference to the kernel cl_kernel kernel = clCreateKernel(program, "sampleKernel", null); // Set the arguments for the kernel clSetKernelArg(kernel, 0, Sizeof.cl_mem, Pointer.to(inMem)); clSetKernelArg(kernel, 1, Sizeof.cl_mem, Pointer.to(outMem)); // Execute the kernel clEnqueueNDRangeKernel(commandQueue, kernel, 1, null, new long[]{inArray.length}, null, 0, null, null); // Read the output data clEnqueueReadBuffer(commandQueue, outMem, CL_TRUE, 0, outArray.length * Sizeof.cl_float, out, 0, null, null); // Release kernel, program, and memory objects clReleaseMemObject(inMem); clReleaseMemObject(outMem); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(commandQueue); clReleaseContext(context); for (float f:outArray){ System.out.printf("%5.2f, ", f); } } }

8| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 ARE DEVELOPERS LIKELY TO ADOPT EMERGING JAVA OPENCL/CUDA BINDINGS?  Some will –Early adopters –Prepared to learn new languages –Motivated to squeeze all the performance they can from available compute devices –Prepared to implement algorithms both in Java and in CUDA/OpenCL  Many won’t –OpenCL/CUDA C99 heritage likely to disenfranchise Java developers  Many walked away from C/C++ or possibly never encountered it at all (due to CS education shifts)  Difficulties exposing low level concepts (such as GPU memory model) to developers who have ‘moved on’ and just expect the JVM to ‘do the right thing’  Who pays for retraining of Java developers? –Notion of writing code twice (once for Java execution another for GPU/APU) alien  Where’s my ‘Write Once, Run Anywhere’?

9| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 WHAT IS APARAPI?  An API for expressing data parallel workloads in Java –Developer extends a Kernel base class –Compiles to Java bytecode using existing tool chain –Uses existing/familiar Java tool chain to debug the logic of their Kernel implementations  A runtime component capable of either : –Executing Kernel via a Java Thread Pool –Converting Kernel bytecode to OpenCL and executing on GPU

10| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 AN EMBARRASSINGLY PARALLEL USE CASE  First lets revisit our earlier code example – Calculate square[0..size] for a given input in[0..size] final int[] square= new int[size]; final int[] in = new int[size]; // populating in[0..size] omitted for (int i=0; i<size; i++){ square[i] = in[i] * in[i]; }  Note that the order we traverse the loop is unimportant  Ideally Java would provide a way to indicate that the body of the loop need not be executed sequentially  Something like a parallel-for ?  However we don’t want to modify the language, compiler or tool chain. parallel-for (int i=0; i<size; i++){

11| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 REFACTORING OUR EXAMPLE TO USE APARAPI final int[] square= new int[size]; final int[] in = new int[size]; // populating in[0..size] omitted for (int i=0; i<size; i++){ square[i] = in[i] * in[i]; } new public void run(){ int i = getGlobalID(); square[i] = in[i]*in[i]; } }.execute(size);

12| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 EXPRESSING DATA PARALLEL IN APARAPI  What happens when we call execute(n)? Kernel kernel = new public void run(){ int i=getGlobalID(); square[i]=int[i]*int[i]; } }; kernel.execute(size);

13| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 FIRST CALL OF KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS AVAILABLE  Reload classfile via classloader and locate all methods and fields  For ‘run()’ method and all methods reachable from ‘run()’ –Convert method bytecode to an IR  Expression trees  Conditional sequences analyzed and converted to if{}, if{}else{} and for{} constructs –Create a list of fields accessed by the bytecode  Note the access type (read/write/read+write)  Accessed fields will be turned into args and passed to generated OpenCL  Create an OpenCL buffer for each accessed primitive array (read, write or readwrite) –Create and Compile OpenCL  Bail and revert to Java Thread Pool if we encounter any issues in previous steps

14| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 ALL CALLS OF KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS AVAILABLE  Lock any accessed primitive arrays (so the garbage collector doesn’t move or collect them)  For each field readable by the kernel: –If field is an array → enqueue a buffer write –If field is scalar → set kernel arg value  Enqueue Kernel execution  For each array writeable by the kernel: –Enqueue a buffer read  Wait for all enqueued requests to complete  Unlock accessed primitive arrays

15| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 KERNEL.EXECUTE(SIZE) WHEN OPENCL/GPU IS NOT AN OPTION  Create a thread pool  One thread per core  Clone Kernel once for each thread  Each Kernel accessed exclusively from a single thread  Each Kernel loops globalSize/threadCount times  Update globalId, localId, groupSize, globalSize on Kernel instance  Executes run() method on Kernel instance  Wait for all threads to complete

16| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 ADOPTION CHALLENGES (APARAPI VS EMERGING JAVA GPU BINDINGS) Emerging GPU bindings Aparapi Learn OpenCL/CUDADIFFICULTN/A Locate potential data parallel opportunitiesMEDIUM Refactor existing code/data structuresMEDIUM Create Kernel CodeDIFFICULTEASY Create code to coordinate execution and buffer transfersMEDIUMN/A Identify GPU performance bottlenecksDIFFICULT Iterate code/debug algorithm logicDIFFICULTMEDIUM Solve build/deployment issuesDIFFICULTMEDIUM

17| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 MANDELBROT EXAMPLE new public void run() { int gid = getGlobalId(); float x = (((gid % w)-(w/2))/(float)w); // x { } float y = (((gid / w)-(h/2))/(float)h); // y { } float zx = x, zy = y, new_zx = 0f; int count = 0; while (count < maxIterations && zx * zx + zy * zy < 8) { new_zx = zx * zx - zy * zy + x; zy = 2 * zx * zy + y; zx = new_zx; count++; } rgb[gid] = pallette[count]; }).execute(width*height);

18| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 EXPRESSING DATA PARALLEL IN JAVA WITH APARAPI BY EXTENDING KERNEL class SquareKernel extends Kernel{ final int[] in, square; public SquareKernel(final int[] in){ this.in = in; this.square = new int[in.length); public void run(){ int i=getGlobalID(); square[i]=int[i]*int[i]; } public int[] square(){ execute(in.length); return(square); } int []in = new int[size]; SquareKernel squareKernel = new SquareKernel(in); // populating in[0..size] omitted int[] result = squareKernel.square(); square() method ‘wraps’ the execution mechanics Provides a more natural Java API

19| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 EXPRESSING DATA PARALLELISM IN APARAPI USING PROPOSED JAVA 8 LAMBDAS  JSR 335 ‘Project Lambda’ proposes addition of ‘lambda’ expressions to Java 8.  How we expect Aparapi will make use of the proposed Java 8 extensions final int [] square = new int[size]; final int [] in = new int[size]; // populating in[0..size] omitted Kernel.execute(size, #{ i -> out[i]=int[i]*int[i]; });

20| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011  At runtime Aparapi converts Java bytecode to OpenCL  OpenCL compiler converts OpenCL to device specific ISA (for GPU/APU)  GPU comprised of multiple SIMD (Single Instruction Multiple Dispatch) Cores  SIMD performance stems from executing the same instruction on different data at the same time –Think single program counter shared across multiple threads – All SIMDs executing at the same time (in lock-step) new public void run(){ int i = getGlobalID(); int temp= in[i]*2; temp = temp+1; out[i] = temp; } }.execute(4) HOW APAPAPI EXECUTES ON THE GPU i=0i=1i=2i=3 int temp =in[0]*2int temp =in[1]*2int temp =in[2]*2int temp =in[3]*2 temp=temp+1 out[0]=tempout[1]=tempout[2]=tempout[3]=temp

21| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 DEVELOPER IS RESPONSIBLE FOR ENSURING PROBLEM IS DATA PARALLEL  Data dependencies may violate the ‘in any order’ contract for (int i=1; i< 100; i++){ out[i] = out[i-1]+in[i]; } out[i-1] refers to a value resulting from a previous iteration which may not have been evaluated yet  Loops mutating shared data will need to be refactored or will necessitate atomic operations for (int i=0; i< 100; i++){ sum += in[i]; } sum += x causes a race condition Almost certainly will not be atomic when translated to OpenCL Not safe in multi-threaded Java either new public void run(){ int i = getGlobalID(); out[i] = out[i-1]+in[i]; }}.execute(100); new public void run(){ int i = getGlobalID(); sum+= in[i]; }}.execute(100);

22| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 SOMETIMES WE CAN REFACTOR TO RECOVER SOME PARALLELISM for (int i=0; i< 100; i++){ sum += in[i]; } new public void run(){ int i = getGlobalID(); sum+= in[i]; } }.execute(100); new public void run(){ int n = getGlobalID() for (int i=0; i<10; i++) partial[n] += data[n*10+i]; } }.execute(10); for (int i=0; i< 10; i++){ sum+=partial[i]; } for (int n=0; n<10; n++){ for (int i=0; i<10; i++){ partial[n] += data[n*10+i]; } for (int i=0; i< 10; i++){ sum+=partial[i]; }

23| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011  SIMD performance impacted when code contains branches –To stay in lockstep SIMDs must process both the 'then' and 'else' blocks – Use result of 'condition' to predicate instructions (conditionally mask to a no-op) new public void run(){ int i = getGlobalID(); int temp= in[i]*2; if (i%2==0) temp = temp+1; else temp = temp -1; out[i] = temp; } }.execute(4) TRY TO AVOID BRANCHING WHEREVER POSSIBLE i=0i=1i=2i=3 int temp =in[0]*2int temp =in[1]*2int temp =in[2]*2int temp =in[3]*2 = (0%2==0) = (1%2==0) = (2%2==0) = (3%2==0) if temp=temp+1 if temp=temp-1 out[0]=tempout[1]=tempout[2]=tempout[3]=temp

24| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 CHARACTERISTICS OF IDEAL DATA PARALLEL WORKLOADS  Code which iterates over large arrays of primitives –32/64 bit data types preferred –Where the order of iteration is not critical  Avoid data dependencies between iterations –Each iteration contains sequential code (few branches)  Good balance between data size (low) and compute (high) –Transfer of data to/from the GPU can be costly  Although APUs likely to mitigate this over time –Trivial compute often not worth the transfer cost –May still benefit by freeing up CPU for other work Compute Data Size GPU Memory Ideal

25| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 APARAPI NBODY EXAMPLE  NBody is a common OpenCL/CUDA benchmark/demo –For each particle/body  Calculate new position based on the gravitational force impinged on each body, by every other body  Essentially a N^2 space problem –If we double the number of bodies, we perform four times the positional calculations  Following charts compare –Naïve Java version (single loop) –Aparapi version using Java Thread Pool –Aparapi version running on the GPU (ATI Radeon ™ 5870)

26| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 APARAPI NBODY PERFORMANCE (FRAMES RATE VS NUMBER OF BODIES) Frames per second

27| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 NBODY PERFORMANCE: CALCULATIONS PER µSEC VS. NUMBER OF BODIES Position calculations per µS Number of bodies/particles

28| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 APARAPI EXPLICIT BUFFER MANAGEMENT  This code demonstrates a fairly common pattern. Namely a Kernel executed inside a loop int [] buffer = new int[HUGE]; int [] unusedBuffer = new int[HUGE]; Kernel k = new public void run(){ // mutates buffer contents // no reference to unusedBuffer } }; for (int i=0; i< 1000; i++){ k.execute(HUGE); } Although Aparapi analyzes kernel methods to optimize host buffer transfer requests, it has no knowledge of buffer accesses from the enclosing loop. Aparapi must assume that the buffer is modified between invocations. This forces (possibly unnecessary) buffer copies to and from the device for each invocation of Kernel.excute(int) //Transfer buffer to GPU //Transfer buffer from GPU

29| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 APARAPI EXPLICIT BUFFER MANAGEMENT  Using the new explicit buffer management APIs int [] buffer = new int[HUGE]; Kernel k = new public void run(){ // mutates buffer contents } }; k.setExplicit(); k.put(buffer); for (int i=0; i< 1000; i++){ k.execute(HUGE); } k.get(buffer);  Developer takes control (of all buffer transfers) by marking the kernel as explicit  Then coordinates when/if transfers take place  Here we save 999 buffer writes and 999 buffer reads

30| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 APARAPI EXPLICIT BUFFER MANAGEMENT  A possible alternative might be to expose the ‘host’ code to Aparapi int [] buffer = new int[HUGE]; Kernel k = new public void run(){ // mutates buffer contents public void host(){ for (int i=0; i< 1000; i++){ execute(HUGE); } } }; k.host();  Developer exposes the host code to Aparapi by overriding the host() method.  By analyzing the bytecode of host(), Aparapi can determine when/if buffers are mutated and can ‘inject’ appropriate put()/get() requests behind the scenes.

31| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 APARAPI BITONIC SORT WITH EXPLICIT BUFFER MANAGEMENT  Bitonic mergesort is a parallel friendly ‘in place’ sorting algorithm –  On 10/18/2010 the following post appeared on Aparapi forums “Aparapi 140x slower than single thread Java?! what am I doing wrong?” –Source code (for Bitonic Sort) was included in the post  An Aparapi Kernel (for each sort pass) executed inside a Java loop.  Aparapi was forcing unnecessary buffer copies.  Following chart compares : –Single threaded Java version –Aparapi/GPU version without explicit buffer management (default AUTO mode) –Aparapi/GPU version with recent explicit buffer management feature enabled.  Both Aparapi versions running on ATI Radeon ™ 5870.

32| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 EXPLICIT BUFFER MANAGEMENT EFFECT ON BITONIC SORT IMPLEMENTATION Time (ms)

33| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 PROPOSED APARAPI ENHANCEMENT: ALLOW ACCESS TO ARRAYS OF OBJECTS  A Java developer implementing an 'nbody' solution would probably define a class for each particle public class Particle{ private int x, y, z; private String name; private Color color; //... }  … would make all fields private and limit access via setters/getters public void setX(int x){ this.x = x}; public int getX(){return this.x); // same for y,z, name etc  … and expect to create a Kernel to update positions for an array of such particles Particle[] particles = new Particle[1024]; ParticleKernel kernel = new ParticleKernel(particles); while(displaying){ kernel.execute(particles.length); updateDisplayPositions(particles); }

34| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 PROPOSED APARAPI ENHANCEMENT: ALLOW ACCESS TO ARRAYS OF OBJECTS  Unfortunately the current ‘alpha’ version of Aparapi would fail to convert this kernel to OpenCL  Would fall back to using a Thread Pool.  Aparapi currently requires that the previous code to be refactored so that data is held in primitive arrays int[] x = new int[1024]; int[] y = new int[1024]; int[] z = new int[1024]; Color[] color = new Color[1024]; String[] name = new String[1024]; Positioner.position(x, y, z);  This is clearly a potential ‘barrier to adoption’

35| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 PROPOSED APARAPI ENHANCEMENT: ALLOW ACCESS TO ARRAYS OF OBJECTS  Proposed enhancement will allow Aparapi Kernels to access arrays (or array based collections) of objects  At runtime Aparapi: –Tracks all fields accessed via objects reachable from Kernel.run() –Extracts the data from these fields into a parallel-array form –Executes a Kernel using the parallel-array form –Returns the data back into the original object array  These extra steps do impact performance (compared with refactored data parallel form) –However, we can still demonstrate performance gains over non-Aparapi versions

36| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 FUTURE WORK  Sync with ‘project lambda’ (Java 8) and allow kernels to be represented as lambda expressions  Continue to investigate automatic extraction of buffer transfers from object collections  Hand more explicit control to ‘power users’ –Explicit buffer (or even sub buffer) transfers –Expose local memory and barriers  Open Source –Aiming for Q3 Open Source release of Aparapi –License TBD, probably BSD variant –Still reviewing hosting options –Encourage community contributions

37| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 SIMILAR INTERESTING/RELATED WORK  Tidepowerd –Offers a similar solution for.NET –NVIDIA cards only at present   java-gpu –An open source project for extracting kernels from nested loops –Extracts code structure from bytecode –Creates CUDA behind the scenes   GRAPHITE-OpenCL –Auto detect data parallel loops in gcc compiler and generate OpenCL + host code for those loops 

38| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 SUMMARY  APUs/GPUs offer unprecedented performance for the appropriate workload  Don’t assume everything can/should execute on the APU/GPU  Profile your Java code to uncover potential parallel opportunities  Aparapi provides an ideal framework for executing data-parallel code on the GPU  Find out more about Aparapi at  Participate in the upcoming Aparapi Open Source community

QUESTIONS

40| APARAPI : Java™ platform’s ‘Write Once Run Anywhere’® now includes the GPU | June 2011 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, AMD Radeon, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. OpenCL is a trademark of Apple Inc used under license to the Khronos Group, Inc. NVIDIA, the NVIDIA logo, and CUDA are trademarks or registered trademarks of NVIDIA Corporation. Java, JVM, JDK and “Write Once, Run Anywhere" are trademark s of Oracle and/or its affiliates. © 2011 Advanced Micro Devices, Inc. All rights reserved.