Using Open64 for High Performance Computing on a GPU by Mike Murphy, Gautam Chakrabarti, and Xiangyun Kong.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

NVIDIA’s Experience with Open64 Mike Murphy NVIDIA.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Programmer's view on Computer Architecture by Istvan Haller.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

GPU Programming with CUDA – Optimisation Mike Griffiths

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

My Coordinates Office EM G.27 contact time:

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Single Instruction Multiple Threads

Computer Engg, IIT(BHU)

Sathish Vadhiyar Parallel Programming

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

Optimization Code Optimization ©SoftMoore Consulting.

Basic CUDA Programming

Lecture 5: GPU Compute Architecture

Vector Processing => Multimedia

Presented by: Isaac Martin

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

Peter Oostema & Rajnish Aggarwal 6th March, 2019

Presentation transcript:

Using Open64 for High Performance Computing on a GPU by Mike Murphy, Gautam Chakrabarti, and Xiangyun Kong

@2010 NVIDIA Corporation Using Open64 for High Performance Computing on a GPU Background and Overview Functionality Work Performance Work Concluding Thoughts

@2010 NVIDIA Corporation Why Use a GPU? Sequential processors have hit a wall GPU is efficient parallel processor lots of big ALUs multithreading can hide latency context switching is basically free all threads run same sequential program SIMT (Single Instruction Multiple Thread)

@2010 NVIDIA Corporation GPU for Compute CUDA (Compute Unified Device Architecture) augment C/C++ with minimal abstraction divide into sequential host and parallel device code let programmers focus on parallel algorithms

@2010 NVIDIA Corporation CUDA Example // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C, int n) { int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C[i] = A[i] + B[i]; } int main() { // Run N/256 blocks of 256 threads each vecAdd >>(d_A, d_B, d_C, n); } Host Code

@2010 NVIDIA Corporation CUDA Successes Nebulae computer uses NVIDIA GPU #4 on Green500 list (MFLOPS/Watt) #2 on Top500 list DARPA “exascale supercomputer” grant just announced Thousands of applications. e.g. AMBER (scientific) Numerix (financial) Adobe Premiere Pro (video) Physx (game collision physics)

@2010 NVIDIA Corporation CUDA Accelerating Computation 146X36X19X17X100X Interactive visualization of volumetric white matter connectivity Ionic placement for molecular dynamics simulation on GPU Transcoding HD video stream to H.264 Simulation in Matlab using.mex file CUDA function Astrophysics N-body simulation 149X47X20X24X30X Financial simulation of LIBOR model with swaptions An M- script API for linear Algebra operations on GPU Ultrasound medical imaging for cancer diagnostics Highly optimized object oriented molecular dynamics Cmatch exact string matching to find similar proteins and gene sequences

@2010 NVIDIA Corporation Why Open64? Previously, GPU’s were hard to program Used optimizing assembler on short shaders Did scheduling, register allocation and peephole opts For CUDA want ability to code in C/C++ Needed high-level optimizing compiler Open64 was open-source and good optimizer

@2010 NVIDIA Corporation Where Open64? CUDA code Host compiler cudafe Open64 ptx ptxas elf executable device code host code

@2010 NVIDIA Corporation What Open64? preprocessed input gfec inliner be ptx no Fortran no IPA no LNO minimal CG ptxas does register allocation, scheduling and peephole optimizations

@2010 NVIDIA Corporation How Open64? Functional enhancements Performance enhancements

@2010 NVIDIA Corporation Windows Host Hosted on 32 and 64bit linux, mac, and windows Windows build uses MINGW, so need cygwin to build, but can run without cygwin. Can also build with visual studio from cygwin. No dso’s or dll’s: combine be, wopt, cg and target into one executable.

@2010 NVIDIA Corporation PTX Target Unlimited virtual registers of different sizes Explicit memory spaces (e.g. ld.global) Strongly typed instructions No stack Abstracted call syntax Vector memory accesses

@2010 NVIDIA Corporation Handling Virtual Registers PTX has unlimited virtual registers of different sizes by default, targ_info and cg use static arrays of registers. compile problems when 100,000 registers. most info is same so use sparse arrays, hash maps, or just no array (recalculate).

@2010 NVIDIA Corporation PTX Target – no stack Try to store all local variables in registers even for –g (limited local memory space) keep small structs and unions in registers enhancements in VHO and CGEXP use local memory if cannot put in reg (e.g. address taken) Abstracted call syntax use param space in PTX ptxas will utilize param registers and stack

@2010 NVIDIA Corporation How Open64? Functional enhancements Performance enhancements

@2010 NVIDIA Corporation Vectorizing Memory Accesses Vector memory accesses save memory latency. We optimize on scalars then in CG we coalesce loads and stores into vectors. ld.f32 f1, [arr+4]; S1; ld.f32 f2, [arr+0]; can be vectorized to: ld.v2.f32 {f2,f1}, [arr+0]; S1; if arr is 8-byte aligned and S1 does not use f2

@2010 NVIDIA Corporation Rematerialization Rematerialize across basic blocks to reduce register pressure. Some instructions like shared memory loads can be folded in final object code. Use dominator info to find last reaching def find defs in BB_dom, the def that dominates others is last reaching def. can rematerialize def if no intervening def, alias, or barrier.

@2010 NVIDIA Corporation 32->16bit optimization C rules promote arithmetic to int (32bits) But use fewer registers if pack into 16bit Some 16bit instructions are faster (e.g. multiply) Pass to analyze 16bit load/store/converts Propagate info forwards and backwards Change to 16bit if 16bits are enough

@2010 NVIDIA Corporation Hierarchy of Memory Spaces Per-thread local memory Per-block shared memory Per-device global memory Generic memory overlays other spaces Thread per-thread local memory Block per-block shared memory Kernel... per-device global memory...

@2010 NVIDIA Corporation Handling Memory Spaces Infer the address space of all memory accesses Use “generic” or “unified” addressing if a memory access cannot be resolved statically Generic address access has more latency than specific memory access Specific memory accesses good for performance Pointer class analysis

@2010 NVIDIA Corporation Pointer Class Analysis An example __shared__ int sharedvar; __device__ void devicefunction(void) { int *lvar = &sharedvar; = *lvar; // whether to generate generic ld or ld.shared *lvar = ; // whether to generate generic st or st.shared }

@2010 NVIDIA Corporation Pointer Class Analysis (continued) Another example __shared__ int sharedvar; __device__ void devicefunction(int * input) { int *lvar = &sharedvar; = *lvar + *input; // generate generic ld or ld.shared for lvar? *lvar = ; // generate generic st or st.shared for lvar? } What address space does “input” point to? Inlining may help disambiguate “input”

@2010 NVIDIA Corporation Pointer Class-based Alias Analysis Memory accesses to different address spaces do not overlap Address space information used to help resolve aliases Pointer class information maintained for each memory access Alias Manager takes pointer class into consideration

@2010 NVIDIA Corporation Why Pointer Class Analysis ? Generic addressing more expensive than specific addressing Improve Open64’s alias analysis Specific memory accesses help memory disambiguation in ptxas Some optimizations applicable only to certain memory spaces (e.g. LDU) In summary pointer class analysis benefits application performance.

@2010 NVIDIA Corporation Variance Analysis CUDA’s execution model: - At a given time, all participated threads run the same kernel in-parallel - Each thread has its own registers, thread ID, local memory

@2010 NVIDIA Corporation Variance Analysis (2) Though all participated threads execute the same instructions, but different thread may get different results because of thread ID differences Variance Analysis is to find out instructions which may produce thread-dependent (variant) results

@2010 NVIDIA Corporation Why Variance Analysis ? CUDA’s execution model: if (cond) then S1 else S2 step 1: all-threads execute cond step 2: threads-with-true-cond execute S1 step 3: threads-with-false-cond execute S2

@2010 NVIDIA Corporation Why Variance Analysis (2)? If the cond is not variant, only one of the branches is executed. If the cond is variant, both branches under the condition will be executed ( sequentially ). - avoid placing code into multiple branches under a variant condition

@2010 NVIDIA Corporation Why Variance Analysis (3) ? An example, … if (x > 0) S1 else S2 endif S3 if (x > 0) S4 else S5 endif Assuming x is not changed in any of the statements, the following transformation may be desired, … if (x > 0) S1 S3 S4 else S2 S3 S5 endif But if x is variant, therefore x > 0 may be variant, the above transformation may increase run-time, since S3 could be executed twice.

@2010 NVIDIA Corporation Why Variance Analysis (3) ? Variance Analysis also help explore certain architecture specific properties, - Issue only one Load request if a LOAD loads from a non-Variant address, otherwise, need to issue load request for each thread.

@2010 NVIDIA Corporation Variance Analysis Algorithm 1) Build Forward Data-Flow 2) Collect an initial set of Variant Values 3) Chasing Forward Data-Flow on Variant Values to Find affected variant Values 4) Mark Variant Conditions, and collect new Variant Values, and back to 3)

@2010 NVIDIA Corporation Concluding Thoughts Open64 has been successfully used for functionality and performance in CUDA.

@2010 NVIDIA Corporation Concluding Thoughts What obstacles have we faced in using Open64? Open64 was originally designed for superscalar CPUs GPU presents different issues with registers and memory model. Register pressure is critical issue; some optimizations increase register pressure. GPL license limits some uses of compiler in embedded space.

@2010 NVIDIA Corporation Questions?