Peter Oostema & Rajnish Aggarwal 6th March, 2019

Slides:



Advertisements
Similar presentations
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Advertisements

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
OS Spring’03 Introduction Operating Systems Spring 2003.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
LLVM Developed by University of Illinois at Urbana-Champaign CIS dept Cisc 471 Matthew Warner.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Operating Systems Lecture 7 OS Potpourri Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software.
GPU Architecture and Programming
5-1 Chapter 5 - Languages and the Machine Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Overview of Compilers and JikesRVM John.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Full and Para Virtualization
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Week 6 Dr. Muhammad Ayaz Intro. to Assembly Language.
LLVM IR, File - Praakrit Pradhan. Overview The LLVM bitcode has essentially two things A bitstream container format Encoding of LLVM IR.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Martin Kruliš by Martin Kruliš (v1.1)1.
A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.
Lecture 3 Translation.
Computer Engg, IIT(BHU)
Computer Organization and Architecture Lecture 1 : Introduction
Advanced Computer Systems
Code Optimization Overview and Examples
Code Optimization.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Introduction to Web Assembly
Control Unit Lecture 6.
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
A Closer Look at Instruction Set Architectures
Compiler Construction (CS-636)
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
Optimization Code Optimization ©SoftMoore Consulting.
Heterogeneous Computing with D
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
The HP OpenVMS Itanium® Calling Standard
CSCI/CMPE 3334 Systems Programming
GPU Programming using OpenCL
Chapter 9 :: Subroutines and Control Abstraction
Superscalar Processors & VLIW Processors
Lecture 5: GPU Compute Architecture for the last time
Code Generation.
Performance Optimization for Embedded Software
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit.
Lecture Topics: 11/1 General Operating System Concepts Processes
COMS 361 Computer Organization
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Compiler Construction
CSc 453 Final Code Generation
6- General Purpose GPU Programming
CUDA Fortran Programming with the IBM XL Fortran Compiler
CMPE 152: Compiler Design April 30 Class Meeting
Presentation transcript:

Peter Oostema & Rajnish Aggarwal 6th March, 2019 Compiling for GPU’s Peter Oostema & Rajnish Aggarwal 6th March, 2019

Papers gpucc: An Open-Source GPGPU Compiler Lift: A Functional Data-Parallel IR for High-Performance GPU Code Generation TwinKernels: An Execution Model to Improve GPU Hardware Scheduling at Compile Time

Gpucc Motivation An open source “probably the first” LLVM based GPGPU compiler for CUDA Nvidia refuses to open source the entire compiler toolchain, hampering: Research Data-center specific optimizations Security of vendor proprietary code Bug turnaround times … Major contributions: Frontend supporting dual mode compilation Opens space for LLVM based GPU optimizations Host code generator handling PTX **We’ll go over the front-end and device (GPU) optimizations in some detail. Skim through most of the code generation part.

Gpucc Architecture (Frontend) CLANG based, capable of parsing the mixed code file in two passes using dual mode compilation emitting LLVM IR for the host & device Complications: Target specific macros __PTX__, __SSE__ Language features (exceptions, inline assembly) Host and device code can be interdependent and may internally call built-in functions The standard compilation of CUDA C/C++ program consists of compiling functions that run on the device into a virtual ISA format dubbed PTX . The PTX code is then compiled at runtime by the driver to the low-level machine instruction set called SASS (Shader ASSembler) that executes natively on NVIDIA GPU hardware. nvcc uses separate compilation i.e. it first separates the code into device (GPU) and host (CPU) and the compiles it, needs 4 passes. This uses a splitter which is a Clang based source to source translator Gpucc solves template issue. It includes the complete translational unit which has info of template instantiation on both host and device Gpucc predefines all the macros (almost all are agreed upon by C++ and CUDA) which helps resolve all conflicts. Specially handles ones not agreed upon by suppressing warnings, error when compiling for the other architecture Has certain restrictions with respect to function calls, distinguishes based on caller and callee

Gpucc Architecture (Code generator) Device After LLVM optimizations, generates PTX using NVPTX (open sourced by Nvidia) PTX produced by device injected into host as string literal constants producing the binary Host Inserts CUDA runtime API calls to load & launch kernels Host wraps PTX in global struct __cuda_fatbin_wrapper Inserts static initializer __cuda_module_ctor loads the PTX from fatbin_wrapper and registers kernels using __cudaRegisterFunctions Generates a kernel stub that prepares arguments for each kernel and launches kernel.

Gpucc optimizations Loop unrolling and function inlining* Jumps and function calls more expensive on GPU’s. No out of order execution. More opportunities for constant propagation and improved register allocation Inferring memory spaces Different type of load instructions, knowing memory spaces helps emit faster instructions Uses fixed point data flow analysis to determine shared spaces Pointer alias analysis Reports two pointers from different memory spaces as not aliasing *can be harmful as well! Can infer p accessing shared memory space. *can be harmful as well

Gpucc optimizations (contd.) Straight line scalar optimizations Strength reduction (SLSR) Common subexpression elimination (CSE) Pointer arithmetic reassociation (PAR) Global reassociation

Gpucc optimizations (contd.) Speculative execution Bypassing 64-bit divisions Silly but very effective. Only optimized Google’s internal benchmarks!

Gpucc results On par with nvcc All optimizations actually speed up the execution, geometric mean 21%

Lift: A Functional Data-Parallel IR for High-Performance GPU Code Generation Problem: Optimizing map and reduce patterns are device specific Solution: Internal representation for parallel patterns used in GPU programs

What is Lift IR? High level representation of data-parallel programs Compiler from the IR to OpenCL

Lift Functions Algorithm Patterns Data Patterns

What Lift does High level program OpenCL program

Lift Program/Graph for Dot Product

Lift IR optimizations Array Accesses: Create simple OpenCL indexes for accesses

Lift IR optimizations Barrier Elimination: Synchronize on accesses to the same memory. Only put in the necessary barriers after tracking memory accesses. Control Flow Simplification: Only write for loops for map, iterate functions that have more iterations than threads.

Lift IR results Performance of Lift created code compared to manually optimized OpenCL

Twin Kernels Overview Execution model to improve GPU hardware performance at compile time Improves performance of memory bound kernels by emitting two binaries for the same code (Twin Kernel) Relaxes strict ordering of executing on GPU’s (all kernels execute the same instruction in lock-step) Compiler based solution, requiring minimal (initialization of program counter) or no change in the hardware In a way, tries to perform out of order execution for an in-order processor.

Twin Kernel Motivation

Twin Kernels Implementation Candidate with heavier register usage used for register usage information Finalizer chooses among the candidate binaries and permutes them to finalize which two binaries give best performance in combination. Passes them on to the assembler which generates the above code in a single file. How to pick which binary to execute? Uses performance tuning based on memory distance. Profile each scheduler’s performance. Profiling and evaluation considered together to build and execution schedule. Give higher probability to kernels that execute faster Last step is execution of the best chosen kernels

Twin Kernels Results 8% faster across all platforms.

Discussion and Topics What could be added to the gpucc IR for optimization? Would generating non-OpenCL code from a data-parallel IR have room for optimizations? Can we apply Twin Kernels to with other devices with similar bottlenecks? Can we make the Twin kernels model more generic, identify and scatter bottleneck instructions ? Can we use Machine Learning to pick the best Twin Kernels?