NVIDIA’s Experience with Open64 Mike Murphy NVIDIA.

Slides:

Advertisements

Similar presentations

Tiling Examples for X86 ISA Slides Selected from Radu Ruginas CS412/413 Lecture on Instruction Selection at Cornell.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Review of the MIPS Instruction Set Architecture. RISC Instruction Set Basics All operations on data apply to data in registers and typically change the.

Lecture 20: 11/12/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.

Graph-Coloring Register Allocation CS153: Compilers Greg Morrisett.

Chapter 2 Machine Language.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

2.3) Example of program execution 1. instruction  B25 8 Op-code B means to change the value of the program counter if the contents of the indicated register.

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

RIVERSIDE RESEARCH INSTITUTE Helikaon Linux Debugger: A Stealthy Custom Debugger For Linux Jason Raber, Team Lead - Reverse Engineer.

Prof. Necula CS 164 Lecture 141 Run-time Environments Lecture 8.

Lab6 – Debug Assembly Language Lab

Intermediate Representation I High-Level to Low-Level IR Translation EECS 483 – Lecture 17 University of Michigan Monday, November 6, 2006.

CS 536 Spring Run-time organization Lecture 19.

3/17/2008Prof. Hilfinger CS 164 Lecture 231 Run-time organization Lecture 23.

Improving IPC by Kernel Design Jochen Liedtke Shane Matthews Portland State University.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

Run-time Environment and Program Organization

1. 2 FUNCTION INLINE FUNCTION DIFFERENCE BETWEEN FUNCTION AND INLINE FUNCTION CONCLUSION 3.

Author: Texas Instruments ®, Sitara™ ARM ® Processors Building Blocks for PRU Development Module 2 PRU Firmware Development This session covers how to.

ARM C Language & Assembler. Using C instead of Java (or Python, or your other favorite language)? C is the de facto standard for embedded systems because.

Systems Software Operating Systems.

Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.

Lab 1 – Assembly Language and Interfacing Start date: Week 3 Due date: Week 4 1.

ITEC 352 Lecture 11 ISA - CPU. ISA (2) Review Questions? HW 2 due on Friday ISA –Machine language –Buses –Memory.

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

GPU Architecture and Programming

6.004 – Fall /29/0L15 – Building a Beta 1 Building the Beta.

Systems Software Operating Systems. What is software? Software is the term that we use for all the programs and data that we use with a computer system.

Optimised C/C++. Overview of DS General code Functions Mathematics.

© Mike Stacey 2008 Single Chip Microcontrollers 7765J Mount Druitt College of TAFE Lesson 3 Arrays in ASM, Bubble Sort algorithm.

Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

17 - Jumps & Branches. The PC PC marks next location in Fetch, Decode, Execute cycle.

Compilers: Overview/1 1 Compiler Structures Objective – –what are the main features (structures) in a compiler? , Semester 1,

CMSC 150 PROGRAM EXECUTION CS 150: Wed 1 Feb 2012.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Represents different voltage levels High: 5 Volts Low: 0 Volts At this raw level a digital computer is instructed to carry out instructions.

1.4 Representation of data in computer systems Instructions.

Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

1 The Instruction Set Architecture September 27 th, 2007 By: Corbin Johnson CS 146.

Computer and Programming. Computer Basics: Outline Hardware and Memory Programs Programming Languages and Compilers.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

1 CS/COE0447 Computer Organization & Assembly Language Chapter 2 Part 3.

ISA's, Compilers, and Assembly

Using Open64 for High Performance Computing on a GPU by Mike Murphy, Gautam Chakrabarti, and Xiangyun Kong.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Hello world !!! ASCII representation of hello.c.

Representation of Data - Instructions Start of the lesson: Open this PowerPoint from the A451 page – Representation of Data/ Instructions How confident.

CUDA programming Performance considerations (CUDA best practices)

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

1. Introduction A microprocessor executes instructions given by the user Instructions should be in a language known to the microprocessor Microprocessor.

Run-time organization

Lecture 5: GPU Compute Architecture

The fetch-execute cycle

Computer Organization & Compilation Process

Lecture 5: GPU Compute Architecture for the last time

Computer Programming Machine and Assembly.

Explaining issues with DCremoval( )

Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.

Computer Organization & Compilation Process

Lecture 11: Machine-Dependent Optimization

6- General Purpose GPU Programming

Presentation transcript:

NVIDIA’s Experience with Open64 Mike Murphy NVIDIA

© NVIDIA Corporation 2008 Outline Why Open64 How we use Open64 What we did to Open64 Future work in Open64

© NVIDIA Corporation 2008 Compiling CUDA for GPUs NVCC C/C++ CUDA Application GPU CodeCPU CodeGPU Code executable

© NVIDIA Corporation 2008 Why Open64 We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. owngccopen64

© NVIDIA Corporation 2008 Why Open64 We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. owngccopen64 take too long

© NVIDIA Corporation 2008 Why Open64 We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. owngccopen64 take too long good long- term support

© NVIDIA Corporation 2008 Why Open64 We had a low-level code generator for graphics codes, but for CUDA needed high-level optimization for C/C++ codes. owngccopen64 take too long good long- term support best performance (kudos to PathScale)

© NVIDIA Corporation 2008 NVCC processing of GPU code cudafe C code for GPU nvopencc (Open64) ptx OCG object code

© NVIDIA Corporation 2008 Changes: Rehosting Open64 Our compiler has to run on 32 & 64bit Linux, 32 & 64bit Windows, and Mac OS. Main Open64 source tree is only for Linux. This is an area where sharing our changes can help grow the user base by making it easier to port Open64. For Windows we build using Cygwin’s MINGW

© NVIDIA Corporation 2008 Changes: Memory and registers We don’t have a stack or fast memory Therefore want to keep data in registers Inline everything and optimize as much as possible Try to keep small structs in registers by expanding struct copies into field copies (versus taking address and generating loop to do byte copy)

© NVIDIA Corporation 2008 Changes: Vector loads and stores Coalesce adjacent loads and stores for performance Do this in CG: Iterate through ops, trying to add to vectors Check for intervening kills Change alignment and use dummy regs for padding if helps to create wider vector (e.g. may use 4-word vector for 3-word struct).

© NVIDIA Corporation 2008 Changes: 16bit optimization Cheaper to use 16bit registers and operations But C converts shorts to int. So add pass in CG that converts back to 16bit: Mark 16bit loads, stores, and converts Propagate 16bit-ness forwards and backwards Unmark 16bit-ness if cannot be 16bit Change remaining registers and instructions to be 16bit.

© NVIDIA Corporation 2008 Future work 1 person -> 4 people working with Open64 New application TBA Merging changes into trunk Thanks to Sun Chan and Shin! Investigating register pressure in WOPT Want better control of register pressure during optimization Investigating using other features (LNO, IPA, etc)

© NVIDIA Corporation 2008 Questions?