SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Intermediate GPGPU Programming in CUDA

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Characterization and Transformation of Unstructured Control Flow in GPU.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.

Ritu Varma Roshanak Roshandel Manu Prasanna

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: PTX EMULATOR 1.

SAGE: Self-Tuning Approximation for Graphics Engines

Previous Next 06/18/2000Shanghai Jiaotong Univ. Computer Science & Engineering Dept. C+J Software Architecture Shanghai Jiaotong University Author: Lu,

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.

MacSim Tutorial (In ICPADS 2013) 1. |The Structural Simulation Toolkit: A Parallel Architectural Simulator (for HPC) A parallel simulation environment.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CSC 310 – Imperative Programming Languages, Spring, 2009 Virtual Machines and Threaded Intermediate Code (instead of PR Chapter 5 on Target Machine Architecture)

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: SUPPORTED DEVICES.

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

An Execution Model for Heterogeneous Multicore Architectures Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili Computer Architecture and Systems Laboratory.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: PTX EMULATOR 1.

Processes and Virtual Memory

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY EXAMPLE: ADDING NEW INSTRUCTIONS - PREFETCH 1.

Full and Para Virtualization

1 University of Maryland Runtime Program Evolution Jeff Hollingsworth © Copyright 2000, Jeffrey K. Hollingsworth, All Rights Reserved. University of Maryland.

Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Sunpyo Hong, Hyesoon Kim

The Execution System1. 2 Introduction Managed code and managed data qualify code or data that executes in cooperation with the execution engine The execution.

My Coordinates Office EM G.27 contact time:

1 The user’s view  A user is a person employing the computer to do useful work  Examples of useful work include spreadsheets word processing developing.

Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.

Computer Engg, IIT(BHU)

Gwangsun Kim, Jiyun Jeong, John Kim

Processes and threads.

Many-core Software Development Platforms

GPU Programming using OpenCL

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Outline Operating System Organization Operating System Examples

6- General Purpose GPU Programming

Dynamic Binary Translators and Instrumenters

Peter Oostema & Rajnish Aggarwal 6th March, 2019

Presentation transcript:

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory Diamos*, Andrew Kerr*, Sudhakar Yalamanchili Computer Architecture and Systems Laboratory School of Electrical and Computer Engineering Georgia Institute of Technology Special thanks to our sponsors: IBM, Intel, LogicBlox, NSF, NVIDIA

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Performance Modeling CUDA Productivity Tools

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Productivity Challenges of Heterogeneity Execution Portability – Systems evolve over time – New systems Performance – Debugging and optimization  Application Migration – Protect investments in existing code bases esd.lbl.gov math.harvard.edu Sandia.gov Harmony Ocelot/LLVMNVIDIA JIT OS/VM Language Front-End Emerging Software Stacks Productivity Tools

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot Vision Just-in-time code generation and optimization for CUDA applications Basis for a range of productivity and workload characterization tools esd.lbl.gov

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Project Goals Encourage proliferation of GPU computing Lower the barriers to entry for researchers and developers Understand performance behavior of CUDA applications on multiple platforms Translate, optimize, debug, and execute CUDA applications on multiple platforms

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Performance Modeling CUDA Productivity Tools

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: Dynamic Execution Infrastructure 7 Dynamic Translation (LLVM)PTX Emulator Performance Analysis and Modeling Productivity ToolsMultiplatform Support Multi-GPU Support PTX Kernel

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot PTX Intermediate Representation (IR) Backend compiler framework for PTX Full-featured PTX IR Class hierarchy for PTX instructions/directives PTX control flow graph Static single-assignment form Dataflow/dominance analysis Enables PTX optimization IR to IR translation From PTX to other IRs LLVM (x86/PowerPC/ARM) CAL (AMD GPUs) PTX Kernel

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example Usage Scenarios Domain specific IR PTX IR Interface Switchable Compute Run- Time Flexible Event Interface Programming Languages Compiler Optimizations Operating Systems & Run-Times Computer Architecture Research Optimization of Database Applications Thread fusion, control flow optimization, etc. Kernel scheduling and data movement optimizations Instruction and/or memory trace generation Research CommunitiesOcelot InterfaceDriving Research Workload Characterization/Tuning Flexible Event Interface Control flow behavior, data sharing patterns

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY NVIDIA’s Parallel Thread eXecution ISA PTX defines an execution model that abstracts cores as CTAs and hardware threads/SIMD units as threads

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot PTX Emulator Performs functional simulation of the PTX execution model Enables detailed performance evaluation/debugging Program correctness checking Workload characterization and modeling Architecture simulation PTX 1.4 support Fermi support in progress Abstract machine model

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY CUDA on CPUs Execution model translation Ocelot fuses CUDA threads into heavy-weight pthreads

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Switchable Compute Switch devices at runtime Load balancing Instrumentation Fault-and-emulate Remote execution

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Performance Modeling CUDA Productivity Tools

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Statistical Modeling Prior work has used Ocelot for statistical workload/architecture modeling Application/machine metrics collected via static analysis, instrumentation, and emulation Principal Component Analysis (PCA) combines related metrics Cluster analysis identifies related sets of applications Regression models use metrics to predict application behavior

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Performance Predictions New application on same GPU Left 12 applications train right 12 GTX280 Same applications on new GPU Trained on: 8600 GS, GTX280, C1060 Test: GeForce 8800 GTX

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Performance Modeling Productivity Tools Program correctness Performance Workload characterization

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – CUDA Debugging – Race Detection // file: raceCondition.cu __global__ void raceCondition(int *A) { __shared__ int SharedMem[64]; SharedMem[threadIdx.x] = A[threadIdx.x]; // no synchronization barrier! A[threadIdx.x] = SharedMem[64 - threadIdx.x]; // line 9 - faulting load }... raceCondition >>( validPtr );... ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception: ==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r ] - Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between. ==Ocelot== Near raceCondition.cu:9:0

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – Performance Tuning.... for (int offset = 1; offset < n; offset *= 2) {// line 61 pout = 1 - pout; pin = 1 - pout; __syncthreads(); temp[pout*n+thid] = temp[pin*n+thid]; if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid - offset]; }.... Identify critical regions Memory demand Floating-point intensity Shared memory bank conflicts hot cold

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Questions? Download Ocelot: (New BSD License) Contact us: Mailing list:

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Additional Material

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example PTX  PTX Optimization Ocelot includes a comprehensive optimization framework for PTX Implementation modeled after LLVM passes Optimizations can be applied dynamically between kernel executions

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX IR – CPU Backend – Memory Traversal 2 This pattern significantly affects throughput-bound applications

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot CUDA Runtime A complete reimplementation of the CUDA Runtime API Clean device abstraction All back-ends implement same interface Compatible with existing applications Link against libocelot.so instead of libcudart Ocelot API Extensions Add/remove trace generators Compile/launch kernels directly in PTX Device memory sharing among host threads Device switching

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – CUDA Debugging- Erroneous Memory Accesses // file: memoryCheck.cu __global__ void badMemoryReference(int *A) { A[threadIdx.x] = 0; // line 3 - faulting store } int main() { int *invalidPtr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation int *validPtr = 0; cudaMalloc((void **)&validPtr, sizeof(int)*64); badMemoryReference >>( invalidPtr ); return 0; } ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range. ==Ocelot== ==Ocelot== Nearby Device Allocations ==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes) ==Ocelot== ==Ocelot== Near memoryCheck.cu:3:0

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – Interactive Debugger Interactive command-line debugger implemented using TraceGenerator interface $./TestCudaSequence A_gpu = 0x16dcbe0 (ocelot-dbg) Attaching debugger to kernel 'sequence' (ocelot-dbg) (ocelot-dbg) watch global address 0x16dcbe4 s32[3] set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes (ocelot-dbg) (ocelot-dbg) continue st.global.s32 [%r11 + 0], %r7 watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6 break on watchpoint (ocelot-dbg) Enables Inspection of application state Single-stepping of instructions Breakpoints and watchpoints Faults in MemoryChecker and RaceDetector invoke ocelot- dbg automatically

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Trace Generator Interface Emulator broadcasts events to trace generators during execution Events describe instruction execution, memory addresses, etc. Used for error checking, instrumentation, and simulation

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Trace Generator Interface //! Base class for generating traces class TraceGenerator { public: TraceGenerator(); virtual ~TraceGenerator(); // called when a traced kernel is launched to // retrieve some parameters from the kernel virtual void initialize( const executive::ExecutableKernel& kernel); // Called whenever an event takes place. virtual void event(const TraceEvent & event); // called when an event is committed virtual void postEvent(const TraceEvent & event); // Called when a kernel is finished. There will // be no more events for this kernel. virtual void finish(); };

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY TraceEvent Object Captures execution of a dynamic instruction PC and internal representation of PTX instruction Set of active threads executing instruction Kernel grid and CTA dimensions Memory addresses and size of transfer Branch target(s)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example – Thread Load Imbalance //! Computes number of dynamic instructions for each thread class ThreadLoadImbalance: public trace::TraceGenerator { public: std::vector dynamicInstructions; // For each dynamic instruction, increment counters of each // thread that executes it virtual void event(const TraceEvent & event) { if (!dynamicInstrucitons.size()) dynamicInstructions.resize(event.active.size(), 0); for (int i = 0; i < event.active.size(); i++) { if (event.active[i]) dynamicInstructions[i]++; } }; Mandelbrot (CUDA SDK)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Ocelot Vision and Goals Multiplatform Dynamic Compilation for CUDA Full Function Instruction Set Emulation Productivity Tools Correctness and debugging functionality Performance Analysis and Workload Characterization

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Workload Characterization Describe applications using architecture-independent metrics Control flow, memory intensity/efficiency, data sharing, parallelism Many functions constructed using the trace generator interface

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Activity Factor Fraction of threads active for each dynamic instruction Utilization of CTA-wide SIMD processor Sensitive to reconvergence mechanism

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Interthread Data Flow Which kernels exchange computed results through shared memory? Track id of producer thread Ensure threads are well synchronized Optionally ignore uses of shared memory to transfer working sets

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX Emulator – Statistical Modeling Data flow correlated to Kernels launched, live registers Kernels imply barriers Data sharing across CTAs Applications that share data do so at all levels of the thread hierarchy

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Architecture Simulation Ocelot PTX emulator has been used as an architecture simulator front-end Can explore impact of architectural changes eg. Instruction fetch and memory scheduling policies MacSim Architecture Simulator (H. Kim-GT) J. Lee, N. B. Lakshminarayana, H. Kim, R.Vuduc, “Many-Thread Aware Prefetching Mechanisms for GPGPU Applications,” MICRO 43, December 2010.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ongoing Work

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Ocelot: PTX Emulator CUDADatalogOpenCL Correctness & Performance Tools Harmony Runtime Instrumentation database Enterprise SW front- end from LogicBlox Inc. (in progress) Future PlansToday Ocelot Front Ends

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY AMD Backend Goal: Explore CUDA a cross-platform alternative to OpenCL CUDA as a generic programming model Understand the relationships between algorithms and architecture Explore execution and performance portability issues Architecture specific code generation issues, VLIW+SIMD vs. Superscalar+SIMD Unstructured control flow People: Rodrigo Dominguez

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Software Warp-Formation People: Andrew Kerr

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Harmony Runtime: Multiple Devices People: Gregory Diamos

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Contributors Ocelot Team: Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Si Li, Tri Pho, Haicheng Wu, Sudhakar Yalamanchili Special Thanks: Nathan Bell, Sylvan Collange, Jared Hoberook, David Luebke, Ryuta Suzuki, Steve Worley, Bruno Coutinho

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PTX IR – CPU Backend – Memory Traversal PTX optimizations can change memory access patterns