1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Slides:

Advertisements

Similar presentations

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Advertisements

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Multiscalar processors

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

QCAdesigner – CUDA HPPS project

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

CS 732: Advance Machine Learning

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Chapter 4: Threads 羅習五. Chapter 4: Threads Motivation and Overview Multithreading Models Threading Issues Examples – Pthreads – Windows XP Threads – Linux.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Single Instruction Multiple Threads

Parallel Programming Models

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Introduction to threads

Chapter 4: Multithreaded Programming

Gwangsun Kim, Jiyun Jeong, John Kim

Code Optimization.

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Computer Engg, IIT(BHU)

Hierarchical Architecture

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

Instruction Scheduling for Instruction-Level Parallelism

Chapter 4: Threads.

Lecture 5: GPU Compute Architecture for the last time

Chapter 4: Threads.

Lesson Objectives Aims You should be able to:

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Coe818 Advanced Computer Architecture

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Chapter 4: Threads & Concurrency

Lecture 2 The Art of Concurrency

Programming with Shared Memory Specifying parallelism

Synchronization These notes introduce:

6- General Purpose GPU Programming

Introduction to Optimization

Presentation transcript:

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24, Yngve Sneen Lindal

2 The article Implementation paper Suggests a source-to-source compiler (CUDA C to C) Chapters – Introduction – Programming model background CUDA features Mapping possibilities – Kernel translation Implementation challenges (mostly synchronization) – Implementation and performance – Related work – Conclusions

3 Introduction Motivation: Why write code two times? Programming models should map well. Aims to maintain synchronization and data locality benefits to achieve good performance.

4 Programming model background CUDA: Threads organized in blocks on a grid (hereby called logical threads). Per-block thread synchronizing. Overview of different memory types. Expensive branching. Warps (SIMD) use a stack based reconvergence algorithm. Performance strategy: assign each block to a specific core to avoid intra-core synchronization overhead and to preserve high locality. Similar control flows/operations should enable use of vector instructions.

5 Programming model background Thread-local (registers) and block-shared memory fits well in L1 cache. Constant memory should fit well in L2 cache (which is often shared among CPU cores)

6 Kernel translation One OS thread per GPU thread mitigates locality aims, and will be very expensive to schedule. Will rather assign blocks to a core and run each block sequentially. The blocks will be divided into “thread loops”, which we will return to soon. Involves three explicit transformation stages (performed on the AST) – Transform a kernel into a serial function (fig. 1) – Enforce synchronization (translate __synchtreads()) (fig. 2) – Replicate thread-local data (fig. 3)

7 Transforming a thread block into a serial function Introducing an iterative structure called a “thread loop” – No synchronization needed inside – No side entries or exits Thread loops expose similar instructions in a “non- branching environment”, and thereby helps an eventual optimizing C compiler to generate fast code.

8 Enforcing synchronization with deep fission For-loops becomes while loops to get rid of initialization and update statements (removing side effects). A loop fission transforms a synchronization statement S into two thread loops which are placed over and under S, or, it divides a thread loop into two thread loops (more on that later). Apply algorithm 1 to each synchronization statement, which will also be run for their containing constructs. Any conditional affecting a synchronization statement must evaluate to true or false for all threads. This is part of the CUDA spec.

9 Enforcing synchronization with deep fission Must pass the AST one more time to correct eventual incorrect control flow inside thread loops.

10 Replicating thread-local data Shared memory – straightforward Local variables: “universal replication”; an array[num_threads] with var for each thread. Inefficient when variables could be reused. Live variable analysis to detect if variables can be reused. This is called “selective replication”.

11 Work distribution and runtime framework Iterating through all blocks and calling the function for each is not optimal on a multi-core processor. Scheduling a portion of blocks to each core would be optimal, and corresponds to the programming model.

12 Implementation and performance analysis Uses OpenMP's “parallel for” to take advantage of multiple OS threads on a multi-core machine. Benchmarks of a selection of algorithms with highly optimized CPU versions. Linear performance wrt. number of cores suggests good exploitation of locality, and “independent” blocks.

13 Related work By the article, noone has done this before (why am I not surprised?). Nvidia CUDA CPU emulation meant for debugging, not performance. MCUDA less suitable for debugging since code is compiled. Mentions some other frameworks that uses other approaches (parallelizing serial code).

14 Conclusions Translated kernels performs comparable to optimized serial code (based on this benchmark). That means high locality is preserved, and computational regularity is exposed for an optimizing compiler. Trade off between portability and performance.

15 My thoughts Untraditional (and a bit cool) problem, but when do you need the CPU? GPUs are cheap. A reversed GPGPU development cycle? Maybe some more benchmarks? The examples use a quite simplified kernel. This is very conceptual, but I guess one have to refine the problem when making something like this. What about C++?