EECE571R -- Harnessing Massively Parallel Processors ece

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Intermediate GPGPU Programming in CUDA

1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Optimization on Kepler Zehuan Wang

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Sunpyo Hong, Hyesoon Kim

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Single Instruction Multiple Threads

Computer Engg, IIT(BHU)

Lecture 5: Performance Considerations

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

Mattan Erez The University of Texas at Austin

L18: CUDA, cont. Memory Hierarchy and Examples

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Programming Massively Parallel Processors Performance Considerations

Mattan Erez The University of Texas at Austin

ECE 498AL Lecture 10: Control Flow

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

ECE 498AL Spring 2010 Lecture 10: Control Flow

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

EECE571R -- Harnessing Massively Parallel Processors http://www. ece EECE571R -- Harnessing Massively Parallel Processors http://www.ece.ubc.ca/~matei/EECE571/ Lecture 1: Introduction to GPU Programming By Samer Al-Kiswany Acknowledgement: some slides borrowed from presentations by Kayvon Fatahalian, and Mark Harris

Outline Hardware Software Programming Model Optimizations

GPU Architecture Intuition

GPU Architecture Intuition

GPU Architecture Intuition

GPU Architecture Intuition

GPU Architecture Intuition

GPU Architecture Intuition

GPU Architecture Intuition

GPU Architecture Intuition

GPU Architecture Host Machine GPU Multiprocessor N Multiprocessor 2 Processor M Instruction Unit Shared Memory Registers Multiprocessor 1 Processor 1 Processor 2 Host Constant Memory Texture Memory Global Memory

GPU Architecture SIMD Architecture. Four memories. Device (a.k.a. global) slow – 400-600 cycles access latency large – 256MB – 1GB Shared fast – 4 cycles access latency small – 16KB Texture – read only Constant – read only

GPU Architecture – Program Flow Preprocessing Data transfer in GPU Processing Data transfer out Postprocessing 3 1 2 4 5 TPreprocesing 1 + TDataHtoG 2 + TProcessing 3 + TDataGtoH 4 + TPostProc 5 TTotal =

Outline Hardware Software Programming Model Optimizations

GPU Programming Model Programming Model: Software representation of the Hardware

GPU Programming Model Block Kernel: A function on the grid

GPU Programming Model

GPU Programming Model

GPU Programming Model In reality scheduling granularity is a warp (32 threads)  4 cycles to complete a single instruction by a warp

GPU Programming Model In reality scheduling granularity is a warp (32 threads)  4 cycles to complete a single instruction by a warp Threads in a Block can share stat through shared memory Threads in the Block can synchronies Global atomic operations

Outline Hardware Software Programming Model Optimizations

Optimizations Can be roughly categorized into the following categories: Memory Related Computation Related Data Transfer Related

Optimizations - Memory Use shared memory Use texture (1D, 2D, or 3D) and constant memory Avoid shared memory bank conflicts Coalesced memory access (one approach: padding)

Optimizations - Memory Shared Memory Complications Bank 0 Bank 1 Bank 15 . Shared memory is organized into 16 -1KB banks. Complication I : Concurrent accesses to the same bank will be serialized (bank conflict)  slow down. Tip : Assign different threads to different banks. Complication II : Banks are interleaved. Bank 0 Bank 1 Bank 2 . 4 bytes 4 8 16

Optimizations - Memory Global Memory Coalesced Access

Optimizations - Memory Global Memory Non-Coalesced Access

Optimizations Can be roughly categorized into the following categories: Memory Related Computation Related Data Transfer Related

Optimizations - Computation Use 1000s of threads to best use the GPU hardware Use Full Warps (32 threads) (use blocks multiple of 32). Lower code branch divergence. Avoid synchronization Loop unrolling (Less instructions, space for compiler optimizations)

Optimizations Can be roughly categorized into the following categories: Memory Related Computation Related Data Transfer Related

Optimizations – Data Transfer Reduce amount of data transferred between host and GPU Hide transfer overhead through overlapping transfer and computation (Asynchronous transfer)

Summary GPUs are highly parallel devices. Easy to program for (functionality). Hard to optimize for (performance). Optimization: Many optimization, but often you do not need them all (Iteration of profiling and optimization) May bring hard tradeoffs (More coputation vs. less memory, more computation vs. better memory access, ..etc).