GPU baseline architecture and gpgpu-sim

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Graphics Processing Units
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
Microbenchmarking the GT200 GPU
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
MV5: A RECONFIGURABLE SIMULATOR FOR HETEROGENEOUS MULTICORE ARCHITECTURES Jiayuan Meng*, Kevin Skadron University of Virginia * Now at Argonne National.
Parallel Computing Lecture
Basic CUDA Programming
Analyzing CUDA Workloads Using a Detailed GPU Simulator
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Some things are naturally parallel
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
NVIDIA Fermi Architecture
Symmetric Multiprocessing (SMP)
Parallel programming with GPGPU coprocessors
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
Introduction to CUDA.
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Peter Oostema & Rajnish Aggarwal 6th March, 2019
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

GPU baseline architecture and gpgpu-sim Presented by 王建飞 2017.9.28

A typical GPGPU: Related terminology: On-chip memory: GPC:SM cluster SM:streaming multiprocessor SIMT core:single instruction multiple threads (?SIMD) On-chip memory: RF:register file,large L1D cache:private,weak coherence Shared memory: programmer-controlled

Runtime of GPGPU 1:

Runtime of GPGPU 2: Scheduler:LRR,GTO SIMT stack:post-dominator Operand collector:access RF Lane:SP,SFU,MEM

A typical code study 1: Constant gridDim.x,blockDim.x Variable:blockIdx.x threadIdx.x blocksPerGrid = 32 threadsPerBlock = 256 So: gridDim.x = 32 blockDim.x = 256 __global__: call from host __device__: call from device Source: cuda by example;

A typical code study 2:

GPGPU-sim: a cycle-level GPU performance simulator that focuses on "GPU computing" (general purpose computation on GPUs) Replace cuda api and supply a configurable GPU Simulation model: functional simulation (cuda-sim.h/cc) and timing simulation (shader.h/cc) gpu-cache.h/cc: cache model

Simulation line: register_set: instruction temporary buffer m_fu: sp, sfu, ldst_unit Reference: GPGPU-sim manual; Nvidia Fermi/Kepler architecture whitepaper

Instruction Set Architecture: PTX: Parallel Thread eXecution , a pseudo-assembly instruction set  ptxas SASS: a native GPU ISA (strength reduction, instruction scheduling, register allocation) PTXPlus: to extend PTX with the required features in order to provide a one-to-one mapping to SASS

Instruction Set Architecture:

Instruction Set Architecture: //SASS S2R R0, SR_CTAid_X; S2R R2, SR_Tid_X; //PTX mov.u32 %r3, %ctaid.x; mov.u32 %r5, %tid.x;; //PTXPlus mad.lo.u16 $r0, %ctaid.x, 0x00000200, $r0; mov.u16 $r4.lo, 0x00000000;

Thanks