Microbenchmarking the GT200 GPU

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

CUBLAS and CUSPARSE MVM Timing Gavin Harrison. SMVM Algorithm.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

SPIM and MIPS programming

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

Instruction Level Parallelism (ILP) Colin Stevens.

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

Systems I Locality and Caching

EECS 370 Discussion 1 xkcd.com. EECS 370 Discussion Topics Today: – Caches!! Theory Design Examples 2.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.

Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

The Alpha – Data Stream Matt Ziegler.

Overview of GPGPU Architecture and Programming Paradigm

GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.

My Coordinates Office EM G.27 contact time:

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Computer Organization Exam Review CS345 David Monismith.

Instruction Level Parallelism

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Improving Memory Access 1/3 The Cache and Virtual Memory

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

D2MA: Accelerating Coarse-Grained Data Transfer for GPUs

5.2 Eleven Advanced Optimizations of Cache Performance

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

COMP4211 : Advance Computer Architecture

The fetch-execute cycle

Super Quick Architecture Review

Mattan Erez The University of Texas at Austin

L18: CUDA, cont. Memory Hierarchy and Examples

Alpha Microarchitecture

Programming Massively Parallel Processors Performance Considerations

Lecture 20: OOO, Memory Hierarchy

Lecture 20: OOO, Memory Hierarchy

Computer Organization

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

GPU baseline architecture and gpgpu-sim

Appendix C Practice Problem Set 1

Mattan Erez The University of Texas at Austin

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Main Memory Background

Lecture 5: Synchronization and ILP

EE108b Review Session February 2nd, 2007 Daxia Ge.

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Microbenchmarking the GT200 GPU

Objective Measure low-level performance Speculate about the microarchitecture of the GT200 GPU

Objective Pipeline Several pipelines Pipeline latency and throughput Memory hierarchy Caches Off-chip DRAM latency and throughput

Pipeline Benchmarks Use clock() for timing Two instructions: move and left shift Unrolled sequence of arithmetic ops One thread for latency measurement 512 threads for throughput uint start_time = clock(); repeat256(a = a + b;) uint stop_time = clock(); [Save timers to global memory]

Pipeline Benchmark Caveats Unrolled loop must fit in 8 KB icache Variable instruction length (32- and 64-bit) Instruction cache cold misses Measure only second iteration of loop PTX “instructions” to real instructions mapping not one-to-one Optimization by nvopencc and ptxas Need to circumvent optimizations by clever coding

Pipeline Functional Units Latency (clocks) Throughput (ops/clock) SP add, sub, (int, float) 24 8 SFU rsqrt, log2f, (float) 28 2 DPU (double) 48 1

SP Pipeline

32-Bit Multiplication r1 = r1 * r2 Latency is 96 cycles r5 = r2.lo * r1.hi (mul24) r5 = (r2.h1 * r1.lo) + r5 (mad24) r5 = r5 << 16 (shl) r1 = (r2.lo * r1.lo) + r5 (mad24) Latency is 96 cycles Throughput is 2 operations per cycle

Instruction Cache Sequence of independent 64-bit SP instructions (cvt) Loop over of varying amount of code Cache misses increase measured runtime L2 icache is supposed to be shared with L2 constant cache

Instruction Cache 32KB L2 8KB L1

Register File Allocation

Register File Allocation Register file has 64 banks Can not utilize entire register file if not a multiple of 64 threads No bank conflicts, as alluded to in manual 8 banks per SP allows single-ported memories to be used for register files ½ bandwidth for ALU ½ bandwidth for memory fetch

Global Memory Read Latency 435 - 450 cycles Chain of dependent reads repeat1024 (j=array[j]) where array[j]=j+stride