Presentation is loading. Please wait.

Presentation is loading. Please wait.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Similar presentations


Presentation on theme: "Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim."— Presentation transcript:

1 Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim

2 Outline Background and Motivation Policies Experimental Setup Results Conclusion 2

3 GPU Architecture (based on Tesla Architecture) SM – Streaming Multiprocessor SP – Scalar Processor SIMT – Single Instruction Multiple Thread 3

4 SM Architecture (based on Tesla Architecture) Fetch Mechanism –Fetch 1 instruction for selected warp –Stall Fetch for warp when it executes a Load/Store or when it encounters a Branch Scheduler Policy –Oldest first and Inorder (within warp) Caches –I Cache, Shared Memory, Constant Cache and Texture Cache 4

5 Handling Multiple Memory Requests MSHR/Memory Request Queue –Allows merging of memory requests (Intra-core) DRAM Controller –Allows merging of memory requests (Inter-core) 5

6 Intra-core Merging 6

7 Code Example - Intra-Core Merging From MonteCarlo in CUDA SDK for(iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x) { … for(int i = iSum; i < pathN; i += SUM_N) { real r = d_Samples[i]; real callValue = endCallValue(S, X, r, MuByT, VBySqrtT); sumCall.Expected += callValue; sumCall.Confidence += callValue * callValue; } … } 7 iSum 0, 2 = 2iSum 1, 2 = 2iSum 2, 2 = 2 i 0, 2 = 2i 1, 2 = 2i 2, 2 = 2 r 0, 2 = r 1, 2 = r 2, 2 = d_Samples[2] A X, Y  X – Block Id, Y – Thread Id multiple blocks are assigned to the same SM threads with corresponding Ids in different blocks access the same memory locations

8 Inter-core Merging 8

9 Why look at Fetch? Allows implicit control over resources allocated to a warp Can control progress of a warp Can boost performance by fetching more for critical warps Implicit resource control within a core 9

10 Memory System is a performance bottleneck for several applications DRAM scheduling decides the order in which memory requests are granted Can prioritize warps based on criticality Implicit performance control across cores Why look at DRAM Scheduling? 10

11 By controlling Fetch and DRAM Scheduling we can control performance 11

12 How is This Useful? Understand applications and their behavior better Detect patterns or behavioral groups across applications Design new policies for GPGPU applications to improve performance 12

13 Outline Background and Motivation Policies Experimental Setup Results Conclusion 13

14 Fetch Policies Round Robin (RR) [default in Tesla architecture] FAIR –Ensures uniform progress of all warps ICOUNT [Tullsen’96] –Same as ICOUNT in SMT –Tries to increase throughput by giving priority to fast moving threads Least Recently Fetched (LRF) – Prevents starvation of warps 14

15 New Oracle Based Fetch Policies ALL –Gives priority to longer warps (total length until termination) –Ensures all warps finish at the same time, this results in higher occupancy 15 Priorities: warp 0 > warp 1 > warp 2 > warp 3

16 New Oracle Based Fetch Policies BAR –Gives priority to warps with greater number of instructions to next barrier –Idea is to reduce wait time at barriers 16 Priorities: warp 0 > warp 1 > warp 2 > warp 3 Priorities: warp 2 > warp 1 > warp 0 > warp 3

17 New Oracle Based Fetch Policies MEM_BAR –Similar to BAR but gives higher priority to warps with more memory instructions 17 Priorities: warp 0 > warp 2 > warp 1 = warp 3 Priorities: warp 1 > warp 0 = warp 2 > warp 3 Priority(Wa) > Priority(Wb) If MemInst(Wa) > MemInst(Wb) or If MemInst(Wa) = MemInst(Wb) AND Inst(Wa) > Inst(Wb)

18 DRAM Scheduling Policies FCFS FRFCFS [Rixner’00] FR_FAIR (new policy) –Row hit with fairness –Ensures uniform progress of warps REM_INST (new Oracle based policy) –Row hit with priority for warps with greater number of instructions remaining for termination –Prioritizes longer warps 18

19 Outline Background and Motivation Policies Experimental Setup Results Conclusion 19

20 Experimental Setup Simulated GPU Architecture –8 SMs –Frontend : 1 wide, 1KB I Cache, branch stall –Execution : 8 wide SIMD execution unit, IO scheduling, 4 cycle latency for most instructions –Caches : 64KB software managed cache, 8 load accesses/cycle –Memory : 32B wide bus, 8 DRAM banks –RR fetch, FRFCFS DRAM scheduling (baseline) Trace driven, cycle accurate simulator Per warp traces generated using GPU Ocelot[Kerr’09] 20

21 Benchmarks Taken from –CUDA SDK 2.2 – MonteCarlo, Nbody, ScalarProd –PARBOIL[UIUC’09] – MRI-Q, MRI-FHD, CP, PNS –RODINIA[Che’09] – Leukocyte, Cell, Needle Classification based on lengths of warps –Symmetric, if <= 2% divergence –Asymmetric, otherwise (results included in paper) 21

22 Outline Background and Motivation Policies Experimental Setup Results Conclusion 22

23 Results - Symmetric Applications Compute intensive – no variation with different fetch policies Memory bound – improvement with fairness oriented fetch policies i.e., FAIR, ALL, BAR, MEM_BAR Baseline : RR + FRFCFS 23

24 Results – Symmetric Applications On average, better than FRFCFS MersenneTwister shows huge improvement REM_INST DRAM policy performs similar to FRFAIR 24 Baseline : RR + FRFCFS

25 Analysis: MonteCarlo Fairness oriented fetch policies improve performance by increasing intra-core merging 25 FRFCFS DRAM Scheduling

26 Analysis: MersenneTwister FAIR DRAM Scheduling (FRFAIR, REM_INST) improves performance by increasing DRAM Row Buffer Hit ratio 26 Baseline : RR + FRFCFS

27 Analysis: BlackScholes FRFCFS DRAM Scheduling Fairness oriented fetch policies increase MLP Increased (MLP + Row Buffer Hit ratio) improves performance 27

28 Outline Background and Motivation Policies Experimental Setup Results Conclusion 28

29 Conclusion Compute intensive applications –Fetch and DRAM Scheduling do not matter Symmetric memory intensive applications –Fairness oriented Fetch (FAIR, ALL, BAR, MEM_BAR) and DRAM policies (FR_FAIR, REM_INST) provide performance improvement MonteCarlo(40%),MersenneTwister(50%), BlackScholes(18%) Asymmetric memory intensive applications –No correlation between performance and Fetch and DRAM Scheduling policies 29

30 THANK YOU! 30


Download ppt "Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim."

Similar presentations


Ads by Google