EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi

- 1 - Announcements & Reading Material v Exams graded and will be returned in Wednesday’s class v This class »“Orchestrating the Execution of Stream Programs on Multicore Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun. 2008. v Next class – Research Topic 3: Security »“Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software,” James Newsome and Dawn Song, Proceedings of the Network and Distributed System Security Symposium, Feb. 2005

Stream Graph Modulo Scheduling

- 3 - Stream Graph Modulo Scheduling (SGMS) v Coarse grain software pipelining »Equal work distribution »Communication/computation overlap »Synchronization costs v Target : Cell processor »Cores with disjoint address spaces »Explicit copy to access remote data  DMA engine independent of PEs v Filters = operations, cores = function units SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) EIB PPE (Power PC) DRAM SPE0SPE1SPE7

- 4 - Preliminaries v Synchronous Data Flow (SDF) [Lee ’87] v StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } Push and pop items from input/output FIFOs Stateless int wgts[N]; wgts = adapt(wgts); Stateful

- 5 - SGMS Overview PE0PE1PE2PE3 PE0 T1T1 T4T4 T4T4 T1T1 ≈ 4 DMA Prologue Epilogue

- 6 - SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap

- 7 - Processor Assignment: Maximizing Throughputs for all filter i = 1, …, N for all PE j = 1,…,P Minimize II BC E D F W: 20 W: 30 W: 50 W: 30 A BC E D F A A D B C E F Minimum II: 50 Balanced workload! Maximum throughput T2 = 50 A B C D E F PE0 T1 = 170 T1/T2 = 3. 4 PE0 PE1PE2PE3 Four Processing Elements Assigns each filter to a processor PE1PE0 PE2PE3 W: workload

- 8 - Need More Than Just Processor Assignment v Assign filters to processors »Goal : Equal work distribution v Graph partitioning? v Bin packing? A B D C A BC D 5 5 4010 Original stream program PE0PE1 B A C D Speedup = 60/40 = 1.5 A B1 D C B2 J S Modified stream program B2 C J B1 A S D Speedup = 60/32 ~ 2

- 9 - Filter Fission Choices PE0PE1PE2PE3 Speedup ~ 4 ?

- 10 - Integrated Fission + PE Assign v Exact solution based on Integer Linear Programming (ILP) … Split/Join overhead factored in v Objective function- Maximal load on any PE »Minimize v Result »Number of times to “split” each filter »Filter → processor mapping

- 11 - Step 2: Forming the Software Pipeline v To achieve speedup »All chunks should execute concurrently »Communication should be overlapped v Processor assignment alone is insufficient information A B C A C B PE0PE1 PE0PE1 Time A B A1A1 B1B1 A2A2 A1A1 B1B1 A2A2 A→BA→B A1A1 B1B1 A2A2 A→BA→B A3A3 A→BA→B Overlap A i+2 with B i X

- 12 - Stage Assignment i j PE 1 S j ≥ S i i j DMA PE 1 PE 2 SiSi S DMA > S i S j = S DMA +1 Preserve causality (producer-consumer dependence) Communication-computation overlap v Data flow traversal of the stream graph »Assign stages using above two rules

- 13 - Stage Assignment Example A B1 D C B2 J S A S B1 Stage 0 DMA Stage 1 C B2 J Stage 2 D DMA Stage 3 Stage 4 DMA PE 0 PE 1

- 14 - Step 3: Code Generation for Cell v Target the Synergistic Processing Elements (SPEs) »PS3 – up to 6 SPEs »QS20 – up to 16 SPEs v One thread / SPE v Challenge »Making a collection of independent threads implement a software pipeline »Adapt kernel-only code schema of a modulo schedule

- 15 - Complete Example void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } A S B1 DMA C B2 J D DMA A S B1 A S B1toJ StoB2 AtoC A S B1 B2 J C B1toJ StoB2 AtoC A S B1 JtoD CtoD B2 J C B1toJ StoB2 AtoC A S B1 JtoD D CtoD B2 J C B1toJ StoB2 AtoC SPE1DMA1SPE2DMA2 Time

- 16 - SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) Solver time < 30 seconds for 16 processors

- 17 - SGMS Conclusions v Streamroller »Efficient mapping of stream programs to multicore »Coarse grain software pipelining v Performance summary »14.7x speedup on 16 cores »Up to 35% better than greedy solution (11% on average) v Scheduling framework »Tradeoff memory space vs. load balance  Memory constrained (embedded) systems  Cache based system

- 18 - Discussion Points v Is it possible to convert stateful filters into stateless? v What if the application does not behave as you expect? »Filters change execution time? »Memory faster/slower than expected? v Could this be adapted for a more conventional multiprocessor with caches? v Can C code be automatically streamized? v Now you have seen 3 forms of software pipelining: »1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling »Where else can it be used?

Compilation for GPUs

- 20 - Why GPUs? 20

- 21 - Efficiency of GPUs 21 High Memory Bandwidth GTX 285 : 159 GB/Sec i7 : 32 GB/Sec High Flop Rate i7 :102 GFLOPS GTX 285 :1062 GFLOPS High DP Flop Rate i7 : 51 GFLOPS GTX 285 : 88.5 GFLOPS GTX 480 : 168 GFLOPS High Flop Per Watt GTX 285 : 5.2 GFLOP/W i7 : 0.78 GFLOP/W High Flop Per Dollar GTX 285 : 3.54 GFLOP/$ i7 : 0.36 GFLOP/$

- 22 - GPU Architecture 22 Shared Regs 0 1 2 3 4 5 6 7 Interconnection Network Global Memory (Device Memory) PCIe Bridge CPU Host Memory Shared Regs 0 1 2 3 4 5 6 7 Shared Regs 0 1 2 3 4 5 6 7 Shared Regs 0 1 2 3 4 5 6 7 SM 0SM 1SM 2SM 29

- 23 - CUDA v “Compute Unified Device Architecture” v General purpose programming model »User kicks off batches of threads on the GPU v Advantages of CUDA »Interface designed for compute - graphics free API »Orchestration of on chip cores »Explicit GPU memory management »Full support for Integer and bitwise operations 23

- 24 - Programming Model 24 Host Kernel 1 Kernel 2 Grid 1 Grid 2 Device Time

- 25 - Grid 1 GPU Scheduling 25 SM 0 Shared Regs 0 1 2 3 4 5 6 7 SM 1 Shared Regs 0 1 2 3 4 5 6 7 SM 2 Shared Regs 0 1 2 3 4 5 6 7 SM 3 Shared Regs 0 1 2 3 4 5 6 7 SM 30 Shared Regs 0 1 2 3 4 5 6 7

- 26 - Warp Generation Block 0 Block 1 Block 3 Shared Registers 0 1 2 4 5 3 6 7 SM0 26 Block 2 ThreadId 0 31 3263 Warp 0Warp 1

- 27 - Memory Hierarchy 27 Per Block Shared Memory __shared__ int SharedVar Block 0 Per-thread Register int LocalVarArray[10] Per-thread Local Memory int RegisterVar Thread 0 Grid 0 Per app Global Memory Host __global__ int GlobalVar __constant__ int ConstVar Texture TextureVar Per app Texture Memory Per app Constant Memory Device

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

Similar presentations

Presentation on theme: "EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.

Similar presentations

Presentation on theme: "EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and."— Presentation transcript:

Similar presentations

About project

Feedback