EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi
- 1 - Announcements & Reading Material v Exams graded and will be returned in Wednesday’s class v This class »“Orchestrating the Execution of Stream Programs on Multicore Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Jun v Next class – Research Topic 3: Security »“Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software,” James Newsome and Dawn Song, Proceedings of the Network and Distributed System Security Symposium, Feb. 2005
Stream Graph Modulo Scheduling
- 3 - Stream Graph Modulo Scheduling (SGMS) v Coarse grain software pipelining »Equal work distribution »Communication/computation overlap »Synchronization costs v Target : Cell processor »Cores with disjoint address spaces »Explicit copy to access remote data DMA engine independent of PEs v Filters = operations, cores = function units SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) EIB PPE (Power PC) DRAM SPE0SPE1SPE7
- 4 - Preliminaries v Synchronous Data Flow (SDF) [Lee ’87] v StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } Push and pop items from input/output FIFOs Stateless int wgts[N]; wgts = adapt(wgts); Stateful
- 5 - SGMS Overview PE0PE1PE2PE3 PE0 T1T1 T4T4 T4T4 T1T1 ≈ 4 DMA Prologue Epilogue
- 6 - SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap
- 7 - Processor Assignment: Maximizing Throughputs for all filter i = 1, …, N for all PE j = 1,…,P Minimize II BC E D F W: 20 W: 30 W: 50 W: 30 A BC E D F A A D B C E F Minimum II: 50 Balanced workload! Maximum throughput T2 = 50 A B C D E F PE0 T1 = 170 T1/T2 = 3. 4 PE0 PE1PE2PE3 Four Processing Elements Assigns each filter to a processor PE1PE0 PE2PE3 W: workload
- 8 - Need More Than Just Processor Assignment v Assign filters to processors »Goal : Equal work distribution v Graph partitioning? v Bin packing? A B D C A BC D Original stream program PE0PE1 B A C D Speedup = 60/40 = 1.5 A B1 D C B2 J S Modified stream program B2 C J B1 A S D Speedup = 60/32 ~ 2
- 9 - Filter Fission Choices PE0PE1PE2PE3 Speedup ~ 4 ?
Integrated Fission + PE Assign v Exact solution based on Integer Linear Programming (ILP) … Split/Join overhead factored in v Objective function- Maximal load on any PE »Minimize v Result »Number of times to “split” each filter »Filter → processor mapping
Step 2: Forming the Software Pipeline v To achieve speedup »All chunks should execute concurrently »Communication should be overlapped v Processor assignment alone is insufficient information A B C A C B PE0PE1 PE0PE1 Time A B A1A1 B1B1 A2A2 A1A1 B1B1 A2A2 A→BA→B A1A1 B1B1 A2A2 A→BA→B A3A3 A→BA→B Overlap A i+2 with B i X
Stage Assignment i j PE 1 S j ≥ S i i j DMA PE 1 PE 2 SiSi S DMA > S i S j = S DMA +1 Preserve causality (producer-consumer dependence) Communication-computation overlap v Data flow traversal of the stream graph »Assign stages using above two rules
Stage Assignment Example A B1 D C B2 J S A S B1 Stage 0 DMA Stage 1 C B2 J Stage 2 D DMA Stage 3 Stage 4 DMA PE 0 PE 1
Step 3: Code Generation for Cell v Target the Synergistic Processing Elements (SPEs) »PS3 – up to 6 SPEs »QS20 – up to 16 SPEs v One thread / SPE v Challenge »Making a collection of independent threads implement a software pipeline »Adapt kernel-only code schema of a modulo schedule
Complete Example void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } A S B1 DMA C B2 J D DMA A S B1 A S B1toJ StoB2 AtoC A S B1 B2 J C B1toJ StoB2 AtoC A S B1 JtoD CtoD B2 J C B1toJ StoB2 AtoC A S B1 JtoD D CtoD B2 J C B1toJ StoB2 AtoC SPE1DMA1SPE2DMA2 Time
SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) Solver time < 30 seconds for 16 processors
SGMS Conclusions v Streamroller »Efficient mapping of stream programs to multicore »Coarse grain software pipelining v Performance summary »14.7x speedup on 16 cores »Up to 35% better than greedy solution (11% on average) v Scheduling framework »Tradeoff memory space vs. load balance Memory constrained (embedded) systems Cache based system
Discussion Points v Is it possible to convert stateful filters into stateless? v What if the application does not behave as you expect? »Filters change execution time? »Memory faster/slower than expected? v Could this be adapted for a more conventional multiprocessor with caches? v Can C code be automatically streamized? v Now you have seen 3 forms of software pipelining: »1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling »Where else can it be used?
Compilation for GPUs
Why GPUs? 20
Efficiency of GPUs 21 High Memory Bandwidth GTX 285 : 159 GB/Sec i7 : 32 GB/Sec High Flop Rate i7 :102 GFLOPS GTX 285 :1062 GFLOPS High DP Flop Rate i7 : 51 GFLOPS GTX 285 : 88.5 GFLOPS GTX 480 : 168 GFLOPS High Flop Per Watt GTX 285 : 5.2 GFLOP/W i7 : 0.78 GFLOP/W High Flop Per Dollar GTX 285 : 3.54 GFLOP/$ i7 : 0.36 GFLOP/$
GPU Architecture 22 Shared Regs Interconnection Network Global Memory (Device Memory) PCIe Bridge CPU Host Memory Shared Regs Shared Regs Shared Regs SM 0SM 1SM 2SM 29
CUDA v “Compute Unified Device Architecture” v General purpose programming model »User kicks off batches of threads on the GPU v Advantages of CUDA »Interface designed for compute - graphics free API »Orchestration of on chip cores »Explicit GPU memory management »Full support for Integer and bitwise operations 23
Programming Model 24 Host Kernel 1 Kernel 2 Grid 1 Grid 2 Device Time
Grid 1 GPU Scheduling 25 SM 0 Shared Regs SM 1 Shared Regs SM 2 Shared Regs SM 3 Shared Regs SM 30 Shared Regs
Warp Generation Block 0 Block 1 Block 3 Shared Registers SM0 26 Block 2 ThreadId Warp 0Warp 1
Memory Hierarchy 27 Per Block Shared Memory __shared__ int SharedVar Block 0 Per-thread Register int LocalVarArray[10] Per-thread Local Memory int RegisterVar Thread 0 Grid 0 Per app Global Memory Host __global__ int GlobalVar __constant__ int ConstVar Texture TextureVar Per app Texture Memory Per app Constant Memory Device