Download presentation
Presentation is loading. Please wait.
Published byDenis Summers Modified over 9 years ago
1
University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan
2
Electrical Engineering and Computer Science 2 Courtesy: Gordon’06 Cores are the New Gates # cores/chip (Shekhar Borkar, Intel) Courtesy: Gordon’06 C/C++/Java CUDA X10 Peakstream Fortress Accelerator Ct C T M Rstream Rapidmind Stream Programming
3
University of Michigan Electrical Engineering and Computer Science 3 Stream Programming Programming style –Embedded domain Audio/video (H.264), wireless (WCDMA) –Mainstream Continuous query processing (IBM SystemS), Search (Google Sawzall) Stream –Collection of data records Kernels/Filters –Functions applied to streams –Input/Output are streams –Coarse grain dataflow –Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03]
4
University of Michigan Electrical Engineering and Computer Science 4 Compiling Stream Programs Core 1Core 2Core 3Core 4 Mem ? Stream ProgramMulticore System Heavy lifting Equal work distribution Communication Synchronization
5
University of Michigan Electrical Engineering and Computer Science 5 Stream Graph Modulo Scheduling(SGMS) Coarse grain software pipelining –Equal work distribution –Communication/computation overlap Target : Cell processor –Cores with disjoint address spaces –Explicit copy to access remote data DMA engine independent of PEs Filters = operations, cores = function units SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) SPU 256 KB LS MFC(DMA) EIB PPE (Power PC) DRAM SPE0SPE1SPE7
6
University of Michigan Electrical Engineering and Computer Science 6 Streamroller Software Overview Stream IR Profiling Analysis Scheduling Code Generation Streamroller StreamIt SPEX (C++ with stream extensions) Stylized C Custom Application Engine SODA (Low power multicore processor for SDR) Multicore (Cell, Core2Quad, Niagara) Focus of this talk
7
University of Michigan Electrical Engineering and Computer Science 7 Preliminaries Synchronous Data Flow (SDF) [Lee ’87] StreamIt [Thies ’02] int->int filter FIR(int N, int wgts[N]) { work pop 1 push 1 { int i, sum = 0; for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } Push and pop items from input/output FIFOs Stateless int wgts[N]; wgts = adapt(wgts); Stateful
8
University of Michigan Electrical Engineering and Computer Science 8 SGMS Overview PE0PE1PE2PE3 PE0 T1T1 T4T4 T4T4 T1T1 ≈ 4 DMA Prologue Epilogue
9
University of Michigan Electrical Engineering and Computer Science 9 SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap
10
University of Michigan Electrical Engineering and Computer Science 10 Processor Assignment Assign filters to processors –Goal : Equal work distribution Graph partitioning? Bin packing? A B D C A BC D 5 5 4010 Original stream program PE0PE1 B A C D Speedup = 60/40 = 1.5 A B1 D C B2 J S Modified stream program B2 C J B1 A S D Speedup = 60/32 ~ 2
11
University of Michigan Electrical Engineering and Computer Science 11 Filter Fission Choices PE0PE1PE2PE3 Speedup ~ 4 ?
12
University of Michigan Electrical Engineering and Computer Science 12 Integrated Fission + PE Assign Exact solution based on Integer Linear Programming (ILP) … Split/Join overhead factored in Objective function- Maximal load on any PE –Minimize Result –Number of times to “split” each filter –Filter → processor mapping
13
University of Michigan Electrical Engineering and Computer Science 13 SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap
14
University of Michigan Electrical Engineering and Computer Science 14 Forming the Software Pipeline To achieve speedup –All chunks should execute concurrently –Communication should be overlapped Processor assignment alone is insufficient information A B C A C B PE0PE1 PE0PE1 Time A B A1A1 B1B1 A2A2 A1A1 B1B1 A2A2 A→BA→B A1A1 B1B1 A2A2 A→BA→B A3A3 A→BA→B Overlap A i+2 with B i X
15
University of Michigan Electrical Engineering and Computer Science 15 Stage Assignment i j PE 1 S j ≥ S i i j DMA PE 1 PE 2 SiSi S DMA > S i S j = S DMA +1 Preserve causality (producer-consumer dependence) Communication-computation overlap Data flow traversal of the stream graph –Assign stages using above two rules
16
University of Michigan Electrical Engineering and Computer Science 16 Stage Assignment Example A B1 D C B2 J S A S B1 Stage 0 DMA Stage 1 C B2 J Stage 2 D DMA Stage 3 Stage 4 DMA PE 0 PE 1
17
University of Michigan Electrical Engineering and Computer Science 17 SGMS Phases Fission + Processor assignment Stage assignment Code generation Load balanceCausality DMA overlap
18
University of Michigan Electrical Engineering and Computer Science 18 Code Generation for Cell Target the Synergistic Processing Elements (SPEs) –PS3 – up to 6 SPEs –QS20 – up to 16 SPEs One thread / SPE Challenge –Making a collection of independent threads implement a software pipeline –Adapt kernel-only code schema of a modulo schedule
19
University of Michigan Electrical Engineering and Computer Science 19 Complete Example void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } A S B1 DMA C B2 J D DMA A S B1 A S B1toJ StoB2 AtoC A S B1 B2 J C B1toJ StoB2 AtoC A S B1 JtoD CtoD B2 J C B1toJ StoB2 AtoC A S B1 JtoD D CtoD B2 J C B1toJ StoB2 AtoC SPE1DMA1SPE2DMA2 Time
20
University of Michigan Electrical Engineering and Computer Science 20 Experiments StreamIt benchmarks –Signal processing – DCT, FFT, FMRadio, radar –MPEG2 decoder subset –Encryption – DES –Parallel sort – Bitonic –Range of 30 to 90 filters/benchmark Platform –QS20 blade server – 2 Cell chips, 16 SPEs Software –StreamIt to C – gcc 4.1 –IBM Cell SDK 2.1
21
University of Michigan Electrical Engineering and Computer Science 21 Evaluation on QS20 Split/join overhead reduces benefit from fission Barrier synchronization (1 per iteration of the stream graph)
22
University of Michigan Electrical Engineering and Computer Science 22 SGMS(ILP) vs. Greedy (MIT method, ASPLOS’06) Solver time < 30 seconds for 16 processors
23
University of Michigan Electrical Engineering and Computer Science 23 Conclusions Streamroller –Efficient mapping of stream programs to multicore –Coarse grain software pipelining Performance summary –14.7x speedup on 16 cores –Up to 35% better than greedy solution (11% on average) Scheduling framework –Tradeoff memory space vs. load balance Memory constrained (embedded) systems Cache based system
24
University of Michigan Electrical Engineering and Computer Science 24
25
University of Michigan Electrical Engineering and Computer Science 25
26
University of Michigan Electrical Engineering and Computer Science 26 PE4PE3PE2 Naïve Unfolding A BC DE F A BC DE F A BC DE F A BC DE F PE1 Completely stateless stream program Stream program with stateful nodes Stream data dependence State data dependence A BC DE F A BC DE F A BC DE F A BC DE F PE1PE2PE3PE4 DMA
27
University of Michigan Electrical Engineering and Computer Science 27 SGMS (ILP) vs. Greedy Solver time < 30 seconds for 16 processors Highly dependent on graph structure DMA overlapped in both ILP and Greedy Speedup drops with exposed DMA (MIT method, ASPLOS’06)
28
University of Michigan Electrical Engineering and Computer Science 28 Unfolding vs. SGMS
29
University of Michigan Electrical Engineering and Computer Science 29 Pros and Cons of Unfolding Pros –Simple –Good speedups for mostly stateless programs Cons A B C D Sync + DMA A B C D A B C D A B C D A B C D A B C D PE 1PE 2PE 3 A BC D All input data need to be available for iterations to begin Long latency for one iteration Not suitable for real time scheduling Sync+DMA Speedup affected by filters with large state
30
University of Michigan Electrical Engineering and Computer Science 30 SGMS Assumptions Data independent filter work functions –Static schedule High bandwidth interconnect –EIB 96 Bytes/cycle –Low overhead DMA, low observed latency Cores can run MIMD efficiently –Not (directly) suitable for GPUs
31
University of Michigan Electrical Engineering and Computer Science 31 Speedup Upperbounds 15% work in one stateless filter Max speedup = 100/15 = 6.7
32
University of Michigan Electrical Engineering and Computer Science 32 Summary 14.7x speedup on 16 cores –Preprocessing steps (fission) necessary –Hiding communication important Extensions –Scaling to more cores –Constrained memory system Other targets –SODA, FPGA, GPU
33
University of Michigan Electrical Engineering and Computer Science 33 Stream Programming Algorithmic style suitable for –Audio/video processing –Signal processing –Networking (packet processing, encryption/decryption) –Wireless domain, software defined radio Characteristics –Independent filters Segregated address space Independent threads of control –Explicit communication –Coarse grain dataflow –Amenable to aggressive compiler optimizations [ASPLOS’02, ’06, PLDI ’03,’08] Motion decode 1D-DCT (row) Bounded saturate Inverse Quant AC Inverse Quant DC Saturate 1D-DCT (column) Bitstream parser Zigzag unorder MPEG Decoder
34
University of Michigan Electrical Engineering and Computer Science 34 Effect of Exposed DMA
35
University of Michigan Electrical Engineering and Computer Science 35 Scheduling via Unfolding Evaluation IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server StreamIt benchmarks characteristics Benchmark Total filtersStatefulPeeking State size (bytes) bitonic 40104 channel 55134252 dct 40104 des 55004 fft 17104 filterbank 85132508 fmradio 43114508 tde 55104 mpeg2 39104 vocoder 54116112 radar 574401032
36
University of Michigan Electrical Engineering and Computer Science 36 Absolute Performance Benchmark GFLOPS with 16 SPEs bitonicN/A channel18.72949 dct17.2353 desN/A fft1.65881 filterbank18.41478 fmradio17.05445 tde6.486601 mpeg2N/A vocoder5.488236 radar30.29937
37
University of Michigan Electrical Engineering and Computer Science 37 0_0 a 1,0,0,i – b 1,0 = 0 Σ i=1 P 1_01_1 1_2 1_3 Σ i=1 P Σ j=0 3 a 1,1,j,i – b 1,1 ≤ Mb 1,1 Σ i=1 P Σ j=0 3 a 1,1,j,i – b 1,1 – 2 ≥ -M + Mb 1,1 2_02_12_2 2_3 2_4 Σ i=1 P Σ j=0 4 a 1,2,j,i – b 1,2 ≤ Mb 1,2 Σ i=1 P Σ j=0 4 a 1,2,j,i – b 1,2 – 3 ≥ -M + Mb 1,2 b 1,0 + b 1,1 + b 1,2 = 1 Original actor Fissed 2x Fissed 3x
38
University of Michigan Electrical Engineering and Computer Science 38 SPE Code Template void spe_work() { char stage[N] = {0, 0,...,0}; stage[0] = 1; for (i=0; i<max_iter+N-1; i++) { if (stage[N-1]) { } if (stage[N-2]) { }... if (stage[0]) { Start_DMA(); FilterA_work(); } if (i == max_iter-1) stage[0] = 0; for(j=N-1; j>=1; j--) stage[j] = stage[j-1]; wait_for_dma(); barrier(); } Bit mask to control active stages Activate Stage 0 Left shift bit mask (activate more stages) Start DMA operation Call filter work function Poll for completion of all outstanding DMAs Barrier synchronization Bit mask controls what gets executed Go through all input items
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.