Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney
Multicores Are Here! PentiumP2P3 P4 Itanium Itanium ?? # of cores Athlon Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 Opteron 4P Xeon MP Ambric AM2045
Multicores Are Here! For uniprocessors, C was: Portable High Performance Composable Malleable Maintainable Uniprocessors: C is the common machine language PentiumP2P3 P4 Itanium Itanium Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom ?? # of cores Opteron 4P Xeon MP Athlon Ambric AM2045
Multicores Are Here! What is the common machine language for multicores? PentiumP2P3 P4 Itanium Itanium Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom ?? # of cores Opteron 4P Xeon MP Athlon Ambric AM2045
Common Machine Languages Common Properties Single flow of control Single memory image Uniprocessors: Differences: Register File ISA Functional Units Register Allocation Instruction Selection Instruction Scheduling Common Properties Multiple flows of control Multiple local memories Multicores: Differences: Number and capabilities of cores Communication Model Synchronization Model von-Neumann languages represent the common properties and abstract away the differences Need common machine language(s) for multicores
Properties of Stream Programs A large (possibly infinite) amount of data Limited lifespan of each data item Little processing of each data item A regular, static computation pattern Stream program structure is relatively constant A lot of opportunities for compiler optimizations
Streaming as a Common Machine Language Regular and repeating computation Independent filters with explicit communication –Segregated address spaces and multiple program counters Natural expression of Parallelism: –Producer / Consumer dependencies –Enables powerful, whole-program transformations Adder Speaker AtoD FMDemod LPF 1 Scatter Gather LPF 2 LPF 3 HPF 1 HPF 2 HPF 3
Application of Streaming Programming
Model of Computation Synchronous Dataflow [Lee ‘92] –Graph of autonomous filters –Communicate via FIFO channels Static I/O rates –Compiler decides on an order of execution (schedule) –Static estimation of computation A/D Duplicate LED Detect Band Pass LED Detect LED Detect LED Detect
Example StreamIt Filter input output FIR 01 float float filter FIR (int N, float[N] weights) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] peek(i); } pop(); push(result); } Stateless
Example StreamIt Filter float float filter FIR (int N, ) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] peek(i); } pop(); push(result); } input output FIR 01 weights = adaptChannel(weights ); float[N] weights ; (int N) { Stateful
parallel computation StreamIt Language Overview StreamIt is a novel language for streaming –Exposes parallelism and communication –Architecture independent –Modular and composable Simple structures composed to creates complex graphs –Malleable Change program behavior with small modifications may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter
13 Mapping to Multicores Baseline techniques 3-phase solution
14 Baseline 1: Task Parallelism Adder Splitter Joiner Compress BandPass Expand Process BandStop Compress BandPass Expand Process BandStop Inherent task parallelism between two processing pipelines Task Parallel Model: –Only parallelize explicit task parallelism –Fork/join parallelism Execute this on a 2 core machine ~2x speedup over single core What about 4, 16, 1024, … cores?
15 Baseline 2: Fine-Grained Data Parallelism Adder Splitter Joiner Each of the filters in the example are stateless Fine-grained Data Parallel Model: –Fiss each stateless filter N ways (N is number of cores) –Remove scatter/gather if possible We can introduce data parallelism –Example: 4 cores Each fission group occupies entire machine BandStop Adder Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner
16 3-Phase Solution [2006] RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner PolarRect Data Parallel Target a 4 core machine Data Parallel, but too little work!
17 Data Parallelize RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner RectPolar Splitter Joiner RectPolar PolarRect Splitter Joiner Target a 4 core machine
18 Data + Task Parallel Execution Time Cores 21 Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner
19 We Can Do Better! Time Cores Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner
20 Phase 3: Coarse-Grained Software Pipelining RectPolar Prologue New Steady State New steady-state is free of dependencies Schedule new steady-state using a greedy partitioning
21 Greedy Partitioning Target 4 core machine Time 16 Cores To Schedule:
Orchestrating the Execution of Stream Programs on Multicore Platforms (SGMS) [June 08] An integrated unfolding and partitioning step based on integer linear programming Then unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. Experimental results (Cell Architecture).
Cell Architecture
Motivation It is critical that the compiler leverage a synergistic combination of parallelism while avoiding –Structural and resource hazards Objective is to maximize concurrent execution of actors across multiple cores –While hiding communication overhead to minimize stalls (pause)
SGMS: A Phase-ordered Approach (2 Steps) First: an integrated actor fission and partitioning step is performed to assign actors to each processor ensuring maximum work balance. –Parallel data actors are selectively replicated and split to increase the opportunities for even work distribution. (Integer Linear Program) Second: Stage assignment –Each actor is assigned to a pipeline stage –Ensure data dependencies are satisfied –Inter processor communication latency is maximally overlapped with computation
Orchestrating the Execution of Stream Programs on Multicore Platforms A BC D Original stream graph A C D B1B2 S J 2 2 Fission and processor assignment Maximum speedup (unmodified graph)=60/40=1.5
Orchestrating the Execution of Stream Programs on Multicore Platforms A BC D Maximum speedup (unmodified graph)=60/40=1.5 A A C C D B1 S S J J 2 2Proc 1 = 27 Proc 2 = 32 Original stream graph Fission and processor assignment Maximum speedup (modified graph)=60/32 ~ 2 B2
Stage Assignment A A C C D B1 S S J J DMA Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 B2 A A C C D B1 S S J J 2 2Proc 1 = 27 Proc 2 = 32 Fission and processor assignment B2
Schedule Running on Cell SPE1DMA1DMA2SPE2 A S B1 A S B1toJ StoB2 AtoC C J B2 A S B1 B1toJ StoB2 AtoC A S B1 JtoD CtoD A S B1 JtoD CtoD B1toJ StoB2 AtoC B1toJ StoB2 AtoC C J B2 C J Steady State Prologue
PE2 Naïve Unfolding [1991] A1 B1 D1E1 C1 F1 A2 B2 D2E2 C2 F2 A3 B3 D3E3 C3 F3 PE1PE3 DMA State data dependence Stream data dependence
Benchmark Characteristics BenchmarkActorsStatefulPeekingState size (bytes) bitonic28204 channel dct36204 des33204 fft17204 filterbank fmradio tde28204 mpeg vocoder radar
Results
Compare
Effect of Exposed DMA Latency
ILP vs Greedy Partitioning