Download presentation
Presentation is loading. Please wait.
1
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney
2
Multicores Are Here! 1985199019801970197519952000 4004 8008 80868080286386486PentiumP2P3 P4 Itanium Itanium 2 200520?? # of cores 1 2 4 8 16 32 64 128 256 512 Athlon Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 Opteron 4P Xeon MP Ambric AM2045
3
Multicores Are Here! For uniprocessors, C was: Portable High Performance Composable Malleable Maintainable Uniprocessors: C is the common machine language 1985199019801970197519952000 4004 8008 80868080286386486PentiumP2P3 P4 Itanium Itanium 2 2005 Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 20?? # of cores 1 2 4 8 16 32 64 128 256 512 Opteron 4P Xeon MP Athlon Ambric AM2045
4
Multicores Are Here! What is the common machine language for multicores? 1985199019801970197519952000 4004 8008 80868080286386486PentiumP2P3 P4 Itanium Itanium 2 2005 Raw Power4 Opteron Power6 Niagara Yonah PExtreme Tanglewood Cell Intel Tflops Xbox360 Cavium Octeon Raza XLR PA-8800 Cisco CSR-1 Picochip PC102 Broadcom 1480 20?? # of cores 1 2 4 8 16 32 64 128 256 512 Opteron 4P Xeon MP Athlon Ambric AM2045
5
Common Machine Languages Common Properties Single flow of control Single memory image Uniprocessors: Differences: Register File ISA Functional Units Register Allocation Instruction Selection Instruction Scheduling Common Properties Multiple flows of control Multiple local memories Multicores: Differences: Number and capabilities of cores Communication Model Synchronization Model von-Neumann languages represent the common properties and abstract away the differences Need common machine language(s) for multicores
6
Properties of Stream Programs A large (possibly infinite) amount of data Limited lifespan of each data item Little processing of each data item A regular, static computation pattern Stream program structure is relatively constant A lot of opportunities for compiler optimizations
7
Streaming as a Common Machine Language Regular and repeating computation Independent filters with explicit communication –Segregated address spaces and multiple program counters Natural expression of Parallelism: –Producer / Consumer dependencies –Enables powerful, whole-program transformations Adder Speaker AtoD FMDemod LPF 1 Scatter Gather LPF 2 LPF 3 HPF 1 HPF 2 HPF 3
8
Application of Streaming Programming
9
Model of Computation Synchronous Dataflow [Lee ‘92] –Graph of autonomous filters –Communicate via FIFO channels Static I/O rates –Compiler decides on an order of execution (schedule) –Static estimation of computation A/D Duplicate LED Detect Band Pass LED Detect LED Detect LED Detect
10
Example StreamIt Filter 01234567891011 input output FIR 01 float float filter FIR (int N, float[N] weights) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] peek(i); } pop(); push(result); } Stateless
11
Example StreamIt Filter float float filter FIR (int N, ) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] peek(i); } pop(); push(result); } 01234567891011 input output FIR 01 weights = adaptChannel(weights ); float[N] weights ; (int N) { Stateful
12
parallel computation StreamIt Language Overview StreamIt is a novel language for streaming –Exposes parallelism and communication –Architecture independent –Modular and composable Simple structures composed to creates complex graphs –Malleable Change program behavior with small modifications may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter
13
13 Mapping to Multicores Baseline techniques 3-phase solution
14
14 Baseline 1: Task Parallelism Adder Splitter Joiner Compress BandPass Expand Process BandStop Compress BandPass Expand Process BandStop Inherent task parallelism between two processing pipelines Task Parallel Model: –Only parallelize explicit task parallelism –Fork/join parallelism Execute this on a 2 core machine ~2x speedup over single core What about 4, 16, 1024, … cores?
15
15 Baseline 2: Fine-Grained Data Parallelism Adder Splitter Joiner Each of the filters in the example are stateless Fine-grained Data Parallel Model: –Fiss each stateless filter N ways (N is number of cores) –Remove scatter/gather if possible We can introduce data parallelism –Example: 4 cores Each fission group occupies entire machine BandStop Adder Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner Expand Process Joiner BandPass Compress BandStop Expand BandStop Splitter Joiner Splitter Process BandPass Compress Splitter Joiner Splitter Joiner Splitter Joiner
16
16 3-Phase Solution [2006] RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner PolarRect 66 20 2 1 1 1 2 1 1 1 Data Parallel Target a 4 core machine Data Parallel, but too little work!
17
17 Data Parallelize RectPolar Splitter Joiner AdaptDFT Splitter Amplify Diff UnWrap Accum Amplify Diff Unwrap Accum Joiner RectPolar Splitter Joiner RectPolar PolarRect Splitter Joiner 66 20 2 1 1 1 2 1 1 1 5 5 Target a 4 core machine
18
18 Data + Task Parallel Execution Time Cores 21 Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner 66 2 1 1 1 2 1 1 1 5 5
19
19 We Can Do Better! Time Cores Target 4 core machine Splitter Joiner Splitter Joiner Splitter Joiner RectPolar Splitter Joiner 66 2 1 1 1 2 1 1 1 5 5 16
20
20 Phase 3: Coarse-Grained Software Pipelining RectPolar Prologue New Steady State New steady-state is free of dependencies Schedule new steady-state using a greedy partitioning
21
21 Greedy Partitioning Target 4 core machine Time 16 Cores To Schedule:
22
Orchestrating the Execution of Stream Programs on Multicore Platforms (SGMS) [June 08] An integrated unfolding and partitioning step based on integer linear programming Then unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. Experimental results (Cell Architecture).
23
Cell Architecture
24
Motivation It is critical that the compiler leverage a synergistic combination of parallelism while avoiding –Structural and resource hazards Objective is to maximize concurrent execution of actors across multiple cores –While hiding communication overhead to minimize stalls (pause)
25
SGMS: A Phase-ordered Approach (2 Steps) First: an integrated actor fission and partitioning step is performed to assign actors to each processor ensuring maximum work balance. –Parallel data actors are selectively replicated and split to increase the opportunities for even work distribution. (Integer Linear Program) Second: Stage assignment –Each actor is assigned to a pipeline stage –Ensure data dependencies are satisfied –Inter processor communication latency is maximally overlapped with computation
26
Orchestrating the Execution of Stream Programs on Multicore Platforms A BC D 5 4010 5 Original stream graph A C D 5 2010 5 20 B1B2 S J 2 2 Fission and processor assignment Maximum speedup (unmodified graph)=60/40=1.5
27
Orchestrating the Execution of Stream Programs on Multicore Platforms A BC D 5 4010 5 Maximum speedup (unmodified graph)=60/40=1.5 A A C C D 5 2010 5 20 B1 S S J J 2 2Proc 1 = 27 Proc 2 = 32 Original stream graph Fission and processor assignment Maximum speedup (modified graph)=60/32 ~ 2 B2
28
Stage Assignment A A C C D B1 S S J J DMA Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 B2 A A C C D 5 2010 5 20 B1 S S J J 2 2Proc 1 = 27 Proc 2 = 32 Fission and processor assignment B2
29
Schedule Running on Cell SPE1DMA1DMA2SPE2 A S B1 A S B1toJ StoB2 AtoC C J B2 A S B1 B1toJ StoB2 AtoC A S B1 JtoD CtoD A S B1 JtoD CtoD B1toJ StoB2 AtoC B1toJ StoB2 AtoC C J B2 C J Steady State Prologue
30
PE2 Naïve Unfolding [1991] A1 B1 D1E1 C1 F1 A2 B2 D2E2 C2 F2 A3 B3 D3E3 C3 F3 PE1PE3 DMA State data dependence Stream data dependence
31
Benchmark Characteristics BenchmarkActorsStatefulPeekingState size (bytes) bitonic28204 channel54234252 dct36204 des33204 fft17204 filterbank68232508 fmradio29214508 tde28204 mpeg226304 vocoder961117112 radar544401032
32
Results
33
Compare
34
Effect of Exposed DMA Latency
35
ILP vs Greedy Partitioning
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.