Download presentation
Presentation is loading. Please wait.
Published byConrad Moore Modified over 9 years ago
1
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz
2
2 Outline Motivation Multicore Stream programming Research question Our work Contributions Overview Details Summary 2
3
3 Cores are the New Gates # cores/chip (Shekhar Borkar, Intel) Courtesy: Scott’08 C/C++/Java CUDA X10 Peakstream Fortress Accelerator Ct C T M Rstream Rapidmind Stream Programming 3
4
4 Stream Programming Paradigm Research topic in parallel programming Various forms of parallelism Pipeline, task, and data Applications Signal Processing Multi-media High-Performance Computing Programs expressed as stream graphs Streams Infinite sequence of data elements (aka. Tokens) Actor Functions applied to streams 4 Actor Stream
5
5 Properties of Stream Program Regular and repeating computation Independent actors with explicit communication Producer / Consumer dependencies 5 Adder Speaker AtoD FMDemod LPF 1 Splitter Joiner LPF 2 LPF 3 HPF 1 HPF 2 HPF 3
6
6 StreamIt Language [ASPLOS’2&6, PLDI’3] An implementation of stream programming Model of computation Synchronous data flow model Program is a graph of independent actors and streams Actors have an execution step with known input/output rates Compiler schedules and manages buffers ABC 1511 x5x1x1 6
7
7 StreamIt Language Contd. Each construct has single input/output stream Hierarchical structure Filters can be stateful/stateless parallel computation may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter 7
8
8 Research Question: How to Orchestrate a Stream Graph? Mapping actors Eliminate bottlenecks (aka. hot actors) 8
9
9 We Focus on? Core 1 B 60 C D 5 5 A Core 2Core 3 5 A D 5 B 60 C Load=10Load=60 Make span = 60, Speedup = 130/60 = 2.17 Algorithm for actor allocation 9
10
10 We Focus on? Bottlenecks elimination B 60 C D 5 5 A D 5 5 A B_1 20 2 s1 2 j1 B_2 20 B_3 20 C_1 20 2 s2 2 j2 C_2 20 C_3 20 Hot actor duplication Core 1 Core 2 Core 3 5 A s1 2 B_1 20 C_1 20 B_2 20 j1 2 s2 2 C_2 20 j2 2 B_3 20 C_3 20 D 5 Load = 47Load = 46Load = 45 Make span = 47, Speedup = 130/47 = 2.77 10
11
11 Orchestration of Stream Program Contd. Current state of the art Integer Linear Programming Intractable Heuristics Unknown performance How to find a fast and good solution? Approximation algorithms that have Polynomial runtime Quality bound for solution 11
12
12 Our Contribution A simple quantitative analysis to detect and eliminate bottlenecks A novel 2-approximation algorithm for deploying stream graphs on multicore platforms Experiments and comparison Results are within 5% of the optimal solution Achieves a geometric mean speedup of 6.95x for 8 processors over single processor execution 12
13
13 Focus of Our Work Stream Graph Scheduling Stream Graph Partitioning Layout on Target Architecture Communication Scheduling Linear Functional Equation Solver Actor Allocation on Processors Bottleneck Resolver StreamIt Compiler Phases
14
14 Our Data Transfer Model Arrival rate depends on the data rate of the actors (maximize) Data transfer model forms a system of sim. functional linear equation Compute a closed form of the output data rate We also consider a processor utilization function for each actor ABC 1511 z 14
15
15 Bottleneck Analysis The arrival rate is limited by Processor capacity of the cores Memory bandwidth A quantitative analysis determines An upper bound of the arrival rate imposed by an actor An upper bound of the arrival rate imposed by the parallel system Hot actor Upper bound (actor) < upper bound (system) 15
16
16 Resolving Bottlenecks 16 B 60 C D 5 5 A D 5 5 A B_1 20 2 s1 B_2 20 B_3 20 C_1 20 2 j1 C_2 20 C_3 20 Hot region duplication D 5 5 A B_1 20 2 s1 2 j1 B_2 20 B_3 20 C_1 20 2 s2 2 j2 C_2 20 C_3 20 Hot actor duplication
17
17 Actor Allocation Problem The actor allocation problem (AAP) is NP- hard For a fixed arrival rate, the AAP reduces to standard bin-packing problem (closed form) There exist approximation algorithms for bin- packing Polynomial running time Solution quality is bounded
18
18 Actor Allocation Constraint 18 100% Actors with their utilizations Each core has 100% utilization 1 3 4 5 n 2 1 n 5 2 3
19
19 Binary Search 19 0ub(z) 1.0 Solution space rightleftmid Allocation possible?
20
20 Binary Search 20 0ub(z) 1.0 rightleft Allocation possible? mid 20
21
21 Binary Search 21 0ub(z) 1.0 rightleft Allocation possible? mid 21
22
22 Actor Allocation of Bottleneck Free Program 22 D 5 5 A B_1 20 2 s1 B_2 20 B_3 20 C_1 20 2 j1 C_2 20 C_3 20 Core 1 5 A B_1 20 C_1 20 Core 2 s1 2 B_2 20 j1 2 C_2 20 Core 3 B_3 20 C_3 20 D 5 Load = 45Load = 44Load = 45 Make span = 45, Speedup = 130/45 = 2.89 Mapping Efficient Bottleneck Resolving
23
23 Experiments Our method implemented as an extension of StreamIt compiler We compare to ILP based method [Scott 08] (solved with CPLEX) Hardware Setup 2.33GHz dual quad-core Intel Xeon processors 16GB memory Linux kernel version 2.6.23 Profiler uses the x86-64’s hardware cycle counters 23
24
24 Experiments Contd. Experimental Process Profiling Computing closed form Resolve bottlenecks Compute the mapping Compute the layout scheduling Invoke the StreamIt back end Finally we measure the performance 24
25
25 Experiments Contd. Benchmark ActorsStateful DCT2218 FMRadio6723 TDE5527 FFT2614 MergeSort312 FilterBankNew5334 RadixSort132 EqualizerProgram6522 BitonicSort4522 DES375180 MPEG397 MatrixMult522 25
26
26 Experimental Results for 2 – 4 Processors BenchmarkProc# ILP Time (Optimal)(s) DCT 20.27 31585.69 42285.01 FMRadio 20.08 33.22 41.29 TDE 20.09 30.17 4274.69 FFT 20.46 344694.25 4249240.09 26 Our method’s run time: <1s BenchmarkProc# ILP Time (Optimal)(s) Equalizer 20.06 35.29 457553.83 BitonicSort 20.3 33.06 416371.99 DES 20.51 32.73 411.24 MPEG 20.12 31.37 40.44
27
27 Experimental Results for 2 – 4 Processors BenchmarkProc# Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps) DCT 20.9942.56 30.9762.46 40.9682.57 FMRadio 21.002.69 31.004.00 40.995.34 TDE 21.0013.31 31.0019.80 41.0028.81 FFT 20.9843.91 30.9846.85 40.9595.50 27 BenchmarkProc# Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps) Equalizer 21.000.56 31.000.83 41.001.59 BitonicSort 21.003.16 31.004.73 41.0010.14 DES 21.000.12 31.000.18 41.000.24 MPEG 21.0036.59 30.9954.68 41.0073.22
28
28 Speedup Results 28
29
29 Summary Approximation algorithm for solving actor allocation problem Data rate transfer model that resolves bottlenecks We separate the bottleneck elimination from the actor allocation We implemented our approach and compared with an optimal approach Optimal approach has unpredictable time Our approach has negligible time for all benchmarks Quality of our approach is at most 5% off the optimum For up to 8 processors we achieve a geometric mean speedup of 6.95x over single processor execution 29
30
30 Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] 30
31
Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.