Download presentation
Presentation is loading. Please wait.
Published byMark Perry Modified over 9 years ago
1
Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz
2
2 Outline Motivation Multicore trend Stream programming Research Questions How to profiling communication overhead on Multicores? How to deploy stream programs? Related works 2
3
3 Motivation # cores/chip Courtesy: Scott’08 C/C++/Java CUDA X10 Peakstream Fortress Accelerator Ct C T M Rstream Rapidmind Stream Programming 3
4
4 Stream Programming Paradigm Programs expressed as stream graphs Streams: Infinite sequence of data elements Actors: Functions applied to streams 4 Actor Stream
5
5 Properties of Stream Program Regular and repeating computation Independent actors with explicit communication Producer / Consumer dependencies 5 Adder Speaker AtoD FMDemod LPF 1 Splitter Joiner LPF 2 LPF 3 HPF 1 HPF 2 HPF 3
6
6 StreamIt Language An implementation of stream prog. Hierarchical structure Each construct has single input/output stream parallel computation may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter 6
7
7 Outline Motivation Multicore trend Stream programming Research Questions How to profiling communication overhead on Multicores? How to deploy stream programs? Related works 7
8
How to Estimate the Communication Overhead on Multicores? 8
9
Problems to Measure Communication Overhead on Multicores Reasons: Multicores are non-communication exposed architecture Complex cache hierarchy Cache coherence protocols Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring the execution time of actors 9
10
Measuring the Communication Overhead of an Edge 10 ik Processor 1 No communication cost Processor 1 With communication cost Processor 2 ki
11
How to Minimize the Required Number of Experiments 11 A B C 1 2 Pipeline Graph Coloring Requires 2+1 Steps A B C D Processor 1Processor 2 1 2 3 E F 5 4 Even edges across partition Processor 1 A D B C E Processor 2 1 3 2 4 Odd edges across partition
12
Obs. 1: There is no loop of three actors in a stream graph 12 ik l Processor 1Processor 2
13
Obs. 2: There is no interference of adjacent nodes between edges 13 A B CD E F For blue color edges P-1 P-2 P-3 P-4
14
Remove Interference Convert to a line graph Add interference edges Use vertex coloring algorithm 14 A B CD E F AB BC BD CE DE EF Line graph Stream graph AB BC BD CE DE EF
15
Processor Leveling Graph 15 A B CD E F For blue colored edge Processor leveling graph A B, C, D, E F
16
Coloring the Processor Labelling Graph 16 A B, C, D, E F Processor 2Processor 1 A B, C, D, E F A F
17
Measuring the Communication Cost 17 A B CD E F A B, C, D, E F Processor 2Processor 1 For blue colored edge
18
Profiling Performance Benchmark Total EdgeProf StepsSteps/Edge (%)Err (%) SAR443710 MatrixMult88212417 MergeSort3741131 FMRadio2131424 DCT2893214 RadixSort122175 FFT2631227 MPEG56173015 Channel2262711 BeamFormer39513 GM17%15% 18
19
19 Outline Motivation Multicore trend Stream programming Research Questions How to profiling communication overhead? How to deploy stream programs? Related works 19
20
Deployment of Stream Programs 20 A (5) B (40) C (40) D(5) Processor 1Processor 2 25 5 5 A (5) B (40) C (40) D(5) 5 25 5 Load = (5 + 40) + 5 = 50 Load = (40 + 5) + 5 = 50 Makespan = 50, Speedup = 90/50 = 1.8
21
Deploying Stream Programs without Considering Communication 21 A (5) B (40) C (40) D(5) Processor 1Processor 2 A (5) C (40) B (40) D(5) 5 25 5 5 5 Load = (5+40) + (25+5+25) = 100 Load = (40+5) + (25+5+25) = 100 Makespan = 100, Speedup = 90/100 = 0.9 Compare = (100 – 50)x100%/50 = 100%
22
Deployment Performance Benchmarkm (us)ḿ (us) (ḿ – m)/m% SAR45.54 0 MatrixMult67.80111.1464 MergeSort1.636.99329 FMRadio1.577.00346 DCT4.647.6866 RadixSort1.493.08107 FFT18.2834.1587 MPEG37.26 0 Channel89.0091.202 BeamFormer7.29 0 22
23
Speedups obtained for 2, 4 and 6 processors 23
24
Summary We propose an efficient profiling technique for multicore that minimizes profiling steps We propose ILP based approach that minimizes the makespan We conducted experiments The number of profiling steps is on the average only 17% The profiling scheme shows only 15% error on the average in the random mapping test Obtains speedup of 3.11x for 4 processors and a speedup of 4.02x for 6 processors 24
25
25 Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] [8] Orchestration by approximation [Farhad ‘11] 25
26
Questions?
27
Minimizing Errors in Profiling Process Errors are likely in any profiling process We chose an architecture which has uniform cache hierarchy We pin the threads using likwidpin tools 27
28
Cache Topology of Processor 28 Core #0Core #1Core #2Core #3Core #4Core #5 L1: 64kB L2: 512kB L3: 6MB 800MHz hexa-core AMD Phenom(tm) II X6 1090T
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.