Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz
2 Outline Motivation Multicore trend Stream programming Research Questions How to profiling communication overhead on Multicores? How to deploy stream programs? Related works 2
3 Motivation # cores/chip Courtesy: Scott’08 C/C++/Java CUDA X10 Peakstream Fortress Accelerator Ct C T M Rstream Rapidmind Stream Programming 3
4 Stream Programming Paradigm Programs expressed as stream graphs Streams: Infinite sequence of data elements Actors: Functions applied to streams 4 Actor Stream
5 Properties of Stream Program Regular and repeating computation Independent actors with explicit communication Producer / Consumer dependencies 5 Adder Speaker AtoD FMDemod LPF 1 Splitter Joiner LPF 2 LPF 3 HPF 1 HPF 2 HPF 3
6 StreamIt Language An implementation of stream prog. Hierarchical structure Each construct has single input/output stream parallel computation may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter 6
7 Outline Motivation Multicore trend Stream programming Research Questions How to profiling communication overhead on Multicores? How to deploy stream programs? Related works 7
How to Estimate the Communication Overhead on Multicores? 8
Problems to Measure Communication Overhead on Multicores Reasons: Multicores are non-communication exposed architecture Complex cache hierarchy Cache coherence protocols Consequence: Cannot directly measure the communication cost Estimate the communication cost by measuring the execution time of actors 9
Measuring the Communication Overhead of an Edge 10 ik Processor 1 No communication cost Processor 1 With communication cost Processor 2 ki
How to Minimize the Required Number of Experiments 11 A B C 1 2 Pipeline Graph Coloring Requires 2+1 Steps A B C D Processor 1Processor E F 5 4 Even edges across partition Processor 1 A D B C E Processor Odd edges across partition
Obs. 1: There is no loop of three actors in a stream graph 12 ik l Processor 1Processor 2
Obs. 2: There is no interference of adjacent nodes between edges 13 A B CD E F For blue color edges P-1 P-2 P-3 P-4
Remove Interference Convert to a line graph Add interference edges Use vertex coloring algorithm 14 A B CD E F AB BC BD CE DE EF Line graph Stream graph AB BC BD CE DE EF
Processor Leveling Graph 15 A B CD E F For blue colored edge Processor leveling graph A B, C, D, E F
Coloring the Processor Labelling Graph 16 A B, C, D, E F Processor 2Processor 1 A B, C, D, E F A F
Measuring the Communication Cost 17 A B CD E F A B, C, D, E F Processor 2Processor 1 For blue colored edge
Profiling Performance Benchmark Total EdgeProf StepsSteps/Edge (%)Err (%) SAR MatrixMult MergeSort FMRadio DCT RadixSort FFT MPEG Channel BeamFormer39513 GM17%15% 18
19 Outline Motivation Multicore trend Stream programming Research Questions How to profiling communication overhead? How to deploy stream programs? Related works 19
Deployment of Stream Programs 20 A (5) B (40) C (40) D(5) Processor 1Processor A (5) B (40) C (40) D(5) Load = (5 + 40) + 5 = 50 Load = (40 + 5) + 5 = 50 Makespan = 50, Speedup = 90/50 = 1.8
Deploying Stream Programs without Considering Communication 21 A (5) B (40) C (40) D(5) Processor 1Processor 2 A (5) C (40) B (40) D(5) Load = (5+40) + ( ) = 100 Load = (40+5) + ( ) = 100 Makespan = 100, Speedup = 90/100 = 0.9 Compare = (100 – 50)x100%/50 = 100%
Deployment Performance Benchmarkm (us)ḿ (us) (ḿ – m)/m% SAR MatrixMult MergeSort FMRadio DCT RadixSort FFT MPEG Channel BeamFormer
Speedups obtained for 2, 4 and 6 processors 23
Summary We propose an efficient profiling technique for multicore that minimizes profiling steps We propose ILP based approach that minimizes the makespan We conducted experiments The number of profiling steps is on the average only 17% The profiling scheme shows only 15% error on the average in the random mapping test Obtains speedup of 3.11x for 4 processors and a speedup of 4.02x for 6 processors 24
25 Related Works [1] Static Scheduling of SDF Programs for DSP [Lee ‘87] [2] StreamIt: A language for streaming applications [Thies ‘02] [3] Phased Scheduling of Stream Programs [Thies ’03] [4] Exploiting Coarse Grained Task, Data, and Pipeline Parallelism in Stream Programs [Thies ‘06] [5] Orchestrating the Execution of Stream Programs on Cell [Scott ’08] [6] Software Pipelined Execution of Stream Programs on GPUs [Udupa‘09] [7] Synergistic Execution of Stream Programs on Multicores with Accelerators [Udupa ‘09] [8] Orchestration by approximation [Farhad ‘11] 25
Questions?
Minimizing Errors in Profiling Process Errors are likely in any profiling process We chose an architecture which has uniform cache hierarchy We pin the threads using likwidpin tools 27
Cache Topology of Processor 28 Core #0Core #1Core #2Core #3Core #4Core #5 L1: 64kB L2: 512kB L3: 6MB 800MHz hexa-core AMD Phenom(tm) II X6 1090T