Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan Agrawal Ohio-State University.

Similar presentations


Presentation on theme: "Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan Agrawal Ohio-State University."— Presentation transcript:

1

2 Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan Agrawal Ohio-State University

3 Coarse-Grained Pipelined Parallelism (CGPP) Definition –Computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units Example — K-nearest Neighbor Given a 3-D range R=, and Given a 3-D range R=, and a point  = (a, b, c). a point  = (a, b, c). We want to find the nearest K neighbors of  within R. We want to find the nearest K neighbors of  within R. Range_queryFind the K-nearest neighbors

4 Coarse-Grained Pipelined Parallelism is Desirable & Feasible Application scenarios Internet data

5 Our belief –A coarse-grained pipelined execution model is a good match Internet data Coarse-Grained Pipelined Parallelism is Desirable & Feasible

6 Coarse-Grained Pipelined Parallelism needs Compiler Support Computation needs to be decomposed into stages Decomposition decisions are dependent on execution environment –availability and capacity of computing sites and communication links Code for each stage follows the same processing pattern, so it can be generated by compiler Shared or distributed memory parallelism needs to be exploited High-level language and compiler support are necessary

7 Outline Motivation Overview of the system DataCutter runtime system & Language dialect Language dialect Compiler techniques Experimental results Related work Future work & Conclusions

8 Overview Java Dialect Compiler Support DataCutter Runtime System Decomposition Code Generation

9 DataCutter Runtime System Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) Targets a distributed, heterogeneous environment Allows decomposition of application-specific data processing operations into a set of interacting processes Provides a specific low-level interface –filter –stream –layout & placement filter1 filter2 filter3 stream

10 Language Dialect Goal –to give compiler information about independent collections of objects, parallel loops, reduction operations, and pipelined parallelism Extensions of Java –Pipelined_loop –Domain & Rectdomain –Foreach loop –reduction variables

11 ISO-Surface Extraction Example Code public class isosurface { public static void main(String arg[]) { public static void main(String arg[]) { float iso_value; float iso_value; RectDomain CubeRange = [min:max]; RectDomain CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange]; CUBE[1d] InputData = new CUBE[CubeRange]; Point p, b; Point p, b; RectDomain PacketRange = RectDomain PacketRange = [1:runtime_def_num_packets ]; RectDomain EachRange = RectDomain EachRange = [1:(max-min)/runtime_define_num_packets]; [1:(max-min)/runtime_define_num_packets]; Pipelined_loop (b in PacketRange) { Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) { Foreach (p in EachRange) { InputData[p].ISO_SurfaceTriangles(iso_value,…); InputData[p].ISO_SurfaceTriangles(iso_value,…); } … … … … }} }} For (int i=min; i++; i<max-1) { // operate on InputData[i] } Pipelined_loop (b in PacketRange) Pipelined_loop (b in PacketRange) { 0. foreach ( …) { … } 1. foreach ( …) { … } … … … … n-1. S; } Merge Merge RectDomain PacketRange = [1:4];

12 Overview of the Challenges for the Compiler Filter Decomposition –To identify the candidate filter boundaries –Compute communication volume between two consecutive filters –Cost Model –Determine a mapping from computations in a loop to processing units in a pipeline Filter Code Generation

13 Compute Required Communication the set of values need to be communicated through this boundary ReqComm(b) = the set of values need to be communicated through this boundary Cons(B) = the set of variables that are used in B, not defined in B Gens(B) = the set of variables that are defined in B, still alive at the end of B ReqComm(b 2 ) = ReqComm(b 1 ) – Gens(B) + Cons(B) B b2b2 b1b1

14 Filter Decomposition C1C1 C2C2 C m-1 CmCm L1L1 L m-1 f1f1 f2f2 fnfn f n+1 b1b1 bnbn Goal: Find a mapping: L i → b j, to minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. Find a mapping: L i → b j, to minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. Intuitively, the candidate filter boundary b j is inserted between computing units C i and C i+1 m-1 n+m-1 Exhaustive search

15 Filter Decomposition: Dynamic Programming C m-2 C m-1 CmCm L m-1 f n+1 L m-2 fnfn fnfn

16 Filter Decomposition: Dynamic Programming C m-2 C m-1 CmCm L m-1 L m-2,…, T[i,j]: min cost of doing computations f 1,…, f i C 1,…, C j, on computing units C 1,…, C j, C j where the results of f i are on C j. T[i,j] = min T[i-1,j] + Cost_comp(P(C j ),Task(f i )) T[i,j-1] + Cost_comp(B(L j-1 ),Vol(f i )) Goal: T[n+1,m] Cost: O(mn)

17 Code Generation Abstraction of the work each filter does –Read in a buffer of data from input stream –Iterate over the set of data –Write out the results to output stream Code generation issues –How to get the Cons(b) from the input stream --- unpacking data --- unpacking data –How to organize the output data for the successive filter --- packing data

18 Experimental Results Goal –To show Compiler-generated code is efficient Configurations # data sites --- # computing sites --- user machine –1-1-1 –2-2-1 –4-4-1 –width of a pipeline datacomputeuser datacompute user datacompute datacompute datacompute datacompute user datacompute

19 Experimental Results Versions –Default version Site hosting the data only reads and transmits data, no processing at all User’s desktop only views the results, no processing at all All the work is done by the computing nodes –Compiler-generated version Intelligent decomposition is done by the compiler More computations are performed on the end nodes to reduce the communication volume –Manual version Hand-written DataCutter filters with similar decomposition as the compiler-generated version Computing nodes workload heavy Communication volume high workload balanced between each node Communication volume reduced

20 Experimental Results: ISO-Surface Rendering Width of pipeline Small dataset 150M Large dataset 600M Speedup 1.92 3.34Speedup 1.99 3.82 20% improvement over default version

21 Experimental Results: KNN Width of pipeline K = 3 108M K = 200 108M Speedup 1.89 3.38Speedup 1.87 3.82 >150% improvement over default version

22 Experimental Results: Virtual Microscope Width of pipeline Small query 800M, 512*512 Large query 800M, 2048*2048 ≈40% improvement over default version

23 Experimental Results Summary –The compiler-decomposed versions achieve an improvement between 10% and 150% over default versions –In most cases, increasing the width of the pipeline results in near-linear speedup –Compared with the manual version, the compiler-decomposed versions are generally quite close

24 Related Work No previous work on language & compiler support for CGPP StreamIt (MIT) –Targets at streaming applications –A language for communication-exposed architectures –A compiler performs stream-specific optimizations –Lower-level language interface –Targets at different architecture Ziegler et al (USC/ISI) – –Target at pipelined FPGA Architectures –Consider different granularity of communication between FPGAs

25 Related Work Run-time support for CGPP –Stampede (Georgia Tech) Multimedia applications, support is in the form of cluster-wide threads and shared objects Multimedia applications, support is in the form of cluster-wide threads and shared objects –Yang et al (Penn State) Scheduler for vision applications, executed in a pipelined fashion within a cluster Scheduler for vision applications, executed in a pipelined fashion within a cluster –Remos (CMU) Resource monitoring system for network-aware applications to get info. about execution environment Resource monitoring system for network-aware applications to get info. about execution environment –Active Stream (Georgia Tech) A middleware approach for distributed applications A middleware approach for distributed applications

26 Future Work & Conclusion Future Work –Buffer size optimization –Cost model refinement & implementation –More applications –More realistic environment settings: resource dynamically available --- compiler directed adaptation

27 Future Work & Conclusion Conclusion –Coarse-Grained Pipelined Parallelism is desirable & feasible –Coarse-Grained Pipelined Parallelism needs language & compiler support –An algorithm for required communication analysis is given –A dynamic programming algorithm for filter decomposition is developed –A cost model is designed –Results of detailed evaluation of our compiler are encouraging

28 Thank you !!!

29

30 Cost Model –A sequence of m computing units, C 1,…, C m with computing powers P(C 1 ), …, P(C m ) –A sequence of m-1 network links, L 1, …, L m-1 with bandwidths B(L 1 ), …, B(L m-1 ) –A sequence of n candidate filter boundaries b 1, …, b n C1C1 C2C2 C3C3 L1L1 L2L2 Say, L 2 is bottleneck stage, T = T(C 1 )+T(L 1 )+T(C 2 )+N*T(L 2 )+T(C 3 )

31 Three types of candidate boundaries –Start & end of a foreach loop –Conditional statement If ( point[p].inRange(high, low) ) { If ( point[p].inRange(high, low) ) { local_KNN(point[p]); local_KNN(point[p]); } –Start & end of a function call within a foreach loop Any non-foreach loop must be completely inside a single filter Identify the Candidate Filter Boundaries

32 Coarse-Grained Pipelined Parallelism is Desirable & Feasible A new class of data-intensive applications –scientific data analysis –data mining –data visualization –image analysis –and more … Two direct ways to implement such applications –Downloading all the data to user’s machine user’s machine –Computing at the data repository

33 Compute Required Communication ReqComm(b 0 ) = { } Cons={X, Y} Gens={Z} Cons={A} Gens={X,Y} b0b0 b2b2 b1b1 ReqComm(b 2 ) = {A} ReqComm(b 1 ) = {X, Y } ReqComm(b 2 ) = ReqComm(b 1 ) – Gens(B) + Cons(B)

34 Compute Required Communication Z = A + 48 If Z > 0 Y = Z * A Y = Z * A X = Z + A Cons(B) = the set of variables that are used in B, not defined in B Gens(B) = the set of variables that are defined in B, still alive at the end of B Cons(B) Gens(B) A X, Z Z, A X Cons(s) Gens(s) A Z A Z Z Z, A Z, A Z, A X Z, A X

35 code Generation Two ways to organize data in a buffer –Instance-wise –Field-wise Class C { int x; int x; float y; float y; int z; int z;} X Y Z X Y Z... Instance-wise Field-wise X X... Y Y... Z...

36 Code Generation Ways that fields of an object are used –In the same loop for (int i=0; i<count; i++) { for (int i=0; i<count; i++) { … = InputData[i].x + …; … = … + InputData[i].y; } –In different loops for (int i=0; i<count; i++) { for (int i=0; i<count; i++) { … = InputData[i].x + …; } for (int i=0; i<count; i++) { for (int i=0; i<count; i++) { … = … + InputData[i].y;} … = … + InputData[i].y;} Instance-wise Field-wise

37 Cost Model C1C1 C2C2 C3C3 L1L1 L2L2 time stage C1C1 L1L1 C2C2 L2L2 C3C3 Say, L 2 is bottleneck stage, T = T(C 1 )+T(L 1 )+T(C 2 )+N*T(L 2 )+T(C 3 ) Say, C 2 is bottleneck stage, T = T(C 1 )+T(L 1 )+N*T(C 2 )+T(L 2 )+T(C 3 )

38 Experimental Results: ISO-Surface Rendering (Active Pixel Based) Width of pipeline Small dataset 150M Large dataset 600M Speedup close to linear > 15% improvement over default version


Download ppt "Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan Agrawal Ohio-State University."

Similar presentations


Ads by Google