Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
Gagan Agrawal Wei Du Tahsin Kurc Umit Catalyurek Joel Saltz The Ohio State University

Overall Context NGS grant titled ``An Integrated Middleware and Language/Compiler Framework for Data-Intensive Applications’’, funded September 2002 – August 2005. Project Components Runtime Optimizations in the DataCutter System Compiler Optimization of DataCutter filters Automatic Generation of DataCutter filters Focus of this talk

General Motivation Language and Compiler Support for Parallelism of many forms has been explored Shared memory parallelism Instruction-level parallelism Distributed memory parallelism Multithreaded execution Application and technology trends are making another form of parallelism desirable and feasible Coarse-Grained Pipelined Parallelism

Coarse-Grained Pipelined Parallelism (CGPP)
Definition Computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units Example — K-nearest Neighbor Given a 3-D range R= <(x1, y1, z1), (x2, y2, z2)>, and a point  = (a, b, c). We want to find the nearest K neighbors of  within R. Range_query Find the K-nearest neighbors

Coarse-Grained Pipelined Parallelism is Desirable & Feasible
Application scenarios data data data data Internet data data data

A new class of data-intensive applications Scientific data analysis data mining data visualization image analysis Two direct ways to implement such applications Downloading all the data to user’s machine – often not feasible Computing at the data repository - usually too slow

Our belief A coarse-grained pipelined execution model is a good match data Internet data

Coarse-Grained Pipelined Parallelism needs Compiler Support
Computation needs to be decomposed into stages Decomposition decisions are dependent on execution environment How many computing sites available How many available computing cycles on each site What are the available communication links What’s the bandwidth of each link Code for each stage follows the same processing pattern, so it can be generated by compiler Shared or distributed memory parallelism needs to be exploited High-level language and compiler support are necessary

Outline Coarse-grained pipelined parallelism is desirable & feasible
Coarse-grained pipelined parallelism needs high-level language & compiler support An entire picture of the system DataCutter runtime system & language dialect Overview of the challenges for the compiler Compiler Techniques Experimental results Related work Future work & Conclusions

An Entire Picture Java Dialect Compiler Support DataCutter Runtime
Decomposition Code Generation Compiler Support DataCutter Runtime System

DataCutter Runtime System
Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) Targets a distributed, heterogeneous environment Allow decomposition of application-specific data processing operations into a set of interacting processes Provides a specific low-level interface filter Stream layout & placement filter1 filter2 filter3 stream

Language Dialect Goal Extensions of Java
to give compiler information about independent collections of objects, parallel loops and reduction operations, pipelined parallelism Extensions of Java Pipelined_loop Domain & Rectdomain Foreach loop reduction variables

ISO-Surface Extraction Example Code
Pipelined_loop (b in PacketRange) { foreach ( …) { … } foreach ( …) { … } … … n S; } Merge RectDomain<1> PacketRange = [1:4]; public class isosurface { public static void main(String arg[]) { float iso_value; RectDomain<1> CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange]; Point<1> p, b; RectDomain<1> PacketRange = [1:runtime_def_num_packets]; RectDomain<1> EachRange = [1:(max-min)/runtime_define_num_packets]; Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) { InputData[p].ISO_SurfaceTriangles(iso_value,…); } … … }} For (int i=min; i++; i<max-1) { // operate on InputData[i] }

Overview of the Challenges for the Compiler
Filter Decomposition To identify the candidate filter boundaries Compute communication volume between two consecutive filters Cost Model Compute a mapping from computations in a loop to computing units in a pipeline Filter Code Generation

Identify the Candidate Filter Boundaries
Three types of candidate boundaries Start & end of a foreach loop Conditional statement If ( point[p].inRange(high, low) ) { local_KNN(point[p]); } Start & end of a function call within a foreach loop Any non-foreach loop must be completely inside a single filter

Compute Required Communication
ReqComm(b) = the set of values need to be communicated through this boundary Cons(B) = the set of variables that are used in B, not defined in B Gens(B) = the set of variables that are defined in B, still alive at the end of B ReqComm(b2) = ReqComm(b1) – Gens(B) + Cons(B) b2 B b1

Cost Model Cost Model A sequence of m computing units, C1,…, Cm with computing powers P(C1), …, P(Cm) A sequence of m-1 network links, L1, …, Lm-1 with bandwidths B(L1), …, B(Lm-1) A sequence of n candidate filter boundaries b1, …, bn

Cost Model C1 L1 C1 L1 C2 C2 L2 L2 C3 C3 Say, L2 is bottleneck stage,
time Say, L2 is bottleneck stage, T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3) Say, C2 is bottleneck stage, T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)

Filter Decomposition Goal:
Find a mapping: Li → bj, to minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. Intuitively, the candidate filter boundary bj is inserted between computing units Ci and Ci+1 f1 C1 b1 L1 f2 C2 Cm-1 fn Lm-1 bn Cm fn+1 m-1 n+1+m-1 Exhaustive search

Filter Decomposition: A Greedy Algo.
To minimize the predicted execution time f1 C1 b1 L1 f2 Estimated Cost L1 C1 C3 C4 C2 f1 , f2 f1 C2 b2 L1 to b1 : T1 f3 L2 L1 to b2 : T2 C3 b3 L1 to b3 : T3 f4 L1 to b4 : T4 L3 b4 C4 Min{T1 … T4 } = T2 f5

Code Generation Abstraction of the work each filter does
Read in a buffer of data from input stream Iterate over the set of data Write out the results to output stream Code generation issues How to get the Cons(b) from the input stream --- unpacking data How to organize the output data for the successive filter --- packing data

Experimental Results Goal Environment settings Configurations 1-1-1
To show Compiler-generated code is efficient Environment settings 700MHZ Pentium machines Connected through Myrinet LANai 7.0 Configurations # data sites --- # computing sites --- user machine 1-1-1 2-2-1 4-4-1

Experimental Results Versions Default version
Site hosting the data only reads and transmits data, no processing at all User’s desktop only views the results, no processing at all All the work are done by the compute nodes Compiler-generated version Intelligent decomposition is done by the compiler More computations are performed on the end nodes to reduce the communication volume Manual version Hand-written DataCutter filters with similar decomposition as the compiler-generated version Computing nodes workload heavy Communication volume high workload balanced between each node Communication volume reduced

Experimental Results: ISO-Surface Rendering (Z-Buffer Based)
Small dataset 150M Large dataset 600M 20% improvement over default version Width of pipeline Width of pipeline Speedup Speedup

Experimental Results: ISO-Surface Rendering (Active Pixel Based)
Small dataset 150M Large dataset 600M > 15% improvement over default version Width of pipeline Width of pipeline Speedup close to linear

Experimental Results: KNN
>150% improvement over default version Width of pipeline Width of pipeline Speedup Speedup

Experimental Results: Virtual Microscope
Small query 800M, 512*512 Large query 800M, 2048*2048 ≈40% improvement over default version Width of pipeline Width of pipeline

Experimental Results Summary
The compiler-decomposed versions achieve an improvement between 10% and 150% over default versions In most cases, increasing the width of the pipeline results in near-linear speedup Compared with the manual version, the compiler-decomposed versions are generally quite close

Ongoing and Future Work
Buffer size optimization Cost model refinement & implementation More applications More realistic environment settings: resource dynamically available

Conclusions Coarse-Grained Pipelined Parallelism is desirable & feasible Coarse-Grained Pipelined Parallelism needs language & compiler support An algorithm for required communication analysis is given A greedy algorithm for filter decomposition is developed A cost model is designed Results of detailed evaluation of our compiler are encouraging

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Similar presentations

Presentation on theme: "Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Similar presentations

Presentation on theme: "Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How"— Presentation transcript:

Similar presentations

About project

Feedback