Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan Agrawal Ohio-State University.

Slides:

Advertisements

Similar presentations

The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.

Advertisements

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Ajou University, South Korea ICSOC 2003 “Disconnected Operation Service in Mobile Grid Computing” Disconnected Operation Service in Mobile Grid Computing.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Compiler (and Runtime) Support for CyberInfrastructure Gagan Agrawal (joint work with Wei Du, Xiaogang Li, Ruoming Jin, Li Weng)

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

1 University of Maryland Runtime Program Evolution Jeff Hollingsworth © Copyright 2000, Jeffrey K. Hollingsworth, All Rights Reserved. University of Maryland.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

Packet Size optimization for Supporting Coarse-Grained Pipelined Parallelism Wei Du Gagan Agrawal Ohio State University.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Research Overview Gagan Agrawal Associate Professor.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.

1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism Wei Du, Gagan Agrawal Ohio State University.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Efficient Evaluation of XQuery over Streaming Data

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

Supporting Fault-Tolerance in Streaming Grid Applications

Communication and Memory Efficient Parallel Decision Tree Construction

GATES: A Grid-Based Middleware for Processing Distributed Data Streams

Resource Allocation in a Middleware for Streaming Data

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Resource Allocation for Distributed Streaming Applications

The Ohio State University

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

LCPC02 Wei Du Renato Ferreira Gagan Agrawal

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan Agrawal Ohio-State University

Coarse-Grained Pipelined Parallelism (CGPP) Definition –Computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units Example — K-nearest Neighbor Given a 3-D range R=, and Given a 3-D range R=, and a point  = (a, b, c). a point  = (a, b, c). We want to find the nearest K neighbors of  within R. We want to find the nearest K neighbors of  within R. Range_queryFind the K-nearest neighbors

Coarse-Grained Pipelined Parallelism is Desirable & Feasible Application scenarios Internet data

Our belief –A coarse-grained pipelined execution model is a good match Internet data Coarse-Grained Pipelined Parallelism is Desirable & Feasible

Coarse-Grained Pipelined Parallelism needs Compiler Support Computation needs to be decomposed into stages Decomposition decisions are dependent on execution environment –availability and capacity of computing sites and communication links Code for each stage follows the same processing pattern, so it can be generated by compiler Shared or distributed memory parallelism needs to be exploited High-level language and compiler support are necessary

Outline Motivation Overview of the system DataCutter runtime system & Language dialect Language dialect Compiler techniques Experimental results Related work Future work & Conclusions

Overview Java Dialect Compiler Support DataCutter Runtime System Decomposition Code Generation

DataCutter Runtime System Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) Targets a distributed, heterogeneous environment Allows decomposition of application-specific data processing operations into a set of interacting processes Provides a specific low-level interface –filter –stream –layout & placement filter1 filter2 filter3 stream

Language Dialect Goal –to give compiler information about independent collections of objects, parallel loops, reduction operations, and pipelined parallelism Extensions of Java –Pipelined_loop –Domain & Rectdomain –Foreach loop –reduction variables

ISO-Surface Extraction Example Code public class isosurface { public static void main(String arg[]) { public static void main(String arg[]) { float iso_value; float iso_value; RectDomain CubeRange = [min:max]; RectDomain CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange]; CUBE[1d] InputData = new CUBE[CubeRange]; Point p, b; Point p, b; RectDomain PacketRange = RectDomain PacketRange = [1:runtime_def_num_packets ]; RectDomain EachRange = RectDomain EachRange = [1:(max-min)/runtime_define_num_packets]; [1:(max-min)/runtime_define_num_packets]; Pipelined_loop (b in PacketRange) { Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) { Foreach (p in EachRange) { InputData[p].ISO_SurfaceTriangles(iso_value,…); InputData[p].ISO_SurfaceTriangles(iso_value,…); } … … … … }} }} For (int i=min; i++; i<max-1) { // operate on InputData[i] } Pipelined_loop (b in PacketRange) Pipelined_loop (b in PacketRange) { 0. foreach ( …) { … } 1. foreach ( …) { … } … … … … n-1. S; } Merge Merge RectDomain PacketRange = [1:4];

Overview of the Challenges for the Compiler Filter Decomposition –To identify the candidate filter boundaries –Compute communication volume between two consecutive filters –Cost Model –Determine a mapping from computations in a loop to processing units in a pipeline Filter Code Generation

Compute Required Communication the set of values need to be communicated through this boundary ReqComm(b) = the set of values need to be communicated through this boundary Cons(B) = the set of variables that are used in B, not defined in B Gens(B) = the set of variables that are defined in B, still alive at the end of B ReqComm(b 2 ) = ReqComm(b 1 ) – Gens(B) + Cons(B) B b2b2 b1b1

Filter Decomposition C1C1 C2C2 C m-1 CmCm L1L1 L m-1 f1f1 f2f2 fnfn f n+1 b1b1 bnbn Goal: Find a mapping: L i → b j, to minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. Find a mapping: L i → b j, to minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. Intuitively, the candidate filter boundary b j is inserted between computing units C i and C i+1 m-1 n+m-1 Exhaustive search

Filter Decomposition: Dynamic Programming C m-2 C m-1 CmCm L m-1 f n+1 L m-2 fnfn fnfn

Filter Decomposition: Dynamic Programming C m-2 C m-1 CmCm L m-1 L m-2,…, T[i,j]: min cost of doing computations f 1,…, f i C 1,…, C j, on computing units C 1,…, C j, C j where the results of f i are on C j. T[i,j] = min T[i-1,j] + Cost_comp(P(C j ),Task(f i )) T[i,j-1] + Cost_comp(B(L j-1 ),Vol(f i )) Goal: T[n+1,m] Cost: O(mn)

Code Generation Abstraction of the work each filter does –Read in a buffer of data from input stream –Iterate over the set of data –Write out the results to output stream Code generation issues –How to get the Cons(b) from the input stream --- unpacking data --- unpacking data –How to organize the output data for the successive filter --- packing data

Experimental Results Goal –To show Compiler-generated code is efficient Configurations # data sites --- # computing sites --- user machine –1-1-1 –2-2-1 –4-4-1 –width of a pipeline datacomputeuser datacompute user datacompute datacompute datacompute datacompute user datacompute

Experimental Results Versions –Default version Site hosting the data only reads and transmits data, no processing at all User’s desktop only views the results, no processing at all All the work is done by the computing nodes –Compiler-generated version Intelligent decomposition is done by the compiler More computations are performed on the end nodes to reduce the communication volume –Manual version Hand-written DataCutter filters with similar decomposition as the compiler-generated version Computing nodes workload heavy Communication volume high workload balanced between each node Communication volume reduced

Experimental Results: ISO-Surface Rendering Width of pipeline Small dataset 150M Large dataset 600M Speedup Speedup % improvement over default version

Experimental Results: KNN Width of pipeline K = 3 108M K = M Speedup Speedup >150% improvement over default version

Experimental Results: Virtual Microscope Width of pipeline Small query 800M, 512*512 Large query 800M, 2048*2048 ≈40% improvement over default version

Experimental Results Summary –The compiler-decomposed versions achieve an improvement between 10% and 150% over default versions –In most cases, increasing the width of the pipeline results in near-linear speedup –Compared with the manual version, the compiler-decomposed versions are generally quite close

Related Work No previous work on language & compiler support for CGPP StreamIt (MIT) –Targets at streaming applications –A language for communication-exposed architectures –A compiler performs stream-specific optimizations –Lower-level language interface –Targets at different architecture Ziegler et al (USC/ISI) – –Target at pipelined FPGA Architectures –Consider different granularity of communication between FPGAs

Related Work Run-time support for CGPP –Stampede (Georgia Tech) Multimedia applications, support is in the form of cluster-wide threads and shared objects Multimedia applications, support is in the form of cluster-wide threads and shared objects –Yang et al (Penn State) Scheduler for vision applications, executed in a pipelined fashion within a cluster Scheduler for vision applications, executed in a pipelined fashion within a cluster –Remos (CMU) Resource monitoring system for network-aware applications to get info. about execution environment Resource monitoring system for network-aware applications to get info. about execution environment –Active Stream (Georgia Tech) A middleware approach for distributed applications A middleware approach for distributed applications

Future Work & Conclusion Future Work –Buffer size optimization –Cost model refinement & implementation –More applications –More realistic environment settings: resource dynamically available --- compiler directed adaptation

Future Work & Conclusion Conclusion –Coarse-Grained Pipelined Parallelism is desirable & feasible –Coarse-Grained Pipelined Parallelism needs language & compiler support –An algorithm for required communication analysis is given –A dynamic programming algorithm for filter decomposition is developed –A cost model is designed –Results of detailed evaluation of our compiler are encouraging

Thank you !!!

Cost Model –A sequence of m computing units, C 1,…, C m with computing powers P(C 1 ), …, P(C m ) –A sequence of m-1 network links, L 1, …, L m-1 with bandwidths B(L 1 ), …, B(L m-1 ) –A sequence of n candidate filter boundaries b 1, …, b n C1C1 C2C2 C3C3 L1L1 L2L2 Say, L 2 is bottleneck stage, T = T(C 1 )+T(L 1 )+T(C 2 )+N*T(L 2 )+T(C 3 )

Three types of candidate boundaries –Start & end of a foreach loop –Conditional statement If ( point[p].inRange(high, low) ) { If ( point[p].inRange(high, low) ) { local_KNN(point[p]); local_KNN(point[p]); } –Start & end of a function call within a foreach loop Any non-foreach loop must be completely inside a single filter Identify the Candidate Filter Boundaries

Coarse-Grained Pipelined Parallelism is Desirable & Feasible A new class of data-intensive applications –scientific data analysis –data mining –data visualization –image analysis –and more … Two direct ways to implement such applications –Downloading all the data to user’s machine user’s machine –Computing at the data repository

Compute Required Communication ReqComm(b 0 ) = { } Cons={X, Y} Gens={Z} Cons={A} Gens={X,Y} b0b0 b2b2 b1b1 ReqComm(b 2 ) = {A} ReqComm(b 1 ) = {X, Y } ReqComm(b 2 ) = ReqComm(b 1 ) – Gens(B) + Cons(B)

Compute Required Communication Z = A + 48 If Z > 0 Y = Z * A Y = Z * A X = Z + A Cons(B) = the set of variables that are used in B, not defined in B Gens(B) = the set of variables that are defined in B, still alive at the end of B Cons(B) Gens(B) A X, Z Z, A X Cons(s) Gens(s) A Z A Z Z Z, A Z, A Z, A X Z, A X

code Generation Two ways to organize data in a buffer –Instance-wise –Field-wise Class C { int x; int x; float y; float y; int z; int z;} X Y Z X Y Z... Instance-wise Field-wise X X... Y Y... Z...

Code Generation Ways that fields of an object are used –In the same loop for (int i=0; i<count; i++) { for (int i=0; i<count; i++) { … = InputData[i].x + …; … = … + InputData[i].y; } –In different loops for (int i=0; i<count; i++) { for (int i=0; i<count; i++) { … = InputData[i].x + …; } for (int i=0; i<count; i++) { for (int i=0; i<count; i++) { … = … + InputData[i].y;} … = … + InputData[i].y;} Instance-wise Field-wise

Cost Model C1C1 C2C2 C3C3 L1L1 L2L2 time stage C1C1 L1L1 C2C2 L2L2 C3C3 Say, L 2 is bottleneck stage, T = T(C 1 )+T(L 1 )+T(C 2 )+N*T(L 2 )+T(C 3 ) Say, C 2 is bottleneck stage, T = T(C 1 )+T(L 1 )+N*T(C 2 )+T(L 2 )+T(C 3 )

Experimental Results: ISO-Surface Rendering (Active Pixel Based) Width of pipeline Small dataset 150M Large dataset 600M Speedup close to linear > 15% improvement over default version