Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Slides:



Advertisements
Similar presentations
The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.
Advertisements

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
1 Compiling with multicore Jeehyung Lee Spring 2009.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Compiler Support for Exploiting Coarse-Grained Pipelined Parallelism Wei Du Renato Ferreira Gagan Agrawal Ohio-State University.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Compiler (and Runtime) Support for CyberInfrastructure Gagan Agrawal (joint work with Wei Du, Xiaogang Li, Ruoming Jin, Li Weng)
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Designing the User Interface: Strategies for Effective Human-Computer.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Packet Size optimization for Supporting Coarse-Grained Pipelined Parallelism Wei Du Gagan Agrawal Ohio State University.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
1 A Grid-Based Middleware’s Support for Processing Distributed Data Streams Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering.
1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.
Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism Wei Du, Gagan Agrawal Ohio State University.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Parallel Patterns.
Efficient Evaluation of XQuery over Streaming Data
Code Optimization.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Conception of parallel algorithms
Spark Presentation.
Resource Elasticity for Large-Scale Machine Learning
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Bin Ren, Gagan Agrawal 9/18/2018.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Vector Processing => Multimedia
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
Collaborative Offloading for Distributed Mobile-Cloud Apps
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Communication and Memory Efficient Parallel Decision Tree Construction
Smita Vijayakumar Qian Zhu Gagan Agrawal
Gary M. Zoppetti Gagan Agrawal
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Grid Based Data Integration with Automatic Wrapper Generation
Resource Allocation in a Middleware for Streaming Data
Resource Allocation for Distributed Streaming Applications
The Ohio State University
CHAPTER 14: Information Visualization
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
LCPC02 Wei Du Renato Ferreira Gagan Agrawal
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How Gagan Agrawal Wei Du Tahsin Kurc Umit Catalyurek Joel Saltz The Ohio State University

Overall Context NGS grant titled ``An Integrated Middleware and Language/Compiler Framework for Data-Intensive Applications’’, funded September 2002 – August 2005. Project Components Runtime Optimizations in the DataCutter System Compiler Optimization of DataCutter filters Automatic Generation of DataCutter filters Focus of this talk

General Motivation Language and Compiler Support for Parallelism of many forms has been explored Shared memory parallelism Instruction-level parallelism Distributed memory parallelism Multithreaded execution Application and technology trends are making another form of parallelism desirable and feasible Coarse-Grained Pipelined Parallelism

Coarse-Grained Pipelined Parallelism (CGPP) Definition Computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units Example — K-nearest Neighbor Given a 3-D range R= <(x1, y1, z1), (x2, y2, z2)>, and a point  = (a, b, c). We want to find the nearest K neighbors of  within R. Range_query Find the K-nearest neighbors

Coarse-Grained Pipelined Parallelism is Desirable & Feasible Application scenarios data data data data Internet data data data

Coarse-Grained Pipelined Parallelism is Desirable & Feasible A new class of data-intensive applications Scientific data analysis data mining data visualization image analysis Two direct ways to implement such applications Downloading all the data to user’s machine – often not feasible Computing at the data repository - usually too slow

Coarse-Grained Pipelined Parallelism is Desirable & Feasible Our belief A coarse-grained pipelined execution model is a good match data Internet data

Coarse-Grained Pipelined Parallelism needs Compiler Support Computation needs to be decomposed into stages Decomposition decisions are dependent on execution environment How many computing sites available How many available computing cycles on each site What are the available communication links What’s the bandwidth of each link Code for each stage follows the same processing pattern, so it can be generated by compiler Shared or distributed memory parallelism needs to be exploited High-level language and compiler support are necessary

Outline Coarse-grained pipelined parallelism is desirable & feasible Coarse-grained pipelined parallelism needs high-level language & compiler support An entire picture of the system DataCutter runtime system & language dialect Overview of the challenges for the compiler Compiler Techniques Experimental results Related work Future work & Conclusions

An Entire Picture Java Dialect Compiler Support DataCutter Runtime Decomposition Code Generation Compiler Support DataCutter Runtime System

DataCutter Runtime System Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) Targets a distributed, heterogeneous environment Allow decomposition of application-specific data processing operations into a set of interacting processes Provides a specific low-level interface filter Stream layout & placement filter1 filter2 filter3 stream

Language Dialect Goal Extensions of Java to give compiler information about independent collections of objects, parallel loops and reduction operations, pipelined parallelism Extensions of Java Pipelined_loop Domain & Rectdomain Foreach loop reduction variables

ISO-Surface Extraction Example Code Pipelined_loop (b in PacketRange) { 0. foreach ( …) { … } 1. foreach ( …) { … } … … n-1. S; } Merge RectDomain<1> PacketRange = [1:4]; public class isosurface { public static void main(String arg[]) { float iso_value; RectDomain<1> CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange]; Point<1> p, b; RectDomain<1> PacketRange = [1:runtime_def_num_packets]; RectDomain<1> EachRange = [1:(max-min)/runtime_define_num_packets]; Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) { InputData[p].ISO_SurfaceTriangles(iso_value,…); } … … }} For (int i=min; i++; i<max-1) { // operate on InputData[i] }

Overview of the Challenges for the Compiler Filter Decomposition To identify the candidate filter boundaries Compute communication volume between two consecutive filters Cost Model Compute a mapping from computations in a loop to computing units in a pipeline Filter Code Generation

Identify the Candidate Filter Boundaries Three types of candidate boundaries Start & end of a foreach loop Conditional statement If ( point[p].inRange(high, low) ) { local_KNN(point[p]); } Start & end of a function call within a foreach loop Any non-foreach loop must be completely inside a single filter

Compute Required Communication ReqComm(b) = the set of values need to be communicated through this boundary Cons(B) = the set of variables that are used in B, not defined in B Gens(B) = the set of variables that are defined in B, still alive at the end of B ReqComm(b2) = ReqComm(b1) – Gens(B) + Cons(B) b2 B b1

Cost Model Cost Model A sequence of m computing units, C1,…, Cm with computing powers P(C1), …, P(Cm) A sequence of m-1 network links, L1, …, Lm-1 with bandwidths B(L1), …, B(Lm-1) A sequence of n candidate filter boundaries b1, …, bn

Cost Model C1 L1 C1 L1 C2 C2 L2 L2 C3 C3 Say, L2 is bottleneck stage, time Say, L2 is bottleneck stage, T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3) Say, C2 is bottleneck stage, T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3)

Filter Decomposition Goal: Find a mapping: Li → bj, to minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. Intuitively, the candidate filter boundary bj is inserted between computing units Ci and Ci+1 f1 C1 b1 L1 f2 C2 Cm-1 fn Lm-1 bn Cm fn+1 m-1 n+1+m-1 Exhaustive search

Filter Decomposition: A Greedy Algo. To minimize the predicted execution time f1 C1 b1 L1 f2 Estimated Cost L1 C1 C3 C4 C2 f1 , f2 f1 C2 b2 L1 to b1 : T1 f3 L2 L1 to b2 : T2 C3 b3 L1 to b3 : T3 f4 L1 to b4 : T4 L3 b4 C4 Min{T1 … T4 } = T2 f5

Code Generation Abstraction of the work each filter does Read in a buffer of data from input stream Iterate over the set of data Write out the results to output stream Code generation issues How to get the Cons(b) from the input stream --- unpacking data How to organize the output data for the successive filter --- packing data

Experimental Results Goal Environment settings Configurations 1-1-1 To show Compiler-generated code is efficient Environment settings 700MHZ Pentium machines Connected through Myrinet LANai 7.0 Configurations # data sites --- # computing sites --- user machine 1-1-1 2-2-1 4-4-1

Experimental Results Versions Default version Site hosting the data only reads and transmits data, no processing at all User’s desktop only views the results, no processing at all All the work are done by the compute nodes Compiler-generated version Intelligent decomposition is done by the compiler More computations are performed on the end nodes to reduce the communication volume Manual version Hand-written DataCutter filters with similar decomposition as the compiler-generated version Computing nodes workload heavy Communication volume high workload balanced between each node Communication volume reduced

Experimental Results: ISO-Surface Rendering (Z-Buffer Based) Small dataset 150M Large dataset 600M 20% improvement over default version Width of pipeline Width of pipeline Speedup 1.92 3.34 Speedup 1.99 3.82

Experimental Results: ISO-Surface Rendering (Active Pixel Based) Small dataset 150M Large dataset 600M > 15% improvement over default version Width of pipeline Width of pipeline Speedup close to linear

Experimental Results: KNN >150% improvement over default version Width of pipeline Width of pipeline Speedup 1.89 3.38 Speedup 1.87 3.82

Experimental Results: Virtual Microscope Small query 800M, 512*512 Large query 800M, 2048*2048 ≈40% improvement over default version Width of pipeline Width of pipeline

Experimental Results Summary The compiler-decomposed versions achieve an improvement between 10% and 150% over default versions In most cases, increasing the width of the pipeline results in near-linear speedup Compared with the manual version, the compiler-decomposed versions are generally quite close

Ongoing and Future Work Buffer size optimization Cost model refinement & implementation More applications More realistic environment settings: resource dynamically available

Conclusions Coarse-Grained Pipelined Parallelism is desirable & feasible Coarse-Grained Pipelined Parallelism needs language & compiler support An algorithm for required communication analysis is given A greedy algorithm for filter decomposition is developed A cost model is designed Results of detailed evaluation of our compiler are encouraging