Compositional Development of Parallel Programs Nasim Mahmood, Guosheng Deng, and James C. Browne.

Slides:



Advertisements
Similar presentations
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Advertisements

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.
Lecture 8: Three-Level Architectures CS 344R: Robotics Benjamin Kuipers.
Control Flow Analysis (Chapter 7) Mooly Sagiv (with Contributions by Hanne Riis Nielson)
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
In collaboration with Hualin Gao, Richard Duncan, Julie A. Baca, Joseph Picone Human and Systems Engineering Center of Advanced Vehicular System Mississippi.
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
An Associative Broadcast Based Coordination Model for Distributed Processes James C. Browne Kevin Kane Hongxia Tian Department of Computer Sciences The.
Distributed, parallel web service orchestration using XSLT Peter Kelly Paul Coddington Andrew Wendelborn.
Introduction CS 524 – High-Performance Computing.
X := 11; if (x == 11) { DoSomething(); } else { DoSomethingElse(); x := x + 1; } y := x; // value of y? Phase ordering problem Optimizations can interact.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Emergy Berger, Calvin Lin, Samuel Z. Guyer CSE 231 Presentation by Jennifer Grau.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Math for CSLecture 11 Mathematical Methods for Computer Science Lecture 1.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
Composing Dataflow Analyses and Transformations Sorin Lerner (University of Washington) David Grove (IBM T.J. Watson) Craig Chambers (University of Washington)
Reuse Activities Selecting Design Patterns and Components
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Fast Low-Frequency Impedance Extraction using a Volumetric 3D Integral Formulation A.MAFFUCCI, A. TAMBURRINO, S. VENTRE, F. VILLONE EURATOM/ENEA/CREATE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Implementing a Speech Recognition System on a GPU using CUDA
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Issues in (Financial) High Performance Computing John Darlington Director Imperial College Internet Centre Fast Financial Algorithms and Computing 4th.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
ProActive components and legacy code Matthieu MOREL.
ICS 252 Introduction to Computer Design Lecture 12 Winter 2004 Eli Bozorgzadeh Computer Science Department-UCI.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
1 Unified Modeling Language, Version 2.0 Chapter 2.
1 University of Maryland Runtime Program Evolution Jeff Hollingsworth © Copyright 2000, Jeffrey K. Hollingsworth, All Rights Reserved. University of Maryland.
Data Structures and Algorithms in Parallel Computing Lecture 10.
Example to motivate discussion We have two lists (of menu items) one implemented using ArrayList and another using Arrays. How does one work with these.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.
Texas A&M University, Department of Aerospace Engineering AUTOMATIC GENERATION AND INTEGRATION OF EQUATIONS OF MOTION BY OPERATOR OVER- LOADING TECHNIQUES.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
SOFTWARE DESIGN AND ARCHITECTURE
Parallel Programming By J. H. Wang May 2, 2017.
Resource Elasticity for Large-Scale Machine Learning
Designing Software for Ease of Extension and Contraction
HPC Modeling of the Power Grid
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
MATLAB HPCS Extensions
Supported by the National Science Foundation.
MATLAB HPCS Extensions
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Compositional Development of Parallel Programs Nasim Mahmood, Guosheng Deng, and James C. Browne

Overview Motivation Goals Programming Model Example Case Study Conclusions Current and Ongoing Work Future Directions

Motivation Optimization and adaptation of parallel programs is effort intensive  Different execution environments  Different problem instances Direct modification of complete application is effort intensive Maintenance and evolution of parallel programs is a complex task

Goals Order of magnitude productivity enhancement for program families  Develop parallel programs from sequential components  Reuse components  Enable development of program families from multiple versions of components  Automatic composition of parallel programs from components

Programming Model Component Development  Domain Analysis  Component Development  Encapsulate Program Instance Development  Analyze problem instance and execution environment  Identify attributes and attribute values  Identify data flow graph  Specify the program using the components and their interfaces

Component Accepts interface (profile, transaction, protocol) Sequential Computation Requests interface (selector, transaction, protocol)

2D FFT Example Steps for 2D FFT computation  Partition given matrix row-wise  Apply 1D FFT to each row of the partition  Combine the partitions and transpose the matrix  Partition transposed matrix row-wise  Apply 1D FFT to each row of the partition  Combine the partitions and transpose the matrix  Transposed matrix is the 2D FFT of the original matrix

2D FFT Example (Cont’d)

selector: string domain == "matrix"; string function == "distribute"; string element_type == "complex"; bool distribute_by_row == true; transaction: int distribute(out mat2 grid_re,out mat2 grid_im, out int n, out int m, out int p); protocol: dataflow; profile: string domain = "matrix"; string function = "distribute"; string element_type = "complex"; bool distribute_by_row = true; transaction: int distribute(in mat2 grid_re,in mat2 grid_im, in int n, in int m, in int p); protocol: dataflow; 2D FFT Example (Cont’d) Requests interface of Initialize Accepts interface of Distribute

Compilation Process Matching of  Selector and profile  Transactions  Profiles Matching starts from the selector of the start component Applied recursively to each matched components Output is a generalized data flow graph as defined in CODE (Newton ’92) Data flow graph is compiled to a parallel program for a specific architecture

{selector: string domain == "fft"; string input == "matrix"; string element_type == "complex"; string algorithm == "Cooley-Tukey"; bool apply_per_row == true; transaction: int fft_row(out mat2 out_grid_re[],out mat2 out_grid_im[], out int n/p, out int m); protocol: dataflow; }index [ p ] profile: string domain = "fft"; string input = "matrix"; string element_type = "complex"; string algorithm = "Cooley-Tukey"; bool apply_per_row = true; transaction : int fft_row(in mat2 grid_re,in mat2 grid_im,in int n, in int m); protocol: dataflow; 2D FFT Example (Cont’d) Requests interface (partial) of Distribute Accepts interface of FFT_Row

selector: string domain == "matrix"; string function == "gather"; string element_type == "complex"; bool combine_by_row == true; bool transpose == true; transaction: int gather_transpose(out mat2 out_grid_re,out mat2 out_grid_im, out int me); protocol: dataflow; profile: string domain = "matrix"; string function = "gather"; string element_type = "complex"; bool combine_by_row = true; bool transpose = true; transaction: int get_no_of_p(in int n, in int m, in int p,in int state); > int gather_transpose(in mat2 grid_re,in mat2 grid_im, in int inst); protocol: dataflow; 2D FFT Example (Cont’d) Requests interface of FFT_Row Accepts interface of Gather_Tr anspose

2D FFT Example (Cont’d) selector: string domain == "matrix"; string function == "distribute"; string element_type == "complex"; bool distribute_by_row == true; transaction: %{ exec_no == 1 && gathered == p }% int distribute(out mat2 out_grid_re,out mat2 out_grid_im, out int m, out int n*p, out int p); protocol: dataflow; Requests interface (partial) of Gather_T ranspose

Compute the Coulomb Energy of point charges in linear time Transforming the information about a cluster of charge into a simpler representation which is used to compute the influence of the cluster on objects at large distances by scaling all particles into hierarchy of cubes in different levels Can be extended and applied to astrophysics, plasma physics, molecular dynamics, fluid dynamics, partial differential equations and numerical complex analysis Fast Multipole in Short

Many generalized N-body problems can be treated as multiple FMM problems which share the same geometry. This feature can be exploited by combining the generalized charges into a vector Generalized FMM is an extension of the FMM algorithm to multiple “charge types” More efficient FMM translation routines could be built using BLAS routines Generalized Fast Multipole Solver – Matrix Version

Six Translation Components  Particle charge to Multipole (finest partitioning level)  Multipole to Multipole (between all partitioning levels, from the finest to the coarsest)  Multipole to Local (all partitioning levels)  Local to Local (between all partitioning levels)  Local to Particle potential and forces (finest partitioning level)  Direct Interaction (finest partitioning level) Two Utility Components  Distribute – Distribute Pre-Calculated Local Coefficient matrices according to Interaction list  Gather – Gather Local coefficients FMM Domain Analysis

Space-computation Tradeoffs For Matrix- structured Formulation of the FMM Algorithm Simultaneous computation of cell potentials for multiple charge types Use of optimized library routines for vector-matrix and matrix-matrix multiply Loop interchange over the two outer loops to improve locality

Flow Graph for Sequential FMM Initialization P2M M2M M2L L2L Distribute Collect L2P Terminate Direct

Initialization P2M M2M M2L L2L Distribute Collect L2P Terminate Flow Graph for The Parallel Version Direct P2M M2M M2L L2L Distribute Collect L2P Direct P[0]P[1]

Sequential Running Time - New/Old WRT number of charge types

Conclusions Effort in domain analysis is not trivial Suitable for  Large applications that are to be optimized for several different execution environments  Large applications that are expected to evolve over a substantial period of time  Large applications with multiple instances Competitive program performance

Current and Ongoing Work Implement evolutionary performance models of programs through composition of components  Abstract components  Concrete components  Performance model for specific architecture Componentize hp-adaptive finite element code and Method of Lines (MOL) code

Future Directions Combine with dynamic linking runtime system based on associative interfaces [Kane ’02] Implement more powerful precedence and sequencing operators for state machine specifications Integrate with Broadway [Guyer/Lin ’99] annotational compiler to overcome “many components” problem