MATLAB HPCS Extensions

Slides:



Advertisements
Similar presentations
Anthony Delprete Jason Mckean Ryan Pineres Chris Olszewski.
Advertisements

Slide 2-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 2 Using the Operating System 2.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Compositional Development of Parallel Programs Nasim Mahmood, Guosheng Deng, and James C. Browne.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Program analysis and synthesis for parallel computing David Padua University of Illinois at Urbana-Champaign.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Priority Research Direction Key challenges Fault oblivious, Error tolerant software Hybrid and hierarchical based algorithms (eg linear algebra split across.
PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Operating Systems Lecture No. 2. Basic Elements  At a top level, a computer consists of a processor, memory and I/ O Components.  These components are.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
HNC COMPUTING - Network Concepts 1 Network Concepts Network Concepts Network Operating Systems Network Operating Systems.
Reflections on Dynamic Languages and Parallelism David Padua University of Illinois at Urbana-Champaign 1.
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
CS498 DHP Program Optimization Fall Course organization  Instructors: Mar í a Garzar á n David Padua.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Distributed Shared Memory
Chapter 10 Design Patterns.
Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris ATC’15
Resource Elasticity for Large-Scale Machine Learning
Programming Models for SimMillennium
High level abstractions for irregular algorithms
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Bin Ren, Gagan Agrawal 9/18/2018.
Chapter 17: Database System Architectures
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Outline Midterm results summary Distributed file systems – continued
Parallel Matrix Operations
Alternative Processor Panel Results 2008
Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Immersed Boundary Method Simulation in Titanium Objectives
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
MATLAB HPCS Extensions
VSIPL++: Parallel Performance HPEC 2004
Computer System Overview
MATLAB HPCS Extensions
Matrix Addition and Multiplication
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Data Parallel Pattern 6c.1
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

MATLAB HPCS Extensions Presented by: David Padua University of Illinois at Urbana-Champaign PERCS Program Review

Contributors Gheorghe Almasi - IBM Research Calin Cascaval - IBM Research Siddhartha Chatterjee - IBM Research Basilio Fraguela - University of Illinois Jose Moreira - IBM Research David Padua - University of Illinois PERCS Program Review

Objectives To develop MATLAB extensions for accessing, prototyping, and implementing scalable parallel algorithms. To give programmers of high-end machines access to all the powerful features of MATLAB, as a result. Array operations / kernels. Interactive interface. Rendering. PERCS Program Review

Uses of the MATLAB Extension Interface for parallel libraries users. Interface for parallel library developers. Input to a “conventional” compiler. Input to linear algebra compiler. A library generator/tuner for parallel machines. Leverage NSF-ITR project with K. Pingali (Cornell) and J. DeJong (Illinois). PERCS Program Review

Design Requirements Minimal extension. A natural extension to MATLAB that is easy to use. Extensions for direct control of parallelism and communication on top of the ability to access parallel library routines. It does not seem it is possible to encapsulate all the important parallelism in library routines. Extensions that provide the necessary information and can be automatically and effectively analyzed for compilation and translation. PERCS Program Review

The Design No existing MATLAB extension had the characteristics we needed. We designed a data type that we call hierarchically tiled arrays (HTAs). These are arrays whose components could be arrays or other HTAs. Operators on HTAs represent computations or communication. PERCS Program Review

Approach In our approach, the programmer interacts with a copy of MATLAB running on a workstation. The workstation controls parallel computation on servers. PERCS Program Review

Approach (Cont.) All conventional MATLAB operations are executed on the workstation. The parallel server operates on the HTAs. The HTA type is implemented as a MATLAB toolbox. This enables implementation as a language extension and simplifies porting to future versions of MATLAB. PERCS Program Review

Interpretation and Compilation A first implementation based on the MATLAB interpreter has been developed. This implementation will be used to improve our understanding of the extensions and will serve as a basis for further development and tuning. Interpretation overhead may hinder performance, but parallelism can compensate for the overhead. Future work will include the implementation of a compiler for MATLAB and our extensions based on the effective strategies of L. DeRose and G. Almasi. PERCS Program Review

From: G. Almasi and D. Padua MaJIC: Compiling MATLAB for Speed and Responsiveness. PLDI 2002 PERCS Program Review

Hierarchically Tiled Arrays Array tiles are a powerful mechanism to enhance locality in sequential computations and to represent data distribution across parallel systems. Several levels of tiling are useful to distribute data across parallel machines with a hierarchical organization and to simultaneously represent both data distribution and memory layout. For example, a two-level hierarchy of tiles can be used to represent: the data distribution on a parallel system and the memory layout within each component. PERCS Program Review

Hierarchically Tiled Arrays (Cont.) Computation and communication are represented as array operations on HTAs. Using array operations for communication and computation raises the level of abstraction and, at the same time, facilitates optimization. PERCS Program Review

Using HTAs for Locality Enhancement Tiled matrix multiplication using conventional arrays for I=1:q:n for J=1:q:n for K=1:q:n for i=I:I+q-1 for j=J:J+q-1 for k=K:K+q-1 C(i,j)=C(i,j)+A(i,k)*B(k,j); end PERCS Program Review

Using HTAs for Locality Enhancement Tiled matrix multiplication using HTAs Here, C{i,j}, A{i,k}, B{k,j} represent submatrices. The * operator represents matrix multiplication in MATLAB. for i=1:m for j=1:m for k=1:m C{i,j}=C{i,j}+A{i,k}*B{k,j}; end PERCS Program Review

Using HTAs to Represent Data Distribution and Parallelism Cannon’s Algorithm A{1,1} B{1,1} A{1,2} B{2,2} A{1,3} B{3,3} A{1,4} B{4,4} A{2,2} B{2,1} A{2,3} B{3,2} A{2,4} B{4,3} A{2,1} B{1,4} A{3,3} B{3,1} A{3,4} B{4,2} A{3,1} B{1,3} A{3,2} B{2,4} A{4,4} B{4,1} A{4,1} B{1,2} A{4,2} B{2,3} A{4,3} B{3,4} PERCS Program Review

A{1,1} B{1,1} A{1,2} B{2,2} A{1,3} B{3,3} A{1,4} B{4,4} A{2,2} B{2,1} PERCS Program Review

A{1,2} B{1,1} A{1,3} B{2,2} A{1,4} B{3,3} A{1,1} B{4,4} A{2,3} B{2,1} PERCS Program Review

A{1,2} B{2,1} A{1,3} B{3,2} A{1,4} B{4,3} A{1,1} B{1,4} A{2,3} B{3,1} PERCS Program Review

Cannnon’s Algorithm in MATLAB with HPCS Extensions C{1:n,1:n} = zeros(p,p); %communication … for k=1:n C{:,:} = C{:,:}+A{:,:}*B{:,:}; %computation A{i,1:n} = A{i,[2:n, 1]}; %communication B{1:n,i} = B{[2:n,1],i}; %communication end PERCS Program Review

Cannnon’s Algorithm in C + MPI for (km = 0; km < m; km++) { char *chn = "T"; dgemm(chn, chn, lclMxSz, lclMxSz, lclMxSz, 1.0, a, lclMxSz, b, lclMxSz, 1.0, c, lclMxSz); MPI_Isend(a, lclMxSz * lclMxSz, MPI_DOUBLE, destrow, ROW_SHIFT_TAG, MPI_COMM_WORLD, &requestrow); MPI_Isend(b, lclMxSz * lclMxSz, MPI_DOUBLE, destcol, COL_SHIFT_TAG, MPI_COMM_WORLD, &requestcol); MPI_Recv(abuf, lclMxSz * lclMxSz, MPI_DOUBLE, MPI_ANY_SOURCE, ROW_SHIFT_TAG, MPI_COMM_WORLD, &status); MPI_Recv(bbuf, lclMxSz * lclMxSz, MPI_DOUBLE, MPI_ANY_SOURCE, COL_SHIFT_TAG, MPI_COMM_WORLD, &status); MPI_Wait(&requestrow, &status); aptr = a; a = abuf; abuf = aptr; MPI_Wait(&requestcol, &status); bptr = b; b = bbuf; bbuf = bptr; } PERCS Program Review

Speedups on a four-processor IBM SP-2 PERCS Program Review

Speedups on a nine-processor IBM SP-2 PERCS Program Review

Flattening Elements of an HTA are referenced using a tile index for each level in the hierarchy followed by an array index. Each tile index tuple is enclosed within {}s and the array index is enclosed within parentheses. In the matrix multiplication code, C{i,j}(3,4) would represent element 3,4 of submatrix i,j. Alternatively, the tiled array could be accessed as a flat array as shown in the next slide. This feature is useful when a global view of the array is needed in the algorithm. It is also useful while transforming a sequential code into parallel form. PERCS Program Review

Two Ways of Referencing the Elements of an 8 x 8 Array. PERCS Program Review

Status We have completed the implementation of practically all of our initial language extensions (for IBM SP-2 and Linux Clusters). Following the toolbox approach has been a challenge, but we have been able to overcome all obstacles. PERCS Program Review

Conclusions We have developed parallel extensions to MATLAB. It is possible to write highly readable parallel code for both dense and sparse computations with these extensions. The HTA objects and operations have been implemented as a MATLAB toolbox which enabled their implementation as language extensions. PERCS Program Review