MPJava : High-Performance Message Passing in Java using Java.nio Bill Pugh Jaime Spacco University of Maryland, College Park.

Slides:



Advertisements
Similar presentations
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
MPJava : High-Performance Message Passing in Java using Java.nio Bill Pugh Jaime Spacco University of Maryland, College Park.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.
Introduction CS 524 – High-Performance Computing.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Topic Overview One-to-All Broadcast and All-to-One Reduction
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
M.S. Thesis Defense Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters.
Message Passing Interface In Java for AgentTeamwork (MPJ) By Zhiji Huang Advisor: Professor Munehiro Fukuda 2005.
The hybird approach to programming clusters of multi-core architetures.
Parallel Programming with Java
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.
JavaGrande Forum: An Overview Vladimir Getov University of Westminster.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
1 Datamation Sort 1 Million Record Sort using OpenMP and MPI Sammie Carter Department of Computer Science N.C. State University November 18, 2004.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Integrating Trilinos Solvers to SEAM code Dagoberto A.R. Justo – UNM Tim Warburton – UNM Bill Spotz – Sandia.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.
Parallel Sparse Matrix Algorithms for numerical computing matrix-vector multiplication.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
JAVA AND MATRIX COMPUTATION
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Parallel Solution of the Poisson Problem Using MPI
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Connections to Other Packages The Cactus Team Albert Einstein Institute
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
13/11/98Java Grande Forum1 MPI for Java Bryan Carpenter, Vladimir Getov, Glenn Judd, Tony Skjellum, Geoffrey Fox and others.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
Semantics of Minimally Synchronous Parallel ML Myrto Arapini, Frédéric Loulergue, Frédéric Gava and Frédéric Dabrowski LACL, Paris, France.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Exploring Parallelism with Joseph Pantoga Jon Simington.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason Maassen Rob van Nieuwpoort Ronald Veldema Vrije Universiteit.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
MPI: Portable Parallel Programming for Scientific Computing
Pluggable Architecture for Java HPC Messaging
CSCE569 Parallel Computing
MPJ: A Java-based Parallel Computing System
Presentation transcript:

MPJava : High-Performance Message Passing in Java using Java.nio Bill Pugh Jaime Spacco University of Maryland, College Park

Message Passing Interface (MPI) ● MPI standardized how we pass data on a cluster ● MPI: – Single Processor Multiple Data (SPMD) – Provides point-to-point as well as collective communications – Is a set of library routines – Is an interface with several free and commercial implementations available – source code is portable – Has C, Fortran and C++ bindings, but not Java

Previous Java + MPI work: ● mpiJava (Carpenter) – Native wrappers to C libraries – Worse performance than native MPI ● jmpi – Pure-Java implementation of proposed standard for Java/MPI bindings – Also bad performance compared to native MPI

MPJava ● Pure-Java Message Passing framework ● Makes extensive use of java.nio – select() mechanism – direct buffers – efficient conversions between primitive types ● Provides point-to-point and collective communications similar to MPI ● We experiment with different broadcast algorithms ● We try to use threads – More work needed to get this right ● Performance is pretty good

Benchmarks ● PingPong ● Alltoall ● NAS Parallel Benchmarks Conjugate Gradient

Results MHz PIII machines 768 MB memory RedHat 7.3 two 100 Mbps channel-bonded NICs Fortran compiler: g77 v2.96 tried a commercial compiler (pgf90) but no difference for these benchmarks LAM-MPI JDK b04

a0a0 b1b1 c2c2 d3d3

abcd0abcd0 abcd1abcd1 abcd2abcd2 abcd3abcd3

a0a0 b1b1 c2c2 d3d3

ab0ab0 ab1ab1 cd2cd2 cd3cd3

abcd0abcd0 abcd1abcd1 abcd2abcd2 abcd3abcd3

NAS PB Conjugate Gradient Class C Spare Matrix is 150,000 X 150, nonzero elements per row 36,121,000 total nonzero elements Class B Sparse Matrix is 75,000 X 75, nonzero elements per row 13,708,000 total nonzero elements

abcdefghijklmnopabcdefghijklmnop * = aw + bx + cy + dz ew + fx + gy + hz iw + jx + ky + lz mw + nx + oy + pz wxyzwxyz Simple approach to parrallelizing matrix-vector multiple: Stripe across rows Requires an all-to-all broadcast to reconstruct the vector p A. p = q 0abcd1efgh2ijkl3mnop0abcd1efgh2ijkl3mnop * = aw + bx + cy + dz ew + fx + gy + hz iw + jx + ky + lz mw + nx + oy + pz wxyzwxyz

abcdefghijklmnopabcdefghijklmnop * = aw + bx + cy + dz ew + fx + gy + hz iw + jx + ky + lz mw + nx + oy + pz wxyzwxyz Multi-Dimensional matrix-vector multiply decomposition

abcdefghijklmnopabcdefghijklmnop * = aw + bx + cy + dz ew + fx + gy + hz iw + jx + ky + lz mw + nx + oy + pz wxyzwxyz Multi-Dimensional matrix-vector multiply decomposition Reduction along decomposed rows

abcdefghijklmnopabcdefghijklmnop * = aw + bx + cy + dz ew + fx + gy + hz iw + jx + ky + lz mw + nx + oy + pz wxyzwxyz Multi-Dimensional matrix-vector multiply decomposition Node 4 needs w, and has y,z Node 3 needs z, has w,x SWAP Node 2 needs y, and has w,x Node 5 needs x, and has y,z SWAP

Conclusion ● A pure-Java message passing framework can provide performance competitive with widely available Fortran and MPI implementations ● java.nio is much faster than the older I/O model ● Java Just-In-Time compilers can deliver competitive performance ● Java has many other useful features – type safety – bounds checks – extensive libraries – portable ● Is performance portable too? – easy to integrate with databases, webservers, GRID applications

Future Work ● Exploiting asynchronous pipes – Great for work-stealing and work-sharing algorithms, but… – subject to Thread scheduling woes ● What about clusters of SMPs? – Different bottlenecks – More use for multiple threads on a single node – Importance of interleaving communication and computation ● Is MPI the right target? – BLAS, LAPACK, Netsolver, etc. suggest that programmers will use libraries

Where do we go next? ● Java has the reputation that it’s too slow for scientific programming! – Is this still accurate? – Or were we lucky with our benchmarks? ● Interest in message passing for Java was high a couple of years ago, but has waned – Because of performance? ● Does anyone care? – Are people happy with Fortran? – Is there enough interest in Java for scientific computing?

Java may be fast enough but... ● No operator overloading ● No multiarrays package (yet) – Also need syntactic sugar to replace.get()/.set() methods with brackets! ● Autoboxing ● Generics (finally available in 1.5) ● Fast, efficient support for a Complex datatype – Stack-allocatable objects in general? ● C# provides all/most of these features

a w x y z b c d abc d

abcdefghijklmnopabcdefghijklmnop * = aw + bx + cy + dz ew + fx + gy + hz iw + jx + ky + lz mw + nx + oy + pz wxyzwxyz Multi-Dimensional matrix-vector multiply decomposition NAS PB implementation uses a better algorithm Note the additional swap required for “inner” nodes