1 Grid-Aware Numerical Libraries To enable the use of the Grid as a seamless computing environment.

Slides:

Advertisements

Similar presentations

CSCI 4717/5717 Computer Architecture

Advertisements

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

SALSA HPC Group School of Informatics and Computing Indiana University.

Beowulf Supercomputer System Lee, Jung won CS843.

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Cactus in GrADS (HFA) Ian Foster Dave Angulo, Matei Ripeanu, Michael Russell.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

MASPLAS ’02 Creating A Virtual Computing Facility Ravi Patchigolla Chris Clarke Lu Marino 8th Annual Mid-Atlantic Student Workshop On Programming Languages.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

1 Performance Evaluation of Gigabit Ethernet & Myrinet

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.

1 Jack Dongarra University of Tennesseehttp://

Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with a similar.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Development Timelines Ken Kennedy Andrew Chien Keith Cooper Ian Foster John Mellor-Curmmey Dan Reed.

Nomadic Grid Applications: The Cactus WORM G.Lanfermann Max Planck Institute for Gravitational Physics Albert-Einstein-Institute, Golm Dave Angulo University.

A modeling approach for estimating execution time of long-running Scientific Applications Seyed Masoud Sadjadi 1, Shu Shimizu 2, Javier Figueroa 1,3, Raju.

SALSA HPC Group School of Informatics and Computing Indiana University.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

On High Performance Computing and Grid Activities at Vilnius Gediminas Technical University (VGTU) dr. Vadimas Starikovičius VGTU, Parallel Computing Laboratory.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Scheduling Architecture and Algorithms within ICENI Laurie Young, Stephen McGough, Steven Newhouse, John Darlington London e-Science Centre Department.

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.

On the Performance of PC Clusters in Solving Partial Differential Equations Xing Cai Åsmund Ødegård Department of Informatics University of Oslo Norway.

Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Software Engineering Algorithms, Compilers, & Lifecycle.

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.

Xing Cai University of Oslo

Abstract Machine Layer Research in VGrADS

Department of Computer Science University of California, Santa Barbara

Memory Hierarchies.

HPML Conference, Lyon, Sept 2018

MPJ: A Java-based Parallel Computing System

Department of Computer Science University of California, Santa Barbara

Emulating Massively Parallel (PetaFLOPS) Machines

Cluster Computers.

Presentation transcript:

1 Grid-Aware Numerical Libraries To enable the use of the Grid as a seamless computing environment

2 Early Focus of GrADS  To create an execution environment that supports reliable performance for numerical libraries on Grid computing platforms.  Interested in structuring and optimizing an application and its environment for execution on target computing platforms whose nature is not known until just before run time.

3 Milestones from GrADS  Library interface for numerical libraries  Infrastructure for defining and validating performance contracts  Adaptive GrADS runtime interface for scheduling and forecasting systems in the Grid

4 ScaLAPACK  ScaLAPACK is a portable distributed memory numerical library  Complete numerical library for dense matrix computations  Designed for distributed parallel computing (MPP & Clusters) using MPI  One of the first math software packages to do this  Numerical software that will work on a heterogeneous platform  In use today by IBM, HP-Convex, Fujitsu, NEC, Sun, SGI, Cray, NAG, IMSL, … ã Tailor performance & provide support

5 ScaLAPACK Demo  Implement a version of a ScaLAPACK library routine that runs on the Grid. ã Make use of resources at the user’s disposal ã Provide the best time to solution ã Proceed without the user’s involvement  Make as few changes as possible to the numerical software.  Assumption is that the user is already “Grid enabled” and runs a program that contacts the execution environment to determine where the execution should take place.

6 How ScaLAPACK Works  To use ScaLAPACK a user must: ã Download the package and auxiliary packages to the machines ã Write a SPMD program which »Sets up the logical process grid »Places the data on the logical process grid »Calls the library routine in a SPMD fashion »Collects the solution after the library routine finishes ã The user must allocate the processors and decide the number of processors the application will run on ã The user must start the application » “mpirun –np N user_app” u The number of processors is fixed at run time ã Upon completion, return the processors to the pool of resources

7 GrADS Numerical Library  Want to relieve the user of some of the tasks  Make decisions on which machines to use based on the user’s problem and the state of the system ã Optimize for the best time to solution ã Distribute the data on the processors and collections of results ã Start the SPMD library routine on all the platforms ã Check to see if the computation is proceeding as planned »If not perhaps migrate application

8 Library Routine User User makes a sequential call to a numerical library routine. The Library Routine has “crafted code” which invokes other components. GrADS Library Sequence Assumption is that Autopilot Manager has been started and Globus is there.

9 GrADS Library Sequence Library Routine User Resource Selector The Library Routine calls a grid based routine to determine which resources are possible for use. The Resource Selector returns a “bag of processors” (coarse grid) that are available.

10 Library Routine User Resource Selector Performance Model The Library Routine calls the Performance Modeler to determine the best set of processors to use for the given problem. May be done by evaluating a formula or running a simulation. May assign a number of processes to a processor. At this point have a fine grid. GrADS Library Sequence

11 Library Routine User Resource Selector Performance Model Contract Development The Library Routine calls the Contract Development routine to commit the fine grid for this call. A performance contract is generated for this run. GrADS Library Sequence

12 Library Routine User Resource Selector Performance Model Contract Development App Launcher “mpirun –machinefile fine_grid grid_linear_solve” GrADS Library Sequence

13 Grid Environment for this Experiment U Tennessee torc0-torc8 Dual PIII 550Mhz 100 Mb sw U Tennessee cypher01 – cypher16 dual PIII 500Mhz 1 Gb sw UIUC amajor-dmajor PII 266Mhz 100Mb sw opus0, opus13-opus16 PII 450Mhz Myrinet UCSD Quidam, Mystere, Soleil PII 400Mhz 100Mb sw Dralion, Nouba PIII 450Mhz 100Mb sw ferret.usc.edu 64 Proc SGI mhz Mhz jupiter.isi.edu 10 Proc SGI lunar.uits.indiana.edu 4 cliques 43 boxes

14 Components Used  Globus version  Autopilot version 2.3  NWS version 2.0.pre2  MPICH-G version  ScaLAPACK version 1.6  ATLAS/BLAS version  BLACS version 1.1  PAPI version  GrADS’ “Crafted code” Millions of lines of code

15 Heterogeneous Grid TORCCYPHEROPUS TypeCluster 8 Dual Pentium III Cluster 16 Dual Pentium III Cluster 8 Pentium II OSRed Hat Linux SMP Debian Linux SMP Red Hat Linux Memory512 MB 128 or 256 MB CPU speed550 MHz500 MHz265 – 448 MHz NetworkFast Ethernet (100 Mbit/s) (3Com 3C905B) and switch (BayStack 350T) with 16 ports Gigabit Ethernet (SK-9843) and switch (Foundry FastIron II) with 24 ports IP over Myrinet (LANai 4.3) + Fast Ethernet (3Com 3C905B) and switch (M2M- SW16 & Cisco Catalyst 2924 XL) with 16 ports each

16 The Grads_lib_linear_solve Routine Performs the Following Operations: 1. Gets information on the user’s problem 2. Creates the “coarse grid” of processors and their NWS statistics by calling the resource selector. 3. Refines the “coarse grid” into a “fine grid” by calling the performance modeler. 4. Invokes the contract developer to commit the resources in the “fine grid” for the problem. Repeat Steps 2-4 until the “fine grid” is committed for the problem. 5. Launches the application to execute on the committed “fine grid”.

17 Library Routine User  Has “crafted code” to make things work correctly and together. Assumptions: Autopilot Manager has been started and Globus is there. GrADS Library Sequence

18 Resource Selector ã Uses MDS and NWS to build an array of values »2 matrices (bw,lat) 2 arrays (cpu, memory available) »Matrix information is clique based ã On return from RS, Crafted Code filters information to use only machines that have the necessary software and are really eligible to be used. Library Routine User Resource Selector

19 Arrays of Values Generated by Resource Selector  Clique based ã UT, UCSD, UIUC ã Full at the cluster level and the connections (clique leaders) ã Bandwidth and Latency information looks like this. ã Linear arrays for CPU and Memory x x x x x x x x x xx x x x x x x x x x xxx x x

20 After the Resource Selector …  Matrix of values are filled out to generate a complete, dense, matrix of values.  At this point have a workable coarse grid. ã Workable in the sense that we know what is available, the connections, and the power of the machines.

21 ScaLAPACK Performance Model ã Total number of floating-point operations per processor ã Total number of data items communicated per processor ã Total number of messages ã Time per floating point operation ã Time per data item communicated ã Time per message

22 Library Routine User Resource Selector Performance Model  Performance Model uses the information generated in the RS to decide on the fine grid. ã Pick a machine that is closest to every other machine in the collection. ã If not enough memory, adds machines until it can solve problem. ã Cost model is run on this set. ã Process adds a machine to group and reruns cost model. ã If “better”, iterate last step, if not stop. Performance Model

23 Performance Model  The PM does a simulation of the actual application using the information from the RS. ã It literally runs the program without doing the computation or data movement.  There is no backtracking implemented. ã This is an area for enhancement and experimentation. ã Only point to point information available for the cost model, ie don’t have broadcast information between cliques.  At this point we have a fine grid.

24 Library Routine User Resource Selector Performance Model Contract Development Contract Development  Today the CD is not enabled fully.  It should validate the fine grid.  Should iterate between the CD and PM phases to get a workable fine grid.  In the future look for enforcement.

25 Library Routine User Resource Selector Performance Model Contract Development App Launcher “mpirun –machinefile –globusrsl fine_grid grid_linear_solve” Application Launcher

26 Things to Keep in Mind About the Results  MPICH-G is not thread safe, so only one processor can be used of the dual machines. ã This is really a problem with MPI in general.  For large problems with on a well connected cluster ScaLAPACK gets ~3/4 of the matrix multiply exec rate and matrix multiply using ATLAS on a Pentium processor gets ~3/4 of peak. So we would expect roughly 50% of peak for ScaLAPACK in the best situation.

27 Performance Model vs Runs

28

29 N=600, NB=40, 2 torc procs. Ratio: N=1500, NB=40, 4 torc procs. Ratio: N=5000, NB=40, 6 torc procs. Ratio: 2.25 N=8000, NB=40, 8 torc procs. Ratio: 1.52 N=10,000, NB=40, 8 torc procs. Ratio: 1.29

30 5 OPUS 8 OPUS 8 OPUS, 6 CYPHER 8 OPUS, 2 TORC, 6 CYPHER 6 OPUS, 5 CYPHER 2 OPUS, 4 TORC, 6 CYPHER 8 OPUS, 4 TORC, 4 CYPHER OPUS OPUS, CYPHER OPUS, TORC, CYPHER

31 Largest Problem Solved  Matrix of size 30,000 ã 7.2 GB for the data ã 32 processors to choose from UIUC and UT »Not all machines have 512 MBs, some little as 128 MBs ã PM chose 17 machines in 2 clusters from UT ã Computation took 84 minutes »3.6 Gflop/s total »210 Mflop/s per processor »ScaLAPACK on a cluster of 17 processors would get about 50% of peak »Processors are 500 MHz or 500 Mflop/s peak »For this grid computation 20% less than ScaLAPACK

32 Futures (1)  Activate the sensors in the code to verify that the application is doing what the performance model predicted  Enable Contract enforcement ã If the “contract” is violated want to migrate application dynamically.  Develop a better strategy in choosing the best set of machines.  Implement fault tolerance and migration  Use the compiler efforts to instrument code for contract monitoring and performance model development.  Results are non-deterministic, need some way to cope with this.

33 Futures (2)  Would like to be in a position to make decisions about which software to run depending on the configuration and problem. ã Dynamically choose the algorithm to fit the situation.  Develop into a general numerical library framework  Work on iterative solvers  Latency tolerant algorithms in general ã Overlap communication/computation

34