Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ahmed Sameh and Ananth Grama Computer Science Department Purdue University.

Slides:

Advertisements

Similar presentations

Introductions to Parallel Programming Using OpenMP

Advertisements

Parallel computer architecture classification

Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”

One-day Meeting, INI, September 26th, 2008 Role of spectral turbulence simulations in developing HPC systems YOKOKAWA, Mitsuo Next-Generation Supercomputer.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.

Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.

Introduction CS 524 – High-Performance Computing.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ananth Grama and Ahmed Sameh Department of Computer Science Purdue University.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.

Hossein Bastan Isfahan University of Technology 1/23.

Parallel Architectures

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

1 Programming Multicore Processors Aamir Shafi High Performance Computing Lab

Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”

Parallel Processing CS453 Lecture 2.  The role of parallelism in accelerating computing speeds has been recognized for several decades.  Its role in.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

DOE BER Climate Modeling PI Meeting, Potomac, Maryland, May 12-14, 2014 Funding for this study was provided by the US Department of Energy, BER Program.

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Portable and Predictable Performance on Heterogeneous Embedded Manycores (ARTEMIS ) ARTEMIS 2 nd Project Review October 2014 Summary of technical.

 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

Experts in numerical algorithms and HPC services Compiler Requirements and Directions Rob Meyer September 10, 2009.

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

Parallel and Distributed Computing Research at the Computing Research Institute Ananth Grama Computing Research Institute and Department of Computer Sciences.

Requirements of Programming Models ● Requirements never change:  Programmability  Performance ● Satisfying these requirements, of course is a function.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.

Understanding Parallel Computers Parallel Processing EE 613.

EU-Russia Call Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.

HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.

Productive Performance Tools for Heterogeneous Parallel Computing

A survey of Exascale Linear Algebra Libraries for Data Assimilation

Chapter1 Fundamental of Computer Design

Tohoku University, Japan

Ananth Grama and Ahmed Sameh Department of Computer Science

David Gleich, Ahmed Sameh, Ananth Grama, et al.

Compiler Back End Panel

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Compiler Back End Panel

Supported by the National Science Foundation.

Hybrid Programming with OpenMP and MPI

Presentation transcript:

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ahmed Sameh and Ananth Grama Computer Science Department Purdue University. Linear Solvers Grant Kickoff Meeting, 9/26/06.

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Project OverviewObjectives and Methodology Design scalable sparse solvers (direct, iterative, and hybrid) and evaluate their scaling/communication characteristics. Evaluate architectural features and their impact on scalable solver performance. Evaluate performance and productivity aspects of programming models -- PGAs (CAF, UPC) and MPI. Challenges and Impact Generalizing the space of linear solvers. Implementation and analysis on parallel platforms Performance projection to the petascale Guidance for architecture and programming model design / performance envelope. Benchmarks and libraries for HPCS. Milestones / Schedule Final deliverable: Comprehensive evaluation of scaling properties of existing (and novel solvers). Six month target: Comparative performance of solvers on multicore SMPs and clusters. Twelve-month target: Comprehensive evaluation on Cray X1, BG, JS20/21, of CAF/UPC/MPI implementations.

Introduction A critical aspect of High-Productivity relates to the identification of points/regions in the algorithm/ architecture/ programming model space that are amenable to petascale systems. This project aims to identify such points in the context of commonly used sparse linear system solvers and to develop novel solvers. These novel solvers emphasize reduction in memory/remote accesses at the expense of (possibly) higher FLOP counts – yielding much better actual performance.

Project Rationale Sparse solvers form the most commonly used kernels on HPC machines. Design of HPC architectures and programming models must be influenced by their suitability to this (and related) kernels. Extreme need for concurrency and novel architectural models require fundamental re-examination of conventional solvers.

Project Goals Develop a generalization of direct and iterative solvers – the Spike polyalgorithm. Implement this generalization on various architectures (multicore, multicore SMP, multicore SMP aggregates) and programming models (PGAs, Messaging APIs) Analytically quantify performance and project to petascale platforms. Compare relative performance, identify architecture/programming model features, and guide algorithm/ architecture/ programming model co-design.

Background Personnel: –Ahmed Sameh, Samuel Conte Chair in Computer Science, has worked on development of parallel sparse solvers for four decades. –Ananth Grama, Professor and University Scholar, has worked both on numerical aspects of sparse solvers, as well as analytical frameworks for parallel systems. –(To be named – Postdoctoral Researcher)* will be primarily responsible for implementation and benchmarking. *We have identified three candidates for this position and will shortly be hiring one of them.

Background Technical –We have built extensive infrastructure on parallel sparse solvers – including the Spike parallel toolkit, augmented-spectral ordering techniques, and multipole-based preconditioners –We have diverse hardware infrastructure, including Intel/AMP multicore SMP clusters, JS20/21 Blade servers, BlueGene/L, Cray X1.

Background Technical (continued) –We have initiated installation of Co-Array Fortran and Unified Parallel C on our machines and porting our toolkits to these PGAs. –We have extensive experience in analysis of performance and scalability of parallel algorithms, including development of the isoefficiency metric for scalability.

Technical Highlights The SPIKE Toolkit (Dr. Sameh, could you include a few slides here).

Technical Highlights Analysis of Scaling Properties –In early work, we developed the Isoefficiency metric for scalability. –With the likely scenario of utilizing up to 100K processing cores, this work becomes critical. –Isoefficiency quantifies the performance of a parallel system (a parallel program and the underlying architecture) as the number of processors is increased.

Technical Highlights Isoefficiency Analysis –The efficiency of any parallel program running on a given problem instance goes down with increasing number of processors. –For a family of parallel programs (formally referred to as scalable programs), increasing the problem size results in an increase in efficiency.

Technical Highlights Isoefficiency is the rate at which problem size must be increased w.r.t. number of processors, to maintain constant efficiency. This rate is critical, since it is ultimately limited by total memory size. Isoefficiency is a key indicator of a program’s ability to scale to very large machine configurations. Isoefficiency analysis will be used extensively for performance projection and scaling properties.

Architecture We target the following currently available architectures –IBM JS20/21 and BlueGene/L platforms –Cray X1/XT3 –AMD Opteron multicore SMP and SMP clusters –Intel Xeon multicore SMP and SMP clusters These platforms represent a wide range of currently available architectural extremes.

Implementation Current implementations are MPI based. The Spike tooklit (iterative as well as direct solvers) will be ported to –POSIX and OpenMP –UPC and CAF –Titanium and X10 (if releases are available) These implementations will be comprehensively benchmarked across platforms.

Benchmarks/Metrics We aim to formally specify a number of benchmark problems (sparse systems arising in structures, CFD, and fluid-structure interaction) We will abstract architecture characteristics – processor speed, memory bandwidth, link bandwidth, bisection bandwidth. We will quantify solvers on the basis of wall- clock time, FLOP count, parallel efficiency, scalability, and projected performance to petascale systems.

Progress/Accomplishments Implementation of the parallel Spike polyalgorithm toolkit Incorporation of a number of direct (SuperLU, MUMPS) and iterative solvers into Spike (preconditioned Krylov subspace methods) Evaluation of Spike on IBM/SP and Intel multicore platforms, integration into the Intel MKL library.

Milestones Final deliverable: Comprehensive evaluation of scaling properties of existing (and new solvers). Six month target: Comparative performance of solvers on multicore SMPs and clusters. Twelve-month target: Comprehensive evaluation on Cray X1, BG, JS20/21, of CAF/UPC/MPI implementations.

Financials The total cost of this project is approximately $150K for its one-year duration. The budget primarily accounts for a postdoctoral researcher’s salary/benefits and minor summer-time for the PIs. Together, these three project personnel are responsible for accomplishing project milestones and reporting.

Concluding Remarks This project takes a comprehensive view of linear system solvers and the suitability of petascale HPC systems. Its results directly influence ongoing and future development of HPC systems. A number of major challenges are likely to emerge, both as a result of this project, and from impending architectural innovations.

Concluding Remarks Architectural features include –Scalable multicore platforms: 64 to 128 cores on the horizon –Heterogeneous multicore: It is likely that cores are likely to be heterogeneous – some with floating point units, others with vector units, yet others with programmable hardware (indeed such chips are commonly used in cell phones) –Significantly higher pressure on the memory subsystem

Concluding Remarks Impact of architectural features on algorithms and programming models. –Affinity scheduling is important for performance – need to specify tasks that must be co-scheduled (suitable programming abstractions needed). –Programming constructs for utilizing heterogeneity.

Concluding Remarks Impact of architectural features on algorithms and programming models. – FLOPS are cheap, memory references are expensive – explore new families of algorithms that optimize for (minimize) latter – Algorithmic techniques and programming constructs for specifying algorithmic asynchrony (used to mask system latency) – Many of the optimizations are likely to be beyond the technical reach of applications programmers – need for scalable library support –Increased emphasis on scalability analysis