Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Thoughts on Shared Caches Jeff Odom University of Maryland.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.
OpenFOAM on a GPU-based Heterogeneous Cluster
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Trilinos Progress, Challenges and Future Plans Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation,
The hybird approach to programming clusters of multi-core architetures.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
Hossein Bastan Isfahan University of Technology 1/23.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Page 1 Software Life-cycle and Integration Issues for CS&E R&D Software and Experiences from Trilinos (Part I) Roscoe A. Bartlett
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
MS17 A Case Study on the Vertical Integration of Trilinos Solver Algorithms with a Production Application Code Organizer: Roscoe A. Bartlett Sandia National.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
PyTrilinos: A Python Interface to Trilinos Bill Spotz Sandia National Laboratories Reproducible Research in Computational Geophysics August 31, 2006.
Page 1 Trilinos Software Engineering Technologies and Integration Numerical Algorithm Interoperability and Vertical Integration –Abstract Numerical Algorithms.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Trilinos 101: Getting Started with Trilinos November 6, :30-9:30 a.m. Jim Willenbring Mike Heroux (Presenter)
Strategic Goals: To align the many efforts at Sandia involved in developing software for the modeling and simulation of physical systems (mostly PDEs):
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
Extreme Scale Trilinos: How We are Ready, And Not Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.
A Current Overview of the Trilinos Project Jonathan Hu Tenth Copper Mountain Conference on Iterative Methods Monday, April 7 th, 2008 SAND# C.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Scalable Linear Algebra Capability Area Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Data Structures and Algorithms in Parallel Computing Lecture 7.
CCA Common Component Architecture Distributed Array Component based on Global Arrays Manoj Krishnan, Jarek Nieplocha High Performance Computing Group Pacific.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Algebraic Solvers in FASTMath Argonne Training Program on Extreme-Scale Computing August 2015.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
DOE/Office of Science/ASCR (Sandia National Laboratories)
C Software Life-cycle and Integration Issues for CS&E R&D Software and Experiences from Trilinos (Part I) Roscoe A. Bartlett
Resource Elasticity for Large-Scale Machine Learning
MPJ: A Java-based Parallel Computing System
Presentation transcript:

Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods Automatic DifferentiationSacado Mortar MethodsMoertel Core Linear algebra objectsEpetra, Jpetra, Tpetra Abstract interfacesThyra, Stratimikos, RTOp Load BalancingZoltan, Isorropia “Skins”PyTrilinos, WebTrilinos, Star-P, ForTrilinos, CTrilinos C++ utilities, I/O, thread APITeuchos, EpetraExt, Kokkos, Triutils, TPI Solvers Iterative (Krylov) linear solversAztecOO, Belos, Komplex Direct sparse linear solversAmesos Direct dense linear solversEpetra, Teuchos, Pliris Iterative eigenvalue solversAnasazi ILU-type preconditionersAztecOO, IFPACK Multilevel preconditionersML, CLAPS Block preconditionersMeros Nonlinear system solversNOX, LOCA Optimization (SAND)MOOCHO, Aristos Stochastic PDEsStokhos

Changing Scope of Trilinos  Capabilities:  Past: Solver capabilities and supporting components.  Now: Any library for science/engineering (Zoltan, Intrepid, …).  Customers:  Past: Sandia and other NNSA customers.  Now: Expanding to Office of Science applications, DoD, DHS, CRADAs and WFO.  Platforms:  Past: All platforms using command-line installer (Autotools). Linux/Unix bias.  Now: Expanding to GUI & binary installer (Cmake). Native Winodws/Mac process. The Changing Scope of the Trilinos Project, Michael A. Heroux, Technical Report, Sandia National Laboratories, SAND , December 2007.

Capability Leaders: New Layer of Proactive Leadership  Areas:  Framework, Tools & Interfaces (J. Willenbring).  Discretizations (P. Bochev).  Geometry, Meshing & Load Balancing (K. Devine).  Scalable Linear Algebra (M. Heroux).  Linear & Eigen Solvers (J. Hu).  Nonlinear, Transient & Optimization Solvers (A. Salinger).  Each leader provides strategic direction across all Trilinos packages within area.

A Few HPCCG Multicore Results  Float useful:  Mixed precision algorithms.  Bandwidth even more important:  Saturation means loss of cores.  Memory placement a concern:  Shared memory allows remote placement.  NiagaraT2 threads hide latency:  Easiest node to program.

More Float vs Double: Barcelona  pHPCCG:  Float faster than double.  Float scales better.

Multi-Programming Model Runtime Environment: Niagara2 MPI & MPI+threads:  App: Scales (superlinearly) MPI-only sufficient.  Solver: BW-limited. MPI+threads can help.

Library Preparations for New Node Architectures (Decision Made Years Ago)  We knew node architectures would change…  Abstract Parallel Machine Interface: Comm Class.  Abstract Linear Algebra Objects:  Operator class: Action of operator only, no knowledge of how.  RowMatrix class: Serve up a row of coefficients on demand.  Pure abstract layer: No unnecessary constraints at all.  Model Evaluator:  Highly flexible API for linear/non-linear solver services.  Templated scalar and integer types:  Compile-time resolution float, double, quad,… int, long long,…  Mixed precision algorithms.

Library Effort in Response to Node Architecture Trends  Block Krylov Methods (Belos & Anasazi):  Natural for UQ, QMU, Sensitivity Analysis…  Superior Node and Network complexity.  Templated Kernel Libraries (Tpetra & Tifpack):  Choice of float vs double made when object created.  High-performance multiprecision algorithms.  Threaded Comm Class (Tpetra):  Intel TBB support, compatible with OpenMP, Pthreads, …  Clients of Tpetra::TbbMpiComm can access static, ready-to-work thread pool.  Code above the basic kernel level is unaware of threads.  Specialized sparse matrix data structures:  Sparse diagonal, sparse-dense, composite.  MPI-only+MPI/PNAS  Application runs MPI-only (8 flat MPI processes on dual quad-core)  Solver runs: MPI-only when interfacing with app using partitioned nodal address space (PNAS). 2 MPI processes, 4 threads each when solving problem.

What is BEC?  A programming model developed at Sandia based on careful analysis of Sandia applications, strengths / weaknesses of past and current programming models, and technology evolution path.  Code example: Shared int A[10000], B[10000], C[10000]; /* globally shared data */ BEC_request(A[3]) ; BEC_request(B[8]) ; /* Bundle requests */ BEC_exchange(); /* Exchange bundled requests globally */ C[10] = A[3]+B[8]; /* Computation using shared data like local */  BEC Model:  BEC combines the convenience of virtual shared memory (a.k.a. Global Address Space (GAS)) and efficiency of Bulk Synchronous Parallel (BSP);  BEC has built-in capabilities for efficient support for high-volume random fine-grained communication (accesses to virtual shared memory).

BEC Application: HPCCG  Form & solve a linear system using the Conjugate Gradient (CG) method  MPI version: Part of Mantevo toolset.  The BEC and MPI versions have two main steps:  Bundle preparations (message queues setup, data bundling, organizing remotely fetched data, etc),  CG iterations (until convergence)  UPC version: a benchmark from the UPC Consortium (“fully optimized”) TaskNumber of lines of code BECMPIUPC Bundle preparation6240N/A CG iterations (computation related code)6087N/A Communication related code11277N/A Whole program (excluding empty lines, comments) About 900

Iteration in BEC (many phases of Bundle-Exchange-Compute) // Note: Since the requests for the shared data are always the same for this application, // there is no need to make explicit BEC_request() calls again. Just call BEC_repeat_request(). for (int iter = 1; iter < max_iter; iter ++){ BEC_repeat_requests(bundle); BEC_exchange(); // Compute mv(A, p, bundle, Ap); /* BEC and MPI versions of mv() are similar */ ddot(p, Ap, &alpha); alpha = rtrans / alpha; waxpby(1.0, x, alpha, p, x); /* BEC and MPI versions of waxpby() are similar */ waxpby(1.0, r, -alpha, Ap, r); oldrtrans = rtrans; ddot(r, r, &rtrans); normr = sqrt(rtrans); if (normr <= tolerance) break; // converge beta = rtrans / oldrtrans; waxpby(1.0, r, beta, p, p); }

BEC Application: Graph Coloring  Vertex coloring  Algorithm (heuristic): Largest Degree First TaskNumber of lines of code BECMPI Communication related code (including bundling) 1069 Computation5861 Whole program (excluding empty lines, comments)

Graph Coloring Performance Comparison (on “Franklin”, Cray XT3 at NERSC)

Using BEC (1)  BEC Beta Release available for download (  Portable to machines (including PCs) with: Unix/Linux, C++ compiler, message-passing library (e.g. MPI, Portals)  BEC can be used alone, or mixed with MPI (e.g. in the same function)  BEC includes  An extension to ANSI C to allow “shared” variables (array syntax)  A runtime library, BEC lib  Use BEC with the language extension BEC code  (C code) + (BEC lib function calls)  Use BEC lib directly  (C or Fortran code) + (BEC lib function calls)  “A[10] = …” is written as “BEC_write(A, 10, …)”

Using BEC (2): BEC Language Extension vs. BEC Lib // Declaration typedef struct { int x; int y; } my_type; shared my_type my_partition A[n]; shared double B[m][n]; // Shared data requests BEC_request(A[3].x); BEC_request(B[3][4]); // Exchange of shared data BEC_exchange(); // computation S = B[3][4] + A[3].x // Declaration typedef struct my_type{ int x; int y; }; BEC_shared_1d(my_type, my_partition ) A; BEC_shared_2d(double, equal) B; // Shared Space allocation BEC_joint_allocate(A, n); BEC_joint_allocate(B, m, n); // Shared data requests BEC_request(BEC_element_attribute(A, x), 3); BEC_request(B, 3, 4); // Global exchange of shared data BEC_exchange(); // computation S = BEC_read(B,3, 4) + BEC_read(BEC_element_attribute(A, x), 3);

BEC Lib Functions Basic BEC_initialize(); BEC_finalize(); BEC_request(); BEC exchange(); BEC_read(); BEC_write(); For dynamic shared space BEC_joint_allocate();BEC_joint_free(); For Performance Optimization Communication & computation overlap BEC_exchange_begin(); BEC_exchange_end(); Special requests BEC_request_to(); BEC_apply_to(); Reusable bundle object BEC_create_persistent_bundle(); BEC_delete_bundle(); BEC_repeat_requests(); Directly getting data out of a bundle BEC_get_index_in_bundle(); BEC_increment_index_in_bundle(); BEC_get_value_in_bundle(); Local portion of a shared array BEC_local_element_count(); BEC_local_element(); Miscellaneous BEC_get_global_index();BEC_proc_count();BEC_my_proc_id();

Some Future Directions  Increased compatibility with MPI:  Light-weight “translation” of BEC global arrays into MPI (plain) local arrays and vice-versa.  BEC embeddable into an existing MPI app (can already do in most instances).  Hybrid BEC+MPI HPCCG:  Setup/exchange done with BEC.  Key kernels performed using plain local arrays.  Vector ops using either BEC or MPI collectives.  Control structures: Parallel Loops  Virtualized processors: > MPI size  Bundle full first-class object: Already mostly there.  Simultaneous exchange_begin()’s  Queuing, pipelining of iterations