Download presentation
Presentation is loading. Please wait.
Published byDarren Armstrong Modified over 9 years ago
1
Trilinos Package Summary ObjectivePackage(s) Discretizations Meshing & Spatial DiscretizationsphdMesh, Intrepid, Pamgen Time IntegrationRythmos Methods Automatic DifferentiationSacado Mortar MethodsMoertel Core Linear algebra objectsEpetra, Jpetra, Tpetra Abstract interfacesThyra, Stratimikos, RTOp Load BalancingZoltan, Isorropia “Skins”PyTrilinos, WebTrilinos, Star-P, ForTrilinos, CTrilinos C++ utilities, I/O, thread APITeuchos, EpetraExt, Kokkos, Triutils, TPI Solvers Iterative (Krylov) linear solversAztecOO, Belos, Komplex Direct sparse linear solversAmesos Direct dense linear solversEpetra, Teuchos, Pliris Iterative eigenvalue solversAnasazi ILU-type preconditionersAztecOO, IFPACK Multilevel preconditionersML, CLAPS Block preconditionersMeros Nonlinear system solversNOX, LOCA Optimization (SAND)MOOCHO, Aristos Stochastic PDEsStokhos
2
Changing Scope of Trilinos Capabilities: Past: Solver capabilities and supporting components. Now: Any library for science/engineering (Zoltan, Intrepid, …). Customers: Past: Sandia and other NNSA customers. Now: Expanding to Office of Science applications, DoD, DHS, CRADAs and WFO. Platforms: Past: All platforms using command-line installer (Autotools). Linux/Unix bias. Now: Expanding to GUI & binary installer (Cmake). Native Winodws/Mac process. The Changing Scope of the Trilinos Project, Michael A. Heroux, Technical Report, Sandia National Laboratories, SAND2007-7775, December 2007.
3
Capability Leaders: New Layer of Proactive Leadership Areas: Framework, Tools & Interfaces (J. Willenbring). Discretizations (P. Bochev). Geometry, Meshing & Load Balancing (K. Devine). Scalable Linear Algebra (M. Heroux). Linear & Eigen Solvers (J. Hu). Nonlinear, Transient & Optimization Solvers (A. Salinger). Each leader provides strategic direction across all Trilinos packages within area.
4
A Few HPCCG Multicore Results Float useful: Mixed precision algorithms. Bandwidth even more important: Saturation means loss of cores. Memory placement a concern: Shared memory allows remote placement. NiagaraT2 threads hide latency: Easiest node to program.
5
More Float vs Double: Barcelona pHPCCG: Float faster than double. Float scales better.
6
Multi-Programming Model Runtime Environment: Niagara2 MPI & MPI+threads: App: Scales (superlinearly) MPI-only sufficient. Solver: BW-limited. MPI+threads can help.
7
Library Preparations for New Node Architectures (Decision Made Years Ago) We knew node architectures would change… Abstract Parallel Machine Interface: Comm Class. Abstract Linear Algebra Objects: Operator class: Action of operator only, no knowledge of how. RowMatrix class: Serve up a row of coefficients on demand. Pure abstract layer: No unnecessary constraints at all. Model Evaluator: Highly flexible API for linear/non-linear solver services. Templated scalar and integer types: Compile-time resolution float, double, quad,… int, long long,… Mixed precision algorithms.
8
Library Effort in Response to Node Architecture Trends Block Krylov Methods (Belos & Anasazi): Natural for UQ, QMU, Sensitivity Analysis… Superior Node and Network complexity. Templated Kernel Libraries (Tpetra & Tifpack): Choice of float vs double made when object created. High-performance multiprecision algorithms. Threaded Comm Class (Tpetra): Intel TBB support, compatible with OpenMP, Pthreads, … Clients of Tpetra::TbbMpiComm can access static, ready-to-work thread pool. Code above the basic kernel level is unaware of threads. Specialized sparse matrix data structures: Sparse diagonal, sparse-dense, composite. MPI-only+MPI/PNAS Application runs MPI-only (8 flat MPI processes on dual quad-core) Solver runs: MPI-only when interfacing with app using partitioned nodal address space (PNAS). 2 MPI processes, 4 threads each when solving problem.
9
What is BEC? A programming model developed at Sandia based on careful analysis of Sandia applications, strengths / weaknesses of past and current programming models, and technology evolution path. Code example: Shared int A[10000], B[10000], C[10000]; /* globally shared data */ BEC_request(A[3]) ; BEC_request(B[8]) ; /* Bundle requests */ BEC_exchange(); /* Exchange bundled requests globally */ C[10] = A[3]+B[8]; /* Computation using shared data like local */ BEC Model: BEC combines the convenience of virtual shared memory (a.k.a. Global Address Space (GAS)) and efficiency of Bulk Synchronous Parallel (BSP); BEC has built-in capabilities for efficient support for high-volume random fine-grained communication (accesses to virtual shared memory).
10
BEC Application: HPCCG Form & solve a linear system using the Conjugate Gradient (CG) method MPI version: Part of Mantevo toolset. The BEC and MPI versions have two main steps: Bundle preparations (message queues setup, data bundling, organizing remotely fetched data, etc), CG iterations (until convergence) UPC version: a benchmark from the UPC Consortium (“fully optimized”) TaskNumber of lines of code BECMPIUPC Bundle preparation6240N/A CG iterations (computation related code)6087N/A Communication related code11277N/A Whole program (excluding empty lines, comments) 233733About 900
11
Iteration in BEC (many phases of Bundle-Exchange-Compute) // Note: Since the requests for the shared data are always the same for this application, // there is no need to make explicit BEC_request() calls again. Just call BEC_repeat_request(). for (int iter = 1; iter < max_iter; iter ++){ BEC_repeat_requests(bundle); BEC_exchange(); // Compute ----------------------------------------------------- mv(A, p, bundle, Ap); /* BEC and MPI versions of mv() are similar */ ddot(p, Ap, &alpha); alpha = rtrans / alpha; waxpby(1.0, x, alpha, p, x); /* BEC and MPI versions of waxpby() are similar */ waxpby(1.0, r, -alpha, Ap, r); oldrtrans = rtrans; ddot(r, r, &rtrans); normr = sqrt(rtrans); if (normr <= tolerance) break; // converge beta = rtrans / oldrtrans; waxpby(1.0, r, beta, p, p); }
12
BEC Application: Graph Coloring Vertex coloring Algorithm (heuristic): Largest Degree First TaskNumber of lines of code BECMPI Communication related code (including bundling) 1069 Computation5861 Whole program (excluding empty lines, comments) 131201
13
Graph Coloring Performance Comparison (on “Franklin”, Cray XT3 at NERSC)
14
Using BEC (1) BEC Beta Release available for download (http://www.cs.sandia.gov/BEC)http://www.cs.sandia.gov/BEC Portable to machines (including PCs) with: Unix/Linux, C++ compiler, message-passing library (e.g. MPI, Portals) BEC can be used alone, or mixed with MPI (e.g. in the same function) BEC includes An extension to ANSI C to allow “shared” variables (array syntax) A runtime library, BEC lib Use BEC with the language extension BEC code (C code) + (BEC lib function calls) Use BEC lib directly (C or Fortran code) + (BEC lib function calls) “A[10] = …” is written as “BEC_write(A, 10, …)”
15
Using BEC (2): BEC Language Extension vs. BEC Lib // Declaration typedef struct { int x; int y; } my_type; shared my_type my_partition A[n]; shared double B[m][n]; // Shared data requests BEC_request(A[3].x); BEC_request(B[3][4]); // Exchange of shared data BEC_exchange(); // computation S = B[3][4] + A[3].x // Declaration typedef struct my_type{ int x; int y; }; BEC_shared_1d(my_type, my_partition ) A; BEC_shared_2d(double, equal) B; // Shared Space allocation BEC_joint_allocate(A, n); BEC_joint_allocate(B, m, n); // Shared data requests BEC_request(BEC_element_attribute(A, x), 3); BEC_request(B, 3, 4); // Global exchange of shared data BEC_exchange(); // computation S = BEC_read(B,3, 4) + BEC_read(BEC_element_attribute(A, x), 3);
16
BEC Lib Functions Basic BEC_initialize(); BEC_finalize(); BEC_request(); BEC exchange(); BEC_read(); BEC_write(); For dynamic shared space BEC_joint_allocate();BEC_joint_free(); --------------------------- For Performance Optimization -------------------------------- Communication & computation overlap BEC_exchange_begin(); BEC_exchange_end(); Special requests BEC_request_to(); BEC_apply_to(); Reusable bundle object BEC_create_persistent_bundle(); BEC_delete_bundle(); BEC_repeat_requests(); Directly getting data out of a bundle BEC_get_index_in_bundle(); BEC_increment_index_in_bundle(); BEC_get_value_in_bundle(); Local portion of a shared array BEC_local_element_count(); BEC_local_element(); Miscellaneous BEC_get_global_index();BEC_proc_count();BEC_my_proc_id();
17
Some Future Directions Increased compatibility with MPI: Light-weight “translation” of BEC global arrays into MPI (plain) local arrays and vice-versa. BEC embeddable into an existing MPI app (can already do in most instances). Hybrid BEC+MPI HPCCG: Setup/exchange done with BEC. Key kernels performed using plain local arrays. Vector ops using either BEC or MPI collectives. Control structures: Parallel Loops Virtualized processors: > MPI size Bundle full first-class object: Already mostly there. Simultaneous exchange_begin()’s Queuing, pipelining of iterations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.