Automatic Differentiation: Introduction Automatic differentiation (AD) is a technology for transforming a subprogram that computes some function into a.

Slides:



Advertisements
Similar presentations
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.
Siddharth Choudhary.  Refines a visual reconstruction to produce jointly optimal 3D structure and viewing parameters  ‘bundle’ refers to the bundle.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.
© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Department of Biomedical Informatics Dynamic Load Balancing (Repartitioning) & Matrix Partitioning Ümit V. Çatalyürek Associate Professor Department of.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Combinatorial Scientific Computing is concerned with the development, analysis and utilization of discrete algorithms in scientific and engineering applications.
A Brief Overview of Methods for Computing Derivatives Wenbin Yu Department of Mechanical & Aerospace Engineering Utah State University, Logan, UT.
Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Slide 1 / 19 Mesh Generation and Load Balancing CS /11/2009 Stan Tomov Innovative Computing Laboratory Computer Science Department The University.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Adaptive MPI Milind A. Bhandarkar
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.
1 1  Capabilities: Dynamic load balancing and static data partitioning -Geometric, graph-based, hypergraph-based -Interfaces to ParMETIS, PT-Scotch, PaToH.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Static Process Scheduling
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
1 1 Zoltan: Toolkit of parallel combinatorial algorithms for unstructured, dynamic and/or adaptive computations Unstructured Communication Tools -Communication.
Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.
Symmetric-pattern multifrontal factorization T(A) G(A)
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
High Performance Computing Seminar
Auburn University
Parallel Hypergraph Partitioning for Scientific Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Programming Models for SimMillennium
Objective of This Course
HIGH LEVEL SYNTHESIS.
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Automatic Differentiation: Introduction Automatic differentiation (AD) is a technology for transforming a subprogram that computes some function into a subprogram that computes the derivatives of that function Derivatives used in optimization, nonlinear solvers, sensitivity analysis, uncertainty quantification Forward mode of AD is efficient for problems with few independent variables or Jacobian-vector products Reverse mode of AD is efficient for problems with few dependent variables or J T v products Efficiency of generated code depends on sophistication of underlying compiler analysis and combinatorial algorithms

AD: Current Capabilities Fortran 77: ADIFOR 2.0/3.0 –Robust, mature tool with excellent language coverage –Excellent compiler analysis –Efficient forward mode (small number of independents) –Adequate reverse mode (small number of dependents) C/C++: ADIC 2.0 –Semi-mature tool with full C language coverage –Sophisticated differentiation algorithms –Efficient forward mode Fortran 90: OpenAD/F –New tool with partial language coverage –Sophisticated differentiation algorithms –Accurate and novel compiler analysis –Innovative templating mechanism –Efficient forward and reverse modes

AD: Application Highlight Runtime (m:s)RatioMemory Simulation alone2:201.0— Basic adjoint143: M Improved checkpointing141: M Add compiler analysis21: M Finite differences23 days14,400— Sensitivity of flow through Drake passage to bottom topography, using MIT shallow water model

AD: Future Capabilities C/C++: ADIC 2.x –Enhanced support for C++ (basic templating, operator overloading) Fortran 90: OpenAD/F –Improved language coverage (user-defined types, pointers, etc.) Both tools –New differentiation algorithms –New checkpointing mechanisms –Advanced compiler analysis –Efficient forward and reverse modes –Integration with CSCAPES coloring algorithms –Ease of use through integration with PETSc and Zoltan toolkits

Load Balancing: Introduction Goals: Provide software and algorithms for load balancing (partitioning) that can easily be used by parallel applications. Load balancing: distribute work evenly among processors while minimizing communication cost. Reduces parallel run time. Static load balancing (often called “partitioning”) –Application computation and communication patterns do not change –Partition and distribute data once Dynamic load balancing –In dynamic or adaptive applications, computation and communication change over time. –Load balancing should be invoked at certain intervals. –Try to reduce data migration (application data to move)

Load Balancing: Current Capabilities Zoltan: Software toolkit for parallel data management and load balancing –Available at Collection of many load-balancing methods –Geometric: RCB, space filling curves –Graph and hypergraph partitioning Data-structure neutral interface –Call-back functions –Single, common interface for many methods Allows applications to “plug and play” Portable, parallel code (MPI) –Used in many DOE and Sandia applications –Can run on thousands of processors

Large variety of applications, requirements, data structures. Multiphysics simulations xbA = Linear solvers & preconditioners Adaptive mesh refinement Crash simulations Particle methods Parallel electronics networks 1 2 Vs SOURCE_VOLTAGE 1 2 Rs R 1 2 Cm012 C 1 2 Rg02 R 1 2 Rg01 R 1 2 C01 C 1 2 C02 C 12 L2 INDUCTOR 12 L1 INDUCTOR 12 R1 R 12 R2 R 1 2 Rl R 1 2 Rg1 R 1 2 Rg2 R 1 2 C2 C 1 2 C1 C 1 2 Cm12 C Cell Modeling Load Balancing: Applications

Load Balancing: Future Capabilities Scalable hypergraph partitioning –Hypergraphs accurately model communication volume –We aim to improve scalability to thousands of processors 2d matrix partitioning –Reduce communication compared to standard 1d distribution Multiconstraint partitioning –Multi-physics simulation Complex objectives partitioning –E.g., simultaneously balance computation and memory Parallel sparse matrix ordering (nested dissection)

Reordering Transformations: Introduction Irregular memory access patterns make performance sensitive to data and iteration orders Run-time reordering transformations schedule data accesses and iterations to maximize performance Preliminary work on reordering heuristics shows that hypergraph models outperform graph models Full sparse tiling: new inspector/executor strategy that exploits inter-iteration locality

RT: Current Capabilities Open source package implementing several data and iteration reordering heuristics: Data_N_Comp_Reorder Data reordering heuristics –Breadth first search (graph-based) –Consecutive packing –Partitioning (graph-based) –Breadth first search (hypergraph-based) –Consecutive packing (hypergraph-based) –Partitioning (hypergraph-based) Iteration reordering heuristics –Breadth first search (hypergraph-based) –Lexicographical sorting and various approximations –Consecutive packing (hypergraph-based) –Partitioning (hypergraph-based) Full sparse tiling implementation for model problems

RT: Application Highlight Reordering for a mesh- quality improvement code (FeasNewt – T. Munson) Hypergraph-BFS data reordering coupled with Cpack iteration reordering offers best performance Reordering leads to performance within 90% of memory bandwidth limit for sparse matvec

RT: Future Capabilities New hypergraph-based runtime reordering transformations Comparison between hypergraph-based and bipartite graph-based runtime reordering transformations Hypergraph partitioners for load balancing modified to work well for reordering transformations Hierarchical full sparse tiling for hierarchical parallel systems

Graph Coloring and Matching: Introduction Graph coloring deals with partitioning a set of binary-related objects into few groups of “independent” objects Sparsity exploitation in computation of Jacobians and Hessians leads to a variety of graph coloring problems. Sources of problem variations: –Unsymmetric vs symmetric matrix –Direct vs substitution method –Uni- vs bi-directional partitioning 1d partition2d partition JacobianDistance-2 coloring Star bicoloring Direct HessianStar coloringNADirect JacobianNAAcyclic bicoloring Subst HessianAcyclic coloringNASubst Matching deals with finding a “large” set of independent edges in a graph Variant matching problems occur in load-balancing, process scheduling, linear solvers, preconditioners, etc. Orthogonal sources of variation in matching problems: Bipartite vs general graphs Cardinality vs weighted problems

GCM: Current Capabilities Coloring Serial: –Developed novel (greedy) algorithms for distance-1, distance-2, star and acyclic coloring problems. A package implementing these algorithms and corresponding variant ordering routines available. Parallel: –Developed a scheme for parallelizing greedy coloring algorithms on distributed-memory computers. MPI implementations of distance-1 and distance-2 coloring made available via Zoltan. Matching –Algorithms that compute optimal solutions for matching problems are polynomial in time, but slow and difficult to parallelize. –High quality approximate solutions can be computed in (near) linear time. Approximation techniques make parallelization easier. –Developed fast approximation algorithms for several matching problems. –Efficient implementations of exact matching algorithms available.

GCM: Application Highlights Coloring –Automatic differentiation (sparse Jacobians and Hessians) –Parallel computation (discovery of concurrency, data migration) –Frequency allocation –Register allocation in compilers, etc Matching –Numerical preprocessing in sparse linear systems: permute a matrix such that its diagonal or block diagonal are heavy. –Block triangular decomposition in sparse linear systems: decompose a system of equations into smaller sets of systems. –Graph partitioning: guide the coarsening phase of multilevel graph partitioning methods.

GCM: Future Capabilities Develop and implement star and acyclic bicoloring algorithms for Jacobian computation Develop parallel algorithms that scale to thousands of processors for the various coloring problems (distance-1, distance-2, star, acyclic) Integrate coloring software with automatic differentiation tools Develop petascale parallel matching algorithms based on approximation techniques