Integrated Performance Analysis in the Uintah Computational Framework Steven G. Parker Allen Morris, Scott Bardenhagen, Biswajit Banerje, James Bigler,

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
ARCS Data Analysis Software An overview of the ARCS software management plan Michael Aivazis California Institute of Technology ARCS Baseline Review March.
Problem-Solving Environments: The Next Level in Software Integration David W. Walker Cardiff University.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.
Computer System Architectures Computer System Software
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Toward interactive visualization in a distributed workflow Steven G. Parker Oscar Barney Ayla Khan Thiago Ize Steven G. Parker Oscar Barney Ayla Khan Thiago.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
From Use Cases to Implementation 1. Structural and Behavioral Aspects of Collaborations  Two aspects of Collaborations Structural – specifies the static.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
From Use Cases to Implementation 1. Mapping Requirements Directly to Design and Code  For many, if not most, of our requirements it is relatively easy.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
 155 South 1452 East Room 380  Salt Lake City, Utah  This research was sponsored by the National Nuclear Security Administration.
VisIt Project Overview
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
OpenMosix, Open SSI, and LinuxPMI
Parallel Programming By J. H. Wang May 2, 2017.
Programming Models for SimMillennium
Is System X for Me? Cal Ribbens Computer Science Department
Allen D. Malony, Sameer Shende
Parallel Programming in C with MPI and OpenMP
Component Frameworks:
CLUSTER COMPUTING.
TAU Performance DataBase Framework (PerfDBF)
Presentation transcript:

Integrated Performance Analysis in the Uintah Computational Framework Steven G. Parker Allen Morris, Scott Bardenhagen, Biswajit Banerje, James Bigler, Jared Campbell, Curtis Larsen, Dav De St. Germain, Dawen Li, Divya Ramachandran, David Walter, Jim Guilkey, Todd Harman, John Schmidt, Jesse Hall, Jun Wang, Kurt Zimmerman, John McCorquodale, Misha Ovchinnikov, Jason Morgan, Nick Benson, Phil Sutton, Rajesh Rawat, Scott Morris, Seshadri Kumar, Steven Parker, Jennifer Spinti, Honglai Tan, Wing Yee, Wayne Witzel, Xiaodong Chen, Runing Zhang

The Beginning C-SAFE funded in September 1997 SCIRun PSE existed: Shared memory only Combustion code existed: NOT parallel Steady state, NOT transient C-SAFE MPM code did not exist C-SAFE ICE code did not exist ? ?

ASCI

Example situation

C-SAFE Goal

Now: Scalable Simulations September 2001 SCIRun Uintah: Distributed memory, CCA-based component model Shared-memory visualization tools Arches: Modular, parallel, transient C-SAFE MPM: Modular, parallel, transient C-SAFE ICE: Modular, parallel, transient Coupled with MPM

How did we get here? Designed and implemented a parallel component architecture (Uintah) Designed and implemented the Uintah Computational Framework (UCF) on top of the component architecture High Level Architecture C-SAFE Implici tly Conne cted to All Comp onents UCF Data Control / Light Data Checkpointing Mixing Model Mixing Model Fluid Model Fluid Model Subgrid Model Subgrid Model Chemistry Database Controller Chemistry Database Controller Chemistry Databases Chemistry Databases High Energy Simulations High Energy Simulations Numerical Solvers Numerical Solvers Non-PSE Components Performance Analysis Performance Analysis Simulation Controller Simulation Controller Problem Specification Numerical Solvers Numerical Solvers MPM Material Properties Database Material Properties Database Blazer Database Visualization Data Manager Data Manager Post Processing And Analysis Post Processing And Analysis Parallel Services Parallel Services Resource Management Resource Management PSE Components Scheduler

Introduction to Components

Good Fences make Good Neighbors A component architecture is all about building (and sometimes enforcing) the fences Popular in the software industry (Microsoft COM, CORBA, Enterprise Java Beans) Commercial component architectures not suitable for Scientific Computing (CCA Forum organized to address this point) Visual programming sometimes used to connect components together

Parallel Components Fluid Model Fluid Model Simulation Controller Simulation Controller MPM Data Manager Data Manager Two ways to split up work Task based Data based (Or a combination) Which is right? Key point: Components, by definition, make local decisions However, parallelism (scalable) is a global decision

Uintah Scalability Challenges Wide range of computational loads, due to: AMR Particles in subset of space Cost of ODE solvers can vary spatially Radiation models Architectural communication limitations

UCF Architecture Overview Application programmers provide: A description of the computation (tasks and variables) Code to perform each task on a single Patch (subregion of space) C++ or Fortran supported UCF uses this information to create a scalable parallel simulation

How Does It Work? Simulation Controller Simulation Controller Problem Specification Problem Specification XML Simulation (One of Arches, ICE, MPM, MPMICE, MPMArches, …) Simulation (One of Arches, ICE, MPM, MPMICE, MPMArches, …) Scheduler Tasks Data Archiver Data Archiver Tasks Callbacks MPI Assignments Load Balancer Load Balancer Configuration

How does the scheduler work? Scheduler component uses description of computation to create a taskgraph Taskgraph gets mapped to processing resources using the Load Balancer component

What is a graph?

CS Graphs: B D C A Vertex or Node Edge Taskgraph: A graph where the nodes are tasks (jobs) to be performed, and the edges are dependencies between those tasks

Example Taskgraphs

Taskgraph advantages Can accommodate flexible integration needs Can accommodate a wide range of unforeseen work loads Can accommodate a mix of static and dynamic load balance Helps manage complexity of a mixed threads/MPI programming model Allows pieces (including the scheduler) to evolve independently

Looking forward to AMR Entire UCF infrastructure is designed around complex meshes Able to achieve scalability like a structured grid code Some codes can currently handle irregular boundaries

Achieving scalability Parallel Taskgraph implementation Use 125 (of 128) processors per box Remaining 3 perform O/S functions 125 processors organized into 5x5x5 cube Multiple boxes by abutting cubes Nirvana load balancer performs this mapping for regular grid problems

Performance Analysis Tools Integrated Tools TAU calls describe costs for each Task Post-processing tools for: Average/Standard Deviation Timings Critical path/Near-critical path analysis Performance regression testing Load imbalance TAU/VAMPIR Analysis

Integration of TAU from Oregon Working with Allen Malony and friends to help with the integration Have identified bottlenecks and this influenced design of new scalable scheduler Have identified numerous ways in which to collaborate in the future Tuning and Analysis Utilities (TAU)

MPM Simulation 27 processors

Arches Simulation 40 of 125 processors

XPARE Performance Tuning typically done only for final products Or sometimes just one/twice during development Performance Analysis throughout development process Retrospective analysis possible Understanding impact of design decisions More informed optimization later

XPARE Regression Analyzer: alerts parties of violations of the thresholds Comparison tool: used by the automation system to report violations. Also can be run manually Integrated in a weekly testing harness for the Uintah / C-SAFE Performance comparisons Compiler flags O/S upgrades Platforms

XPARE Alan Morris – Utah Allen D. Malony - Oregon Sameer S. Shende - Oregon J. Davison de St. Germain - Utah Steven G. Parker - Utah XPARE - eXPeriment Alerting and REporting

Load balancing Taskgraph provides a nice mechanism for flexible load-balancing algorithms To date: simple, static mechanisms have sufficed But, we are outgrowing those

Real-world scalability Parallel I/O Parallel compiles Production run obtained speedup of 1.95 going from 500 to 1000 processors

New scalability - MPM

Breakdown

Mixed MPI/Thread scheduler Most ASCI platforms have SMP nodes Multi-threading and asynchronous MPI could give us ~2X speed improvement SGI MPI Implementation is supposedly thread-safe, but….

Network traffic into Utah Visual Supercomputing Center 2 hour average 1 day average

Volume Rendering

MPM Simulation processors 6.8 million particles, 22 timesteps interactively visualized using the real-time ray tracer (6-10 fps)

RTRT with MPM Data

Other SCIRun Applications

Geo Sciences

Conclusions Holistic performance approach Architecture Tools Scalability achieved, now we can keep it