Component Frameworks:

Slides:



Advertisements
Similar presentations
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Advertisements

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
PPL-Dept of Computer Science, UIUC Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Adaptive MPI Milind A. Bhandarkar
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
SOFTWARE DESIGN AND ARCHITECTURE LECTURE 07. Review Architectural Representation – Using UML – Using ADL.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Advanced / Other Programming Models Sathish Vadhiyar.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Processes and Threads Processes and their scheduling
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Objects: Virtualization & In-Process Components
Computer Engg, IIT(BHU)
Grid Computing.
Unstructured Grids at Sandia National Labs
Performance Evaluation of Adaptive MPI
Ch > 28.4.
Component Frameworks:
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
Chapter 17: Database System Architectures
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Runtime Optimizations via Processor Virtualization
Introduction to Teradata
An Introduction to Software Architecture
Gary M. Zoppetti Gagan Agrawal
Faucets: Efficient Utilization of Multiple Clusters
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Higher Level Languages on Adaptive Run-Time System
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel Implementation of Adaptive Spacetime Simulations A
Emulating Massively Parallel (PetaFLOPS) Machines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Motivation Parallel Computing in Science and Engineering Competitive advantage Pain in the neck Necessary evil It is not so difficult But tedious, and error-prone New issues: race conditions, load imbalances, modularity in presence of concurrency,.. Just have to bite the bullet, right? PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC But wait… Parallel computation structures The set of the parallel applications is diverse and complex Yet, the underlying parallel data structures and communication structures are small in number Structured and unstructured grids, trees (AMR,..), particles, interactions between these, space-time One should be able to reuse those Avoid doing the same parallel programming again and again Domain specific frameworks PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC A Unique Twist Many apps require dynamic load balancing Reuse load re-balancing strategies It should be possible to separate load balancing code from application code This strategy is embodied in Charm++ Express the program as a collection of interacting entities (objects). Let the system control mapping to processors PPL-Dept of Computer Science, UIUC

Multi-partition decomposition Idea: divide the computation into a large number of pieces Independent of number of processors typically larger than number of processors Let the system map entities to processors PPL-Dept of Computer Science, UIUC

Object-based Parallelization User is only concerned with interaction between objects System implementation User View PPL-Dept of Computer Science, UIUC

Charm Component Frameworks Object based decomposition Reuse of Specialized Parallel Strucutres Load balancing Auto. Checkpointing Flexible use of clusters Out-of-core execn. Component Frameworks PPL-Dept of Computer Science, UIUC

Goals for Our Frameworks Ease of use: C++ and Fortran versions Retain “look-and-feel” of sequential programs Provide commonly needed features Application-driven development Portability Performance: Low overhead Dynamic load balancing via Charm++ Cache performance PPL-Dept of Computer Science, UIUC

Current Set of Component Frameworks FEM / unstructured meshes: “Mature”, with several applications already Multiblock: multiple structured grids New, but very promising AMR: Oct and Quad-trees PPL-Dept of Computer Science, UIUC

Using the Load Balancing Framework Automatic Conversion from MPI Cross module interpolation Structured FEM MPI-on-Charm Irecv+ Frameworkpath Load database + balancer Migration path Charm++ Converse PPL-Dept of Computer Science, UIUC

Finite Element Framework Goals Hide parallel implementation in the runtime system Allow adaptive parallel computation and dynamic automatic load balancing Leave physics and numerics to user Present clean, “almost serial” interface: begin time loop compute forces communicate shared nodes update node positions end time loop begin time loop compute forces update node positions end time loop Serial Code for entire mesh Framework Code for mesh partition PPL-Dept of Computer Science, UIUC

FEM Framework: Responsibilities FEM Application (Initialize, Registration of Nodal Attributes, Loops Over Elements, Finalize) FEM Framework (Update of Nodal properties, Reductions over nodes or partitions) Partitioner Combiner METIS Charm++ (Dynamic Load Balancing, Communication) I/O PPL-Dept of Computer Science, UIUC

Structure of an FEM Program Serial init() and finalize() subroutines Do serial I/O, read serial mesh and call FEM_Set_Mesh Parallel driver() main routine: One driver per partitioned mesh chunk Runs in a thread: time-loop looks like serial version Does computation and call FEM_Update_Field Framework handles partitioning, parallelization, and communication PPL-Dept of Computer Science, UIUC

Structure of an FEM Application init() driver Update Update driver driver Update Shared Nodes Shared Nodes finalize() PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Framework Calls FEM_Set_Mesh Called from initialization to set the serial mesh Framework partitions mesh into chunks FEM_Create_Field Registers a node data field with the framework, supports user data types FEM_Update_Field Updates node data field across all processors Handles all parallel communication Other parallel calls (Reductions, etc.) PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Dendritic Growth Studies evolution of solidification microstructures using a phase-field model computed on an adaptive finite element grid Adaptive refinement and coarsening of grid involves re-partitioning PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Crack Propagation Explicit FEM code Zero-volume Cohesive Elements inserted near the crack As the crack propagates, more cohesive elements added near the crack, which leads to severe load imbalance Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Pictures: S. Breitenfeld, and P. Geubelle PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle PPL-Dept of Computer Science, UIUC

“Overhead” of Multipartitioning PPL-Dept of Computer Science, UIUC

Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked PPL-Dept of Computer Science, UIUC

Scalability of FEM Framework PPL-Dept of Computer Science, UIUC

Scalability of FEM Framework 3.1M elements 1.5M Nodes ASCI Red 1 processor time : 8.24 secs 1024 processors time:7.13 msecs PPL-Dept of Computer Science, UIUC

Parallel Collision Detection Detect collisions (intersections) between objects scattered across processors Approach, based on Charm++ Arrays Overlay regular, sparse 3D grid of voxels (boxes) Send objects to all voxels they touch Collide voxels independently and collect results Leave collision response to user code PPL-Dept of Computer Science, UIUC

Collision Detection Speed O(n) serial performance Single Linux PC 2us per polygon serial performance Good speedups to 1000s of processors ASCI Red, 65,000 polygons per processor scaling problem (to 100 million polygons) PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC FEM: Future Plans Better support for implicit computations Interface to Solvers: e.g. ESI (PETSC), ScaLAPACK or POOMA’s Linear Solvers Better discontinuous Galerkin method support Fully distributed startup Fully distributed insertion Eliminate serial bottleneck in insertion phase Abstraction to allow multiple active meshes Needed for multigrid methods PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Multiblock framework For collection of structured grids Older versions: (Gengbin Zheng, 1999-2000) Recent completely new version: Motivated by ROCFLO Like FEM: User writes driver subroutines that deal with the life-cycle of a single chunk of the grid Ghost arrays managed by the framework Based on registration of data by the user program Support for “connecting up” multiple blocks makemblock processes geometry info PPL-Dept of Computer Science, UIUC

Multiblock Constituents PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Terminology PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Mutiblock structure Steps: Feed geometry information to makemblock Input: top level blocks, number of partitions desired Output: block file containing list of partitions, and communication structure Run parallel application Reads the block file Initialization of data Manual and info: http://charm.cs.uiuc.edu/ppl_research/mblock/ PPL-Dept of Computer Science, UIUC

Multiblock code example: main loop do tStep=1,nSteps call MBLK_Apply_bc_All(grid, size, err) call MBLK_Update_field(fid,ghostWidth,grid,err) do k=sk,ek do j=sj,ej do i=si,ei ! Only relax along I and J directions-- not K newGrid(i,j,k)=cenWeight*grid(i,j,k) & &+neighWeight*(grid(i+1,j,k)+grid(i,j+1,k) & &+grid(i-1,j,k)+grid(i,j-1,k)) end do PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Multiblock Driver subroutine driver() implicit none include 'mblockf.h’ … call MBLK_Get_myblock(blockNo,err) call MBLK_Get_blocksize(size,err) ... call MBLK_Create_field(& &size,1, MBLK_DOUBLE,1,& &offsetof(grid(1,1,1),grid(si,sj,sk)),& &offsetof(grid(1,1,1),grid(2,1,1)),fid,err) ! Register boundary condition functions call MBLK_Register_bc(0,ghostWidth, BC_imposed, err) … Time Loop end PPL-Dept of Computer Science, UIUC

Multiblock: Future work Support other stencils Currently diagonal elements are not used Applications We need volunteers! We will write demo apps ourselves PPL-Dept of Computer Science, UIUC

Adaptive Mesh Refinement Used in various engineering applications where there are regions of greater interest e.g.http://www.damtp.cam.ac.uk/user/sdh20/amr/amr.html Global Atmospheric modeling Numerical Cosmology Hyperbolic partial differential equations (M.J. Berger and J. Oliger) Problems with uniformly refined meshes for above Grid is too fine grained thus wasting resources Grid is too coarse thus the results are not accurate PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC AMR Library Implements the distributed grid which can be dynamically adapted at runtime Uses the arbitrary bit indexing of arrays Requires synchronization only before refinement or coarsening Interoperability because of Charm++ Uses the dynamic load balancing capability of the chare arrays PPL-Dept of Computer Science, UIUC

Indexing of array elements Question: Who are my neighbors Node or root Leaf Virtual Leaf Case of 2D mesh (4x4) 0,0,0 0,0,2 0,1,2 1,0,2 1,1,2 0,0,4 0,1,4 1,0,4 1,1,4 0,2,4 0,3,4 1,2,4 1,3,4 PPL-Dept of Computer Science, UIUC

Indexing of array elements (contd.) Mathematicaly: (for 2D) if parent is x,y using n bits then, child1 – 2x , 2y using n+2 bits child2 – 2x ,2y+1 using n+2 bits child3 – 2x+1, 2y using n+2 bits child4 – 2x+1,2y+1 using n+2 bits PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Pictorially 0,0,4 PPL-Dept of Computer Science, UIUC

Communication with Nbors In dimension x the two nbors can be obtained by - nbor --- x-1 where x is not equal to 0 + nbor --- x+1 where x is not equal to 2n In dimension y the two nbors can be obtained by - nbor --- y-1 where y is not equal to 0 + nbor--- y+1 where y is not equal to 2n PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Case 1 Nbors of 1,1,2 Y dimension : -nbor 1,0,2 Case 2 Nbors of 1,1,2 X dimension : -nbor 0,1,2 Case 3 Nbors of 1,3,4 X dimension : +nbor 2,3,4 Nbor of 1,2,4 X Dimension : +nbor 2,2,4 0,0,4 0,0,0 0,0,2 0,1,2 1,0,2 1,1,2 0,1,4 1,0,4 1,1,4 0,2,4 0,3,4 1,2,4 1,3,4 PPL-Dept of Computer Science, UIUC

Communication (contd.) Assumption : The level of refinement of adjacent cells differs at maximum by one (requirement of the indexing scheme used) Indexing scheme is similar for 1D and 3D cases PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC AMR Interface Library Tasks - Creation of Tree - Creation of Data at cells - Communication between cells - Calling the appropriate user routines in each iteration - Refining – Refine on Criteria (specified by user) User Tasks - Writing the user data structure to be kept by each cell - Fragmenting + Combining of data for the Neighbors - Fragmenting of the data of the cell for refine - Writing the sequential computation code at each cell PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Some Related Work PARAMESH Peter MacNeice et al. http://sdcd.gsfc.nasa.gov/RIB/repositories/inhouse_gsfc/Users_manual/amr.htm This library is implemented in Fortran 90 Supported on CrayT3E and SGI’s Parallel Algorithms for Adaptive Mesh Refinement, Mark T. Jones and Paul E. Plassmann, SIAM J. on Scientific Computing, 18,(1997) pp. 686-708. (Also MSC Preprint p 421-0394. ) http://www-unix.mcs.anl.gov/sumaa3d/Papers/papers.html DAGH-Dynamic Adaptive Grid Hierarchies By Manish Parashar & James C. Browne In C++ using MPI PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Future work Specialized version for structured grids Integration with multiblock Fortran interface Current version is C++ only unlike FEM and Multiblock frameworks, which support Fortran 90 Relatively easy to do PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Summary Frameworks are ripe for use Well tested in some cases Questions and answers: MPI libraries? Performance issues? Future plans: Provide all features of Charm++ PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC AMPI: Goals Runtime adaptivity for MPI programs Based on multi-domain decomposition and dynamic load balancing features of Charm++ Minimal changes to the original MPI code Full MPI 1.1 standard compliance Additional support for coupled codes Automatic conversion of existing MPI programs AMPIzer Original MPI Code AMPI Code AMPI Runtime PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Charm++ Parallel C++ with Data Driven Objects Object Arrays/ Object Collections Object Groups: Global object with a “representative” on each PE Asynchronous method invocation Prioritized scheduling Mature, robust, portable http://charm.cs.uiuc.edu PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Data driven execution Scheduler Scheduler Message Q Message Q PPL-Dept of Computer Science, UIUC

Load Balancing Framework Based on object migration and measurement of load information Partition problem more finely than the number of available processors Partitions implemented as objects (or threads) and mapped to available processors by LB framework Runtime system measures actual computation times of every partition, as well as communication patterns Variety of “plug-in” LB strategies available PPL-Dept of Computer Science, UIUC

Load Balancing Framework PPL-Dept of Computer Science, UIUC

Building on Object-based Parallelism Application induced load imbalances Environment induced performance issues: Dealing with extraneous loads on shared m/cs Vacating workstations Automatic checkpointing Automatic prefetching for out-of-core execution Heterogeneous clusters Reuse: object based components But: Must use Charm++! PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC AMPI: Goals Runtime adaptivity for MPI programs Based on multi-domain decomposition and dynamic load balancing features of Charm++ Minimal changes to the original MPI code Full MPI 1.1 standard compliance Additional support for coupled codes Automatic conversion of existing MPI programs AMPIzer Original MPI Code AMPI Code AMPI Runtime PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Adaptive MPI A bridge between legacy MPI codes and dynamic load balancing capabilities of Charm++ AMPI = MPI + dynamic load balancing Based on Charm++ object arrays and Converse’s migratable threads Minimal modification needed to convert existing MPI programs (to be automated in future) Bindings for C, C++, and Fortran90 Currently supports most of the MPI 1.1 standard PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC AMPI Features Over 70+ common MPI routines C, C++, and Fortran 90 bindings Tested on IBM SP, SGI Origin 2000, Linux clusters Automatic conversion: AMPIzer Based on Polaris front-end Source-to-source translator for converting MPI programs to AMPI Generates supporting code for migration Very low “overhead” compared with native MPI PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC AMPI Extensions Integration of multiple MPI-based modules Example: integrated rocket simulation ROCFLO, ROCSOLID, ROCBURN, ROCFACE Each module gets its own MPI_COMM_WORLD All COMM_WORLDs form MPI_COMM_UNIVERSE Point-to-point communication among different MPI_COMM_WORLDs using the same AMPI functions Communication across modules also considered for balancing load Automatic checkpoint-and-restart On different number of processors Number of virtual processors remain the same, but can be mapped to different number of physical processors PPL-Dept of Computer Science, UIUC

Rocket Simulation & other Apps. Faucets Molecular Dynamics Other Programming Pardigms Load balancing Charm++ Obj. Allocation Remapping Web based interface Converse Performance Visulaization Standard Libraries for parallel programming Instrumentation Mini Languages & libraries Component Frameworks Rocket Simulation & other Apps. PPL-Dept of Computer Science, UIUC

Application Areas and Collaborations Molecular Dynamics: Simulation of biomolecules Material properties and electronic structures CSE applications: Rocket Simulation Industrial process simulation Cosmology visualizer Combinatorial Search: State space search, game tree search, optimization PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Molecular Dynamics Collection of [charged] atoms, with bonds Newtonian mechanics At each time-step Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal’s Calculate velocities and advance positions 1 femtosecond time-step, millions needed! Thousands of atoms (1,000 - 100,000) PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC BC1 complex: 200k atoms PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Performance Data: SC2000 PPL-Dept of Computer Science, UIUC

Component Frameworks: Using the Load Balancing Framework Automatic Conversion from MPI Cross module interpolation Structured FEM MPI-on-Charm Irecv+ Frameworkpath Load database + balancer Migration path Charm++ Converse PPL-Dept of Computer Science, UIUC

Finite Element Framework Goals Hide parallel implementation in the runtime system Allow adaptive parallel computation and dynamic automatic load balancing Leave physics and numerics to user Present clean, “almost serial” interface: begin time loop compute forces communicate shared nodes update node positions end time loop begin time loop compute forces update node positions end time loop Serial Code for entire mesh Framework Code for mesh partition PPL-Dept of Computer Science, UIUC

FEM Framework: Responsibilities FEM Application (Initialize, Registration of Nodal Attributes, Loops Over Elements, Finalize) FEM Framework (Update of Nodal properties, Reductions over nodes or partitions) Partitioner Combiner METIS Charm++ (Dynamic Load Balancing, Communication) I/O PPL-Dept of Computer Science, UIUC

Structure of an FEM Application init() driver Update Update driver driver Update Shared Nodes Shared Nodes finalize() PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Dendritic Growth Studies evolution of solidification microstructures using a phase-field model computed on an adaptive finite element grid Adaptive refinement and coarsening of grid involves re-partitioning PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle PPL-Dept of Computer Science, UIUC

“Overhead” of Multipartitioning PPL-Dept of Computer Science, UIUC

Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked PPL-Dept of Computer Science, UIUC

Parallel Collision Detection Detect collisions (intersections) between objects scattered across processors Approach, based on Charm++ Arrays Overlay regular, sparse 3D grid of voxels (boxes) Send objects to all voxels they touch Collide voxels independently and collect results Leave collision response to user code PPL-Dept of Computer Science, UIUC

Collision Detection Speed O(n) serial performance Single Linux PC 2us per polygon serial performance Good speedups to 1000s of processors ASCI Red, 65,000 polygons per processor scaling problem (to 100 million polygons) PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Rocket Simulation Our Approach: Multi-partition decomposition Data-driven objects (Charm++) Automatic load balancing framework AMPI: Migration path for existing MPI+Fortran90 codes ROCFLO, ROCSOLID, and ROCFACE PPL-Dept of Computer Science, UIUC

Timeshared parallel machines How to use parallel machines effectively? Need resource management Shrink and expand individual jobs to available sets of processors Example: Machine with 100 processors Job1 arrives, can use 20-150 processors Assign 100 processors to it Job2 arrives, can use 30-70 processors, and will pay more if we meet its deadline We can do this with migratable objects! PPL-Dept of Computer Science, UIUC

Faucets: Multiple Parallel Machines Faucet submits a request, with a QoS contract: CPU seconds, min-max cpus, deadline, interacive? Parallel machines submit bids: A job for 100 cpu hours may get a lower price bid if: It has less tight deadline, more flexible PE range A job that requires 15 cpu minutes and a deadline of 1 minute Will generate a variety of bids A machine with idle time on its hand: low bid PPL-Dept of Computer Science, UIUC

Faucets QoS and Architecture User specifies desired job parameters such as: min PE, max PE, estimated CPU-seconds, priority, etc. User does not specify machine. . Planned: Integration with Globus Workstation Cluster Faucet Client Central Server Workstation Cluster Web Browser Workstation Cluster PPL-Dept of Computer Science, UIUC

How to make all of this work? The key: fine-grained resource management model Work units are objects and threads rather than processes Data units are object data, thread stacks, .. Rather than pages Work/Data units can be migrated automatically during a run PPL-Dept of Computer Science, UIUC

Time-Shared Parallel Machines PPL-Dept of Computer Science, UIUC

Appspector: Web-based Monitoring and Steering of Parallel Programs Parallel Jobs submitted via a server Server maintains database of running programs Charm++ client-server interface Allows one to inject messages into a running application From any web browser: You can attach to a job (if authenticated) Monitor performance Monitor behavior Interact and steer job (send commands) PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC BioCoRE Goal: Provide a web-based way to virtually bring scientists together. Project Based Workbench for Modeling Conferences/Chat Rooms Lab Notebook Joint Document Preparation http://www.ks.uiuc.edu/Research/biocore/ PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Some New Projects Load Balancing for really large machines: 30k-128k processors Million-processor Petaflops class machines Emulation for software development Simulation for Performance Prediction Operations Research Combinatorial optiization Parallel Discrete Event Simulation PPL-Dept of Computer Science, UIUC

PPL-Dept of Computer Science, UIUC Summary Exciting times for parallel computing ahead We are preparing an object based infrastructure To exploit future apps on future machines Charm++, AMPI, automatic load balancing Application-oriented research that produces enabling CS technology Rich set of collaborations More information: http://charm.cs.uiuc.edu PPL-Dept of Computer Science, UIUC