PPL-Dept of Computer Science, UIUC Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University.

Slides:



Advertisements
Similar presentations
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Advertisements

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Technical Architectures
Reference: Message Passing Fundamentals.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Managing Agent Platforms with the Simple Network Management Protocol Brian Remick Thesis Defense June 26, 2015.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Client/Server Architectures
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000.
ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Adaptive MPI Milind A. Bhandarkar
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
SOFTWARE DESIGN AND ARCHITECTURE LECTURE 07. Review Architectural Representation – Using UML – Using ADL.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Advanced / Other Programming Models Sathish Vadhiyar.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
DISTRIBUTED COMPUTING. Computing? Computing is usually defined as the activity of using and improving computer technology, computer hardware and software.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.
Faucets Queuing System Presented by, Sameer Kumar.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
CSAR Overview Laxmikant (Sanjay) Kale 11 September 2001 © ©2001 Board of Trustees of the University of Illinois.
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
OpenMosix, Open SSI, and LinuxPMI
Processes and Threads Processes and their scheduling
Parallel Objects: Virtualization & In-Process Components
Grid Computing.
Performance Evaluation of Adaptive MPI
Component Frameworks:
Ch > 28.4.
Component Frameworks:
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Runtime Optimizations via Processor Virtualization
Faucets: Efficient Utilization of Multiple Clusters
An Orchestration Language for Parallel Objects
Parallel Implementation of Adaptive Spacetime Simulations A
Presentation transcript:

PPL-Dept of Computer Science, UIUC Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign

PPL-Dept of Computer Science, UIUC Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of processors –Productivity: of human programmers –complex: irregular structure, dynamic variations Approach: Application Oriented yet CS centered research –Develop enabling technology, for a wide collection of apps. –Develop, use and test it in the context of real applications –Optimal division of labor between “system” and programmer: Decomposition done by programmer, everything else automated Develop standard library of reusable parallel components

PPL-Dept of Computer Science, UIUC Motivation Parallel Computing in Science and Engineering –Competitive advantage –Pain in the neck –Necessary evil It is not so difficult –But tedious, and error-prone –New issues: race conditions, load imbalances, modularity in presence of concurrency,.. –Just have to bite the bullet, right?

PPL-Dept of Computer Science, UIUC But wait… Parallel computation structures –The set of the parallel applications is diverse and complex –Yet, the underlying parallel data structures and communication structures are small in number Structured and unstructured grids, trees (AMR,..), particles, interactions between these, space-time One should be able to reuse those –Avoid doing the same parallel programming again and again

PPL-Dept of Computer Science, UIUC A second idea Many problems require dynamic load balancing –We should be able to reuse load rebalancing strategies It should be possible to separate load balancing code from application code This strategy is embodied in Charm++ –Express the program as a collection of interacting entities (objects). –Let the system control mapping to processors

PPL-Dept of Computer Science, UIUC Charm Component Frameworks Object based decomposition Reuse of Specialized Parallel Strucutres Component Frameworks Load balancing Auto. Checkpointing Flexible use of clusters Out-of-core execn.

PPL-Dept of Computer Science, UIUC Current Set of Component Frameworks FEM / unstructured meshes: –“Mature”, with several applications already Multiblock: multiple structured grids –New, but very promising AMR: – oct and quad trees

PPL-Dept of Computer Science, UIUC

Multiblock Constituents

PPL-Dept of Computer Science, UIUC Terminology

PPL-Dept of Computer Science, UIUC Multi-partition decomposition Idea: divide the computation into a large number of pieces –Independent of number of processors –typically larger than number of processors –Let the system map entities to processors

PPL-Dept of Computer Science, UIUC Object-based Parallelization User View System implementation User is only concerned with interaction between objects

PPL-Dept of Computer Science, UIUC Charm++ Parallel C++ with Data Driven Objects Object Arrays/ Object Collections Object Groups: –Global object with a “representative” on each PE Asynchronous method invocation Prioritized scheduling Mature, robust, portable

PPL-Dept of Computer Science, UIUC Data driven execution Scheduler Message Q

PPL-Dept of Computer Science, UIUC Load Balancing Framework Based on object migration and measurement of load information Partition problem more finely than the number of available processors Partitions implemented as objects (or threads) and mapped to available processors by LB framework Runtime system measures actual computation times of every partition, as well as communication patterns Variety of “plug-in” LB strategies available

PPL-Dept of Computer Science, UIUC Load Balancing Framework

PPL-Dept of Computer Science, UIUC Building on Object-based Parallelism Application induced load imbalances Environment induced performance issues: –Dealing with extraneous loads on shared m/cs –Vacating workstations –Automatic checkpointing –Automatic prefetching for out-of-core execution –Heterogeneous clusters Reuse: object based components But: Must use Charm++!

PPL-Dept of Computer Science, UIUC AMPI: Goals Runtime adaptivity for MPI programs –Based on multi-domain decomposition and dynamic load balancing features of Charm++ –Minimal changes to the original MPI code –Full MPI 1.1 standard compliance –Additional support for coupled codes –Automatic conversion of existing MPI programs Original MPI CodeAMPI Code AMPI Runtime AMPIzer

PPL-Dept of Computer Science, UIUC Adaptive MPI A bridge between legacy MPI codes and dynamic load balancing capabilities of Charm++ AMPI = MPI + dynamic load balancing Based on Charm++ object arrays and Converse’s migratable threads Minimal modification needed to convert existing MPI programs (to be automated in future) Bindings for C, C++, and Fortran90 Currently supports most of the MPI 1.1 standard

PPL-Dept of Computer Science, UIUC AMPI Features Over 70+ common MPI routines –C, C++, and Fortran 90 bindings –Tested on IBM SP, SGI Origin 2000, Linux clusters Automatic conversion: AMPIzer –Based on Polaris front-end –Source-to-source translator for converting MPI programs to AMPI –Generates supporting code for migration Very low “overhead” compared with native MPI

PPL-Dept of Computer Science, UIUC AMPI Extensions Integration of multiple MPI-based modules –Example: integrated rocket simulation ROCFLO, ROCSOLID, ROCBURN, ROCFACE Each module gets its own MPI_COMM_WORLD –All COMM_WORLDs form MPI_COMM_UNIVERSE Point-to-point communication among different MPI_COMM_WORLDs using the same AMPI functions Communication across modules also considered for balancing load Automatic checkpoint-and-restart –On different number of processors –Number of virtual processors remain the same, but can be mapped to different number of physical processors

PPL-Dept of Computer Science, UIUC Charm++ Converse

PPL-Dept of Computer Science, UIUC Application Areas and Collaborations Molecular Dynamics: –Simulation of biomolecules –Material properties and electronic structures CSE applications: –Rocket Simulation –Industrial process simulation –Cosmology visualizer Combinatorial Search: –State space search, game tree search, optimization

PPL-Dept of Computer Science, UIUC Molecular Dynamics Collection of [charged] atoms, with bonds Newtonian mechanics At each time-step –Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal’s –Calculate velocities and advance positions 1 femtosecond time-step, millions needed! Thousands of atoms (1, ,000)

PPL-Dept of Computer Science, UIUC

BC1 complex: 200k atoms

PPL-Dept of Computer Science, UIUC Performance Data: SC2000

PPL-Dept of Computer Science, UIUC Charm++ Converse Load database + balancer MPI-on-CharmIrecv+ Automatic Conversion from MPI FEM Structured Cross module interpolation Migration path Framework path Component Frameworks: Using the Load Balancing Framework

PPL-Dept of Computer Science, UIUC Finite Element Framework Goals Hide parallel implementation in the runtime system Allow adaptive parallel computation and dynamic automatic load balancing Leave physics and numerics to user Present clean, “almost serial” interface: begin time loop compute forces update node positions end time loop begin time loop compute forces communicate shared nodes update node positions end time loop Serial Code for entire mesh Framework Code for mesh partition

PPL-Dept of Computer Science, UIUC FEM Framework: Responsibilities Charm++ (Dynamic Load Balancing, Communication) FEM Framework (Update of Nodal properties, Reductions over nodes or partitions) FEM Application (Initialize, Registration of Nodal Attributes, Loops Over Elements, Finalize) METISI/O PartitionerCombiner

PPL-Dept of Computer Science, UIUC Structure of an FEM Program Serial init() and finalize() subroutines –Do serial I/O, read serial mesh and call FEM_Set_Mesh Parallel driver() main routine: –One driver per partitioned mesh chunk –Runs in a thread: time-loop looks like serial version –Does computation and call FEM_Update_Field Framework handles partitioning, parallelization, and communication

PPL-Dept of Computer Science, UIUC Structure of an FEM Application init() Update finalize() driver Shared Nodes

PPL-Dept of Computer Science, UIUC Dendritic Growth Studies evolution of solidification microstructures using a phase-field model computed on an adaptive finite element grid Adaptive refinement and coarsening of grid involves re- partitioning

PPL-Dept of Computer Science, UIUC Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

PPL-Dept of Computer Science, UIUC “Overhead” of Multipartitioning

PPL-Dept of Computer Science, UIUC Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked

PPL-Dept of Computer Science, UIUC Parallel Collision Detection Detect collisions (intersections) between objects scattered across processors Approach, based on Charm++ Arrays Overlay regular, sparse 3D grid of voxels (boxes) Send objects to all voxels they touch Collide voxels independently and collect results Leave collision response to user code

PPL-Dept of Computer Science, UIUC Collision Detection Speed O( n ) serial performance Good speedups to 1000s of processors ASCI Red, 65,000 polygons per processor scaling problem (to 100 million polygons) Single Linux PC 2us per polygon serial performance

PPL-Dept of Computer Science, UIUC Rocket Simulation Our Approach: –Multi-partition decomposition –Data-driven objects (Charm++) –Automatic load balancing framework AMPI: Migration path for existing MPI+Fortran90 codes –ROCFLO, ROCSOLID, and ROCFACE

PPL-Dept of Computer Science, UIUC Timeshared parallel machines How to use parallel machines effectively? Need resource management –Shrink and expand individual jobs to available sets of processors –Example: Machine with 100 processors Job1 arrives, can use processors Assign 100 processors to it Job2 arrives, can use processors, –and will pay more if we meet its deadline We can do this with migratable objects!

PPL-Dept of Computer Science, UIUC Faucets: Multiple Parallel Machines Faucet submits a request, with a QoS contract: –CPU seconds, min-max cpus, deadline, interacive? Parallel machines submit bids: –A job for 100 cpu hours may get a lower price bid if: It has less tight deadline, more flexible PE range –A job that requires 15 cpu minutes and a deadline of 1 minute Will generate a variety of bids A machine with idle time on its hand: low bid

PPL-Dept of Computer Science, UIUC Faucets QoS and Architecture User specifies desired job parameters such as: min PE, max PE, estimated CPU-seconds, priority, etc. User does not specify machine.. Planned: Integration with Globus Central Server Faucet Client Web Browser Workstation Cluster

PPL-Dept of Computer Science, UIUC How to make all of this work? The key: fine-grained resource management model –Work units are objects and threads rather than processes –Data units are object data, thread stacks,.. Rather than pages –Work/Data units can be migrated automatically during a run

PPL-Dept of Computer Science, UIUC Time-Shared Parallel Machines

PPL-Dept of Computer Science, UIUC Appspector: Web-based Monitoring and Steering of Parallel Programs Parallel Jobs submitted via a server –Server maintains database of running programs –Charm++ client-server interface Allows one to inject messages into a running application From any web browser: –You can attach to a job (if authenticated) –Monitor performance –Monitor behavior –Interact and steer job (send commands)

PPL-Dept of Computer Science, UIUC BioCoRE Project Based Workbench for Modeling Conferences/Chat Rooms Lab Notebook Joint Document Preparation Goal: Provide a web-based way to virtually bring scientists together.

PPL-Dept of Computer Science, UIUC Some New Projects Load Balancing for really large machines: –30k-128k processors Million-processor Petaflops class machines –Emulation for software development –Simulation for Performance Prediction Operations Research –Combinatorial optiization Parallel Discrete Event Simulation

PPL-Dept of Computer Science, UIUC Summary Exciting times for parallel computing ahead We are preparing an object based infrastructure –To exploit future apps on future machines Charm++, AMPI, automatic load balancing Application-oriented research that produces enabling CS technology Rich set of collaborations More information: