IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,

Slides:

Advertisements

Similar presentations

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Advertisements

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

PPL-Dept of Computer Science, UIUC Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

Adaptive MPI Milind A. Bhandarkar

Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.

1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.

Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Advanced / Other Programming Models Sathish Vadhiyar.

1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

Charm++ overview L. V. Kale. Parallel Programming Decomposition – what to do in parallel –Tasks (loop iterations, functions,.. ) that can be done in parallel.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.

NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

1 Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.

Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Parallel Objects: Virtualization & In-Process Components

Performance Evaluation of Adaptive MPI

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Programming Models for Blue Gene/L : Charm++, AMPI and Applications

Component Frameworks:

Scalable Molecular Dynamics for Large Biomolecular Systems

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

An Orchestration Language for Parallel Objects

Support for Adaptivity in ARMCI Using Migratable Objects

Emulating Massively Parallel (PetaFLOPS) Machines

Presentation transcript:

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Massive Parallel Processors-In-Memory MPPIM –Large number of identical chips –Each contains multiple processors and memory Blue Gene/C –34 x 34 x 36 cube –Multi-million hardware threads Challenges –How to program? –Software challenges: cost-effective

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Need for Emulator Emulate BG/C machine API on conventional supercomputers and clusters. –Emulator enables programmer to develop, compile, and run software using programming interface that will be used in actual machine Performance estimation (with proper time stamping) Allow further research on high level parallel languages like Charm++ Low memory-to-processor ratio make it possible –Half terabyte memory require 1000 processors 512MB

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Emulation on a Parallel Machine Simulating (Host) Processor BG/C Nodes Hardware thread

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Bluegene Emulator one BG/C Node Communication threads Non-affinity message queues Affinity message queues Worker thread inBuffer

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Blue Gene Programming API Low-level –Machine initialization Get node ID: (x, y, z) Get Blue Gene size –Register handler functions on node –Send packets to other nodes (x,y,z) With handler ID in out

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Blue Gene application example - Ring typedef struct { char core[CmiBlueGeneMsgHeaderSizeBytes]; int data; } RingMsg; void BgNodeStart(int argc, char **argv) { int x,y,z, nx, ny, nz; RingMsg msg; msg.data = 888; BgGetXYZ(&x, &y, &z); nextxyz(x, y, z, &nx, &ny, &nz); if (x == 0 && y==0 && z==0) BgSendPacket(nx, ny, nz, passRingID, LARGE_WORK, sizeof(int), (char *)&msg); } void passRing(char *msg) { int x, y, z, nx, ny, nz; BgGetXYZ(&x, &y, &z); nextxyz(x, y, z, &nx, &ny, &nz); if (x==0 && y==0 && z==0) if (++iter == MAXITER) BgShutdown(); BgSendPacket(nx, ny, nz, passRingID, LARGE_WORK, sizeof(int), msg); }

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Emulator Status Implemented on Charm++/Converse –8 Million processors being emulated on 100 ASCI-Red processors How much time does it take to run an emulation v.s. how much time does it take to run on real BG/C? –Timestamp module Emulation efficiency –On a Linux cluster: Emulation shows good speedup(later slides)

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Programming issues for MPPIM Need higher level of programming language Data locality Parallelism Load balancing Charm++ is a good programming model candidate for MPPIMs

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Charm++ Parallel C++ with Data Driven Objects Object Arrays/ Object Collections Object Groups: –Global object with a “representative” on each PE Asynchronous method invocation Built-in load balancing(runtime) Mature, robust, portable

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Multi-partition Decomposition Idea: divide the computation into a large number of pieces(parallel objects) –Independent of number of processors –Typically larger than number of processors –Let the system map entities to processors Optimal division of labor between “system” and programmer: Decomposition done by programmer, Everything else automated

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Object-based Parallelization User View System implementation User is only concerned with interaction between objects Charm++ PE

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Data driven execution Scheduler Message Q

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Load Balancing Framework Based on object migration –Partitions implemented as objects (or threads) are mapped to available processors by LB framework Measurement based load balancers: –Principle of persistence Computational loads and communication patterns –Runtime system measures actual computation times of every partition, as well as communication patterns Variety of “plug-in” LB strategies available –Scalable to a few thousand processors –Including those for situations when principle of persistence does not apply

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Charm++ is a Good Match for MPPIM Message driven/Data driven Encapsulation : objects Explicit cost model: –Object data, read-only data, remote data –Aware of the cost of accessing remote data Migration and resource management: automatic One sided communication Asynchronous global operations (reductions,..)

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Charm++ Applications Charm++ developed in the context of real applications Current applications we are involved with: –Molecular dynamics(NAMD) –Crack propagation –Rocket simulation: fluid dynamics + structures + –QM/MM: Material properties via quantum mech –Cosmology simulations: parallel analysis+viz –Cosmology: gravitational with multiple timestepping

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Molecular Dynamics Collection of [charged] atoms, with bonds Newtonian mechanics At each time-step –Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal’s –Calculate velocities and advance positions 1 femtosecond time-step, millions needed! Thousands of atoms (1, ,000)

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Performance Data: SC2000

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Further Match With MPPIM Ability to predict: –Which data is going to be needed and which code will execute –Based on the ready queue of object method invocations –So, we can: Prefetch data accurately Prefetch code if needed S S Q Q

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Blue Gene/C Charm++ Implemented Charm++ on Blue Gene/C Emulator –Almost all existing Charm++ applications can run w/o change on emulator Case study on some real applications –leanMD: Fully functional MD with only cutoff (PME later) –AMR Time stamping(ongoing work) –Log generation and correction

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Parallel Object Programming Model Charm++ Converse UDP/TCP, MPI, Myrinet, etc Converse Charm++ UDP/TCP, MPI, Myrinet, etc NS Selector BGConverse Emulator

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC BG/C Charm++ Object affinity –Object mapped to a BG node A message can be executed by any thread Load balancing at node level Locking needed –Object mapped to a BG thread An object is created on a particular thread All messages to the object will go to that thread No locking needed. Load balancing at thread level

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Applications on the current system LeanMD: –Research quality Molecular Dynamics –Version 0: only electrostatics + van der Vaal Simple AMR kernel –Adaptive tree to generate millions of objects Each holding a 3D array –Communication with “neighbors” Tree makes it harder to find nbrs, but Charm makes it easy

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC LeanMD K-array molecular dynamics simulation Using Charm++ Chare arrays  10x10x threads each  11x11x11 cells  cell-to-cell computes

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Correction of Time stamps at runtime back back Timestamp –Per thread timer –Message arrive time Calculate at time of sending –Based on hop and corner Update thread timer when arrive Correction needed for out-of-order messages –Correction messages send out

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Performance Analysis Tool: Projections

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC  200,000 atoms  Use 4 simulating processors

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Summary Emulation of BG/C with millions of threads –On conventional supercomputers and clusters Charm++ on BG Emulator –Legacy Charm++ applications –Load balancing(need more research) We have Implemented multi-million object applications using Charm++ –And tested on emulated Blue Gene/C Getting accurate simulating timing data More info: –Both Emulator and BG Charm++ are available for download

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC Processor in Memory architecture back back Motivation –Growing gap in performance –Processor-centric optimization to bridge the gap like prefetching, speculation, and multithreading hide latency but lead to memory- bandwidth problems –Logic close to memory ‘may’ provide high bandwidth, low latency access to memory –Advances in fabrication technology make integration of logic and memory practical Dream : Simple-Cellular-Scalable-Inherently Parallel PIM systems Mixing significant Logic and Memory on same chip Enabling huge improvements in Latency and Bandwidth