1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Threads, SMP, and Microkernels

MPI Message Passing Interface

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Chap 2 System Structures.

Distributed Processing, Client/Server, and Clusters

3.5 Interprocess Communication

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

1 I/O Management in Representative Operating Systems.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.

System Calls 1.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

1 Distributed Operating Systems and Process Scheduling Brett O’Neill CSE 8343 – Group A6.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Adaptive MPI Milind A. Bhandarkar

Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Advanced / Other Programming Models Sathish Vadhiyar.

6 Memory Management and Processor Management Management of Resources Measure of Effectiveness – On most modern computers, the operating system serves.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

Charm++ Tutorial Presented by: Laxmikant V. Kale Kumaresh Pattabiraman Chee Wai Lee.

Source: Operating System Concepts by Silberschatz, Galvin and Gagne.

Processes Introduction to Operating Systems: Module 3.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.

Faucets Queuing System Presented by, Sameer Kumar.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

Chapter 4: Multithreaded Programming. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts What is Thread “Thread is a part of a program.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Data Structures and Algorithms in Parallel Computing Lecture 7.

Implementation: Charm++ Orion Sky Lawlor

1 Charm++ Tutorial Parallel Programming Laboratory, UIUC.

Charm++ overview L. V. Kale. Parallel Programming Decomposition – what to do in parallel –Tasks (loop iterations, functions,.. ) that can be done in parallel.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

1 Charm++ Tutorial Parallel Programming Laboratory, UIUC.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

1 Charm++ Tutorial Parallel Programming Laboratory, UIUC.

1 Charm++ Tutorial Parallel Programming Laboratory, UIUC.

Dynamic Load Balancing Tree and Structured Computations.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Processes and threads.

Process concept.

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Chapter 3: Process Concept

Performance Evaluation of Adaptive MPI

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Operating Systems.

Chapter 3: Processes.

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Screen shots – Load imbalance

Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.

An Orchestration Language for Parallel Objects

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005

2 Charm++ Basics

3 Charm++ Parallel library for Object- Oriented C++ applications Invoke functions remotely Messaging via remote method calls (like CORBA) Communication “proxy” objects Methods called by scheduler System determines who runs next Multiple objects per processor Object migration fully supported Even with broadcasts, reductions

4 Virtualized Programming Model User View System implementation User writes code in terms of communicating objects System maps objects to processors

5 Chares – Concurrent Objects Can be dynamically created on any available processor Can be accessed from remote processors Send messages to each other asynchronously Contain “ entry methods ”

6 Charm++ Features: Object Arrays A[0]A[1]A[2]A[3]A[n] User’s view Applications are written as a set of communicating objects

7 Charm++ Features: Object Arrays Charm++ maps those objects onto processors, routing messages as needed A[0]A[1]A[2]A[3]A[n] A[3]A[0] User’s view System view

8 Charm++ Features: Object Arrays Charm++ can re-map (migrate) objects for communication, load balance, fault tolerance, etc. A[0]A[1]A[2]A[3]A[n] A[3]A[0] User’s view System view

9 Charm++ Array Definition array[1D] foo { entry foo(int problemNo); entry void bar(int x); } Interface (.ci) file class foo : public CBase_foo { public: // Remote calls foo(int problemNo) {... } void bar(int x) {... } // Migration support: foo(CkMigrateMessage *m) {} void pup(PUP::er &p) {...} }; In a.C file

10 Charm++ Remote Method Calls To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file: array[1D] foo { entry foo(int problemNo); entry void bar(int x); }; Interface (.ci) file CProxy_foo someFoo=...; someFoo[i].bar(17); In a.C file This results in a network message, and eventually to a call to the real object’s method: void foo::bar(int x) {... } In another.C file Generated class i’th objectmethod and parameters

11 Charm++ Startup Process: Main module myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); } }; Interface (.ci) file #include “myModule.decl.h” class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); } }; #include “myModule.def.h” In a.C file Generated class Called at startup on PE 0 Special startup object

12.ci file mainmodule hello { mainchare mymain { entry mymain(CkArgMsg *m); }; “ Hello World! ” Generates hello.decl.h hello.def.h #include “hello.decl.h” class mymain : public CBase_mymain{ public: mymain(CkArgMsg *m) { ckout <<“Hello World” <<endl; CkExit(); } }; #include “hello.def.h”.C file

13 Compile and run the program Compiling charmc -o, -g, -language, -module, -tracemode pgm: pgm.ci pgm.h pgm.C charmc pgm.ci charmc pgm.C charmc –o pgm pgm.o –language charm++ To run a CHARM++ program named ``pgm'' on four processors, type: charmrun pgm +p4 Nodelist file (for network architecture) list of machines to run the program host Example Nodelist File: group main ++shell ssh host Host1 host Host2

14 Charm++: Portability Runs on: Any machine with MPI, including IBM SP, Blue Gene/L Cray XT3 Origin2000 PSC’s Lemieux (Quadrics Elan) Clusters with Ethernet (Udp/Tcp) Clusters with Myrinet (GM) Clusters with Amasso cards Apple clusters Even Windows! SMP-Aware (pthreads)

15 Build Charm++ Download from website Build Charm++./build [compile flags]./build charm++ net-linux gm -g Parallel make (-j2) Compile code using charmc Portable compiler wrapper Link with “-language charm++” Run code using charmrun

16 How Charmrun Works? sshconnect Acknowledge Charmrun charmrun +p4./pgm

17 Charmrun (batch mode) ssh connect Acknowledge Charmrun charmrun ++batch 8

18 Debugging Charm++ Applications Printf Gdb Sequentially (standalone mode) gdb./pgm +vp16 Run debugger in xterm charmrun +p4 pgm ++debug charmrun +p4 pgm ++debug-no- pause Memory paranoid Parallel debugger

19 Charm++ Features

20 Message Driven Execution Scheduler Message Q Virtualization leads to Message Driven Execution

21 Prioritized Messages Number of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages Signed integer priorities: *CkPriorityPtr(msg)=-1; CkSetQueueing(m, CK_QUEUEING_IFIFO); Unsigned bitvector priorities CkPriorityPtr(msg)[0]=0x7fffffff; CkSetQueueing(m, CK_QUEUEING_BFIFO);

22 Advanced Message Features Expedited messages Message do not go through the charm++ scheduler (faster) Top priority messages Immediate messages Entries are executed in an interrupt or the communication thread Very fast, but tough to get right

23 Object Migration

24 How to Migrate a Virtual Processor? Move all application state to new processor Stack Data (threads) Subroutine variables and calls Managed by compiler Heap Data Allocated with malloc/free Managed by user Global Variables Open files, environment variables, etc. (not handled yet!)

25 Migration Solutions Stack Data (threads) Automatic: isomalloc stacks Heap Data Use “-memory isomalloc” -or- Write pup routines Global Variables Use “-swapglobals” Works on ELF platform (Linux and Sun) Just a pointer swap, no data copying -or- Remove globals entirely

26 Migrate Heap Data: PUP Packing/unpacking user allocated data Basic contract: here is my data Sizing: counts up data size Packing: copies data into message Unpacking: copies data back out Same call works for network, memory, disk I/O...

27 Migrate Heap Data: PUP C++ Example #include “pup.h” #include “pup_stl.h” class myMesh { std::vector nodes; std::vector elts; public:... void pup(PUP::er &p) { p|nodes; p|elts; } };

28 Migrate Heap Data: PUP F90 Example TYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: elts END TYPE SUBROUTINE pupMesh(p,mesh) USE MODULE... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh); END SUBROUTINE

29 Automatic Load Balancing

30 Motivation Irregular or dynamic applications Initial static load balancing Application behaviors change dynamically Difficult to implement with good parallel efficiency Versatile, automatic load balancers Application independent No/little user effort is needed in load balance Work for both Charm++ and Adaptive MPI

31 Using Dynamic Mapping to Processors Migrate objects between processors Use that for dynamic (and static, initial) load balancing Two major approaches No predictability of load patterns Fully dynamic Early work on State Space Search, Branch&Bound,.. With certain predictability Measurement-based load balancing strategy CSE, molecular dynamics simulation

32 Applications lack of predictability Flow of tasks - application generates a continuous flow of tasks The goal of the load balancing strategies is to balance these tasks across the system for a fast response time and a better throughput Tasks are assigned at creation time, no migration afterwards

33 Seed Load Balancing Neighborhood averaging with work-stealing when Idle using immediate messages Load balancing among neighboring processors Load is represented by length of queue Work-stealing at idle time with interruption- based message Fast response to the request objects, 10% heavy objects

34 Link with a seed load balancer Use –balance Charmc –o pgm pgm.o –balance neighbor Specify topology +LBTopo

35 Principle of Persistence Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior Abrupt and large,but infrequent changes (eg:AMR) Slow and small changes (eg: particle migration) Parallel analog of principle of locality Heuristics, that holds for most CSE applications Run-time instrumentation is possible

36 Measurement Based Load Balancing Runtime instrumentation Measures CPU load per object Measures communication volume between objects Measurement based load balancers Use the instrumented database periodically to make new decisions A load balancing strategy takes the database as input and generates a new object-to-processor mapping

37 Load Balancing – graph partitioning LB View mapping of objects Weighted object graph in view of Load Balancer Charm++ PE

38 Charm++ Load Balancer in Action Automatic Load Balancing in Crack Propagation

39 Load Balancer Categories Centralized Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcasted from processor 0 Global barrier Distributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier

40 Main Centralized Load Balancing Strategies GreedyCommLB a “greedy” load balancing strategy which uses the process load and communications graph to map the processes with the highest load onto the processors with the lowest load, while trying to keep communicating processes on the same processor RefineLB Incremental adjustment by moving objects off overloaded processors to under-utilized processors to reach average load MetisLB uses the METIS graph partitioning library to partition the object-communication graph with node (object) weights and communication loads on edges. OrbLB treats objects with spatial coordinates. It applies an orthogonal recursive bisection algorithm which attempts to provide a more balanced division of space. Others – the manual discusses several other load balancers which are not used as often, but may be useful in some cases; also, more are being developed

41 Load Balancing Strategies

42 Neighborhood Load Balancing Strategies NeighborLB processor tries to average out its load only among its neighbors WSLB A load balancer for timeshared workstation clusters, which can detect load changes on desktops and adjust load without interferes with other's use of the desktop

43 Compiler Interface Link time options -module: Link load balancers as modules -module EveryLB Link multiple modules into binary - balancer GreedyCommLB -balancer RefineLB -balancer ComboCentLB:GreedyLB,RefineLB

44 Runtime Options Run-time options do the same thing, but override the compile time options +balancer: invoke a load balancer Can have multiple load balancers +balancer GreedyCommLB +balancer RefineLB

45 When to Re-balance Load? Programmer Control: ReadyLoadBalance() Enable load balancing at specific point Object ready to migrate Re-balance if needed ReadyLoadBalance() called when your chare is ready to be load balanced – load balancing may not start right away ResumeFromSync() called when load balancing for this chare has finished Default: Load balancer is periodic Provide period as a runtime parameter (+LBPeriod)

46 Thank You! Free source, binaries, manuals, and more information at: Parallel Programming Lab at University of Illinois