1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005.

1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005

2 Charm++ Basics

3 Charm++ Parallel library for Object- Oriented C++ applications Invoke functions remotely Messaging via remote method calls (like CORBA) Communication “proxy” objects Methods called by scheduler System determines who runs next Multiple objects per processor Object migration fully supported Even with broadcasts, reductions

4 Virtualized Programming Model User View System implementation User writes code in terms of communicating objects System maps objects to processors

5 Chares – Concurrent Objects Can be dynamically created on any available processor Can be accessed from remote processors Send messages to each other asynchronously Contain “ entry methods ”

6 Charm++ Features: Object Arrays A[0]A[1]A[2]A[3]A[n] User’s view Applications are written as a set of communicating objects

7 Charm++ Features: Object Arrays Charm++ maps those objects onto processors, routing messages as needed A[0]A[1]A[2]A[3]A[n] A[3]A[0] User’s view System view

8 Charm++ Features: Object Arrays Charm++ can re-map (migrate) objects for communication, load balance, fault tolerance, etc. A[0]A[1]A[2]A[3]A[n] A[3]A[0] User’s view System view

9 Charm++ Array Definition array[1D] foo { entry foo(int problemNo); entry void bar(int x); } Interface (.ci) file class foo : public CBase_foo { public: // Remote calls foo(int problemNo) {... } void bar(int x) {... } // Migration support: foo(CkMigrateMessage *m) {} void pup(PUP::er &p) {...} }; In a.C file

10 Charm++ Remote Method Calls To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file: array[1D] foo { entry foo(int problemNo); entry void bar(int x); }; Interface (.ci) file CProxy_foo someFoo=...; someFoo[i].bar(17); In a.C file This results in a network message, and eventually to a call to the real object’s method: void foo::bar(int x) {... } In another.C file Generated class i’th objectmethod and parameters

11 Charm++ Startup Process: Main module myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); } }; Interface (.ci) file #include “myModule.decl.h” class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); } }; #include “myModule.def.h” In a.C file Generated class Called at startup on PE 0 Special startup object

12.ci file mainmodule hello { mainchare mymain { entry mymain(CkArgMsg *m); }; “ Hello World! ” Generates hello.decl.h hello.def.h #include “hello.decl.h” class mymain : public CBase_mymain{ public: mymain(CkArgMsg *m) { ckout <<“Hello World” <<endl; CkExit(); } }; #include “hello.def.h”.C file

13 Compile and run the program Compiling charmc -o, -g, -language, -module, -tracemode pgm: pgm.ci pgm.h pgm.C charmc pgm.ci charmc pgm.C charmc –o pgm pgm.o –language charm++ To run a CHARM++ program named ``pgm'' on four processors, type: charmrun pgm +p4 Nodelist file (for network architecture) list of machines to run the program host Example Nodelist File: group main ++shell ssh host Host1 host Host2

14 Charm++: Portability Runs on: Any machine with MPI, including IBM SP, Blue Gene/L Cray XT3 Origin2000 PSC’s Lemieux (Quadrics Elan) Clusters with Ethernet (Udp/Tcp) Clusters with Myrinet (GM) Clusters with Amasso cards Apple clusters Even Windows! SMP-Aware (pthreads)

15 Build Charm++ Download from website http://charm.cs.uiuc.edu/download.html Build Charm++./build [compile flags]./build charm++ net-linux gm -g Parallel make (-j2) Compile code using charmc Portable compiler wrapper Link with “-language charm++” Run code using charmrun

16 How Charmrun Works? sshconnect Acknowledge Charmrun charmrun +p4./pgm

17 Charmrun (batch mode) ssh connect Acknowledge Charmrun charmrun ++batch 8

18 Debugging Charm++ Applications Printf Gdb Sequentially (standalone mode) gdb./pgm +vp16 Run debugger in xterm charmrun +p4 pgm ++debug charmrun +p4 pgm ++debug-no- pause Memory paranoid Parallel debugger

19 Charm++ Features

20 Message Driven Execution Scheduler Message Q Virtualization leads to Message Driven Execution

21 Prioritized Messages Number of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages Signed integer priorities: *CkPriorityPtr(msg)=-1; CkSetQueueing(m, CK_QUEUEING_IFIFO); Unsigned bitvector priorities CkPriorityPtr(msg)[0]=0x7fffffff; CkSetQueueing(m, CK_QUEUEING_BFIFO);

22 Advanced Message Features Expedited messages Message do not go through the charm++ scheduler (faster) Top priority messages Immediate messages Entries are executed in an interrupt or the communication thread Very fast, but tough to get right

23 Object Migration

24 How to Migrate a Virtual Processor? Move all application state to new processor Stack Data (threads) Subroutine variables and calls Managed by compiler Heap Data Allocated with malloc/free Managed by user Global Variables Open files, environment variables, etc. (not handled yet!)

25 Migration Solutions Stack Data (threads) Automatic: isomalloc stacks Heap Data Use “-memory isomalloc” -or- Write pup routines Global Variables Use “-swapglobals” Works on ELF platform (Linux and Sun) Just a pointer swap, no data copying -or- Remove globals entirely

26 Migrate Heap Data: PUP Packing/unpacking user allocated data Basic contract: here is my data Sizing: counts up data size Packing: copies data into message Unpacking: copies data back out Same call works for network, memory, disk I/O...

27 Migrate Heap Data: PUP C++ Example #include “pup.h” #include “pup_stl.h” class myMesh { std::vector nodes; std::vector elts; public:... void pup(PUP::er &p) { p|nodes; p|elts; } };

28 Migrate Heap Data: PUP F90 Example TYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: elts END TYPE SUBROUTINE pupMesh(p,mesh) USE MODULE... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh); END SUBROUTINE

29 Automatic Load Balancing

30 Motivation Irregular or dynamic applications Initial static load balancing Application behaviors change dynamically Difficult to implement with good parallel efficiency Versatile, automatic load balancers Application independent No/little user effort is needed in load balance Work for both Charm++ and Adaptive MPI

31 Using Dynamic Mapping to Processors Migrate objects between processors Use that for dynamic (and static, initial) load balancing Two major approaches No predictability of load patterns Fully dynamic Early work on State Space Search, Branch&Bound,.. With certain predictability Measurement-based load balancing strategy CSE, molecular dynamics simulation

32 Applications lack of predictability Flow of tasks - application generates a continuous flow of tasks The goal of the load balancing strategies is to balance these tasks across the system for a fast response time and a better throughput Tasks are assigned at creation time, no migration afterwards

33 Seed Load Balancing Neighborhood averaging with work-stealing when Idle using immediate messages Load balancing among neighboring processors Load is represented by length of queue Work-stealing at idle time with interruption- based message Fast response to the request 80000 objects, 10% heavy objects

34 Link with a seed load balancer Use –balance Charmc –o pgm pgm.o –balance neighbor Specify topology +LBTopo

35 Principle of Persistence Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior Abrupt and large,but infrequent changes (eg:AMR) Slow and small changes (eg: particle migration) Parallel analog of principle of locality Heuristics, that holds for most CSE applications Run-time instrumentation is possible

36 Measurement Based Load Balancing Runtime instrumentation Measures CPU load per object Measures communication volume between objects Measurement based load balancers Use the instrumented database periodically to make new decisions A load balancing strategy takes the database as input and generates a new object-to-processor mapping

37 Load Balancing – graph partitioning LB View mapping of objects Weighted object graph in view of Load Balancer Charm++ PE

38 Charm++ Load Balancer in Action Automatic Load Balancing in Crack Propagation

39 Load Balancer Categories Centralized Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcasted from processor 0 Global barrier Distributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier

40 Main Centralized Load Balancing Strategies GreedyCommLB a “greedy” load balancing strategy which uses the process load and communications graph to map the processes with the highest load onto the processors with the lowest load, while trying to keep communicating processes on the same processor RefineLB Incremental adjustment by moving objects off overloaded processors to under-utilized processors to reach average load MetisLB uses the METIS graph partitioning library to partition the object-communication graph with node (object) weights and communication loads on edges. OrbLB treats objects with spatial coordinates. It applies an orthogonal recursive bisection algorithm which attempts to provide a more balanced division of space. Others – the manual discusses several other load balancers which are not used as often, but may be useful in some cases; also, more are being developed

41 Load Balancing Strategies

42 Neighborhood Load Balancing Strategies NeighborLB processor tries to average out its load only among its neighbors WSLB A load balancer for timeshared workstation clusters, which can detect load changes on desktops and adjust load without interferes with other's use of the desktop

43 Compiler Interface Link time options -module: Link load balancers as modules -module EveryLB Link multiple modules into binary - balancer GreedyCommLB -balancer RefineLB -balancer ComboCentLB:GreedyLB,RefineLB

44 Runtime Options Run-time options do the same thing, but override the compile time options +balancer: invoke a load balancer Can have multiple load balancers +balancer GreedyCommLB +balancer RefineLB

45 When to Re-balance Load? Programmer Control: ReadyLoadBalance() Enable load balancing at specific point Object ready to migrate Re-balance if needed ReadyLoadBalance() called when your chare is ready to be load balanced – load balancing may not start right away ResumeFromSync() called when load balancing for this chare has finished Default: Load balancer is periodic Provide period as a runtime parameter (+LBPeriod)

46 Thank You! Free source, binaries, manuals, and more information at: http://charm.cs.uiuc.edu/ http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois

1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005.

Similar presentations

Presentation on theme: "1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005.

Similar presentations

Presentation on theme: "1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005."— Presentation transcript:

Similar presentations

About project

Feedback