Download presentation
Presentation is loading. Please wait.
Published byMatthew Simon Modified over 9 years ago
1
1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19
2
2 How to Become a Charm++ Hacker Advanced Charm++ Advanced Messaging Writing system libraries Groups Delegation Communication framework Advanced load-balancing Checkpointing Threads SDAG
3
3 Advanced Messaging
4
4 Prioritized Execution If several messages available, Charm will process the message with highest priority Otherwise, oldest message (FIFO) Has no effect: If only one message is available (common for network-bound applications!) On outgoing messages Very useful for speculative work, ordering timesteps, etc...
5
5 Priority Classes Charm++ scheduler has three queues: high, default, and low As signed integer priorities: -MAXINT Highest priority -- -1 0 Default priority 1 -- +MAXINT Lowest priority As unsigned bitvector priorities: 0x0000 Highest priority -- 0x7FFF 0x8000 Default priority 0x8001 -- 0xFFFF Lowest priority
6
6 Prioritized Marshalled Messages Pass “CkEntryOptions” as last parameter For signed integer priorities: CkEntryOptions opts; opts.setPriority(-1); fooProxy.bar(x,y,opts); For bitvector priorities: CkEntryOptions opts; unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF}; opts.setPriority(64,prio); fooProxy.bar(x,y,opts);
7
7 Prioritized Messages Number of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages Signed integer priorities: *CkPriorityPtr(msg)=-1; CkSetQueueing(m, CK_QUEUEING_IFIFO); Unsigned bitvector priorities CkPriorityPtr(msg)[0]=0x7fffffff; CkSetQueueing(m, CK_QUEUEING_BFIFO);
8
8 Advanced Message Features Read-only messages Entry method agrees not to modify or delete the message Avoids message copy for broadcasts, saving time Expedited messages Message do not go through the charm++ scheduler (faster) Immediate messages Entries are executed in a interrupt or the communication thread Very fast, but tough to get right
9
9 Read-Only, Expedited, Immediate All declared in the.ci file {..... entry [nokeep] void foo_readonly(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *);...... }; // Immediate messages only currently work //for NodeGroups
10
10 Groups
11
11 Object Groups A collection of objects (chares) Also called branch office chares Exactly one representative on each processor Ideally suited for system libraries A single proxy for the group as a whole Similar to arrays: Broadcasts, reductions, indexing But not completely like arrays: Non-migratable; one per processor
12
12 Declarations.ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method }; C++ file class mygroup : public Group { mygroup() {} void foo(foomsg *m) { CkPrintf(“Do Nothing”);} };
13
13 Creating and Calling Groups Creation p = CProxy_mygroup::ckNew(); Remote invocation p.foo(msg); //broadcast p[1].foo(msg); //asynchronous invocation Direct local access mygroup *g=p.ckLocalBranch(); g->foo(….); //local invocation Danger: if you migrate, the group stays behind!
14
14 Delegation
15
15 Delegation Enables Charm++ proxy messages to be forwarded to a delegation manager group Delegation manager can trap calls to proxy sends and apply optimizations Delegation manager must inherit from CkDelegateMgr User program must to call proxy.ckDelegate(mgrID);
16
16 Delegation Interface.ci file group MyDelegateMgr { entry MyDelegateMgr(); //Constructor };.h file class MyDelegateMgr : public CkDelegateMgr { MyDelegateMgr(); void ArraySend(...,int ep,void *m,const CkArrayIndexMax &idx,CkArrayID a); void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &s); …………….. }
17
17 Communication Optimization
18
18 Automatic Communication Optimizations The parallel-objects Runtime System can observe, instrument, and measure communication patterns Communication libraries can optimize By substituting most suitable algorithm for each operation Learning at runtime E.g. All to all communication Performance depends on many runtime characteristics Library switches between different algorithms Communication is from/to objects, not processors Streaming messages optimization
19
19 Managing Collective Communication Communication operation where all (or most) the processors participate For example broadcast, barrier, all reduce, all to all communication etc Applications: NAMD multicast, NAMD PME, CPAIMD Issues Performance impediment Naïve implementations often do not scale Synchronous implementations do not utilize the co-processor effectively
20
20 All to All Communication All processors send data to all other processors All to all personalized communication (AAPC) MPI_Alltoall All to all multicast/broadcast (AAMC) MPI_Allgather
21
21 Strategies For AAPC Short message optimizations High software over head (α) Message combining Large messages Network contention
22
22 Short Message Optimizations Direct all to all communication is α dominated Message combining for small messages Reduce the total number of messages Multistage algorithm to send messages along a virtual topology Group of messages combined and sent to an intermediate processor which then forwards them to their final destinations AAPC strategy may send same message multiple times
23
23 Virtual Topology: Mesh Organize processors in a 2D (virtual) Mesh Phase 1: Processors send messages to row neighbors Message from (x1,y1) to (x2,y2) goes via (x1,y2) Phase 2: Processors send messages to column neighbors 2* messages instead of P-1
24
24 AAPC Performance
25
25 Large Message Issues Network contention Contention free schedules Topology specific optimizations
26
26 Ring Strategy for Collective Multicast Performs all to all multicast by sending messages along a ring formed by the processors Congestion free on most topologies 012ii+1P-1 …………..
27
27 Streaming Messages Programs often have streams of short messages Streaming library combines a bunch of messages and sends them off Stripping large charm++ header Short array message packing Effective message performance of about 3us
28
28 Using communication library Communication optimizations embodied as strategies EachToManyMulticastStrategy RingMulticast PipeBroadcast Streaming MeshStreaming
29
29 Bracketed vs. Non-bracketed Bracketed Strategies Require user to give specific end points for each iteration of message sends Endpoints declared by calling ComlibBegin() and ComlibEnd() Examples: EachToManyMulticast Non bracketed strategies No such end points necessary Examples: Streaming, PipeBroadcast
30
30 Accessing the Communication Library From mainchare::main Creating a strategy Strategy *strat = new EachToManyMulticastStrategy(USE_MESH) Strat = new StreamingStrategy(); Strat->enableShortMessagePacking(); Associating a proxy with a Strategy ComlibAssociateProxy(strat, myproxy); myproxy should be passed to all array elements
31
31 Sending Messages ComlibBegin(myproxy);// Bracketed Strategies for(....) {.... myproxy.foo(msg);.... } ComlibEnd(); // Bracketed strategies
32
32 Handling Migration Migrating array element PUP’s the comlib associated proxy FooArray::pup(PUP::er &p) {...... p | myProxy;...... }
33
33 Compiling You must include compile time option –module commlib
34
34 Advanced Load-balancers Writing a Load-balancing Strategy
35
35 Advanced load balancing: Writing a new strategy Inherit from CentralLB and implement the work(…) function class foolb : public CentralLB { public:...... void work (CentralLB::LDStats* stats, int count);...... };
36
36 LB Database struct LDStats { ProcStats *procs; LDObjData* objData; LDCommData* commData; int *to_proc; //...... } //Dummy Work function which assigns all objects to //processor 0 //Don’t implement it! void fooLB::work(CentralLB::LDStats* stats,int count){ for(int count=0;count < nobjs; count++) stats.to_proc[count] = 0; }
37
37 Compiling and Integration Edit and run Makefile_lb.sh Creates Make.lb which is included by the LDB Makefile Run make depends to correct dependencies Rebuild charm++
38
38 Checkpoint Restart
39
39 Checkpoint/Restart Any long running application must be able to save its state When you checkpoint an application, it uses the pup routine to store the state of all objects State information is saved in a directory of your choosing Restore also uses pup, so no additional application code is needed (pup is all you need)
40
40 Checkpointing Job In AMPI, use MPI_Checkpoint( ); Collective call; returns when checkpoint is complete In Charm++, use CkCheckpoint(, ); Called on one processor; calls resume when checkpoint is complete
41
41 Restart Job from Checkpoint The charmrun option ++restart is used to restart Number of processors need not be the same You can also restart groups by marking them migratable and writing a PUP routine – they still will not load balance, though
42
42 Threads
43
43 Why use Threads? They provide one key feature: blocking Suspend execution (e.g., at message receive) Do something else Resume later (e.g., after message arrives) Example: MPI_Recv, MPI_Wait semantics Function call interface more convenient than message-passing Regular call/return structure (no CkCallbacks) Allows blocking in middle of deeply nested communication subroutine
44
44 Why not use Threads? Slower Around 1us context-switching overhead unavoidable Creation/deletion perhaps 10us More complexity, more bugs Breaks a lot of machines! (but we have workarounds) Migration more difficult State of thread is scattered through stack, which is maintained by compiler By contrast, state of object is maintained by users Thread disadvantages form the motivation to use SDAG (later)
45
45 What are (Charm) Threads? One flow of control (instruction stream) Machine Registers & program counter Execution stack Like pthreads (kernel threads) Only different: Implemented at user level (in Converse) Scheduled at user level; non-preemptive Migratable between nodes
46
46 How do I use Threads? Many options: AMPI Always uses threads via TCharm library Charm++ [threaded] entry methods run in a thread [sync] methods Converse C routines CthCreate/CthSuspend/CthAwaken Everything else is built on these Implemented using SYSV makecontext/setcontext POSIX setjmp/alloca/longjmp
47
47 How do I use Threads (example) Blocking API routine: find array element int requestFoo(int src) { myObject *obj=...; return obj->fooRequest(src) } Send request and suspend int myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); // -- blocks until awaken call -- return stashed_return; } Awaken thread when data arrives void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread); }
48
48 How do I use Threads (example) Send request, suspend, recv, awaken, return int myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); return stashed_return; } void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread); }
49
49 The Horror of Thread Migration
50
50 Stack Data The stack is used by the compiler to track function calls and provide temporary storage Local Variables Subroutine Parameters C “alloca” storage Most of the variables in a typical application are stack data Users have no control over how stack is laid out
51
51 Migrate Stack Data Without compiler support, cannot change stack’s address Because we can’t change stack’s interior pointers (return frame pointer, function arguments, etc.) Solution: “isomalloc” addresses Reserve address space on every processor for every thread stack Use mmap to scatter stacks in virtual memory efficiently Idea comes from PM 2
52
52 Migrate Stack Data Thread 2 stack Thread 3 stack Thread 4 stack Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Thread 1 stack Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Migrate Thread 3
53
53 Migrate Stack Data: Isomalloc Thread 2 stack Thread 4 stack Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Thread 1 stack Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Migrate Thread 3 Thread 3 stack
54
54 Migrate Stack Data Isomalloc is a completely automatic solution No changes needed in application or compilers Just like a software shared-memory system, but with proactive paging But has a few limitations Depends on having large quantities of virtual address space (best on 64-bit) 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine Depends on unportable mmap Which addresses are safe? (We must guess!) What about Windows? Or Blue Gene?
55
55 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack
56
56 Aliasing Stack Data: Run Thread 2 Thread 2 stack Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack Execution Copy
57
57 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack
58
58 Aliasing Stack Data: Run Thread 3 Thread 3 stack Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack Execution Copy
59
59 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack Migrate Thread 3
60
60 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack
61
61 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack Execution Copy Thread 3 stack
62
62 Aliasing Stack Data Does not depend on having large quantities of virtual address space Works well on 32-bit machines Requires only one mmap’d region at a time Works even on Blue Gene! Downsides: Thread context switch requires munmap/mmap (3us) Can only have one thread running at a time (so no SMP’s!)
63
63 Heap Data Heap data is any dynamically allocated data C “malloc” and “free” C++ “new” and “delete” F90 “ALLOCATE” and “DEALLOCATE” Arrays and linked data structures are almost always heap data
64
64 Migrate Heap Data Automatic solution: isomalloc all heap data just like stacks! “-memory isomalloc” link option Overrides malloc/free No new application code needed Same limitations as isomalloc; page allocation granularity (huge!) Manual solution: application moves its heap data Need to be able to size message buffer, pack data into message, and unpack on other side “pup” abstraction does all three
65
65 SDAG
66
66 Structured Dagger What is it? A coordination language built on top of Charm++ Motivation Charm++’s asynchrony is efficient and reliable, but tough to program Flags, buffering, out-of-order receives, etc. Threads are easy to program, but less efficient and less reliable Implementation complexity Porting headaches Want benefits of both!
67
67 Structured Dagger Constructs when {code} Do not continue until method is called Internally generates flags, checks, etc. Does not use threads atomic {code} Call ordinary sequential C++ code if/else/for/while C-like control flow overlap {code1 code2...} Execute code segments in parallel forall “Parallel Do” Like a parameterized overlap
68
68 Stencil Example Using Structured Dagger array[1D] myArray { … entry void GetMessages () { when rightmsgEntry(), leftmsgEntry() { atomic { CkPrintf(“Got both left and right messages \n”); doWork(right, left); } } }; entry void rightmsgEntry(); entry void leftmsgEntry(); … };
69
69 Overlap for LeanMD Initialization array[1D] myArray { … entry void waitForInit(void) { overlap { when recvNumCellPairs(myMsg* pMsg) { atomic { setNumCellPairs(pMsg->intVal); delete pMsg; } } when recvNumCells(myMsg * cMsg) { atomic { setNumCells(cMsg->intVal); delete cMsg; } } };
70
70 For for LeanMD timeloop entry void doTimeloop(void) { for (timeStep_=1; timeStep_<=SimParam.NumSteps; timeStep++) { atomic {sendAtomPos(); } overlap { for (forceCount_=0; forceCount_<numForceMsg_; forceCount_++) { when recvForces(ForcesMsg* msg) { atomic {procForces(msg); } } } for (pmeCount_=0; pmeCount_<nPME; pmeCount_++) { when recvPME(PMEGridMsg* m) {atomic {procPME(m);}} } atomic { doIntegration(); } if (timeForMigrate()) {... } }
71
71 Conclusions
72
72 Conclusions AMPI and Charm++ provide a fully virtualized runtime system Load balancing via migration Communication optimizations Checkpoint/restart Virtualization can significantly improve performance for real applications
73
73 Thank You! Free source, binaries, manuals, and more information at: http://charm.cs.uiuc.edu/ http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.