1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19.

Slides:

Advertisements

Similar presentations

Processes Management.

Advertisements

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.

1 Homework Turn in HW2 at start of next class. Starting Chapter 2 K&R. Read ahead. HW3 is on line. –Due: class 9, but a lot to do! –You may want to get.

1 1 Lecture 4 Structure – Array, Records and Alignment Memory- How to allocate memory to speed up operation Structure – Array, Records and Alignment Memory-

CS 536 Spring Run-time organization Lecture 19.

3/17/2008Prof. Hilfinger CS 164 Lecture 231 Run-time organization Lecture 23.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

Run time vs. Compile time

1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.

7/13/20151 Topic 3: Run-Time Environment Memory Model Activation Record Call Convention Storage Allocation Runtime Stack and Heap Garbage Collection.

1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

System Calls 1.

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

Advanced Charm++ and ?Virtualization Tutorial Presented by: Eric Bohm 4/15/2009.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

Chapter 4 Memory Management Virtual Memory.

Processes Introduction to Operating Systems: Module 3.

1 Advanced Charm++ Tutorial Charm Workshop Tutorial Gengbin Zheng charm.cs.uiuc.edu 10/19/2005.

RUN-Time Organization Compiler phase— Before writing a code generator, we must decide how to marshal the resources of the target machine (instructions,

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.

Department of Computer Science and Software Engineering

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Processes and Virtual Memory

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.

Implementation: Charm++ Orion Sky Lawlor

Charm++ overview L. V. Kale. Parallel Programming Decomposition – what to do in parallel –Tasks (loop iterations, functions,.. ) that can be done in parallel.

1 Becoming More Effective with C++ … Day Two Stanley B. Lippman

Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

1 Network Access to Charm Programs: CCS Orion Sky Lawlor 2003/10/20.

LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.

7-Nov Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Oct lecture23-24-hll-interrupts 1 High Level Language vs. Assembly.

Introduction to Operating Systems Concepts

Processes and threads.

Chapter 3: Process Concept

CS 6560: Operating Systems Design

Operating Systems (CS 340 D)

CS399 New Beginnings Jonathan Walpole.

Parallel Objects: Virtualization & In-Process Components

Swapping Segmented paging allows us to have non-contiguous allocations

Performance Evaluation of Adaptive MPI

Operating Systems (CS 340 D)

Threads and Cooperation

Chapter 9 :: Subroutines and Control Abstraction

Chapter 3 Process Management.

Realizing Concurrency using Posix Threads (pthreads)

Memory Allocation CS 217.

Process Description and Control

Lecture Topics: 11/1 General Operating System Concepts Processes

Realizing Concurrency using Posix Threads (pthreads)

Realizing Concurrency using the thread model

CS510 Operating System Foundations

An Orchestration Language for Parallel Objects

RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science

Higher Level Languages on Adaptive Run-Time System

Dynamic Binary Translators and Instrumenters

Presentation transcript:

1 Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

2 How to Become a Charm++ Hacker Advanced Charm++ Advanced Messaging Writing system libraries Groups Delegation Communication framework Advanced load-balancing Checkpointing Threads SDAG

3 Advanced Messaging

4 Prioritized Execution If several messages available, Charm will process the message with highest priority Otherwise, oldest message (FIFO) Has no effect: If only one message is available (common for network-bound applications!) On outgoing messages Very useful for speculative work, ordering timesteps, etc...

5 Priority Classes Charm++ scheduler has three queues: high, default, and low As signed integer priorities: -MAXINT Highest priority Default priority MAXINT Lowest priority As unsigned bitvector priorities: 0x0000 Highest priority -- 0x7FFF 0x8000 Default priority 0x xFFFF Lowest priority

6 Prioritized Marshalled Messages Pass “CkEntryOptions” as last parameter For signed integer priorities: CkEntryOptions opts; opts.setPriority(-1); fooProxy.bar(x,y,opts); For bitvector priorities: CkEntryOptions opts; unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF}; opts.setPriority(64,prio); fooProxy.bar(x,y,opts);

7 Prioritized Messages Number of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; Priorities stored at the end of messages Signed integer priorities: *CkPriorityPtr(msg)=-1; CkSetQueueing(m, CK_QUEUEING_IFIFO); Unsigned bitvector priorities CkPriorityPtr(msg)[0]=0x7fffffff; CkSetQueueing(m, CK_QUEUEING_BFIFO);

8 Advanced Message Features Read-only messages Entry method agrees not to modify or delete the message Avoids message copy for broadcasts, saving time Expedited messages Message do not go through the charm++ scheduler (faster) Immediate messages Entries are executed in a interrupt or the communication thread Very fast, but tough to get right

9 Read-Only, Expedited, Immediate All declared in the.ci file {..... entry [nokeep] void foo_readonly(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *); }; // Immediate messages only currently work //for NodeGroups

10 Groups

11 Object Groups A collection of objects (chares) Also called branch office chares Exactly one representative on each processor Ideally suited for system libraries A single proxy for the group as a whole Similar to arrays: Broadcasts, reductions, indexing But not completely like arrays: Non-migratable; one per processor

12 Declarations.ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method }; C++ file class mygroup : public Group { mygroup() {} void foo(foomsg *m) { CkPrintf(“Do Nothing”);} };

13 Creating and Calling Groups Creation p = CProxy_mygroup::ckNew(); Remote invocation p.foo(msg); //broadcast p[1].foo(msg); //asynchronous invocation Direct local access mygroup *g=p.ckLocalBranch(); g->foo(….); //local invocation Danger: if you migrate, the group stays behind!

14 Delegation

15 Delegation Enables Charm++ proxy messages to be forwarded to a delegation manager group Delegation manager can trap calls to proxy sends and apply optimizations Delegation manager must inherit from CkDelegateMgr User program must to call proxy.ckDelegate(mgrID);

16 Delegation Interface.ci file group MyDelegateMgr { entry MyDelegateMgr(); //Constructor };.h file class MyDelegateMgr : public CkDelegateMgr { MyDelegateMgr(); void ArraySend(...,int ep,void *m,const CkArrayIndexMax &idx,CkArrayID a); void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &s); …………….. }

17 Communication Optimization

18 Automatic Communication Optimizations The parallel-objects Runtime System can observe, instrument, and measure communication patterns Communication libraries can optimize By substituting most suitable algorithm for each operation Learning at runtime E.g. All to all communication Performance depends on many runtime characteristics Library switches between different algorithms Communication is from/to objects, not processors Streaming messages optimization

19 Managing Collective Communication Communication operation where all (or most) the processors participate For example broadcast, barrier, all reduce, all to all communication etc Applications: NAMD multicast, NAMD PME, CPAIMD Issues Performance impediment Naïve implementations often do not scale Synchronous implementations do not utilize the co-processor effectively

20 All to All Communication All processors send data to all other processors All to all personalized communication (AAPC) MPI_Alltoall All to all multicast/broadcast (AAMC) MPI_Allgather

21 Strategies For AAPC Short message optimizations High software over head (α) Message combining Large messages Network contention

22 Short Message Optimizations Direct all to all communication is α dominated Message combining for small messages Reduce the total number of messages Multistage algorithm to send messages along a virtual topology Group of messages combined and sent to an intermediate processor which then forwards them to their final destinations AAPC strategy may send same message multiple times

23 Virtual Topology: Mesh Organize processors in a 2D (virtual) Mesh Phase 1: Processors send messages to row neighbors Message from (x1,y1) to (x2,y2) goes via (x1,y2) Phase 2: Processors send messages to column neighbors 2* messages instead of P-1

24 AAPC Performance

25 Large Message Issues Network contention Contention free schedules Topology specific optimizations

26 Ring Strategy for Collective Multicast Performs all to all multicast by sending messages along a ring formed by the processors Congestion free on most topologies 012ii+1P-1 …………..

27 Streaming Messages Programs often have streams of short messages Streaming library combines a bunch of messages and sends them off Stripping large charm++ header Short array message packing Effective message performance of about 3us

28 Using communication library Communication optimizations embodied as strategies EachToManyMulticastStrategy RingMulticast PipeBroadcast Streaming MeshStreaming

29 Bracketed vs. Non-bracketed Bracketed Strategies Require user to give specific end points for each iteration of message sends Endpoints declared by calling ComlibBegin() and ComlibEnd() Examples: EachToManyMulticast Non bracketed strategies No such end points necessary Examples: Streaming, PipeBroadcast

30 Accessing the Communication Library From mainchare::main Creating a strategy Strategy *strat = new EachToManyMulticastStrategy(USE_MESH) Strat = new StreamingStrategy(); Strat->enableShortMessagePacking(); Associating a proxy with a Strategy ComlibAssociateProxy(strat, myproxy); myproxy should be passed to all array elements

31 Sending Messages ComlibBegin(myproxy);// Bracketed Strategies for(....) {.... myproxy.foo(msg);.... } ComlibEnd(); // Bracketed strategies

32 Handling Migration Migrating array element PUP’s the comlib associated proxy FooArray::pup(PUP::er &p) { p | myProxy; }

33 Compiling You must include compile time option –module commlib

34 Advanced Load-balancers Writing a Load-balancing Strategy

35 Advanced load balancing: Writing a new strategy Inherit from CentralLB and implement the work(…) function class foolb : public CentralLB { public: void work (CentralLB::LDStats* stats, int count); };

36 LB Database struct LDStats { ProcStats *procs; LDObjData* objData; LDCommData* commData; int *to_proc; // } //Dummy Work function which assigns all objects to //processor 0 //Don’t implement it! void fooLB::work(CentralLB::LDStats* stats,int count){ for(int count=0;count < nobjs; count++) stats.to_proc[count] = 0; }

37 Compiling and Integration Edit and run Makefile_lb.sh Creates Make.lb which is included by the LDB Makefile Run make depends to correct dependencies Rebuild charm++

38 Checkpoint Restart

39 Checkpoint/Restart Any long running application must be able to save its state When you checkpoint an application, it uses the pup routine to store the state of all objects State information is saved in a directory of your choosing Restore also uses pup, so no additional application code is needed (pup is all you need)

40 Checkpointing Job In AMPI, use MPI_Checkpoint( ); Collective call; returns when checkpoint is complete In Charm++, use CkCheckpoint(, ); Called on one processor; calls resume when checkpoint is complete

41 Restart Job from Checkpoint The charmrun option ++restart is used to restart Number of processors need not be the same You can also restart groups by marking them migratable and writing a PUP routine – they still will not load balance, though

42 Threads

43 Why use Threads? They provide one key feature: blocking Suspend execution (e.g., at message receive) Do something else Resume later (e.g., after message arrives) Example: MPI_Recv, MPI_Wait semantics Function call interface more convenient than message-passing Regular call/return structure (no CkCallbacks) Allows blocking in middle of deeply nested communication subroutine

44 Why not use Threads? Slower Around 1us context-switching overhead unavoidable Creation/deletion perhaps 10us More complexity, more bugs Breaks a lot of machines! (but we have workarounds) Migration more difficult State of thread is scattered through stack, which is maintained by compiler By contrast, state of object is maintained by users Thread disadvantages form the motivation to use SDAG (later)

45 What are (Charm) Threads? One flow of control (instruction stream) Machine Registers & program counter Execution stack Like pthreads (kernel threads) Only different: Implemented at user level (in Converse) Scheduled at user level; non-preemptive Migratable between nodes

46 How do I use Threads? Many options: AMPI Always uses threads via TCharm library Charm++ [threaded] entry methods run in a thread [sync] methods Converse C routines CthCreate/CthSuspend/CthAwaken Everything else is built on these Implemented using SYSV makecontext/setcontext POSIX setjmp/alloca/longjmp

47 How do I use Threads (example) Blocking API routine: find array element int requestFoo(int src) { myObject *obj=...; return obj->fooRequest(src) } Send request and suspend int myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); // -- blocks until awaken call -- return stashed_return; } Awaken thread when data arrives void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread); }

48 How do I use Threads (example) Send request, suspend, recv, awaken, return int myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); return stashed_return; } void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread); }

49 The Horror of Thread Migration

50 Stack Data The stack is used by the compiler to track function calls and provide temporary storage Local Variables Subroutine Parameters C “alloca” storage Most of the variables in a typical application are stack data Users have no control over how stack is laid out

51 Migrate Stack Data Without compiler support, cannot change stack’s address Because we can’t change stack’s interior pointers (return frame pointer, function arguments, etc.) Solution: “isomalloc” addresses Reserve address space on every processor for every thread stack Use mmap to scatter stacks in virtual memory efficiently Idea comes from PM 2

52 Migrate Stack Data Thread 2 stack Thread 3 stack Thread 4 stack Processor A’s Memory Code Globals Heap 0x xFFFFFFFF Thread 1 stack Code Globals Heap 0x xFFFFFFFF Processor B’s Memory Migrate Thread 3

53 Migrate Stack Data: Isomalloc Thread 2 stack Thread 4 stack Processor A’s Memory Code Globals Heap 0x xFFFFFFFF Thread 1 stack Code Globals Heap 0x xFFFFFFFF Processor B’s Memory Migrate Thread 3 Thread 3 stack

54 Migrate Stack Data Isomalloc is a completely automatic solution No changes needed in application or compilers Just like a software shared-memory system, but with proactive paging But has a few limitations Depends on having large quantities of virtual address space (best on 64-bit) 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine Depends on unportable mmap Which addresses are safe? (We must guess!) What about Windows? Or Blue Gene?

55 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x xFFFFFFFF Code Globals Heap 0x xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack

56 Aliasing Stack Data: Run Thread 2 Thread 2 stack Processor A’s Memory Code Globals Heap 0x xFFFFFFFF Code Globals Heap 0x xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack Execution Copy

57 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x xFFFFFFFF Code Globals Heap 0x xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack

58 Aliasing Stack Data: Run Thread 3 Thread 3 stack Processor A’s Memory Code Globals Heap 0x xFFFFFFFF Code Globals Heap 0x xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack Execution Copy

59 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x xFFFFFFFF Code Globals Heap 0x xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack Migrate Thread 3

60 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x xFFFFFFFF Code Globals Heap 0x xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack

61 Aliasing Stack Data Processor A’s Memory Code Globals Heap 0x xFFFFFFFF Code Globals Heap 0x xFFFFFFFF Processor B’s Memory Thread 2 stack Thread 3 stack Execution Copy Thread 3 stack

62 Aliasing Stack Data Does not depend on having large quantities of virtual address space Works well on 32-bit machines Requires only one mmap’d region at a time Works even on Blue Gene! Downsides: Thread context switch requires munmap/mmap (3us) Can only have one thread running at a time (so no SMP’s!)

63 Heap Data Heap data is any dynamically allocated data C “malloc” and “free” C++ “new” and “delete” F90 “ALLOCATE” and “DEALLOCATE” Arrays and linked data structures are almost always heap data

64 Migrate Heap Data Automatic solution: isomalloc all heap data just like stacks! “-memory isomalloc” link option Overrides malloc/free No new application code needed Same limitations as isomalloc; page allocation granularity (huge!) Manual solution: application moves its heap data Need to be able to size message buffer, pack data into message, and unpack on other side “pup” abstraction does all three

65 SDAG

66 Structured Dagger What is it? A coordination language built on top of Charm++ Motivation Charm++’s asynchrony is efficient and reliable, but tough to program Flags, buffering, out-of-order receives, etc. Threads are easy to program, but less efficient and less reliable Implementation complexity Porting headaches Want benefits of both!

67 Structured Dagger Constructs when {code} Do not continue until method is called Internally generates flags, checks, etc. Does not use threads atomic {code} Call ordinary sequential C++ code if/else/for/while C-like control flow overlap {code1 code2...} Execute code segments in parallel forall “Parallel Do” Like a parameterized overlap

68 Stencil Example Using Structured Dagger array[1D] myArray { … entry void GetMessages () { when rightmsgEntry(), leftmsgEntry() { atomic { CkPrintf(“Got both left and right messages \n”); doWork(right, left); } } }; entry void rightmsgEntry(); entry void leftmsgEntry(); … };

69 Overlap for LeanMD Initialization array[1D] myArray { … entry void waitForInit(void) { overlap { when recvNumCellPairs(myMsg* pMsg) { atomic { setNumCellPairs(pMsg->intVal); delete pMsg; } } when recvNumCells(myMsg * cMsg) { atomic { setNumCells(cMsg->intVal); delete cMsg; } } };

70 For for LeanMD timeloop entry void doTimeloop(void) { for (timeStep_=1; timeStep_<=SimParam.NumSteps; timeStep++) { atomic {sendAtomPos(); } overlap { for (forceCount_=0; forceCount_<numForceMsg_; forceCount_++) { when recvForces(ForcesMsg* msg) { atomic {procForces(msg); } } } for (pmeCount_=0; pmeCount_<nPME; pmeCount_++) { when recvPME(PMEGridMsg* m) {atomic {procPME(m);}} } atomic { doIntegration(); } if (timeForMigrate()) {... } }

71 Conclusions

72 Conclusions AMPI and Charm++ provide a fully virtualized runtime system Load balancing via migration Communication optimizations Checkpoint/restart Virtualization can significantly improve performance for real applications

73 Thank You! Free source, binaries, manuals, and more information at: Parallel Programming Lab at University of Illinois