Screen shots – Load imbalance

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
Module R2 CS450. Next Week R1 is due next Friday ▫Bring manuals in a binder - make sure to have a cover page with group number, module, and date. You.
Charm++ Arrays, Parameter Marshalling, Load Balancing and what they have in common-- PUP 9/1/2001 Orion Sky Lawlor PPL Developer Bachelor's in Math and.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Hands-On Microsoft Windows Server 2003 Administration Chapter 5 Administering File Resources.
C++ fundamentals.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1 LiveViz – What is it? Charm++ library Visualization tool Inspect your program’s current state Client runs on any machine (java) You code the image generation.
Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,
FINAL MPX DELIVERABLE Due when you schedule your interview and presentation.
1 Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005.
Charm++ Tutorial Presented by: Laxmikant V. Kale Kumaresh Pattabiraman Chee Wai Lee.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.
Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.
Charm++ Tutorial Presented by: Abhinav Bhatele Chao Mei Aaron Becker.
Implementation: Charm++ Orion Sky Lawlor
1 Charm++ Tutorial Parallel Programming Laboratory, UIUC.
Charm++ overview L. V. Kale. Parallel Programming Decomposition – what to do in parallel –Tasks (loop iterations, functions,.. ) that can be done in parallel.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
1 Charm++ Tutorial Parallel Programming Laboratory, UIUC.
Chapter 1 Java Programming Review. Introduction Java is platform-independent, meaning that you can write a program once and run it anywhere. Java programs.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
1 Charm++ Tutorial Parallel Programming Laboratory, UIUC.
Charm++ Tutorial Presented by: Lukasz Wesolowski Pritish Jetley.
1 Charm++ Tutorial Parallel Programming Laboratory, UIUC.
1 Network Access to Charm Programs: CCS Orion Sky Lawlor 2003/10/20.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
Introduction to Operating Systems Concepts
Chapter 1: Introduction
Processes and threads.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Chapter 3: Process Concept
LiveViz – What is it? Charm++ library Visualization tool
Operating Systems (CS 340 D)
OPERATING SYSTEMS CS3502 Fall 2017
Computer Architecture
In-situ Visualization using VisIt
Computer Engg, IIT(BHU)
Implementing Chares in a High-Level Scripting Language
Real-time Software Design
Main Memory Management
Routing.
Process management Information maintained by OS for process management
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
DISTRIBUTED COMPUTING
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Presented by: Abhinav Bhatele Chao Mei Aaron Becker
Operating Systems.
Process Description and Control
Parallel Computation Patterns (Reduction)
Faucets: Efficient Utilization of Multiple Clusters
Introduction to Data Structure
Chapter 3: Processes.
Java Programming Language
An Orchestration Language for Parallel Objects
Higher Level Languages on Adaptive Run-Time System
Support for Adaptivity in ARMCI Using Migratable Objects
Routing.
Exceptions and networking
Threads CSE 2431: Introduction to Operating Systems
How to install and manage exchange server 2010 OP Saklani.
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Screen shots – Load imbalance Jacobi 2048 X 2048 Threshold 0.1 Chares 32 Processors 4

Timelines – load imbalance

Migration Array objects can migrate from one PE to another To migrate, must implement pack/unpack or pup method pup combines 3 functions into one Data structure traversal : compute message size, in bytes Pack : write object into message Unpack : read object out of message Basic Contract : here are my fields (types, sizes and a pointer)

Pup – How to write it? Class ShowPup { double a; int x; char y; unsigned long z; float q[3]; int *r; // heap allocated memory public: … other methods … void pup(PUP:er &p) { p(a); p(x); p(y); p(z); p(q,3); if(p.isUnpacking() ) r = new int[ARRAY_SIZE]; p(r,ARRAY_SIZE); } }; The system uses the one pup routine to do both packing and unpacking by passing different types of PUP::ers to it. You can determine what type of PUP::er has been passed to you with the p.isPacking(), p.isUnpacking(), and p.isSizing() methods. The p.isDeleting() method returns true if the pupped object will be deleted after packing.

Load Balancing All you need is a working pup link a LB module -module <strategy> CommLB, Comm1LB, GreedyLB, GreedyRefLB, MetisLB, NeighborLB, RandCentLB, RandRefLB, RecBisectLB, RefineLB EveryLB will include all load balancing strategies runtime option +balancer CommLB

Centralized Load Balancing Uses information about activity on all processors to make load balancing decisions Advantage: since it has the entire object communication graph, it can make the best global decision Disadvantage: Higher communication costs/latency, since this requires information from all running chares

Neighborhood Load Balancing Load balances among a small set of processors (the neighborhood) to decrease communication costs Advantage: Lower communication costs, since communication is between a smaller subset of processors Disadvantage: Could leave a system which is globally poorly balanced

Centralized Load Balancing Strategies RandCentLB – randomly assign objects to processors, with no reference to the object call graph GreedyLB – starting with no load on any processor, places objects with highest load on processors with lowest load until all objects are allocated to a processor RefineLB – move objects off overloaded processors to under-utilized processors to reach average load RandRefLB – randomly assign objects to processors, then refine GreedyRefLB – assign objects to processors using the greedy load balancer, then refine

Centralized Load Balancing Strategies, Part 2 RecBisectBfLB – recursively partition the object communication graph until there is one partition for each processor MetisLB – use Metis to partition object communication graph CommLB – similar to the greedy load balancer, but also takes communication graph into account Comm1LB – variation of CommLB

Neighborhood Load Balancing Strategies NeighborLB – neighborhood load balancer, currently uses a neighborhood of 4 processors

When to Re-balance Load? Default: Load balancer will migrate when needed Programmer Control AtSync method: enable load balancing at specific point Object ready to migrate Re-balance if needed AtSync(), ResumeFromSync() Manual trigger: specify when to do load balancing All objects ready to migrate Re-balance now TurnManualLBOn(), StartLB()

Processor Utilization: After Load Balance Interval 48.5s / 41.6s Jacobi 2048 (dimension) .1 (threshold) 32 chares 4 processors

Timelines: Before and After Load Balancing Each chare waits for the others to complete for starting the next step 2 steps take Approx 500 us 2 steps take approx. 300 us

Other tools: LiveViz

LiveViz – What is it? Charm++ library Visualization tool Inspect your program’s current state Client runs on any machine (java) You code the image generation 2D and 3D modes

LiveViz – Monitoring Your Application LiveViz allows you to watch your application’s progress Can use it from work or home Doesn’t slow down computation when there is no client

LiveViz - Compilation Compile the LiveViz library itself Must have built charm++ first! From the charm directory, run: cd tmp/libs/ck-libs/liveViz make

Running LiveViz Build and run the server Or in detail… cd pgms/charm++/ccs/liveViz/serverpush Make ./run_server Or in detail…

Running LiveViz Run the client Should get a result window: cd pgms/charm++/ccs/liveViz/client ./run_client [<host> [<port>]] Should get a result window:

LiveViz Request Model Client Get Image LiveViz Server Code Send Image to Client Image Chunk Passed to Server Buffer Request Poll for Request Poll Request Returns Work Server Combines Image Chunks Parallel Application

Jacobi 2D Example Structure Main: Setup worker array, pass data to them Workers: Start looping Send messages to all neighbors with ghost rows Wait for all neighbors to send ghost rows to me Once they arrive, do the regular Jacobi relaxation Calculate maximum error, do a reduction to compute global maximum error If timestep is a multiple of 64, load balance the computation. Then restart the loop. Main: Setup worker array, pass data to them Workers: Start looping Send messages to all neighbors with ghost rows Wait for all neighbors to send ghost rows to me Once they arrive, do the regular Jacobi relaxation Calculate maximum error, do a reduction to compute global maximum error If timestep is a multiple of 64, load balance the computation. Then restart the loop. Main: Setup worker array, pass data to them Workers: Start looping Send messages to all neighbors with ghost rows Wait for all neighbors to send ghost rows to me Once they arrive, do the regular Jacobi relaxation Calculate maximum error, do a reduction to compute global maximum error If timestep is a multiple of 64, load balance the computation. Then restart the loop.

LiveViz Setup #include <liveVizPoll.h> void main::main(. . .) { // Do misc initilization stuff // Create the workers and register with liveviz CkArrayOptions opts(0); // By default allocate 0 // array elements. liveVizConfig cfg(true, true); // color image = true and // animate image = true liveVizPollInit(cfg, opts); // Initialize the library // Now create the jacobi 2D array work = CProxy_matrix::ckNew(opts); // Distribute work to the array, filling it as you do } #include <liveVizPoll.h> void main::main(. . .) { // Do misc initilization stuff // Now create the (empty) jacobi 2D array work = CProxy_matrix::ckNew(0); // Distribute work to the array, filling it as you do }

Adding LiveViz To Your Code void matrix::serviceLiveViz() { liveVizPollRequestMsg *m; while ( (m = liveVizPoll((ArrayElement *)this, timestep)) != NULL ) { requestNextFrame(m); } void matrix::startTimeSlice() { // Send ghost row north, south, east, west, . . . sendMsg(dims.x-2, NORTH, dims.x+1, 1, +0, -1); // Now having sent all our ghosts, service liveViz // while waiting for neighbor’s ghosts to arrive. serviceLiveViz(); } void matrix::startTimeSlice() { // Send ghost row north, south, east, west, . . . sendMsg(dims.x-2, NORTH, dims.x+1, 1, +0, -1); }

Generate an Image For a Request void matrix::requestNextFrame(liveVizPollRequestMsg *m) { // Compute the dimensions of the image bit we’ll send // Compute the image data of the chunk we’ll send – // image data is just a linear array of bytes in row-major // order. For greyscale it’s 1 byte, for color it’s 3 // bytes (rgb). // The liveViz library routine colorScale(value, min, max, // *array) will rainbow-color your data automatically. // Finally, return the image data to the library liveVizPollDeposit((ArrayElement *)this, timestep, m, loc_x, loc_y, width, height, imageBits); }

Link With The LiveViz Library OPTS=-g CHARMC=charmc $(OPTS) LB=-module RefineLB OBJS = jacobi2d.o all: jacobi2d jacobi2d: $(OBJS) $(CHARMC) -language charm++ \ -o jacobi2d $(OBJS) $(LB) -lm \ -module liveViz jacobi2d.o: jacobi2d.C jacobi2d.decl.h $(CHARMC) -c jacobi2d.C OPTS=-g CHARMC=charmc $(OPTS) LB=-module RefineLB OBJS = jacobi2d.o all: jacobi2d jacobi2d: $(OBJS) $(CHARMC) -language charm++ \ -o jacobi2d $(OBJS) $(LB) –lm jacobi2d.o: jacobi2d.C jacobi2d.decl.h $(CHARMC) -c jacobi2d.C

LiveViz Summary Easy to use visualization library Simple code handles any number of clients Doesn’t slow computation when there are no clients connected Works in parallel, with load balancing, etc.

Advanced Features: Groups Groups are similar to arrays, except only one element is on each processor – the index to access the group is the processor ID Advantage: Groups can be used to batch messages from chares running on a single processor, which cuts down on the message traffic Disadvantage: Does not allow for effective load balancing, since groups are stationary (they are not virtualized)

Advanced Features: Node Groups Similar to groups, but only one per node, instead of one per processor – the index is the node number Similar advantages and disadvantages as well – node groups can be used to batch messages on a single node, but are not virtualized and do not participate in load balancing Node groups can have exclusive entry methods – only one exclusive entry method may be running at once

Advanced Features: Priorities Messages can be assigned different priorities The simplest priorities just specify that the message should either go on the end of the queue (standard behavior) or the beginning of the queue Specific priorities can also be assigned to messages Priorities can be specified with either numbers or bit vectors For numeric priorities, lower numbers have a higher priority

Advanced Features: Custom Array Indexes Standard, system supplied indexes are available for 1D, 2D, and 3D arrays You can create your own custom index for higher dimension arrays or for custom indexing information Need to create a custom class that provides indexing functionality, supply this with the array definition

Advanced Features: Entry Method Attributes entry [attribute1, ..., attributeN] void EntryMethod(parameters); Attributes: threaded entry methods which are run in their own non- preemptible threads sync methods return message as a result Sync entry methods are special in that calls to sync entry methods are blocking - they do not return control to the caller until the method is finished executing completely. Sync methods may have return values; however, they may only return messages. Exclusive entry methods, which exist only on node groups, are entry methods that do not execute while other exclusive entry methods of its node group are executing in the same node. If one exclusive method of a node group is executing on node 0, and another one is scheduled to run on that same node, the second exclusive method will wait for the first to finish before it executes. To make an entry method exclusive, add the keyword exclusive to that entry method's attribute list.

Advanced features: Reductions Callbacks transfer control back to a client after a library has finished Various pre-defined callbacks, eg: CkExit the program Callbacks in reductions Can be specified in the main chare on processor 0: myProxy.ckSetReductionClient(new CkCallback(...)); Can be specified in the call to contribute by specifying the callback method: contribute(sizeof(x), &x, CkReduction::sum_int,processResult ); Reduction results sent to reduction client function More general interface required in several cases A callback is a single object that can represent one of the following: an entry method of a particular chare instance an entry method of a particular array element Broadcast call to a particular entry method of an array Build a callback object & send it as a parameter to the reduction instead of the client function For any commutative associative op, u can use the framework in charm to build your own reduction operation provide

Reductions, Part 2 Predefined Reductions Sum values or arrays with CkReduction::sum_[int, float, double] Calculate the product of values or arrays with CkReduction:: product_[int,float,double] Calculate the maximum contributed value with CkReduction:: max_[int,float,double] Calculate the minimum contributed value with CkReduction:: min_[int,float,double] Calculate the logical and of integer values with CkReduction:: logical_and

Reductions, Part 3 Predefined Reductions, continued… Calculate the logical or of contributed integers with CkReduction::logical_or Form a set of all contributed values with CkReduction::set Concatenate bytes of all contributed values with CkReduction::concat

Reductions, Part 4 User defined reductions performing a user-defined operation on user-defined data Defined as CkReductionMsg *reductionFn(int nMsg, CkReductionMsg **msgs) Registered using CkReducer::addReducer, which returns a reduction type you can pass to Contribute

Benefits of Virtualization Better Software Engineering Logical Units decoupled from “Number of processors” Message Driven Execution Adaptive overlap between computation and communication Predictability of execution Flexible and dynamic mapping to processors Flexible mapping on clusters Change the set of processors for a given job Automatic Checkpointing Principle of Persistence

More Information http://charm.cs.uiuc.edu ppl@cs.uiuc.edu Manuals Papers Download files FAQs ppl@cs.uiuc.edu