Comparative Study of Techniques for Parallelization of the Grid Simulation Problem Prepared by Arthi Ramachandran Kansas State University
2 The Problem Iterative schemes eg. – Jacobi method – Kinetic Monte Carlo – Gauss Seidel method Input data has a grid/matrix topology. Computation of value in a cell at time-step ‘t’ requires values of the neighboring cells at time-step ‘t-1’. Parallelization of this problem
3 The Problem (contd…) Jacobi Simulation of Heat flow
4 Outline Goal Available Solutions Parallel Adaptive Distributed Simulations – Architecture Performance Comparison Results Conclusions Future Work
5 The Goal A framework such that user/programmer should have to code only the problem specific logic. The framework manages the parallelization of the user’s application Load Balancing – required for some problems to achieve good performance – the model that we build is geared towards achieving a good performance via moving fixed size jobs across machines – we will see a slide that shows that we do get very good improvement in performance with load balancing.
6 Solution 1 – OptimalGrid Developed by IBM Almaden Research Lab Specifically built to parallelize connected problems Built using Java User only needs to supply – Java code for the problem to be parallelized – Changes to configuration files to fine-tune the behaviour of OptimalGrid (if required)
7 OptimalGrid Architecture Manages the Compute Agents Invokes Problem Builder if necessary – which automatically Partitions Problem Distributes the problem among agents Monitors the Agents – tracks their status performs optimizations if necessary
8 Data Units Original Problem Cell (OPC ) VPP Edge Variable Problem Partition (VPP) OPC surrounded by 4 neighbor cells
9 Implementation of Jacobi method of Heat Flow – OptimalGrid (1) EntityJacobi extends EntityAbstract Data Members double temperature Class Name Methods double getTemperature() void setTemperature (double) void initFromXML(Element entity) Element getXML()
10 Implementation of Jacobi method of Heat Flow – OptimalGrid (2) OPCJacobi extends OPCAbstract Data Members Version Identifier Class Name Methods propagate() localInteraction() propagate(){ ArrayList newOccupants; ArrayList neibList; neibList = this.getAllOpcNeighbors(); for(each entry in neibList){ newOccupants.add(neibListEntry. getOccupants() ); } return newOccupants; } localInteraction(ArrayList occupants){ double temperature = 0.0; if(this.loc.x == 0){ // Cell is on first row of grid temperature = 5.0; } else if(this.loc.y == 0 || this.loc.y == Grid_Dimension-1){ temperature = 0; } else{ // Inner cells – compute for(each neighbor entry in occupants list){ temperature += occupantsEntry.getTemperature(); } temperature /= 4; } occupants(0).setTemperature(temperature); Remove all neighbor entries from occupantsList }
11 MPI Solution Message Passing Interface library – a specification Various Implementations of this specification available eg. – LAM MPI – developed by Ohio Supercomputer Centre. – MPICH – developed by Argonne National Labs and Mississippi State University. (used in this work)
12 Features of MPI Flexible Send and Receive APIs – void Comm::Send(void *buf, int count, Datatype& datatype, int destination, int tag) – void Comm::Recv(void *buf, int count, Datatype& datatype, int source, int tag) Collective Communications support – Broadcast – Scatter and Gather operations between a set of processes – Collective computation operations such as ‘minimum’, ‘maximum’, ‘sum’ etc.
13 Features of MPI (contd.) Virtual Topologies Communication Modes – Non-blocking versions of Send/Receive APIs – Synchronous Mode – Buffered Mode Debugging and Profiling hooks
14 MPI Solution overview Process # 0Process # 1 Process # 2Process # 3
15 MPI Implementation - Jacobi Iteration method Number of Processes (N) Partition Size Number of Iterations Cartesian Matrix of processes Create_cart() All processes (0…N-1) Each process has the id of the left, top, right and bottom neighbor processes Comm::Shift()
16 MPI Implementation - Jacobi Iteration method (contd.) Process # 0 computes the grid co-ordinates of the partition to be assigned to each process Processes # 1 … N-1 wait for the partition co-ordinates from Process #0 Send Allocate Boundary Buffers for the partition
17 MPI Implementation - Jacobi Iteration method (contd.) Iterations finished No Issue calls to Isend and Irecv – non blocking methods to send/recv data Compute Inner Cells Wait for the Isend and Irecv calls to complete Compute Outer Cells Yes Send the result data to the Process # 0
18 Parallel Adaptive Distributed Simulations – A new model … But Why ? –E–Experiment with bringing together the concepts of partititions, double buffers, thread pool, jobs, synchronization schemes in thread pools and load balancing by moving fixed size jobs across machines. –O–OptimalGrid does some of this, however it is proprietary software – hence no access to its source. –I–It is fairly easy to code the application using MPI, however for problems such as atomistic motion simulation, load balancing is a required feature. –C–Can we do better ?
19 Thread Pool
20 Jobs and partitions Data Members int phase Methods bool execute public bool execute(phase){ switch(phase){ case 0: // Sequential code for phase 0 phase = 1; if(!synchronizationMethod(this)){ return false; } else{ return true; } case 1: } … About to enter Synchronization part Advance phase by 1 Synchronization returns false Job has to wait on some Condition – Job thread should relinquish this job Synchronization returns true Job can continue
21 PADS - Architecture Node3 Parsing Input Opening, monitoring communication channels with controller agents Partitioning input grid Assigning the jobs to the hosts/nodes Co-ordinate load balancing Collect results from all hosts after iterations are completed Emit output in the specified format Establish communication channels with controller as well as with other controller agents Initialize Thread Pool on host Deploy jobs received from controller in Thread Pool Handle communication requirements of each job Respond to controller messages (load balancing) Send results back to controller after iterations are complete Connected Sockets Connection-less socket
22 Communication between Controller Agents
23 Synchronization – Jobs and Controller agent All communication between jobs is through controller agent(s) Hence, synchronization required only between the controller agent and the job Shared Job data: – Time step of job – Boundary Buffers – Waiting flag – Frozen flag
24 Load Balancer Controller Node N S Job J M Node N S Job J MC Job movement Messages
25 Overview of Load Balancer Module
26 Overview of Load Balancer Module (2)
27 Overview of Load Balancer Module (3)
28 Overview of Load Balancer Module (4)
29 Performance Comparison Experiment 1
30 Performance Comparison Experiment 2
31 Performance Comparison Experiment 3
32 PADS – Performance comparison for varying number of threads per node (50 x 50 partition size)
33 PADS – Performance comparison for varying number of threads per node (25 x 25 partition size)
34 PADS – Performance comparison for varying number of threads per node (10 x 10 partition size)
35 Preliminary Results for Load Balancer
36 Conclusions OptimalGrid seems to perform better than PADS and MPI solution for a larger grain size. (≥ 10 μs) (System.nanoseconds() – accuracy ? ) PADS and MPI perform better than the OptimalGrid by an order of magnitude for small grain size (4 ~ 10ns) OptimalGrid provides features that can be used easily by the user. MPI provides hooks for logging and debugging which can be used by the programmer. OptimalGrid and PADS allow for load balancing to be done automatically. With PADS, from the results of the simulation, we see that a good performance improvement is obtained with load balancing. MPI does not provide dynamic load balancing.
37 Future Work Formulation and implementation of policies for dynamic load balancing in PADS. Experiment with flexibility - partitions are allowed to have variable dimensions – however, synchronization and communication will become more complex and might give rise to more overhead. Heterogeneity in duration of a time-step and computation among jobs needs to be allowed for implementation of certain problems. Develop a GUI for PADS.
38 Acknowledgements Dr. Virgil E. Wallentine Dr. Daniel A. Andresen Dr. Gurdip Singh Dr. Masaaki Mizuno
Questions ?