Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Thoughts on Shared Caches Jeff Odom University of Maryland.
1 Approved for unlimited release as SAND C Verification Practices for Code Development Teams Greg Weirs Computational Shock and Multiphysics.
Unstructured Data Partitioning for Large Scale Visualization CSCAPES Workshop June, 2008 Kenneth Moreland Sandia National Laboratories Sandia is a multiprogram.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
OpenFOAM on a GPU-based Heterogeneous Cluster
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
CLUSTER COMPUTING Prepared by: Kalpesh Sindha (ITSNS)
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
SAINT2002 Towards Next Generation January 31, 2002 Ly Sauer Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation,
A Parallelisation Approach for Multi-Resolution Grids Based Upon the Peano Space-Filling Curve Student: Adriana Bocoi Advisor: Dipl.-Inf.Tobias Weinzierl.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
STK (Sierra Toolkit) Update Trilinos User Group meetings, 2014 R&A: SAND PE Sandia National Laboratories is a multi-program laboratory operated.
Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.
PaGrid: A Mesh Partitioner for Computational Grids Virendra C. Bhavsar Professor and Dean Faculty of Computer Science UNB, Fredericton This.
HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
University of Texas at Arlington Scheduling and Load Balancing on the NASA Information Power Grid Sajal K. Das, Shailendra Kumar, Manish Arora Department.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 1 Zoltan: Toolkit of parallel combinatorial algorithms for unstructured, dynamic and/or adaptive computations Unstructured Communication Tools -Communication.
Background Computer System Architectures Computer System Software.
Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.
On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
Automated File Server Disk Quota Management May 13 th, 2008 Bill Claycomb Computer Systems Analyst Infrastructure Computing Systems Department Sandia is.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Architecture and Algorithms for an IEEE 802
Parallel Plasma Equilibrium Reconstruction Using GPU
OpenMosix, Open SSI, and LinuxPMI
CRESCO Project: Salvatore Raia
Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.
Parallel Programming in C with MPI and OpenMP
Parallel Implementation of Adaptive Spacetime Simulations A
Presentation transcript:

Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine 3 L.G. Gervasio 1 1 Department of Computer Science, Rensselaer Polytechnic Institute 2 Department of Computer Science, Williams College 3 Computer Science Research Institute, Sandia National Labs

Load Balancing on Heterogeneous Clusters  Objective: Generate partitions, such that the number of elements in each partition matches the capabilities of the processor on which that partition is mapped  Minimize inter-node and/or inter-cluster communication Single SMP – strict balance Uniprocessors - minimize communication Four 4-way SMPs - min comm across slow network Two 8-way SMPs - min comm across slow network

Resource Capabilities  What capabilities to monitor?  Processing power  Network bandwidth  Communication volume  Used and available Memory  How to quantify the heterogeneity?  On which basis to compare the nodes?  How to deal with SMPs?

DRUM: Dynamic Resource Utilization Model  A tree-based model of the execution environment  Internal nodes model communication points (switches, routers)  Leaf nodes model uni- processor (UP) computation nodes or symmetric multi- processors (SMPs)  Can be used by existing load balancer with minimal modifications UP SMP Switch Router UP SMP

Node Power  For each node in the tree, quantify capabilities by computing a power value  The power of a node is the percent of total load it can handle in accordance with its capabilities  A node’s n power includes processing power (p n ) and communication power (c n )  It is computed as a weighted sum of communication power and processing power power n = w cpu p n + w comm c n

Processing (CPU) power  Involves a static part obtained from benchmarks and a dynamic part p n = b n (u n+ i n ) i n = percent of CPU idle time u n = CPU utilization by local process b n = benchmark value  The processing power of internal nodes is computed as the sum of the powers of the node’s immediate children  For an SMP node n with m CPUs and k n running application processes, we compute p n as:

Communication power  A node’s communication power c n at node n is estimated as the sum of average available bandwidth across all communication interfaces of node n  If during a given monitoring period T, n,i and  n,i reflect the average rate of incoming and outgoing packets to and from node n, k the number of communication interfaces (links) at node n and s n,i the maximum bandwidth for communication interface i, then:

Weights  What values for w comm and w cpu ?  w comm+ w cpu = 1  Values depend on the communication to processing ratio in the application, during the monitoring period.  Hard to estimate, especially when communication and processing are overlapped

Implementation  Topology description through XML file, generated from a graphical configuration tool (DRUMHead)  Benchmark (Linpack) is run to obtain MFLOPS for all computation nodes  Dynamic monitoring runs in parallel with application to collect data necessary for power computation

Configuration tool  Used to describe the topology  Also used to run benchmark (LINPACK) to get MFLOPS for computation nodes  Compute bandwidth values for all communication interfaces.  Generate XML file describing the execution environment

Dynamic Monitoring  Dynamic monitoring is implemented by two kind of monitors:  CommInterface monitors collect communication traffic information  CpuMem monitors collect cpu information  Monitors are run in separate threads

Monitoring commInterface MONITOR Open Start Stop GetPower cpuMem MONITOR Open Start Stop GetPower R3 R1 R4 Execution environment N11N12N13N14 R1R2 R4 N11

Interface to LB algorithms  DRUM_createModel  Reads XML file and generates tree structure  Specific computation nodes (representatives) monitor one (or more) communication nodes  On SMPs, one processor monitors communication  DRUM_startMonitoring  Starts monitors on every node in the tree  DRUM_stopMonitoring  Stops the monitors and computes the powers

 Obtained by running a two-dimensional Rayleigh-Taylor instability problem  Sun cluster with “fast” and “slow” nodes  Fast nodes are approximately 1.5 faster than slow nodes  Same number of slow and fast nodes  Used modified Zoltan Octree LB algorithm ProcessorsOctreeOctree + DRUM Improvement % % % Total execution time (s) Experimental results

DRUM on homogeneous clusters?  We ran Rayleigh-Taylor on a collection of homogeneous clusters and used DRUM-enabled Octree  Experiments with a probing frequency of 1 second ProcessorsOctreeOctree + DRUM 4 (fast) (slow) Execution Time in seconds

PHAML results with HSFC  Hilbert Space Filling Curve  Used DRUM to guide load balancing in the solution of a Laplace equation on a unit square  Used Bill Mitchell’s (NIST) Parallel Hierarchical Multi- Level (PHAML) software  Runs on a combination of “fast” and “slow” processors  The “fast” processors are 1.5 faster than the slow ones

PHAML experiments on the Williams College Bullpen cluster  We used DRUM to guide resource-aware HSFC load balancing in the adaptive solution of a Laplace equation on the unit square, using PHAML.  After 17 adaptive refinement steps, the mesh has 524,500 nodes.  Runs on the Williams College Bullpen cluster

PHAML experiments (1)

PHAML experiment (2)

PHAML experiments: Relative Change vs. Degree of Heterogeneity  Improvement gained by using DRUM is more substantial when the cluster heterogeneity is bigger  We used a measure of degree of heterogeneity based on the variance of nodes MFLOPS obtained from the benchmark runs

PHAML experiment Non-dedicated Usage  Synthetic pure computational load (no communication) added on last two processors.

Latest DRUM efforts  Implementation using NWS measurement  Integration with Zoltan’s new hierarchical partitioning and load balancing.  Porting to Linux and AIX  Interaction between DRUM core and DRUMHead. The primary funding for this work has been through Sandia National Laboratories by contract and by the Computer Science Research Institute. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

Bckp1: Adaptive applications  Discretization of the solution domain by a mesh  Distribute the mesh over available processors  Compute solution on each element domain and integrate  Error resulting from discretization  refinement / coarsening of the mesh (mesh enrichment)  Mesh enrichment results in an imbalance of the number of elements assigned to each processor  Load Balancing becomes necessary

Dynamic Load Balancing  Graph-based methods (Metis, Jostle)  Geometric methods  Recursive Inertial Bisection  Recursive Coordinate Bisection  Octree/SFC methods

Backp2: PHAML experiments, communication weight study