Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Block LU Factorization Lecture 24 MA471 Fall 2003.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Load distribution in distributed systems
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
1. Memory Manager 2 Memory Management In an environment that supports dynamic memory allocation, the memory manager must keep a record of the usage of.
Threads in Java. History  Process is a program in execution  Has stack/heap memory  Has a program counter  Multiuser operating systems since the sixties.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Advanced / Other Programming Models Sathish Vadhiyar.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
The Functions of Operating Systems Desktop PC Operating Systems.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Parallelizing Gauss-Seidel Solver/Pre-conditioner Aim: To parallelize a Gauss-Seidel Solver, which can be used as a pre-conditioner for the finite element.
Distributed Process Scheduling : A Summary
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Programming for Performance Laxmikant Kale CS 433.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Pitfalls: Time Dependent Behaviors CS433 Spring 2001 Laxmikant Kale.
Auburn University
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Parallel Objects: Virtualization & In-Process Components
Parallel Algorithm Design
Performance Evaluation of Adaptive MPI
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Parallel Programming with MPI and OpenMP
CSCE569 Parallel Computing
Faucets: Efficient Utilization of Multiple Clusters
Dense Linear Algebra (Data Distributions)
An Orchestration Language for Parallel Objects
Higher Level Languages on Adaptive Run-Time System
Support for Adaptivity in ARMCI Using Migratable Objects
Mattan Erez The University of Texas at Austin
Presentation transcript:

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale

Problem Unrestricted parallelism may lead to a continuous increase of memory usage on a node – e.g. LU lookahead Previous solutions – Statically restricting concurrency (HPL) – Dynamically restrict, but also restrict some tasks (to eliminate deadlock) (Husbands and Yelick)

A timeline view, colored by memory usage, of an LU program run on 64 processors of BG/P using a Block-Cyclic Mapping for a N = sized matrix with 512 x 512 sized blocks. The traditional block-cyclic mapping suffers from limited concurrency at the end (the right portion of this plot). This is most problematic in small matrices.

Goal Language runtime system should provide a mechanism to schedule for memory usage – Adaptive runtime systems (RTS) are the future Memory-aware scheduling is a case-study of one of the adaptive techniques that could be exploited in RTS – Use Charm++ RTS as the framework to study such technique

Charm++ Essentials Computation: expressed as a collection of objects that intreract via asynchronous method invocations – RTS controls the mapping objects to PEs – Adaptive techniques are naturally introduced AMPI provides the same functions for MPI apps – Schedulers in Charm++ RTS – Queues with priorities

Memory-Aware Scheduling In parallel interface file – Tag entry method known to decrease memory with [memcritical] – At runtime set a memory threshold Scheduler – When the threshold is reached: Perform linear scan of priority queues Schedule the first task known to reduce memory usage Repeat until the memory usage is below the threshold

Memory-Aware Scheduling Overhead – In LU program with N = x matrix, and 512 x 512 block size, average time spent in scheduler code is seconds – LU factorization takes seconds – Negligible overhead of 0.014%

LU in Charm++ -LU solve on diagonal -Broadcast of L and U across the row and column -Triangular solve for L and U in the row and column -Trailing updates for submatrix

Mapping Blocks to Processors Block-cyclic mapping reduces concurrency at the end – However, it decreases the cost of communication (by limiting the number of processors for each multicast across the row and column) – For smaller matrices, another mapping scheme may perform better, due to better load balance (even if it involves more processors in the multicast)

Balanced Snake Mapping Traverse in roughly decreasing amount of work – As the diagram shows Assign to processor which has been assigned the smallest amount of work so far – Keep alist of processors and the amount of work each has been assigned

Balanced Snake Mapping

Memory Increase in LU Trailing updates may be delayed – Only needed for next diagonal and the next set of triangular solves (which may also be delayed) – These are scheduled using priorities – Trailing updates accumulate in the queue (because of the relatively low priority), increasing memory usage – Override priority and schedule immediately if memory threshold is reached

With Memory-Aware Scheduling

Without Memory-Aware Scheduling

Memory-Aware Scheduling

Performance

Future work Make the scheduler automatically detect which entry method will be marked memory critical Respect priorities within messages marked memory critical in the scheduler Allow other messages to be marked as increasing memory, or having no effect on memory

Conclusion A general memory-aware scheduling technique is demonstrated – Could be used in other RTS – Using Charm++ as a case study A new LU block mapping in a message-driven system – Performs better for small matrices