Strong Scalability Analysis and Performance Evaluation of a SAMR CCA-based Reacting Flow Code Sophia Lefantzi, Jaideep Ray and Sameer Shende SAMR: Structured.

Slides:

Advertisements

Similar presentations

Costas Busch Louisiana State University CCW08. Becomes an issue when designing algorithms The output of the algorithms may affect the energy efficiency.

Advertisements

Misbah Mubarak, Christopher D. Carothers

4/22/ Clock Network Synthesis Prof. Shiyan Hu Office: EREC 731.

Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.

Sensor network Routing protocol A study on LEACH protocol and how to improve it.

Introduction to MIMD architectures

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

SEBD Tutorial, June Monitoring Distributed Streams Joint works with Tsachi Scharfman, Daniel Keren.

Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Large Scale File Distribution Troy Raeder & Tanya Peters.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Copyright ©2009 Opher Etzion Event Processing Course Engineering and implementation considerations (related to chapter 10)

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Resource Virtualization Winter Quarter Project.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Strategies for Implementing Dynamic Load Sharing.

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.

Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.

7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.

Low-Power Wireless Sensor Networks

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Expanding the CASE Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd.

Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Parallelization of 2D Lid-Driven Cavity Flow

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Client Assignment in Content Dissemination Networks for Dynamic Data Shetal Shah Krithi Ramamritham Indian Institute of Technology Bombay Chinya Ravishankar.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

CS 351/ IT 351 Modeling and Simulation Technologies Review ( ) Dr. Jim Holten.

MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Performance Evaluation of OhHelp'ed PIC Simulation Hiroshi Nakashima (ACCMS, Kyoto U.) cooperated by Yohei Miyake (ACCMS, Kyoto U.) Hideyuki Usui (Kobe.

Cluster Based Protein Folding Douglas Fuller and Brandon McKethan.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Xing Cai University of Oslo

Scalable Load-Distance Balancing

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.

Programming Models for SimMillennium

Parallel Computation of River Basin Hydrologic Response Using DHM

Hybrid Programming with OpenMP and MPI

COMP60621 Designing for Parallelism

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Parallel Programming in C with MPI and OpenMP

Parallel Implementation of Adaptive Spacetime Simulations A

Presentation transcript:

Strong Scalability Analysis and Performance Evaluation of a SAMR CCA-based Reacting Flow Code Sophia Lefantzi, Jaideep Ray and Sameer Shende SAMR: Structured Adaptive Mesh Refinement Multiscale algorithm for Cartesian meshes Pros:  Less grid points, less compute time.  Preservation of numerical accuracy. The Problem: Algorithms ensuring efficient resolution and numerical accuracy may not be scalable. Scaling analyses of SAMR simulations are rare. The Objective: Identify the non-scalability sources. Methodology: 1.Choose a test problem where the initial and final states of the computational domain are vastly different so that: i.It poses a challenge to domain partitioning. ii.The quality of the partitioning affects scalability. 2.Analyze timing measurements and domain decomposition specifics to identify the causes of lack of scalability. Cons:  Load Balancing problems.  Lack of scalability. Problem Statement: Global problem size is held constant. Virtual machine expanded in steps of two, starting with 7 processors.  For more than 28 processors experimental results deviate a lot from the linear scale up behavior. What are the possible causes? Excessive transfer times caused by network saturation. Synchronization costs due to non-optimal domain and load partitioning.  We pick two processor topologies, 28 and 112, and analyze.  We have a dynamically adaptive algorithm. Are all intermediate states equally scalable or non scalable? No they are not. Results-Analysis: Communication and Compute time distributions for the 28 processor run. Timestep 40Timestep 80 Scalable State (Timestep 40) On the left: Average compute time (T comp ) and standard deviation (σ comp ), average communication time (T comm ) for 7…112 processors at timestep 40. On the right: Snapshot of the mesh hierarchy at timestep 40.  T comp dominates over T comm (which includes data transfer and synchronization times).  We model communication time (transfer rate bound model) as: T comm (model)~t tr, where t tr is the transfer data time:  Linear scalability suffers for p≥56. Why? Communication patterns for p=28 (left) and p=112 (right) at timestep 40. On the left: Average communication time versus average communication radius for 5 different runs (min p=7, max p=112) at timestep 40. On the right: Clos network schematic.  The average communication radius is 3.2 for p=28 and 8.1 for p=112. By comparing their communication patterns we see that as the number of processors increases they communicate further in the virtual machine.  As the communication pattern radius approaches 8 nodes and substantial communication occurs over tier-1 switches the communication times increase. Note: T comp >> T comm, so Timestep 40 is scalable. Unscalable State (Timestep 80) On the left: Average compute time (T comp ) and standard deviation (σ comp ), average communication time (T comm ) for 7…112 processors at timestep 80. On the right: Snapshot of the mesh hierarchy at timestep 80.  T comm dominates over T comp (T comm includes data transfer and synchronization times).  Since the mesh has become visibly sparser the increase in T comm is NOT due to data transfer times.  We model communication time (synchronization bound model) as: T comm (model)~t sync, where t sync is the synchronization time:  The synchronization bound model only holds for p=7 and p=14.  For p > 28, σ comp /T comp > 0.25 and the non-linear effects of σ comp dominate. This analysis will be part of future work. Communication patterns for p=28 (left) and p=112 (right) at timestep 80.