Parallelization of CPAIMD using Charm++

Slides:



Advertisements
Similar presentations
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
12e.1 More on Parallel Computing UNC-Wilmington, C. Ferner, 2007 Mar 21, 2007.
Elements of a Digital Communication System
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Topic Overview One-to-All Broadcast and All-to-One Reduction
AstroBEAR Parallelization Options. Areas With Room For Improvement Ghost Zone Resolution MPI Load-Balancing Re-Gridding Algorithm Upgrading MPI Library.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Lecture 19: Triple Integrals with Cyclindrical Coordinates and Spherical Coordinates, Double Integrals for Surface Area, Vector Fields, and Line Integrals.
Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.
Creating a Parallel Program to Compute Statistical Information Victoria Sensano Maui Scientific Research Center Research Supervisor: Douglas Hope.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
PIMA-motivation PIMA: Partition Improvement using Mesh Adjacencies  Parallel simulation requires that the mesh be distributed with equal work-load and.
Sieve of Eratosthenes by Fola Olagbemi. Outline What is the sieve of Eratosthenes? Algorithm used Parallelizing the algorithm Data decomposition options.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
PARALLEL LINES AND TRANSVERSALS SECTIONS
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
1 Optimizing Quantum Chemistry using Charm++ Eric Bohm Parallel Programming Laboratory Department of Computer Science University.
Programming for Performance Laxmikant Kale CS 433.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
DGrid: A Library of Large-Scale Distributed Spatial Data Structures Pieter Hooimeijer,
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
ChaNGa: Design Issues in High Performance Cosmology
An Iterative FFT We rewrite the loop to calculate nkyk[1] once
CS 584 Lecture 3 How is the assignment going?.
Introduction to parallel algorithms
Parallel Algorithm Design
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Parallel Programming in C with MPI and OpenMP
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Component Frameworks:
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Performance Evaluation of the Parallel Fast Multipole Algorithm Using the Optimal Effectiveness Metric Ioana Banicescu and Mark Bilderback Department of.
Introduction to parallel algorithms
CS 584.
Hybrid Programming with OpenMP and MPI
Atomic Structure and Periodicity (cont’d)
Charisma: Orchestrating Migratable Parallel Objects
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Chapter 01: Introduction
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
CS 584 Lecture 5 Assignment. Due NOW!!.
Computational issues Issues Solutions Large time scale
Force Directed Placement: GPU Implementation
Introduction to parallel algorithms
Presentation transcript:

Parallelization of CPAIMD using Charm++ Parallel Programming Lab

CPAIMD Collaboration with Glenn Martyna and Mark Tuckerman MPI code – PINY Scalability problems When #procs >= #orbitals Charm++ approach Better scalability using virtualization Further divide orbitals

The Iteration

The Iteration (contd.) Start with 128 “states” FFT each of 128 states State – spatial representation of electron FFT each of 128 states In parallel Planar decomposition => transpose Compute densities (DFT) Compute energies using density Compute Forces and move electrons Orthonormalize states Start over

Parallel View

Optimized Parallel 3D FFT To perform 3D FFT 1d followed by 2d instead of 2d followed by 1d Lesser computation Lesser communication

Orthonormalization All-pairs operation Our approach (picture follows) The data of each state has to meet with the data of all other states Our approach (picture follows) A virtual processor acts as meeting point for several pairs of states Create lots of these The number of pairs meeting at a VP: n Communication decreases with n Computation increases with n Balance required

VP based approach

Performance Existing MPI code – PINY Our performance: Does not scale beyond 128 processors Best per-iteration: 1.7s Our performance: Processors Time(s) 128 2.07 256 1.18 512 0.65 1024 0.48 1536 0.39

Load balancing Load imbalance due to distribution of data in orbitals Planes are sections of a sphere Hence imbalance Computation – more points Communication – more data to send

Load Imbalance Iteration time: 900ms on 1024 procs

Improvement - I Improvement by pairing heavily loaded planes with lightly loaded planes. Iteration time: 590ms

Charm++ Load Balancing Load balancing provided by the system, iteration time: 600ms

Improvement - II Improvement by using a load vector based scheme to map planes to processors. The number of “light” planes per processor is corresponding lesser than that of the number of “heavy” planes. Iteration time: 480ms

Scope for Improvement Load balancing Charm++ load balancer shows encouraging results on 512 pes Combination of automated and manual load-balancing Avoiding copying when sending messages In ffts Sending large read-only messages FFTs can be made more efficient Use double packing Make assumption about data distribution when performing FFTs Alternative implementation of orthonormalization