Parallel Multi-Reference Configuration Interaction on JAZZ Ron Shepard (CHM) Mike Minkoff (MCS) Mike Dvorak (MCS)

Slides:



Advertisements
Similar presentations
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Advertisements

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …
Chapter 7 Introduction to Procedures. So far, all programs written in such way that all subtasks are integrated in one single large program. There is.
Fortran Jordan Martin Steven Devine. Background Developed by IBM in the 1950s Designed for use in scientific and engineering fields Originally written.
0 Jack SimonsJack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah Electronic Structure Theory Session 7.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
1 DISTRIBUTION STATEMENT XXX– Unclassified, Unlimited Distribution Laser Propagation Modeling Making Large Scale Ultrashort Pulse Laser Simulations Possible.
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Reference: Message Passing Fundamentals.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Introduction CS 524 – High-Performance Computing.
4/26/05Han: ELEC72501 Department of Electrical and Computer Engineering Auburn University, AL K.Han Development of Parallel Distributed Computing System.
Introduction to the very basic computational aspects of the modern Quantum Chemistry for Software Engineers Alexander A. Granovsky The PC GAMESS/Firefly.
A Fast Algorithm for Generalized Van Vleck Perturbation Theory Wanyi Jiang, Yuriy G. Khait, Alexander V. Gaenko, and Mark R. Hoffmann Chemistry Department,
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.
CS 221 – May 13 Review chapter 1 Lab – Show me your C programs – Black spaghetti – connect remaining machines – Be able to ping, ssh, and transfer files.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Columbus Program System for Molecular Electronic Structure Relativistic Quantum Chemistry Capabilities Russell M. Pitzer Department of Chemistry Ohio State.
Basis Light-Front Quantization: a non-perturbative approach for quantum field theory Xingbo Zhao With Anton Ilderton, Heli Honkanen, Pieter Maris, James.
Engineering Analysis ENG 3420 Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 11:00-12:00.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Math 3400 Computer Applications of Statistics Lecture 1 Introduction and SAS Overview.
Khoros Yongqun He Dept. of Computer Science, Virginia Tech.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Nuclear structure and reactions Nicolas Michel University of Tennessee.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Matlab 14.html Cost: $100 Available in labs on Windows and Unix machines.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
Ab initio Reactant – Transition State Structure – Product 1.Selection of the theoretical model 2.Geometry optimization 3.Frequency calculation 4.Energy.
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Data Structures and Algorithms in Parallel Computing
Postgraduate Computing Lectures PAW 1 PAW: Physicist Analysis Workstation What is PAW? –A tool to display and manipulate data. Learning PAW –See ref. in.
4. One – and many electronic wave functions (multplicty) 5. The Hartree-Fock method and its limitations (Correlation Energy) 6. Post Hartree-Fock methods.
CS 420 Design of Algorithms Parallel Algorithm Design.
Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Parallel Computing Presented by Justin Reschke
ECG Simulation NCRR Overview Technology for the ECG Simulation project CardioWave BioPSE project background Tools developed to date Tools for the next.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Nek5000 preliminary discussion for petaflops apps project.
Operating Systems A Biswas, Dept. of Information Technology.
Channel Coding: Part I Presentation II Irvanda Kurniadi V. ( ) Digital Communication 1.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Part 3 Linear Programming
Self Healing and Dynamic Construction Framework:
Programming Models for SimMillennium
Data Structures Recursion CIS265/506: Chapter 06 - Recursion.
Part 3 Linear Programming
Parallelization of CPAIMD using Charm++
Part 3 Linear Programming
Multireference Spin-Orbit Configuration Interaction with Columbus; Application to the Electronic Spectrum of UO2+ Russell M. Pitzer The Ohio State University.
Presentation transcript:

Parallel Multi-Reference Configuration Interaction on JAZZ Ron Shepard (CHM) Mike Minkoff (MCS) Mike Dvorak (MCS)

The COLUMBUS Program System Molecular Electronic Structure Collection of individual programs that communicate through external files 1: Atomic-Orbital Integral Generation 2: Orbital Optimization (MCSCF, SCF) 3: Integral Transformation 4: MR-SDCI 5: CI Density 6: Properties (energy gradient, geometry optimization)

Real Symmetric Eigenvalue Problem Use the iterative Davidson Method for the lowest (or lowest few) eigenpairs Direct CI: H is not explicitly constructed, w=Hv are constructed in “operator” form Matrix dimensions are 10 4 to 10 9 All floating point calculations are 64-bit

Davidson Method Generate an initial vector x 1 MAINLOOP: DO n=1, NITER Compute and save w n = H x n Compute the n th row and column of G = X T HX = W T X Compute the subspace Ritz pair: (G –  1) c = 0 Compute the residual vector r = W c –  X c Check for convergence using |r|, c, , etc. IF (converged) THEN EXIT MAINLOOP ELSE Generate a new expansion vector x n+1 from r, , v=Xc, etc. ENDIF ENDDO MAINLOOP

Matrix Elements H mn = |n> = |  (r 1 )  1  (r 2 )  2 …  (r n )  n | with  j = , 

…Matrix Elements h pq and g pqrs are computed and stored as arrays (with index symmetry) and are coupling coefficients; these are sparse and are recomputed as needed

Matrix-Vector Products The challenge is to bring together the different factors in order to compute w efficiently w = H x

Coupling Coefficient Evaluation Graphical Unitary Group Approach (GUGA) Define a directed graph with nodes and arcs: Shavitt Graph Nodes correspond to spin-coupled states consisting of a subset of the total number of orbitals Arcs correspond to the (up to) four allowed spin couplings when an orbital is added to the graph

…Coupling Coefficient Evaluation  w,x,y,z  graph head  graph tail Internal orbitals External orbitals

…Coupling Coefficient Evaluation

Integral Types 0: g pqrs 1: g pqra 2: g pqab, g pa,qb 3: g pabc 4: g abcd

Original Program (1980) Need to optimize wave functions for N csf =10 5 to 10 6 Available memory was typically 10 5 words Must segment the vectors, v and w, and partition the matrix H into subblocks, then work with one subblock at a time.

…First Parallel Program (1990) Networked workstations using TCGMSG Each matrix subblock corresponds to a compute task Different tasks require different resources (pay attention to load balancing) Same vector segmentation for all g pqrs types g pqrs,, w, and v were stored on external shared files (file contention bottlenecks)

Current Parallel Program Eliminate shared file I/O by distributing data across the nodes with the GA Library Parallel efficiency depends on the vector segmentation and corresponding H subblocking Apply different vector segmentation for different g pqrs types Tasks are timed each Davidson iteration, then sorted into decreasing order and reassigned for the next iteration in order to optimize load balancing Manual tuning of the segmentation is required for optimal performance Capable of optimizing expansions up to N csf =10 9

COLUMBUS- Petaflops Application Mike Dvorak, Mike Minkoff MCS Division Ron Shepard Chemistry Division Argonne National Lab

Notes on software engineering PCIUDG parallel code –Fortran 77/90 –Compiled with Intel/Myrinet on Jazz 70k lines in PCIUDG –14 files containing ~205 subroutines Versioning system –Currently distributed in a tar file –Created a LCRC CVS repository for personal code mods

Notes on Software Engineering (cont) Homegrown preprocessing system –Uses “*mdc*if parallel” statements to comment/uncomment parts of the code –Could/should be replaced with CPP directives Global Arrays library –Provides global address space for matrix computation –Used mainly for chemistry codes but applicable for other applications –Ran with most current version --> no perf gain –Installed on Softenv on Jazz (version 3.2.6)

Gprof Output 270 subroutines called loopcalc subroutine using ~20% of simulation time Added user defined MPE states to 50 loopcalc calls –Challenge due to large number of subroutines in file –2 GB file size severe limiter on number of procs –Broken logging Show actual output

Jumpshot/MPE Instrumentation Live Demo of a 20 proc run

Using FPMPI Relinked code with FPMPI Tell you total number of MPE calls made Output file size smalled (compared to other tools i.e. Jumpshot) Produces a histogram of message sizes Not installed in Softenv on Jazz yet –~riley/fpmpi-2.0 Problem for runs –Double Zeta C2H4 without optimizing the load balance

Total Number of MPI calls

Max/Avg MPI Complete Time

Avg/Max Time MPI Barrier

COLUMBUS Performance Results

COLUMBUS Performance Data R. Shepard, M. Dvorak, M. Minkoff

Timing of Steps (Sec.) Time Basis Set Integral Time Orbital Opt. Time CI Time QZ ,221 TZ ,415 DZ ,281

Walks Vs. Basis Set (Millions) Walk Type Basis Set ZYXW Matrix Dim. cc-pVQZ cc-pVTZ cc-pVDZ

Timing of CI Iteration

Basic Model of Performance Time = C1+C2*N+C3/N

Constrained Linear Term C2 > 0