Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Multi-Reference Configuration Interaction on JAZZ Ron Shepard (CHM) Mike Minkoff (MCS) Mike Dvorak (MCS)

Similar presentations


Presentation on theme: "Parallel Multi-Reference Configuration Interaction on JAZZ Ron Shepard (CHM) Mike Minkoff (MCS) Mike Dvorak (MCS)"— Presentation transcript:

1 Parallel Multi-Reference Configuration Interaction on JAZZ Ron Shepard (CHM) Mike Minkoff (MCS) Mike Dvorak (MCS)

2 The COLUMBUS Program System Molecular Electronic Structure Collection of individual programs that communicate through external files 1: Atomic-Orbital Integral Generation 2: Orbital Optimization (MCSCF, SCF) 3: Integral Transformation 4: MR-SDCI 5: CI Density 6: Properties (energy gradient, geometry optimization)

3 Real Symmetric Eigenvalue Problem Use the iterative Davidson Method for the lowest (or lowest few) eigenpairs Direct CI: H is not explicitly constructed, w=Hv are constructed in “operator” form Matrix dimensions are 10 4 to 10 9 All floating point calculations are 64-bit

4 Davidson Method Generate an initial vector x 1 MAINLOOP: DO n=1, NITER Compute and save w n = H x n Compute the n th row and column of G = X T HX = W T X Compute the subspace Ritz pair: (G –  1) c = 0 Compute the residual vector r = W c –  X c Check for convergence using |r|, c, , etc. IF (converged) THEN EXIT MAINLOOP ELSE Generate a new expansion vector x n+1 from r, , v=Xc, etc. ENDIF ENDDO MAINLOOP

5 Matrix Elements H mn = |n> = |  (r 1 )  1  (r 2 )  2 …  (r n )  n | with  j = , 

6 …Matrix Elements h pq and g pqrs are computed and stored as arrays (with index symmetry) and are coupling coefficients; these are sparse and are recomputed as needed

7 Matrix-Vector Products The challenge is to bring together the different factors in order to compute w efficiently w = H x

8 Coupling Coefficient Evaluation Graphical Unitary Group Approach (GUGA) Define a directed graph with nodes and arcs: Shavitt Graph Nodes correspond to spin-coupled states consisting of a subset of the total number of orbitals Arcs correspond to the (up to) four allowed spin couplings when an orbital is added to the graph

9 …Coupling Coefficient Evaluation  w,x,y,z  graph head  graph tail Internal orbitals External orbitals

10 …Coupling Coefficient Evaluation

11 Integral Types 0: g pqrs 1: g pqra 2: g pqab, g pa,qb 3: g pabc 4: g abcd

12 Original Program (1980) Need to optimize wave functions for N csf =10 5 to 10 6 Available memory was typically 10 5 words Must segment the vectors, v and w, and partition the matrix H into subblocks, then work with one subblock at a time.

13 …First Parallel Program (1990) Networked workstations using TCGMSG Each matrix subblock corresponds to a compute task Different tasks require different resources (pay attention to load balancing) Same vector segmentation for all g pqrs types g pqrs,, w, and v were stored on external shared files (file contention bottlenecks)

14 Current Parallel Program Eliminate shared file I/O by distributing data across the nodes with the GA Library Parallel efficiency depends on the vector segmentation and corresponding H subblocking Apply different vector segmentation for different g pqrs types Tasks are timed each Davidson iteration, then sorted into decreasing order and reassigned for the next iteration in order to optimize load balancing Manual tuning of the segmentation is required for optimal performance Capable of optimizing expansions up to N csf =10 9

15 COLUMBUS- Petaflops Application Mike Dvorak, Mike Minkoff MCS Division Ron Shepard Chemistry Division Argonne National Lab

16 Notes on software engineering PCIUDG parallel code –Fortran 77/90 –Compiled with Intel/Myrinet on Jazz 70k lines in PCIUDG –14 files containing ~205 subroutines Versioning system –Currently distributed in a tar file –Created a LCRC CVS repository for personal code mods

17 Notes on Software Engineering (cont) Homegrown preprocessing system –Uses “*mdc*if parallel” statements to comment/uncomment parts of the code –Could/should be replaced with CPP directives Global Arrays library –Provides global address space for matrix computation –Used mainly for chemistry codes but applicable for other applications –Ran with most current version --> no perf gain –Installed on Softenv on Jazz (version 3.2.6)

18 Gprof Output 270 subroutines called loopcalc subroutine using ~20% of simulation time Added user defined MPE states to 50 loopcalc calls –Challenge due to large number of subroutines in file –2 GB file size severe limiter on number of procs –Broken logging Show actual output

19 Jumpshot/MPE Instrumentation Live Demo of a 20 proc run

20 Using FPMPI Relinked code with FPMPI Tell you total number of MPE calls made Output file size smalled (compared to other tools i.e. Jumpshot) Produces a histogram of message sizes Not installed in Softenv on Jazz yet –~riley/fpmpi-2.0 Problem for runs –Double Zeta C2H4 without optimizing the load balance

21 Total Number of MPI calls

22 Max/Avg MPI Complete Time

23 Avg/Max Time MPI Barrier

24 COLUMBUS Performance Results

25 COLUMBUS Performance Data R. Shepard, M. Dvorak, M. Minkoff

26 Timing of Steps (Sec.) Time Basis Set Integral Time Orbital Opt. Time CI Time QZ38811806 382,221 TZ 26 104 31,415 DZ 1 34 3,281

27 Walks Vs. Basis Set (Millions) Walk Type Basis Set ZYXW Matrix Dim. cc-pVQZ.0815536305858 cc-pVTZ.08 7120 69198 cc-pVDZ.08 2 13 824

28 Timing of CI Iteration

29 Basic Model of Performance Time = C1+C2*N+C3/N

30 Constrained Linear Term C2 > 0


Download ppt "Parallel Multi-Reference Configuration Interaction on JAZZ Ron Shepard (CHM) Mike Minkoff (MCS) Mike Dvorak (MCS)"

Similar presentations


Ads by Google