UNEDF 2011 ANNUAL/FINAL MEETING Progress report on the BIGSTICK configuration-interaction code Calvin Johnson 1 Erich Ormand 2 Plamen Krastev 1,2,3 1 San.

Slides:



Advertisements
Similar presentations
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Advertisements

TUPEC057 Advances With Merlin – A Beam Tracking Code J. Molson, R.J. Barlow, H.L. Owen, A. Toader MERLIN is a.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Lawrence Livermore National Laboratory SciDAC Reaction Theory LLNL-PRES Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
1 Lecture 6 Performance Measurement and Improvement.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Lawrence Livermore National Laboratory Physical and Life Sciences/Physics Many-Nucleon Interactions via the SRG Trento, Italy July 13, 2011 Eric Jurgenson.
1 Software Testing and Quality Assurance Lecture 40 – Software Quality Assurance.
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
© 2009 IBM Corporation Statements of IBM future plans and directions are provided for information purposes only. Plans and direction are subject to change.
The hybird approach to programming clusters of multi-core architetures.
The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Renormalized Interactions with EDF Single-Particle Basis States
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
1 © 2012 The MathWorks, Inc. Speeding up MATLAB Applications.
Leadership Class Configuration Interaction (LCCI) Code Project of SciDAC/UNEDF Introduction James P. Vary Iowa State University Version.
1 CCOS Seasonal Modeling: The Computing Environment S.Tonse, N.J.Brown & R. Harley Lawrence Berkeley National Laboratory University Of California at Berkeley.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Shell-model CI codes and applications Calvin Johnson (1) Plamen Krastev (1,2) * Erich Ormand (2) 1 San Diego State University 2 Lawrence Livermore National.
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Parallel Multi-Reference Configuration Interaction on JAZZ Ron Shepard (CHM) Mike Minkoff (MCS) Mike Dvorak (MCS)
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
HFODD for Leadership Class Computers N. Schunck, J. McDonnell, Hai Ah Nam.
Author: B. C. Bromley Presented by: Shuaiyuan Zhou Quasi-random Number Generators for Parallel Monte Carlo Algorithms.
Ab-initio Calculations of Microscopic Structure of Nuclei James P. Vary, Iowa State University Esmond G. Ng, Berkeley Lab Masha Sosonkina, Ames Lab April.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Lawrence Livermore National Laboratory Reaction Theory: Year-4 Deliverables Year-5 Plans LLNL-PRES Lawrence Livermore National Laboratory, P. O.
File Processing - Hash File Considerations MVNC1 Hash File Considerations.
An FX software correlator for VLBI Adam Deller Swinburne University Australia Telescope National Facility (ATNF)
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
TBPW: A Modular Framework for Pedagogical Electronic Structure Codes Todd D. Beaudet, Dyutiman Das, Nichols A. Romero, William D. Mattson, Jeongnim Kim.
Parallel Solution of the Poisson Problem Using MPI
A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
The Fifth UNEDF Annual Collaboration Meeting June 20-24, 2011, Michigan State University Welcome! UNEDF members collaborators guests.
Slide 1 NEMOVAR-LEFE Workshop 22/ Slide 1 Current status of NEMOVAR Kristian Mogensen.
Alex Brown, Pack Forest UNEDF 2009 Implementations of NuShellX.
April 17 DoE review 1 Future Computing Needs for Reaction Theory Ian Thompson Nuclear Theory and Modeling Group, Lawrence Livermore National Laboratory.
PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Computational Nuclear Structure SNP Moment (and Fermi gas) methods for modeling nuclear state densities Calvin W. Johnson (PI) Edgar Teran (former.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Progress Report—11/13 宗慶. Problem Statement Find kernels of large and sparse linear systems over GF(2)
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Monte Carlo shell model towards ab initio nuclear structure
Parallel Processing and GPUs
Parallel Implementations of Ensemble Kalman Filters for Huge Geophysical Models Jeffrey Anderson, Helen Kershaw, Jonathan Hendricks, Nancy Collins, Ye.
Scientific Computing At Jefferson Lab
Restructuring the multi-resolution approximation for spatial data to reduce the memory footprint and to facilitate scalability Vinay Ramakrishnaiah Mentors:
Parallel Programming in C with MPI and OpenMP
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
Presentation transcript:

UNEDF 2011 ANNUAL/FINAL MEETING Progress report on the BIGSTICK configuration-interaction code Calvin Johnson 1 Erich Ormand 2 Plamen Krastev 1,2,3 1 San Diego State University, 2 Lawrence Livermore Lab, 3 Harvard University Supported by DOE Grants DE-FG02-96ER40985,DE-FC02-09ER41587, and DE-AC52-07NA27344

UNEDF 2011 ANNUAL/FINAL MEETING We have good news and bad news both the same thing the postdoc (Plamen Krastev) got a permanent staff position in scientific computing at Harvard.

BIGSTICK:  General purpose M-scheme configuration interaction (CI) code  On-the-fly calculation of the many-body Hamiltonian  Fortran 90, MPI and OpenMP  35,000+ lines in 30+ files and 200+ subroutines  Faster set-up  Faster Hamiltonian application  Rewritten for “easy” parallelization  New parallelization scheme REDSTICKBIGSTICK 2

BIGSTICK:  Flexible truncation scheme: handles ‘no core’ ab initio Nhw truncation, valence-shell (sd & pf shell) orbital truncation; np-nh truncations; and more.  Applied to ab initio calculations, valence shell calculations (in particular level densities, random interaction studies, and benchmarking projected HF), cold atoms, and electronic structure of atoms (benchmarking RPA and HF for atoms). REDSTICKBIGSTICK 2 Version 6.5 is available at NERSC: unedf/lcci/BIGSTICK/v650/

BIGSTICK uses factorization algorithm reduces storage of Hamiltonian arrays 5 NuclideSpaceBasis dimmatrix storefactorization 56 Fepf501 M290 Gb0.72 Gb 7 LiN max =12252 M3600 Gb96 Gb 7 LiN max = M23 Tb624 Gb 12 CN max =632M196 Gb3.3 Gb 12 CN max =8590M5000 Gb65 Gb 12 CN max =107800M111 Tb1.4 Tb 16 ON max =626 M142 Gb3.0 Gb 16 ON max =8990 M9700 Gb130 Gb Comparison of nonzero matrix storage with factorization TRIUMF – Feb 2011 UNEDF 2011 ANNUAL/FINAL MEETING

BIGSTICK: 2 Micah Schuster, Physics MS project

BIGSTICK: 2 Joshua Staker, Physics MS project

BIGSTICK: 2

2

3

BIGSTICK 3

UNEDF 2011 ANNUAL/FINAL MEETING Major accomplishment as of last year: excellent scaling of mat-vec multiply This demonstrates our factorization algorithm, as predicted, facilitates efficient distribution of mat-vec ops This demonstrates our factorization algorithm, as predicted, facilitates efficient distribution of mat-vec ops

Major accomplishments after last UNEDF meeting:  Rebalanced workload with additional constraint for dimension of local Lanczos vectors (Krastev)  Fully distributed Lanczos vectors with hermiticity on (Krastev)  Major steps towards distributing Lanczos vectors with suppressed hermiticity (Krastev)  OpenMP implementations in matrix-vector multiply (Ormand & Johnson)  Significant progress in 3-body implementation (Johnson & Ormand)  Added restart option (Johnson)  Implemented in-lined 1-body density matrices (Johnson) 6

UNEDF 2011 ANNUAL/FINAL MEETING Highlighting accomplishments for : Add OpenMP Reduce memory load/ node -- Lanczos vectors -- matrix information (matrix elements/jumps) Speed up reorthogonalization -- I/O is bottleneck

UNEDF 2011 ANNUAL/FINAL MEETING Highlighting accomplishments for : Add OpenMP -- Crude 1 st generation by Johnson (about 70-80% efficiency) -- 2 nd generation by Ormand (nearly 100% efficiency) Hybrid OpenMP+MPI implemented, full testing delayed due to reorthogonalization issues

UNEDF 2011 ANNUAL/FINAL MEETING Highlighting accomplishments for : Add OpenMP Reduce memory load/ node -- Lanczos vectors -- matrix information (matrix elements/jumps) We break up the Lanczos vectors so only part on each node Future: separate forward/backward multiplication

Vin Vout Proton sectorNeutron sector Lanczos vectors distribution: 22

Vin Vout Proton sectorNeutron sector Lanczos vectors distribution: Hermiticity on Forward and … 22

Vin Vout Proton sectorNeutron sector Lanczos vectors distribution: Hermiticity on Forward and … … backward application of H 22

Vin Vout Proton sectorNeutron sector Lanczos vectors distribution: Hermiticity on Each compute node needs at a minimum TWO sectors from initial and TWO sectors from final Lanczos vector Forward and … … backward application of H 22

Vin 1 2 Vout 1 2 Lanczos vectors distribution: Hermiticity off Proton sectorNeutron sector Forward application of H on one node and … 23

Vin 1 2 Vout 1 2 Lanczos vectors distribution: Hermiticity off Proton sectorNeutron sector Forward application of H on one node and … … backward application of H on another node

Vin 1 2 Vout 1 2 Lanczos vectors distribution: Hermiticity off Proton sectorNeutron sector Forward application of H on one node and … … backward application of H on another node Each compute node needs ONE sector from initial and ONE sector from final Lanczos vector 23

Comparison of memory requirements for distributing Lanczos vectors: NuclideSpaceBasis dimStoreHermiticity ON Hermiticity OFF 12 C N max = M117GB8.44GB4.39GB 60 Znpf2300M34GB8.65GB4.45GB 24 Memory required to store 2 Lanczos vectors (double precision) on a node

Comparison of memory requirements for distributing Lanczos vectors: NuclideSpaceBasis dimStoreHermiticity ON Hermiticity OFF 12 C N max = M117GB8.44GB4.39GB 60 Znpf2300M34GB8.65GB4.45GB 24 Memory required to store 2 Lanczos vectors (double precision) on a node Distribution scheme with suppressed hermiticity is the most memory efficient. This is the scheme of choice for us

UNEDF 2011 ANNUAL/FINAL MEETING Highlighting accomplishments for : Add OpenMP Reduce memory load/ node -- Lanczos vectors -- matrix information (matrix elements/jumps) Speed up reorthogonalization -- I/O is bottleneck

UNEDF 2011 ANNUAL/FINAL MEETING Highlighting accomplishments for : Add OpenMP Reduce memory load/ node -- Lanczos vectors -- matrix information (matrix elements/jumps) Speed up reorthogonalization -- I/O is bottleneck We (i.e. PK) spent time trying to make MPI/IO efficient for our needs via striping, etc. Analysis by Rebecca Hartman-Baker (ORNL) suggests our I/O still running sequentially rather than in parallel. Now we will store all Lanczos vectors in memory a la MFDn (makes restarting an interrupted run difficult)

UNEDF 2011 ANNUAL/FINAL MEETING Next steps for remainder of project period: Store Lanczos vectors in RAM (end of summer) Write paper on factorization algorithm (drafted, finish by 9/2011) Fully implement MPI/ OpenMP hybrid code (11/2011) Write up paper for publication of code (early 2012)

UNEDF 2011 ANNUAL/FINAL MEETING UNEDF Deliverables for BIGSTICK The LCCI project will deliver final UNEDF versions of LCCI codes, scripts, and test cases will be completed and released. Current version (6.5) at NERSC; expect final version by end of year; plans to publish in CPC or similar venue. Improve the scalability of BIGSTICK CI code up to 50,000 cores. Main barrier was reorthogonalization; now putting Lanczos vectors in memory to minimize I/O Use BIGSTICK code to investigate isospin breaking in pf shell Delayed due to problem with I/O hardware on Sierra

UNEDF 2011 ANNUAL/FINAL MEETING SciDAC-3 possible deliverables for BIGSTICK (End of SciDAC-2: 3-body forces on 100,000 cores) Run with 3-body up to 1,000,000 cores on Sequoia, Nmax =10/12 for 12,14 C Add in 4-body forces ; investigate alpha-clustering with effective 4-body forces (via SRG or Lee-Suzuki) Currently interfaces with Navratil’s TRDENS to generate densities, spectroscopic factors, etc, needed for RGM reaction calculations; will improve this: develop fast post-processing with factorization Investigate general unitary-transform effective interactions, adding constraint to observables

31 Sample application: cold atomic gases at unitarity in a harmonic trap Using only 1 generator (d/dr) ( very much like UCOM ) Fit to A =3, 1 -, 0 + A = 4, 0 +,1 +, 2 + UNEDF -- MSU June 2010 starting rms = 2.32 final rms = 0.58 UNEDF 2011 ANNUAL/FINAL MEETING

Cross-fertilization of LCCI project: BIGSTICK MFDn On-the-fly construction of basis states and matrix elements Reorthogonalization and Lanczos vector management Reorthogonalization and Lanczos vector management NuShellX J-projected basis