Lattice QCD in Nuclear Physics Robert Edwards Jefferson Lab NERSC 2011 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Slides:



Advertisements
Similar presentations
Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Advertisements

Nuclear Physics in the SciDAC Era Robert Edwards Jefferson Lab SciDAC 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Parallel Research at Illinois Parallel Everywhere
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,
JLab Status & 2016 Planning April 2015 All Hands Meeting Chip Watson Jefferson Lab Outline Operations Status FY15 File System Upgrade 2016 Planning for.
OpenFOAM on a GPU-based Heterogeneous Cluster
Algorithms for Lattice Field Theory at Extreme Scales Rich Brower 1*, Ron Babich 1, James Brannick 2, Mike Clark 3, Saul Cohen 1, Balint Joo 4, Tony Kennedy.
Some Thoughts on Technology and Strategies for Petaflops.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Dynamical Anisotropic-Clover Lattice Production for Hadronic Physics J. Foley, C. Morningstar, CMU K. Orginos, College W&M J. Dudek, R. Edwards, B. Joo,
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
Simulating Quarks and Gluons with Quantum Chromodynamics February 10, CS635 Parallel Computer Architecture. Mahantesh Halappanavar.
QCD Project Overview Ying Zhang September 26, 2005.
Remote Production and Regional Analysis Centers Iain Bertram 24 May 2002 Draft 1 Lancaster University.
Lattice QCD in Nuclear Physics Robert Edwards Jefferson Lab CCP 2011 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Physics Steven Gottlieb, NCSA/Indiana University Lattice QCD: focus on one area I understand well. A central aim of calculations using lattice QCD is to.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
Excited State Spectroscopy using GPUs Robert Edwards Jefferson Lab TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Computing and IT Update Jefferson Lab User Group Roy Whitney, CIO & CTO 10 June 2009.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
Scientific Computing Experimental Physics Lattice QCD Sandy Philpott May 20, 2011 IT Internal Review 12GeV Readiness.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Lattice QCD and the SciDAC-2 LQCD Computing Project Lattice QCD Workflow Workshop Fermilab, December 18, 2006 Don Holmgren,
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Hadron Spectroscopy from Lattice QCD
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
1 Application Scalability and High Productivity Computing Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL.
Computational Requirements for NP Robert Edwards Jefferson Lab TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAA.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.
MESQUITE: Mesh Optimization Toolkit Brian Miller, LLNL
CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
Site Report on Physics Plans and ILDG Usage for US Balint Joo Jefferson Lab.
B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.
Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.
Feb. 14, 2002DØRAM Proposal DØ IB Meeting, Jae Yu 1 Proposal for a DØ Remote Analysis Model (DØRAM) Introduction Partial Workshop Results DØRAM Architecture.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
HEP and NP SciDAC projects: Key ideas presented in the SciDAC II white papers Robert D. Ryne.
DiRAC-3 – The future Jeremy Yates, STFC DiRAC HPC Facility.
U.S. Department of Energy’s Office of Science Midrange Scientific Computing Requirements Jefferson Lab Robert Edwards October 21, 2008.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
Tackling I/O Issues 1 David Race 16 March 2010.
HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
LQCD Computing Project Overview
VisIt Project Overview
Project Management – Part I
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Baryons on the Lattice Robert Edwards Jefferson Lab Hadron 09
Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.
Appro Xtreme-X Supercomputers
Programming Models for SimMillennium
Scientific Computing At Jefferson Lab
Excited State Spectroscopy from Lattice QCD
Presentation transcript:

Lattice QCD in Nuclear Physics Robert Edwards Jefferson Lab NERSC 2011 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAA Report: Robert Edwards, Martin Savage & Chip Watson

Current HPC Methods Algorithms –Gauge generation –Analysis phase Codes –USQCD SciDAC codes –Heavily used at NERSC: QDP++ & Chroma Quantities that affect the scale of the simulations –Lattice size, lattice spacing & pion mass

Gauge generation Hybrid Monte Carlo (HMC) Hamiltonian integrator: 1 st order coupled PDE’s Large, sparse, matrix solve per step “Configurations” via importance sampling Use Metropolis method Produce ~1000 useful configurations in a dataset Cost: Controlled by lattice size & spacing, quark mass Requires capability resources 3

Analysis Compute observables via averages over configurations For one configuration: –Solve Dirac eqn. repeatedly: D(A,m) Ã = Â Large, sparse, matrix solver. Use iterative methods. –Either hold solutions in memory, or page to disk –Tie solutions together to make “observables” Can be 1000’s of measurements for each configuration Repeat for 1000’s of configurations

Codes SciDAC: developed new code base – use C++ QDP++: –Data parallel interface – well suited for QCD (called Level 2) –Hides architectural details –Supports threading & comms (e.g., Hybrid/MPI model) –Thread package: customized – can outperform OpenMP –Parallel file I/O support Chroma: –Built over QDP++ –High degree of code modularity –Supports gauge generation & –Task based measurement system Modern CS/software-engineering techniques 2010: Chroma accounted for 1/3 NERSC cycles

User base and support Aware of many groups using SciDAC codes –LHPC, NPLQCD, QCDSF, UKQCD, FNAL, etc. Several external applications at level 2 only Used extensively at sites like LLNL BG/L, clusters, gpus, NERSC/UT/OLCF/UK Crays, ANL BG/P, TACC, QPACE (Cell) Software web-pages and documentation Codes available via tarballs and Git Nightly builds and regression checks 184 citations to Chroma/QDP++ paper (Lattice 2004, Edwards & Joo’)Chroma/QDP++

Current HPC Requirements Architectures currently used Cray XT4&5 (variants), XE6, BG/L, BG/P, Linux clusters, GPUs Wallclock time: ~12 hours Compute/memory load Gauge generation: 40k cores XT5 1 GB/core Analysis phase: 20k cores (NERSC 2 GB/core Data read/written Gauge generation: Read/write at most 10 GB Analysis phase: a) Read 10GB, write temp. ~ 1 TB, write perm 100 GB b) [Read 10GB, write perm 100 GB Necessary software, services or infrastructure Some analyses use Lapack Analysis: global file systems useful for paging short term

Current HPC Requirements Hours allocated/used in 2010 USQCD = 58M (J/psi) + 19M BG/P + 2M (GPU) Core-Hours NSF Teragrid = 95 M SU’s Conversion Factors : XT4=XT5 =0.5 J/psi = 1.0 BG/P = SU –Total = 303M MPP + 2M GPU NERSC hours used over last year 16M (charged) = 73M (used) MPP Core-Hours

Current HPC Requirements Known limitations/obstacles/bottlenecks Performance on Cray variants & BG/P: inverter strong scaling Cray XT (<)5, hybrid MPI/threading essential

Current HPC Requirements Known limitations/obstacles/bottlenecks Crays: loaded system -> comms interference -> perf. degradation BG/P Crays (loaded)Cray (dedicated)

Current HPC Requirements NERSC Cray XE6: weak scaling improved hybrid threading less essential

HPC Usage and methods for the next 3-5 years Gauge generation: Cost: reasonable statistics, box size and “physical” pion mass First milestone: 1 ensemble + analysis; second: two ensembles+analysis PF-years State-of-Art , 10TF-yr 2014 (0.5 PF-yr) ~4 B core-hrs 12 >5 PF-yr ~50 B core-hrs

Computational Requirements Gauge generation : Analysis Current calculations Weak matrix elements: 1 : 1 Baryon spectroscopy: 1 : 10 Nuclear structure: 1 : 4 Computational Requirements: Gauge Generation : Analysis 10 : 1 (2005) 1 : 4 (2011) Core work: Dirac inverters - use GPU-s 13

GPU-s boost capacity resources 2010 Total hours consumed: capability+capacity –303M MPP + 2M GPU USQCD (JLab): –Deployed 2010: 100 TF sustained ! 870 M core-hrs –1 GPU ~ 100 cores –Ramping up: 2M GPU-hr ~ 200M core-hrs Capacity now increasing much faster than capability

HPC Usage and methods for the next 3-5 years Upcoming changes to codes/methods/approaches New algorithms & software infrastructure crucial: Extreme scaling: efficiency within steep memory hierarchies Domain decomposition techniques: new algorithms needed Fault tolerance – more sophisticated workflows Parallel IO – staging/paging Moving data for capacity calculations Need SciDAC-3 and partnerships

HPC Usage and methods for the next 3-5 years Need SciDAC-3 and partnerships Current/future activities: USQCD researchers: co-designers BG/Q - improved cache usage Collaboration w/ Intel Parallel Computing Labs – improving codes NSF PRAC Bluewaters development Development of QUDA GPU software codes: hugely successful Exploiting domain decomposition techniques: –Push into integrators –Multigrid based inverters (with TOPS) Improving physics-level algorithms/software: –Measurement methods for spectroscopy, hadron & nuclear structure Developing for Exascale

HPC Usage and methods for the next 3-5 years Requirements/year by 2014: 100 TF-yr = 876 M 1 GF/core Wallclock time: still 12 to 24 hours Compute/memory load: typical problem: 48 3 x384 (up from 32 3 x256) Gauge generation: typical problem: 48 3 x384 (up from 32 3 x256) Many-core: 100k 1 GB/core Heterogeneous: 1000’s of gpus: e.g., 2592*4 within XK6 Analysis phase: Many-core: 20k – 40k 2 GB/core Heterogeneous: low 100’s of gpus Data read/written Gauge generation: Read/write 50 GB Analysis phase: Read 50GB, write temp. write perm 100 GB Checkpoint size: 100GB

HPC Usage and methods for the next 3-5 years On-line storage: 1 TB and 20 files Off-line storage: 100 TB and ~2000 files Necessary software, services or infrastructure LAPACK support & GPU library support Possibly/probably heterogenous-system compiler support Analysis: global file systems required for paging short term Require parallel disk read/write for performance Off-site network: no real-time comms, but requirements growing Assumes capability & capacity co-located

HPC Usage and methods for the next 3-5 years Anticipated limitations/obstacles/bottlenecks on 10K-1000K PE system. Generically, limitation is usually comms/compute ratio Obviously memory bandwidth an issue, but also latency

Strategy for new architectures Heterogeneous systems (GPU-s)? –Steep hierarchical memory subsystems –Suggests a domain decomposition strategy –Inverter: simple block-Jacobi, half-precision + GCR GPUs LLNL) –JaguarPF (XT5) ~ 16K cores Total PerformanceTime to solution 20 TF

Strategy for new architectures Heterogeneous systems (GPU-s)? Possible path to exascale At 20 TF on 256 GPUs (128 boxes) – comparable to leadership BUT: only inverter (& half-precision) Great for capacity. Can we generalize?? Capability computing: Push domain decomposition into integrators Amdahl’s law – formally small bits of code Under SciDAC-3: Data-parallel domain specific extensions to data-parallel layer (QDP++) Portable/efficient on heterogeneous hardware – higher level code will “port” Wish collaboration with other institutes

Strategy for new architectures Many-core with 10 – 100 cores/node?? LQCD deeply involved: Columbia/BNL/Edinburgh co-designers of BG/Q Low cost of synchronizations Hybrid/threading model: already used Synchronizations not cheap and/or steep memory hierachy? –Domain decomposition techniques Under SciDAC-3: Extend aggressive cache optimizations more into Data-Parallel

Summary Nuclear Physics LQCD: Significant efforts underway developing for Exascale Recommendations for NERSC: Deploy 10’s PF capability resource Deploy larger capacity resource

Backup/other