Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Slides:



Advertisements
Similar presentations
Introduction Glasgow’s NPE research Group uses high precision electromagnetic probes to study the subatomic structure of matter. Alongside this we are.
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Excited State Spectroscopy from Lattice QCD
Nuclear Physics in the SciDAC Era Robert Edwards Jefferson Lab SciDAC 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Dynamical Anisotropic-Clover Lattice Production for Hadronic Physics C. Morningstar, CMU K. Orginos, College W&M J. Dudek, R. Edwards, B. Joo, D. Richards,
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
JLab Status & 2016 Planning April 2015 All Hands Meeting Chip Watson Jefferson Lab Outline Operations Status FY15 File System Upgrade 2016 Planning for.
OpenFOAM on a GPU-based Heterogeneous Cluster
Lattice QCD in Nuclear Physics Robert Edwards Jefferson Lab NERSC 2011 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Algorithms for Lattice Field Theory at Extreme Scales Rich Brower 1*, Ron Babich 1, James Brannick 2, Mike Clark 3, Saul Cohen 1, Balint Joo 4, Tony Kennedy.
Some Thoughts on Technology and Strategies for Petaflops.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Scientific Computing at Jefferson Lab Petabytes, Petaflops and GPUs Chip Watson Scientific Computing Group Jefferson Lab Presented at CLAS12 Workshop,
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
Dynamical Anisotropic-Clover Lattice Production for Hadronic Physics J. Foley, C. Morningstar, CMU K. Orginos, College W&M J. Dudek, R. Edwards, B. Joo,
Simulating Quarks and Gluons with Quantum Chromodynamics February 10, CS635 Parallel Computer Architecture. Mahantesh Halappanavar.
QCD Project Overview Ying Zhang September 26, 2005.
Extracted directly from:
Lattice QCD in Nuclear Physics Robert Edwards Jefferson Lab CCP 2011 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Physics Steven Gottlieb, NCSA/Indiana University Lattice QCD: focus on one area I understand well. A central aim of calculations using lattice QCD is to.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Excited State Spectroscopy using GPUs Robert Edwards Jefferson Lab TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Scientific Computing Experimental Physics Lattice QCD Sandy Philpott May 20, 2011 IT Internal Review 12GeV Readiness.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Baryon Resonances from Lattice QCD Robert Edwards Jefferson Lab N high Q 2, 2011 TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Lattice QCD and the SciDAC-2 LQCD Computing Project Lattice QCD Workflow Workshop Fermilab, December 18, 2006 Don Holmgren,
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Hadron Spectroscopy from Lattice QCD
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
Computational Requirements for NP Robert Edwards Jefferson Lab TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAA.
Excited baryon spectrum using Lattice QCD Robert Edwards Jefferson Lab JLab Users Group Meeting 2011 TexPoint fonts used in EMF. Read the TexPoint manual.
Baryon Resonance Determination using LQCD Robert Edwards Jefferson Lab Baryons 2013 TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
High Energy Nuclear Physics and the Nature of Matter Outstanding questions about strongly interacting matter: How does matter behave at very high temperature.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing.
Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
LQCD Workflow: Gauge Generation, Prop Calcs, Analysis Robert Edwards Jefferson Lab HackFest 2014.
Robert Edwards Jefferson Lab Creutz-Fest 2014 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAA 1983 HADRONS.
SURA BOT 11/5/02 Lattice QCD Stephen J Wallace. SURA BOT 11/5/02 Lattice.
Excited State Spectroscopy from Lattice QCD Robert Edwards Jefferson Lab CERN 2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.
Excited State Spectroscopy from Lattice QCD Robert Edwards Jefferson Lab MENU 2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
What is QCD? Quantum ChromoDynamics is the theory of the strong force
HEP and NP SciDAC projects: Key ideas presented in the SciDAC II white papers Robert D. Ryne.
U.S. Department of Energy’s Office of Science Midrange Scientific Computing Requirements Jefferson Lab Robert Edwards October 21, 2008.
Baryons (and Mesons) on the Lattice Robert Edwards Jefferson Lab EBAC May 2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Baryon Resonances from Lattice QCD Robert Edwards Jefferson Lab GHP 2011 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
LQCD Computing Project Overview
Project Management – Part I
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Baryons on the Lattice Robert Edwards Jefferson Lab Hadron 09
Nucleon Resonances from Lattice QCD
Baryon Spectroscopy and Resonances
Excited State Spectroscopy from Lattice QCD
A Phenomenology of the Baryon Spectrum from Lattice QCD
Excited State Spectroscopy from Lattice QCD
Excited State Spectroscopy from Lattice QCD
Excited state meson and baryon spectroscopy from Lattice QCD
Baryon Resonances from Lattice QCD
Presentation transcript:

Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AA A A A

Outline Will describe how: Capability computing + Capacity computing + SciDAC –Deliver science & NP milestones Collaborative efforts involve USQCD + JLab & DOE+NSF user communities 2

Hadronic & Nuclear Physics with LQCD Hadronic spectroscopy –Hadron resonance determinations –Exotic meson spectrum (JLab 12GeV ) Hadronic structure –3-D picture of hadrons from gluon & quark spin+flavor distributions –Ground & excited E&M transition form-factors (JLab 6GeV+12GeV+Mainz) –E&M polarizabilities of hadrons (Duke+CERN+Lund) Nuclear interactions –Nuclear processes relevant for stellar evolution –Hyperon-hyperon scattering –3 & 4 nucleon interaction properties [Collab. w/LLNL] (JLab+LLNL) Beyond the Standard Model –Neutron decay constraints on BSM from Ultra Cold Neutron source (LANL) 3

Bridges in Nuclear Physics NP Exascale 4

Spectroscopy Spectroscopy reveals fundamental aspects of hadronic physics –Essential degrees of freedom? –Gluonic excitations in mesons - exotic states of matter? Status –Can extract excited hadron energies & identify spins, –Pursuing full QCD calculations with realistic quark masses. New spectroscopy programs world-wide –E.g., BES III (Beijing), GSI/Panda (Darmstadt) –Crucial complement to 12 GeV program at JLab. Excited nucleon spectroscopy (JLab) JLab GlueX: search for gluonic excitations. 5

USQCD National Effort US Lattice QCD effort: Jefferson Laboratory, BNL and FNAL FNAL Weak matrix elements BNL RHIC Physics JLAB Hadronic Physics SciDAC – R&D Vehicle Software R&D INCITE resources (~20 TF-yr) + USQCD cluster facilities (17 TF-yr): Impact on DOE ’ s High Energy & Nuclear Physics Program 6

QCD: Theory of Strong Interactions QCD: theory of quarks & gluons Lattice QCD: approximate with a grid –Systematically improvable Gluon (Gauge) generation: –“Configurations” via importance sampling –Rewrite as diff. eqns. – sparse matrix solve per step – avoid “determinant” problem Analysis: –Compute observables via averages over configurations Requires large scale computing resources 7

Gauge Generation: Cost Scaling Cost: reasonable statistics, box size and “physical” pion mass Extrapolate in lattice spacings: 10 ~ 100 PF-yr PF-years State-of-Art Today, 10TF-yr 2011 (100TF-yr) 8

Typical LQCD Workflow Generate the configurations Leadership level 24k cores, 10 TF-yr t=0 t=T Analyze Typically mid- range level 256 cores Few big jobs Few big files Many small jobs Many big files I/O movement Extract Extract information from measured observables 9

Computational Requirements Gauge generation (INCITE) : Analysis (LQCD) Current calculations Weak matrix elements: 1 : 1 Baryon spectroscopy: 1 : 10 Nuclear structure: 1 : 4 Computational Requirements: INCITE : LQCD Computing 1 : 1 (2005) 1 : 3 (2010) Current availability: INCITE (~20 TF) : LQCD (17 TF) Core work: solve sparse matrix equation (iteratively) 10

SciDAC Impact Software development –QCD friendly API’s and libraries: enables high user productivity –Allows rapid prototyping & optimization –Significant software effort for GPU-s Algorithm improvements –Operators & contractions: clusters (Distillation: PRL (2009)) –Mixed-precision Dirac-solvers: INCITE+clusters+GPU-s, 2-3X –Adaptive multi-grid solvers: clusters, ~8X (?) Hardware development via USQCD Facilities –Adding support for new hardware –GPU-s 11

Modern GPU Characteristics Hundreds of simple cores: high flop rate SIMD architecture (single instruction, multiple data) Complex (high bandwidth) memory hierarchy Fast context switching -> hides memory access latency Gaming cards: no memory Error-Correction (ECC) – reliability issue I/O bandwidth << Memory bandwidth Commodity Processorsx86 CPUNVIDIA GT200New Fermi GPU #cores Clock speed3.2 GHz1.4 GHz Main memory bandwidth20 GB/s160 GB/s (gaming card) 180 GB/s (gaming card) I/O bandwidth7 GB/s (dual QDR IB) 3 GB/s 4 GB/s Power80 watts200 watts250 watts 12

Inverter Strong Scaling: V=32 3 x256 Local volume on GPU too small (I/O bottleneck) 3 Tflops 13

Science / Dollar for (Some) LQCD Capacity Apps 14

530 GPUs at Jefferson Lab (July)  200,000 cores (1,600 million core hours / year)  600 Tflops peak single precision  100 Tflops aggregate sustained in the inverter, (mixed half / single precision)  Significant increase in dedicated USQCD resources All this for only $1M with hosts, networking, etc. Disclaimer: To exploit this performance, code has to be run on the GPUs, not the CPU (Amdahl’s Law problem). SciDAC-2 (& 3) software effort: move more inverters & other code to gpu A Large Capacity Resource 15

New Science Reach in QCD Spectrum Gauge generation: (next dataset) –INCITE: Crays&BG/P-s, ~ 16K – 24K cores –Double precision Analysis (existing dataset): two-classes –Propagators (Dirac matrix inversions) Few GPU level Single + half precision No memory error-correction –Contractions: Clusters: few cores Double precision + large memory footprint Cost (TF-yr) New: 10 TF-yr Old: 1 TF-yr 10 TF-yr 1 TF-yr 16

Isovector Meson Spectrum 17

Isovector Meson Spectrum 18

Exotic matter? Can we observe exotic matter? Excited string QED QCD 19

Exotic matter Exotics: world summary 20

Exotic matter Suggests (many) exotics within range of JLab Hall D Previous work: photo- production rates high Current GPU work: (strong) decays - important experimental input Exotics: first GPU results 21

Baryon Spectrum “Missing resonance problem” What are collective modes? What is the structure of the states? –Major focus of (and motivation for) JLab Hall B –Not resolved 6GeV 22

Nucleon & Delta Spectrum First results from GPU-s < 2% error bars [ 56,2 + ] D-wave [ 70,1 - ] P-wave [ 70,1 - ] P-wave [ 56,2 + ] D-wave Discern structure: wave-function overlaps Change at light quark mass? Decays! Suggests spectrum at least as dense as quark model 23

Towards resonance determinations Augment with multi-particle operators –Needs “annihilation diagrams” – provided by Distillation Ideally suited for (GPU-s) Resonance determination –Scattering in a finite box – discrete energy levels –Lüscher finite volume techniques –Phase shifts ! Width First results (partially from GPU-s) –Seems practical arxiv:

Phase Shifts: demonstration 25

Extending science reach USQCD: –Next calculations: physical quark masses: 100 TF – 1 PF-yr –New INCITE+Early Science application (ANL+ORNL+NERSC) –NSF Blue Waters Petascale (PRAC) Need SciDAC-3 –Significant software effort for next generation GPU-s & heterogeneous environments –Participate in emerging ASCR Exascale initiatives INCITE + LQCD synergy: –ARRA GPU system well matched to current leadership facilities 26

Path to Exascale Enabled by some hybrid GPU system? –Cray + Nvidia ?? NSF GaTech: Tier 2 (experimental facility) –Phase 1: HP cluster+GPU (Nvidia Tesla) –Phase 2: hybrid GPU+ ASCR Exascale facility –Case studies for Science, Software+Runtime, Hardware Exascale capacity resource will be needed 27

Summary Capability + Capacity + SciDAC –Deliver science & HEP+NP milestones Petascale (leadership) + Petascale (capacity)+SciDAC-3 Spectrum + decays First contact with experimental resolution Exascale (leadership) + Exascale (capacity)+SciDAC-3 Full resolution Spectrum + transitions Nuclear structure Collaborative efforts: USQCD + JLab user communities 28

Backup slides The end 29

JLab ARRA: Phase 1 30

JLab ARRA: Phase 2 31

Hardware: ARRA GPU Cluster Host: 2.4 GHz Nehalem 48 GB memory / node 65 nodes, 200 GPUs Original configuration: 40 nodes w/ 4 GTX-285 GPUs 16 nodes w/ 2 GTX QDR IB 2 nodes w/ 4 Tesla C1050 or S1070 One quad GPU node = one rack of conventional nodes

SciDAC Software Stack QCD friendly API’s/libs Architectural level (Data parallel) High-level (lapack-like) GPU-s Application level 33

Dirac Inverter with Parallel GPU-s Divide problem among nodes: Trade-offs –On-node vs off-node bandwidths –Locality vs memory bandwidth Efficient at large problem size per node 34

Amdahl’s Law (Problem) Also disappointing: the GPU is idle 80% of the time! Conclusion: need to move more code to the GPU, and/or need task level parallelism (overlap CPU and GPU) Jefferson Lab has split this workload into two jobs (red and black), for 2 machines (conventional, GPU) 2x clock time improvement A major challenge in exploiting GPUs is Amdahl’s Law: If 60% of the code is GPU accelerated by 6x, the net gain is only 2x. 35

Considerable Software R&D is Needed Hardware Device Drivers Linux or  Kernel RTSMPI (?) User Application Space Up until now: O/S & RTS form a 'thin layer' between Application & H/W Hardware Device Drivers Power RAS Memory RTS scheduling, load balancing, work stealing, program model coexistence MPI (?) Programming Model hybrid MPI + node parallelism PGAS? Chapel? User Application Space Exascale X-Stack (?) Libraries (BLAS, PetSc,Trilinos...) Need SciDAC-3 to move to Exascale Chroma CPS MILC MDWF Dirac Operators QOP QDP++ QIO QMP Message Passing QDP/C QLA QMT Threads Application Layer Level 1: Basics Level 2: Data Parallel Level 3: Optimization QA0, GCC-BGL, Workflow, Viz. Tools + tools from collaborations with other SciDAC projects e.g. PERI 36

Need SciDAC-3 Application porting to new programming models/languages – Node abstraction – portability (like QDP++ now?) – Interactions with more restrictive (liberating?) exascale stack? Performance libraries for Exascale hardware – like level 3 currently – will need productivity tools Domain Specific Languages (QDP++ is almost this) Code Generators (More QA0, BAGEL etc) Performance monitoring Debugging, Simulation Algorithms for greater concurrency/reduced synchronization 37

NP Exascale Report