Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the.

Similar presentations


Presentation on theme: "Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the."— Presentation transcript:

1 Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AA A A A

2 Outline Will describe how: Capability computing + Capacity computing + SciDAC –Deliver science & NP milestones Collaborative efforts involve USQCD + JLab & DOE+NSF user communities 2

3 Hadronic & Nuclear Physics with LQCD Hadronic spectroscopy –Hadron resonance determinations –Exotic meson spectrum (JLab 12GeV ) Hadronic structure –3-D picture of hadrons from gluon & quark spin+flavor distributions –Ground & excited E&M transition form-factors (JLab 6GeV+12GeV+Mainz) –E&M polarizabilities of hadrons (Duke+CERN+Lund) Nuclear interactions –Nuclear processes relevant for stellar evolution –Hyperon-hyperon scattering –3 & 4 nucleon interaction properties [Collab. w/LLNL] (JLab+LLNL) Beyond the Standard Model –Neutron decay constraints on BSM from Ultra Cold Neutron source (LANL) 3

4 Bridges in Nuclear Physics NP Exascale 4

5 Spectroscopy Spectroscopy reveals fundamental aspects of hadronic physics –Essential degrees of freedom? –Gluonic excitations in mesons - exotic states of matter? Status –Can extract excited hadron energies & identify spins, –Pursuing full QCD calculations with realistic quark masses. New spectroscopy programs world-wide –E.g., BES III (Beijing), GSI/Panda (Darmstadt) –Crucial complement to 12 GeV program at JLab. Excited nucleon spectroscopy (JLab) JLab GlueX: search for gluonic excitations. 5

6 USQCD National Effort US Lattice QCD effort: Jefferson Laboratory, BNL and FNAL FNAL Weak matrix elements BNL RHIC Physics JLAB Hadronic Physics SciDAC – R&D Vehicle Software R&D INCITE resources (~20 TF-yr) + USQCD cluster facilities (17 TF-yr): Impact on DOE ’ s High Energy & Nuclear Physics Program 6

7 QCD: Theory of Strong Interactions QCD: theory of quarks & gluons Lattice QCD: approximate with a grid –Systematically improvable Gluon (Gauge) generation: –“Configurations” via importance sampling –Rewrite as diff. eqns. – sparse matrix solve per step – avoid “determinant” problem Analysis: –Compute observables via averages over configurations Requires large scale computing resources 7

8 Gauge Generation: Cost Scaling Cost: reasonable statistics, box size and “physical” pion mass Extrapolate in lattice spacings: 10 ~ 100 PF-yr PF-years State-of-Art Today, 10TF-yr 2011 (100TF-yr) 8

9 Typical LQCD Workflow Generate the configurations Leadership level 24k cores, 10 TF-yr t=0 t=T Analyze Typically mid- range level 256 cores Few big jobs Few big files Many small jobs Many big files I/O movement Extract Extract information from measured observables 9

10 Computational Requirements Gauge generation (INCITE) : Analysis (LQCD) Current calculations Weak matrix elements: 1 : 1 Baryon spectroscopy: 1 : 10 Nuclear structure: 1 : 4 Computational Requirements: INCITE : LQCD Computing 1 : 1 (2005) 1 : 3 (2010) Current availability: INCITE (~20 TF) : LQCD (17 TF) Core work: solve sparse matrix equation (iteratively) 10

11 SciDAC Impact Software development –QCD friendly API’s and libraries: enables high user productivity –Allows rapid prototyping & optimization –Significant software effort for GPU-s Algorithm improvements –Operators & contractions: clusters (Distillation: PRL (2009)) –Mixed-precision Dirac-solvers: INCITE+clusters+GPU-s, 2-3X –Adaptive multi-grid solvers: clusters, ~8X (?) Hardware development via USQCD Facilities –Adding support for new hardware –GPU-s 11

12 Modern GPU Characteristics Hundreds of simple cores: high flop rate SIMD architecture (single instruction, multiple data) Complex (high bandwidth) memory hierarchy Fast context switching -> hides memory access latency Gaming cards: no memory Error-Correction (ECC) – reliability issue I/O bandwidth << Memory bandwidth Commodity Processorsx86 CPUNVIDIA GT200New Fermi GPU #cores8240480 Clock speed3.2 GHz1.4 GHz Main memory bandwidth20 GB/s160 GB/s (gaming card) 180 GB/s (gaming card) I/O bandwidth7 GB/s (dual QDR IB) 3 GB/s 4 GB/s Power80 watts200 watts250 watts 12

13 Inverter Strong Scaling: V=32 3 x256 Local volume on GPU too small (I/O bottleneck) 3 Tflops 13

14 Science / Dollar for (Some) LQCD Capacity Apps 14

15 530 GPUs at Jefferson Lab (July)  200,000 cores (1,600 million core hours / year)  600 Tflops peak single precision  100 Tflops aggregate sustained in the inverter, (mixed half / single precision)  Significant increase in dedicated USQCD resources All this for only $1M with hosts, networking, etc. Disclaimer: To exploit this performance, code has to be run on the GPUs, not the CPU (Amdahl’s Law problem). SciDAC-2 (& 3) software effort: move more inverters & other code to gpu A Large Capacity Resource 15

16 New Science Reach in 2010-2011 QCD Spectrum Gauge generation: (next dataset) –INCITE: Crays&BG/P-s, ~ 16K – 24K cores –Double precision Analysis (existing dataset): two-classes –Propagators (Dirac matrix inversions) Few GPU level Single + half precision No memory error-correction –Contractions: Clusters: few cores Double precision + large memory footprint Cost (TF-yr) New: 10 TF-yr Old: 1 TF-yr 10 TF-yr 1 TF-yr 16

17 Isovector Meson Spectrum 17

18 Isovector Meson Spectrum 18

19 Exotic matter? Can we observe exotic matter? Excited string QED QCD 19

20 Exotic matter Exotics: world summary 20

21 Exotic matter Suggests (many) exotics within range of JLab Hall D Previous work: photo- production rates high Current GPU work: (strong) decays - important experimental input Exotics: first GPU results 21

22 Baryon Spectrum “Missing resonance problem” What are collective modes? What is the structure of the states? –Major focus of (and motivation for) JLab Hall B –Not resolved experimentally @ 6GeV 22

23 Nucleon & Delta Spectrum First results from GPU-s < 2% error bars [ 56,2 + ] D-wave [ 70,1 - ] P-wave [ 70,1 - ] P-wave [ 56,2 + ] D-wave Discern structure: wave-function overlaps Change at light quark mass? Decays! Suggests spectrum at least as dense as quark model 23

24 Towards resonance determinations Augment with multi-particle operators –Needs “annihilation diagrams” – provided by Distillation Ideally suited for (GPU-s) Resonance determination –Scattering in a finite box – discrete energy levels –Lüscher finite volume techniques –Phase shifts ! Width First results (partially from GPU-s) –Seems practical arxiv:0905.2160 24

25 Phase Shifts: demonstration 25

26 Extending science reach USQCD: –Next calculations: physical quark masses: 100 TF – 1 PF-yr –New INCITE+Early Science application (ANL+ORNL+NERSC) –NSF Blue Waters Petascale (PRAC) Need SciDAC-3 –Significant software effort for next generation GPU-s & heterogeneous environments –Participate in emerging ASCR Exascale initiatives INCITE + LQCD synergy: –ARRA GPU system well matched to current leadership facilities 26

27 Path to Exascale Enabled by some hybrid GPU system? –Cray + Nvidia ?? NSF GaTech: Tier 2 (experimental facility) –Phase 1: HP cluster+GPU (Nvidia Tesla) –Phase 2: hybrid GPU+ ASCR Exascale facility –Case studies for Science, Software+Runtime, Hardware Exascale capacity resource will be needed 27

28 Summary Capability + Capacity + SciDAC –Deliver science & HEP+NP milestones Petascale (leadership) + Petascale (capacity)+SciDAC-3 Spectrum + decays First contact with experimental resolution Exascale (leadership) + Exascale (capacity)+SciDAC-3 Full resolution Spectrum + transitions Nuclear structure Collaborative efforts: USQCD + JLab user communities 28

29 Backup slides The end 29

30 JLab ARRA: Phase 1 30

31 JLab ARRA: Phase 2 31

32 Hardware: ARRA GPU Cluster Host: 2.4 GHz Nehalem 48 GB memory / node 65 nodes, 200 GPUs Original configuration: 40 nodes w/ 4 GTX-285 GPUs 16 nodes w/ 2 GTX-285 + QDR IB 2 nodes w/ 4 Tesla C1050 or S1070 One quad GPU node = one rack of conventional nodes

33 SciDAC Software Stack QCD friendly API’s/libs http://www.usqcd.org Architectural level (Data parallel) High-level (lapack-like) GPU-s Application level 33

34 Dirac Inverter with Parallel GPU-s Divide problem among nodes: Trade-offs –On-node vs off-node bandwidths –Locality vs memory bandwidth Efficient at large problem size per node 34

35 Amdahl’s Law (Problem) Also disappointing: the GPU is idle 80% of the time! Conclusion: need to move more code to the GPU, and/or need task level parallelism (overlap CPU and GPU) Jefferson Lab has split this workload into two jobs (red and black), for 2 machines (conventional, GPU) 2x clock time improvement A major challenge in exploiting GPUs is Amdahl’s Law: If 60% of the code is GPU accelerated by 6x, the net gain is only 2x. 35

36 Considerable Software R&D is Needed Hardware Device Drivers Linux or  Kernel RTSMPI (?) User Application Space Up until now: O/S & RTS form a 'thin layer' between Application & H/W Hardware Device Drivers Power RAS Memory RTS scheduling, load balancing, work stealing, program model coexistence MPI (?) Programming Model hybrid MPI + node parallelism PGAS? Chapel? User Application Space Exascale X-Stack (?) Libraries (BLAS, PetSc,Trilinos...) Need SciDAC-3 to move to Exascale Chroma CPS MILC MDWF Dirac Operators QOP QDP++ QIO QMP Message Passing QDP/C QLA QMT Threads Application Layer Level 1: Basics Level 2: Data Parallel Level 3: Optimization QA0, GCC-BGL, Workflow, Viz. Tools + tools from collaborations with other SciDAC projects e.g. PERI 36

37 Need SciDAC-3 Application porting to new programming models/languages – Node abstraction – portability (like QDP++ now?) – Interactions with more restrictive (liberating?) exascale stack? Performance libraries for Exascale hardware – like level 3 currently – will need productivity tools Domain Specific Languages (QDP++ is almost this) Code Generators (More QA0, BAGEL etc) Performance monitoring Debugging, Simulation Algorithms for greater concurrency/reduced synchronization 37

38 NP Exascale Report


Download ppt "Lattice QCD and GPU-s Robert Edwards, Theory Group Chip Watson, HPC & CIO Jie Chen & Balint Joo, HPC Jefferson Lab TexPoint fonts used in EMF. Read the."

Similar presentations


Ads by Google