Presentation is loading. Please wait.

Presentation is loading. Please wait.

LQCD Workflow: Gauge Generation, Prop Calcs, Analysis Robert Edwards Jefferson Lab HackFest 2014.

Similar presentations


Presentation on theme: "LQCD Workflow: Gauge Generation, Prop Calcs, Analysis Robert Edwards Jefferson Lab HackFest 2014."— Presentation transcript:

1 LQCD Workflow: Gauge Generation, Prop Calcs, Analysis Robert Edwards Jefferson Lab HackFest 2014

2 The “Good Ole’ Days”

3 Beginnings of Industrial Science

4 The Future of Industrial Science?? Graduate students Software manager Old Fortran code Graduate students

5 I said Joke slide, not Goat slide

6 LQCD Workflow Few big jobs Few big files Many small jobs Many big files I/O movement 6 Generate the configurations Leadership level 60K cores, 10’s TF-yr t=0 t=T Contract 8 cores, CPUs Correlators 100K – 1M copies Analyze 100K copies 4 Kepler GPUs + t=0 t=T Propagators

7 LQCD Workflow ~25% > 5% 7 Generate the configurations Leadership level 60K cores, 10’s TF-yr t=0 t=T Contract 8 cores, CPUs Correlators 100K – 1M copies Analyze 100K copies 4 Kepler GPUs Now AMG! + t=0 t=T Propagators ~75% Production cost New analysis cost Leadership level Throughput mode

8 LOTS of propagators Isovector meson Isoscalar meson t=0 t t t - Quark propagation between 0 & t & Quark propagation 0 to 0 & t to t Expensive!

9 Reuse those propagators Variational method: Propagators Operators And lots of permutations of contractions… “single particle” “single to two- particle”

10 Distillation - mesons Smearing in correlator Correlator Factorizes: operators and perambulators

11 Rewrite correlator keeping track of smearing labels Notation – keep track of labels as indices Consider complexity Mesons t0

12 Two-mesons (hadrons) require projection into irreducible representation Can have many Wick contractions and mom. projections –Worst case: rest -> p=100 -> 6x, p=110 -> 12x, p=111 -> 8x More complicated mesons 1 2 4 3 6 5

13 Two-mesons (hadrons) require projection into irreducible representation Can have many Wick contractions and mom. projections –Worst case: rest -> p=100 -> 6x, p=110 -> 12x, p=111 -> 8x More complicated mesons “Graph”

14 Consider I=1, A 1 ++, 20 3 x128 lattice Cost of a correlator driven by number of irrep graphs –1752 unique irrep graphs (using SU(2) isospin symmetry) –Time ~ 11100 sec. (N=128) 8-core Xeon, ATLAS for mat-muls (“zgemms”) Corresponding case, but all at rest –14 unique irrep graphs –Time ~ 85 sec. Over all irreps & time-sources, 79MB in graph DBs, 34700 keys Have seen up to ~2000 graphs Correlation functions get to be expensive… rest-frame

15 Optimize order of operations Traverse graphs along a t-slice –10,000’s of graphs Also 3-particles and more… Common sub-expression elimination For fixed t-slice - 100’s vertices t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 I=1/2 K*π arXiv:1406.4158

16 Workflow is choreographed Topology –Vertices can have different ordinalities Order of evaluation is important –Baryons ~ O(N 4 ), Mesons ~ O(N 3 ) t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 More graduate students B B M M

17 Workflow is choreographed Topology –Vertices can have different ordinalities Order of evaluation is important –Baryons ~ O(N 4 ), Mesons ~ O(N 3 ) –Avoid creating larger ranked objects t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 More graduate students Yuk M M B B M M Bad

18 Workflow is choreographed Topology –Vertices can have different ordinalities Order of evaluation is important –Baryons ~ O(N 4 ), Mesons ~ O(N 3 ) –Avoid creating larger ranked objects Lots of mat-muls Obvious application for accelerators t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 More graduate students B B M M B’ B M B’’ Good

19 Smells like Industrial Science… Three main components –Gauge generation – leadership level (strong scaling) –Propagator calculations (solver) –Contraction calculations (distillation) Choice of rank N ~ 100’s sources –Larger N helps to resolve higher excited states Our present largest calculation: –4 spins, 256 timeslices, 384 vectors, 1 quark –395,264 individual solves per configuration –Huge # for LQCD –Fast solves enable new class of algorithms AMG for Phi-s or GPU-s –New ways to reduce contraction costs Wick contractions (determinant) via L-U factorization –All components obvious application for accelerators

20 Some left over bits

21 Internal unit of work is a “hadron node” In perambulator language Hadron nodes – matrices in distillation space Hadron nodes t0

22 Code flow Redstar gen_graph (redstar) Input: correlator xml Hadron_node (colorvec) Input: hadron node xml, perambulators, elementals Harom (harom – 3d code) Input: hadron node xml, 3d solution vectors (stochastic) Redstar npt (redstar) Input: correlator xml, hadron node sdbs Perams: Elementals: Solutions : Output: sdb - hadron node xml is key, value is matrix or tensor in distillation space (hadron node sdbs) Output: : unique graphs & hadron nodes (xml “keys”) Eigs: Output: : sdb – correlator xml is key, value is array of complex-s

23 Could add noise: V space rank N, space rank d Only want minimal number of noise insertions Arbitrary – choose noise on antiquarks All BLAS mat-muls Stochastic estimation

24 Could add noise: V space rank N, space rank d Only want minimal number of noise insertions Arbitrary – choose noise on antiquarks Lots of ops of lattice objects – prefactors in scaling can be large Stochastic estimation


Download ppt "LQCD Workflow: Gauge Generation, Prop Calcs, Analysis Robert Edwards Jefferson Lab HackFest 2014."

Similar presentations


Ads by Google