LQCD Workflow: Gauge Generation, Prop Calcs, Analysis Robert Edwards Jefferson Lab HackFest 2014
The “Good Ole’ Days”
Beginnings of Industrial Science
The Future of Industrial Science?? Graduate students Software manager Old Fortran code Graduate students
I said Joke slide, not Goat slide
LQCD Workflow Few big jobs Few big files Many small jobs Many big files I/O movement 6 Generate the configurations Leadership level 60K cores, 10’s TF-yr t=0 t=T Contract 8 cores, CPUs Correlators 100K – 1M copies Analyze 100K copies 4 Kepler GPUs + t=0 t=T Propagators
LQCD Workflow ~25% > 5% 7 Generate the configurations Leadership level 60K cores, 10’s TF-yr t=0 t=T Contract 8 cores, CPUs Correlators 100K – 1M copies Analyze 100K copies 4 Kepler GPUs Now AMG! + t=0 t=T Propagators ~75% Production cost New analysis cost Leadership level Throughput mode
LOTS of propagators Isovector meson Isoscalar meson t=0 t t t - Quark propagation between 0 & t & Quark propagation 0 to 0 & t to t Expensive!
Reuse those propagators Variational method: Propagators Operators And lots of permutations of contractions… “single particle” “single to two- particle”
Distillation - mesons Smearing in correlator Correlator Factorizes: operators and perambulators
Rewrite correlator keeping track of smearing labels Notation – keep track of labels as indices Consider complexity Mesons t0
Two-mesons (hadrons) require projection into irreducible representation Can have many Wick contractions and mom. projections –Worst case: rest -> p=100 -> 6x, p=110 -> 12x, p=111 -> 8x More complicated mesons
Two-mesons (hadrons) require projection into irreducible representation Can have many Wick contractions and mom. projections –Worst case: rest -> p=100 -> 6x, p=110 -> 12x, p=111 -> 8x More complicated mesons “Graph”
Consider I=1, A 1 ++, 20 3 x128 lattice Cost of a correlator driven by number of irrep graphs –1752 unique irrep graphs (using SU(2) isospin symmetry) –Time ~ sec. (N=128) 8-core Xeon, ATLAS for mat-muls (“zgemms”) Corresponding case, but all at rest –14 unique irrep graphs –Time ~ 85 sec. Over all irreps & time-sources, 79MB in graph DBs, keys Have seen up to ~2000 graphs Correlation functions get to be expensive… rest-frame
Optimize order of operations Traverse graphs along a t-slice –10,000’s of graphs Also 3-particles and more… Common sub-expression elimination For fixed t-slice - 100’s vertices t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 I=1/2 K*π arXiv:
Workflow is choreographed Topology –Vertices can have different ordinalities Order of evaluation is important –Baryons ~ O(N 4 ), Mesons ~ O(N 3 ) t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 More graduate students B B M M
Workflow is choreographed Topology –Vertices can have different ordinalities Order of evaluation is important –Baryons ~ O(N 4 ), Mesons ~ O(N 3 ) –Avoid creating larger ranked objects t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 More graduate students Yuk M M B B M M Bad
Workflow is choreographed Topology –Vertices can have different ordinalities Order of evaluation is important –Baryons ~ O(N 4 ), Mesons ~ O(N 3 ) –Avoid creating larger ranked objects Lots of mat-muls Obvious application for accelerators t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 More graduate students B B M M B’ B M B’’ Good
Smells like Industrial Science… Three main components –Gauge generation – leadership level (strong scaling) –Propagator calculations (solver) –Contraction calculations (distillation) Choice of rank N ~ 100’s sources –Larger N helps to resolve higher excited states Our present largest calculation: –4 spins, 256 timeslices, 384 vectors, 1 quark –395,264 individual solves per configuration –Huge # for LQCD –Fast solves enable new class of algorithms AMG for Phi-s or GPU-s –New ways to reduce contraction costs Wick contractions (determinant) via L-U factorization –All components obvious application for accelerators
Some left over bits
Internal unit of work is a “hadron node” In perambulator language Hadron nodes – matrices in distillation space Hadron nodes t0
Code flow Redstar gen_graph (redstar) Input: correlator xml Hadron_node (colorvec) Input: hadron node xml, perambulators, elementals Harom (harom – 3d code) Input: hadron node xml, 3d solution vectors (stochastic) Redstar npt (redstar) Input: correlator xml, hadron node sdbs Perams: Elementals: Solutions : Output: sdb - hadron node xml is key, value is matrix or tensor in distillation space (hadron node sdbs) Output: : unique graphs & hadron nodes (xml “keys”) Eigs: Output: : sdb – correlator xml is key, value is array of complex-s
Could add noise: V space rank N, space rank d Only want minimal number of noise insertions Arbitrary – choose noise on antiquarks All BLAS mat-muls Stochastic estimation
Could add noise: V space rank N, space rank d Only want minimal number of noise insertions Arbitrary – choose noise on antiquarks Lots of ops of lattice objects – prefactors in scaling can be large Stochastic estimation