Download presentation
Presentation is loading. Please wait.
Published byEmerald Gilmore Modified over 9 years ago
1
LQCD Workflow: Gauge Generation, Prop Calcs, Analysis Robert Edwards Jefferson Lab HackFest 2014
2
The “Good Ole’ Days”
3
Beginnings of Industrial Science
4
The Future of Industrial Science?? Graduate students Software manager Old Fortran code Graduate students
5
I said Joke slide, not Goat slide
6
LQCD Workflow Few big jobs Few big files Many small jobs Many big files I/O movement 6 Generate the configurations Leadership level 60K cores, 10’s TF-yr t=0 t=T Contract 8 cores, CPUs Correlators 100K – 1M copies Analyze 100K copies 4 Kepler GPUs + t=0 t=T Propagators
7
LQCD Workflow ~25% > 5% 7 Generate the configurations Leadership level 60K cores, 10’s TF-yr t=0 t=T Contract 8 cores, CPUs Correlators 100K – 1M copies Analyze 100K copies 4 Kepler GPUs Now AMG! + t=0 t=T Propagators ~75% Production cost New analysis cost Leadership level Throughput mode
8
LOTS of propagators Isovector meson Isoscalar meson t=0 t t t - Quark propagation between 0 & t & Quark propagation 0 to 0 & t to t Expensive!
9
Reuse those propagators Variational method: Propagators Operators And lots of permutations of contractions… “single particle” “single to two- particle”
10
Distillation - mesons Smearing in correlator Correlator Factorizes: operators and perambulators
11
Rewrite correlator keeping track of smearing labels Notation – keep track of labels as indices Consider complexity Mesons t0
12
Two-mesons (hadrons) require projection into irreducible representation Can have many Wick contractions and mom. projections –Worst case: rest -> p=100 -> 6x, p=110 -> 12x, p=111 -> 8x More complicated mesons 1 2 4 3 6 5
13
Two-mesons (hadrons) require projection into irreducible representation Can have many Wick contractions and mom. projections –Worst case: rest -> p=100 -> 6x, p=110 -> 12x, p=111 -> 8x More complicated mesons “Graph”
14
Consider I=1, A 1 ++, 20 3 x128 lattice Cost of a correlator driven by number of irrep graphs –1752 unique irrep graphs (using SU(2) isospin symmetry) –Time ~ 11100 sec. (N=128) 8-core Xeon, ATLAS for mat-muls (“zgemms”) Corresponding case, but all at rest –14 unique irrep graphs –Time ~ 85 sec. Over all irreps & time-sources, 79MB in graph DBs, 34700 keys Have seen up to ~2000 graphs Correlation functions get to be expensive… rest-frame
15
Optimize order of operations Traverse graphs along a t-slice –10,000’s of graphs Also 3-particles and more… Common sub-expression elimination For fixed t-slice - 100’s vertices t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 I=1/2 K*π arXiv:1406.4158
16
Workflow is choreographed Topology –Vertices can have different ordinalities Order of evaluation is important –Baryons ~ O(N 4 ), Mesons ~ O(N 3 ) t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 More graduate students B B M M
17
Workflow is choreographed Topology –Vertices can have different ordinalities Order of evaluation is important –Baryons ~ O(N 4 ), Mesons ~ O(N 3 ) –Avoid creating larger ranked objects t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 More graduate students Yuk M M B B M M Bad
18
Workflow is choreographed Topology –Vertices can have different ordinalities Order of evaluation is important –Baryons ~ O(N 4 ), Mesons ~ O(N 3 ) –Avoid creating larger ranked objects Lots of mat-muls Obvious application for accelerators t=0t=T = 48 Graph 1 Graph 2 Graph 3 t=1t=2t=3 Device 1Device 2Device 3Device 4 More graduate students B B M M B’ B M B’’ Good
19
Smells like Industrial Science… Three main components –Gauge generation – leadership level (strong scaling) –Propagator calculations (solver) –Contraction calculations (distillation) Choice of rank N ~ 100’s sources –Larger N helps to resolve higher excited states Our present largest calculation: –4 spins, 256 timeslices, 384 vectors, 1 quark –395,264 individual solves per configuration –Huge # for LQCD –Fast solves enable new class of algorithms AMG for Phi-s or GPU-s –New ways to reduce contraction costs Wick contractions (determinant) via L-U factorization –All components obvious application for accelerators
20
Some left over bits
21
Internal unit of work is a “hadron node” In perambulator language Hadron nodes – matrices in distillation space Hadron nodes t0
22
Code flow Redstar gen_graph (redstar) Input: correlator xml Hadron_node (colorvec) Input: hadron node xml, perambulators, elementals Harom (harom – 3d code) Input: hadron node xml, 3d solution vectors (stochastic) Redstar npt (redstar) Input: correlator xml, hadron node sdbs Perams: Elementals: Solutions : Output: sdb - hadron node xml is key, value is matrix or tensor in distillation space (hadron node sdbs) Output: : unique graphs & hadron nodes (xml “keys”) Eigs: Output: : sdb – correlator xml is key, value is array of complex-s
23
Could add noise: V space rank N, space rank d Only want minimal number of noise insertions Arbitrary – choose noise on antiquarks All BLAS mat-muls Stochastic estimation
24
Could add noise: V space rank N, space rank d Only want minimal number of noise insertions Arbitrary – choose noise on antiquarks Lots of ops of lattice objects – prefactors in scaling can be large Stochastic estimation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.