SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower & Robert Edwards June 24, 2003
K. Wilson (1989 Capri): “ One lesson is that lattice gauge theory could also require a 10 8 increase in computer power AND spectacular algorithmic advances before useful interactions with experiment... ” ab initio Chemistry = flops 10 Mflops 3.Gaussian Basis functions ab initio QCD = 2030?* 2.10 Mflops 1000 Tflops 3.Clever Collective Variable? vs * Hopefully sooner but need $1/Mflops $1/Gflops!
Sci D A C Through entific iscovery dvanced omputing
QCD Infrastructure Project Funded (2005?) QCD Infrastructure Project Funded (2005?) HARDWARE: – 10+ Tflops each at BNL, FNAL & JLab BNL (2004), FNAL/JLab ( ) SOFTWARE: – enable US lattice physicists to use the BNL and FNAL & JLab PHYSICS: – Provide Crucial Lattice “Data” that now dominate some tests of the Standard Model. – Deeper understanding of Field Theory (and even String Theory!)
Software Infrastructure Goals: Create a unified software environment that will enable the US lattice community to achieve very high efficiency on diverse multi-terascale hardware. TASKS: LIBRARIES: I. QCD Data Parallel API QDP II. Optimize Message Passing QMP III. Optimize QCD Linear Algebra QLA IV. I/O, Data Files and Data Grid QIO V. Opt. Physics Codes CPS/MILC/Croma/etc. VI. Execution Environment unify BNL/FNAL/JLab TASKS: LIBRARIES: I. QCD Data Parallel API QDP II. Optimize Message Passing QMP III. Optimize QCD Linear Algebra QLA IV. I/O, Data Files and Data Grid QIO V. Opt. Physics Codes CPS/MILC/Croma/etc. VI. Execution Environment unify BNL/FNAL/JLab
Participants in Software Project (partial list) * Software Coordinating Committee
Lattice QCD – extremely uniform Periodic or very simple boundary conditions SPMD: Identical sublattices per processor Lattice Operator: Dirac operator:
Optimised Dirac Operators, Inverters Level 3 QDP (QCD Data Parallel) Lattice Wide Operations, Data shifts Level 2 QMP (QCD Message Passing) QLA (QCD Linear Algebra) Level 1 QIO XML I/O DIME SciDAC Software Structure Exists in C/C++, implemented over MPI, GM, QCDOC, gigE Optimised for P4 and QCDOC Exists in C/C++
Overlapping communications and computations C(x)=A(x) * shift(B, + mu): – Send face forward non-blocking to neighboring node. – Receive face into pre-allocated buffer. – Meanwhile do A*B on interior sites. – “Wait” on receive to perform A*B on the face. Lazy Evaluation (C style): Shift(tmp, B, + mu); Mult(C, A, tmp); Data layout over processors
QCDOC 1.5 Tflops (Fall 2003) Performance of Dirac Inverters (% peak) – clover Wilson (assembly): 2 4 56%, 4 4 59% – naive Staggered (MILC) 2 4 14%, 4 4 22% (4 4 assembly 38%) – Asqtad Force (MILC) 2 4 3%, 4 4 7% – Asqtad Force (1 st attempt optimize) 4 4 16% as determined by ASIC Simulator with native SciDAC message passing (QMP).
Cluster Performance: 2002
Future Software Goals Critical needs: – On going Optimization, Testing and Hardening of SciDAC software infrastructure – Leverage SciDAC QCD infrastructure with collaborative efforts with ILDG and SciParC projects – Develop mechanism to maintain distributed software libraries. – Foster an international (Linux style?) development of application code.
Message Passing QMP Philosophy: Subset of MPI capability appropriate to QCD Broadcasts, Global reductions, Barrier Minimal copying / DMA where possible Channel-oriented / asynchronous communication Multidirection sends/receives for QCDOC Grid and switch model for node layout Implemented on GM and MPI. gigE nearly completed
QMP Simple Example char buf[size]; QMP_msgmem_t mm; QMP_msghandle_t mh; mm = QMP_declare_msgmem(buf,size); mh = QMP_declare_send_relative(mm,+x); QMP_start(mh); // Do computations QMP_wait(mh); Receiving node coordinates with the same steps except mh = QMP_declare_receive_from(mm,-x); Multiple calls
Data Parallel QDP/C,C++ API Hides architecture and layout Operates on lattice fields across sites Linear algebra tailored for QCD Shifts and permutation maps across sites Reductions Subsets Entry/exit – attach to existing codes
Data-parallel Operations Unary and binary: -a; a-b; … Unary functions: adj(a), cos(a), sin(a), … Random numbers: // platform independent random(a), gaussian(a) Comparisons (booleans) a <= b, … Broadcasts: a = 0, … Reductions: sum(a), … Fields have various types (indices):
QDP Expressions Can create expressions QDP/C++ code multi1d u(Nd); LatticeDiracFermion b, c, d; int mu; c[even] = u[mu] * shift(b,mu) + 2 * d; PETE: Portable Expression Template Engine Temporaries eliminated, expressions optimised
Linear Algebra Implementation Naïve ops involve lattice temps – inefficient Eliminate lattice temps -PETE Allows further combining of operations (adj(x)*y) Overlap communications/computations // Lattice operation A = adj(B) + 2 * C; // Lattice temporaries t1 = 2 * C; t2 = adj(B); t3 = t2 + t1; A = t3; // Merged Lattice loop for (i =... ;... ;...) { A[i] = adj(B[i]) + 2 * C[i]; }
Binary File/Interchange Formats Metadata – data describing data; e.g., physics params Use XML for metadata File formats: – Files mixed mode – XML ascii+binary – Using DIME (similar to MIME) to package – Use BinX (Edinburgh) to describe binary Replica-catalog web-archive repositories
Current Status Releases and documentation QMP, QDP/C,C++ in first release Performance improvements/testing underway Porting & development of physics codes over QDP on-going QIO/XML support near completion Cluster/QCDOC Run-time environment in development
SciDAC Prototype Clusters Myrinet + Pentium 4 – 48 duals 2.0 GHz FNAL (Spring 2002) – 128 single 2.0 GHz JLab (Summer 2002) – 128 dual 2.4 GHz FNAL (Fall 2002) Gigabit Ethernet Mesh + Pentium 4 – 256 (8x8x4) singles 2.8 GHz JLab (Summer 2003) – FPGA NIC for GigE (Summer 2003) – 256 FNAL (Fall 2003?)
Cast of Characters Software Committee * : R.Brower (chair), C.DeTar, R.Edwards, D.Holmgren, R.Mawhinney, C.Mendes, C.Watson Additional Software: J.Chen, E.Gregory, J.Hetrick, B.Joó, C.Jung, J.Osborn, K.Petrov, A.Pochinsky, J.Simone et al ( * Minutes and working documents: Executive Committee: R. Brower, N. Christ, M. Creutz P. Mackenzie, J. Negele, C. Rebbi, S. Sharpe, R. Suger(chair) C. Watson