SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower Annual Progress Review JLab, May 14, 2007 Code distribution see
Software Committee Rich Brower (chair) Brower (chair) Carleton DeTar DeTar Robert Edwards Edwards Don Holmgren Holmgren Bob Mawhinney Mawhinney Chip Watson Watson Ying Zhang Zhang SciDAC-2 Minutes/Documents &Progress report
Major Participants in SciDAC Project ArizonaDoug ToussaintMITAndrew Pochinsky Dru RennerJoy Khoriaty BURich Brower *North CarolinaRob Fowler James OsbornYing Zhang * Mike ClarkJLabChip Watson * BNLChulwoo JungRobert Edwards * Enno ScholzJie Chen Efstratios EfstathiadisBalint Joo ColumbiaBob Mawhinney *IITXien-He Sun DePaulMassimo DiPierroIndianaSteve Gottlieb FNALDon Holmgren *Subhasish Basak Jim SimoneUtahCarleton DeTar * Jim KowalkowskiLudmila Levkova Amitoj SinghVanderbiltTed Bapty * Software Committee: Participants funded in part by SciDAC grant
QCD Software Infrastructure Goals: Create a unified software environment that will enable the US lattice community to achieve very high efficiency on diverse high performance hardware. Requirements: 1.Build on 20 year investment in MILC/CPS/Chroma 2.Optimize critical kernels for peak performance 3.Minimize software effort to port to new platforms & to create new applications
Solution for Lattice QCD (Perfect) Load Balancing: Uniform periodic lattices & identical sublattices per processor. (Complete) Latency hiding: overlap computation /communications Data Parallel: operations on Small 3x3 Complex Matrices per link. Critical kernels : Dirac Solver, HMC forces, etc %-90% Lattice Dirac operator:
Optimized Dirac Operators, Inverters Level 3 QDP (QCD Data Parallel) Lattice Wide Operations, Data shifts Level 2 QMP (QCD Message Passing) QLA (QCD Linear Algebra) Level 1 QIO Binary/ XML Metadata Files SciDAC-1 QCD API C/C++, implemented over MPI, native QCDOC, M-via GigE mesh Optimised for P4 and QCDOC Exists in C/C++ ILDG collab
Data Parallel QDP/C,C++ API Hides architecture and layout Operates on lattice fields across sites Linear algebra tailored for QCD Shifts and permutation maps across sites Reductions Subsets Entry/exit – attach to existing codes
Example of QDP++ Expression Typical for Dirac Operator: QDP/C++ code: Use Portable Expression Template Engine (PETE) Temporaries eliminated, expressions optimized
QOP (Optimized in asm) Dirac Operator, Inverters, Force etc QDP (QCD Data Parallel) Lattice Wide Operations, Data shifts QMP (QCD Message Passing) QLA (QCD Linear Algebra) QIO Binary / XML files & ILDG SciDAC-2 QCD API QMC (QCD Multi-core interface) Uniform User Env Runtime, accounting, grid, QCD Physics Toolbox Shared Alg,Building Blocks, Visualization,Performance Tools Level 4 Workflow and Data Analysis tools Application Codes: MILC / CPS / Chroma / RoleYourOwnMILC CPSChroma Level 3 Level 2 Level 1 SciDAC-1/SciDAC-2 = Gold/Blue PERI TOPS
Level 3 Domain Wall CG Inverter y y L s = 16, 4g is P4 2.8MHz, 800MHz FSB (6) (32 nodes) (4)
Asqtad Inverter on Kaon FNAL Mflop/s per core L Comparison of MILC C code vs SciDAC/QDP on L 4 sub-volumes for 16 and 64 core partition of Kaon
Level 3 on QCDOC Asqtad CG on L 4 subvolumes Mflop/s L Asqtad RHMC kernels 24 3 x32 with subvolumes 6 3 x18 DW RHMC kernels 32 3 x64x16 with subvolumes 4 3 x8x16
Building on SciDAC-1 Common Runtime Environment 1.File transfer, Batch scripts, Compile targets 2.Practical 3 Laboratory “Metafacility” Fuller use of API in application code. 1. Integrate QDP into MILC & QMP into CPS 2. Universal use of QIO, File Formats, QLA etc 3. Level 3 Interface standards Porting API to INCITE Platforms 1.BG/L & BG/P: QMP and QLA using XLC & Perl script 2. Cray XT4 & Opteron, clusters
Workflow and Cluster Reliability 1. Automated campaign to merge lattices, propagators and extract physics. (FNAL & Illinois Institute of Tech) 2.Cluster Reliability: (FNAL & Vanderbuilt) Tool Box -- shared algorithms / building blocks 1. RHMC, eigenvector solvers, etc 2. Visualization and Performance Analysis (DePaul & PERC) 3. Multi-scale Algorithms (QCD/TOPS Collaboration) Exploitation of Multi-core 1.Multi-core not Hertz is new paradigm 2.Plans for a QMC API (JLab & FNAL & PERC) New SciDAC-2 Goals See SciDAC-2 kickoff workshop Oct27-28,
QMC – QCD Multi-Threading General Evaluation – OpenMP vs. Explicit Thread library (Chen) –Explicit thread library can do better than OpenMP but OpenMP is compiler dependent Simple Threading API: QMC –based on older smp_lib (Pochinsky) –use pthreads and investigate barrier synchronization algorithms Evaluate threads for SSE-Dslash Consider threaded version of QMP ( Fowler and Porterfield in RENCI) Parallel sites 0..7 Serial Code Parallel sites Idle finalize / thread join
Conclusions Progress has been made using a common QCD-API & libraries for Communication, Linear Algebra, I/0, optimized inverters etc. But full Implementation, Optimization, Documentation & Maintenance of shared codes is a continuing challenge. And there is much work to keep up with changing Hardware and Algorithms. Still NEW users (young and old) with no prior lattice experience have initiated new lattice QCD research using SciDAC software! The bottom line is PHYSICS is being well served.