QDP++ and Chroma Robert Edwards Jefferson Lab

Slides:



Advertisements
Similar presentations
I/O and the SciDAC Software API Robert Edwards U.S. SciDAC Software Coordinating Committee May 2, 2003.
Advertisements

SciDAC Software Infrastructure for Lattice Gauge Theory
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Chapter 8 ICS 412. Code Generation Final phase of a compiler construction. It generates executable code for a target machine. A compiler may instead generate.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Nuclear Physics in the SciDAC Era Robert Edwards Jefferson Lab SciDAC 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Dynamical Chiral Fermions The `Grail’ – dyn. chiral fermions Generation of dyn. chiral fermions configs –RBC on the RIKEN QCDOC – Jan 05 (some %) –UKQCD.
UBlas: Boost High Performance Vector and Matrix Classes Juan José Gómez Cadenas University of Geneve and University of Valencia (thanks to: Joerg Walter,
Lattice QCD Comes of Age y Richard C. Brower XLIst Rencontres de Moriond March QCD and Hadronic interactions at high energy.
BU SciDAC Meeting Balint Joo Jefferson Lab. Anisotropic Clover Why do it ?  Anisotropy -> Fine Temporal Lattice Spacing at moderate cost  Combine with.
Data-Parallel Programming Model Basic uniform operations across lattice: C(x) = A(x)*B(x) Distribute problem grid across a machine grid Want API to hide.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Design Considerations Don Holmgren Lattice QCD Project Review May 24, Design Considerations Don Holmgren Lattice QCD Computing Project Review Cambridge,
Algorithms for Lattice Field Theory at Extreme Scales Rich Brower 1*, Ron Babich 1, James Brannick 2, Mike Clark 3, Saul Cohen 1, Balint Joo 4, Tony Kennedy.
Introduction CS 524 – High-Performance Computing.
HackLatt MILC with SciDAC C Carleton DeTar HackLatt 2008.
MILC Code Basics Carleton DeTar KITPC MILC Code Capabilities Molecular dynamics evolution –Staggered fermion actions (Asqtad, Fat7, HISQ,
HackLatt MILC Code Basics Carleton DeTar HackLatt 2008.
SciDAC Software Infrastructure for Lattice Gauge Theory DOE Grant ’01 -- ’03 (-- ’05?) All Hands Meeting: FNAL Feb. 21, 2003 Richard C.Brower Quick Overview.
I/O and the SciDAC Software API Robert Edwards U.S. SciDAC Software Coordinating Committee May 2, 2003.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
The science of simulation falsification algorithms phenomenology machines better theories computer architectures non-perturbative QFT experimental tests.
The QCDOC Project Overview and Status Norman H. Christ DOE LGT Review May 24-25, 2005.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Dr Chris Maynard Application Consultant, EPCC UKQCD software for lattice QCD P.A. Boyle, R.D. Kenway and C.M.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower Annual Progress Review JLab, May 14, 2007 Code distribution see
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.
QCD Project Overview Ying Zhang September 26, 2005.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chroma I: A High Level View Bálint Joó Jefferson Lab, Newport News, VA given at HackLatt'06 NeSC, Edinburgh March 29, 2006.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower & Robert Edwards June 24, 2003.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
LQCD Clusters at JLab Chip Watson Jie Chen, Robert Edwards Ying Chen, Walt Akers Jefferson Lab.
JLab SciDAC Activities QCD-API design and other activities at JLab include: –Messaging passing design and code (Level 1) [Watson, Edwards] First implementation.
Lattice QCD and the SciDAC-2 LQCD Computing Project Lattice QCD Workflow Workshop Fermilab, December 18, 2006 Don Holmgren,
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Generative Programming. Automated Assembly Lines.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
HackLatt MILC Code Basics Carleton DeTar First presented at Edinburgh EPCC HackLatt 2008 Updated 2013.
Cousins HPEC 2002 Session 4: Emerging High Performance Software David Cousins Division Scientist High Performance Computing Dept. Newport,
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
May 25-26, 2006 LQCD Computing Review1 Jefferson Lab 2006 LQCD Analysis Cluster Chip Watson Jefferson Lab, High Performance Computing.
Chroma: An Application of the SciDAC QCD API(s) Bálint Joó School of Physics University of Edinburgh UKQCD Collaboration Soon to be moving to the JLAB.
SciDAC Software Infrastructure for Lattice Gauge Theory Richard C. Brower QCD Project Review May 24-25, 2005 Code distribution see
Benchmarks of a Weather Forecasting Research Model Daniel B. Weber, Ph.D. Research Scientist CAPS/University of Oklahoma ****CONFIDENTIAL**** August 3,
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
JL MONGE - Chimere software evolution 1 Chimere software evolution How to make a code : –Obscure –Less efficient –Usable by nobody but its author.
What is QCD? Quantum ChromoDynamics is the theory of the strong force
A QCD Grid: 5 Easy Pieces? Richard Kenway University of Edinburgh.
U.S. Department of Energy’s Office of Science Midrange Scientific Computing Requirements Jefferson Lab Robert Edwards October 21, 2008.
QDP++ and Chroma Robert Edwards Jefferson Lab Collaborators: Balint Joo.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
UKQCD NeSCAC Irving, 24/1/061 January 06 UKQCD meeting Staggered fermion project Alan Irving University of Liverpool.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
LQCD Computing Project Overview
Project Management – Part I
Computational Requirements
Current status and future work
Jun Doi Tokyo Research Laboratory IBM Research
BlueGene/L Supercomputer
Embedded Development Application Note: Endian-Independence
Chroma: An Application of the SciDAC QCD API(s)
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
<PETE> Shape Programmability
Chapter 12 Pipelining and RISC
Lecture 4: Instruction Set Design/Pipelining
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

QDP++ and Chroma Robert Edwards Jefferson Lab

Optimised Dirac Operators, Inverters Level 3 QDP (QCD Data Parallel) Lattice Wide Operations, Data shifts Level 2 QMP (QCD Message Passing) QLA (QCD Linear Algebra) Level 1 QIO XML I/O LIME SciDAC Software Structure exists, implemented in MPI, GM, gigE and QCDOC Optimised Wilson Op for P4, Wilson and Staggered Ops for QCDOC Exists in C, C++, scalar and MPP using QMP

Data Parallel QDP C/C++ API ● Hides architecture and layout ● Operates on lattice fields across sites ● Linear algebra tailored for QCD ● Shifts and permutation maps across sites ● Reductions ● Subsets ● Entry/exit – attach to existing codes ● Implements SciDAC level 2.

Data-parallel Operations ● Unary and binary: -a; a-b; … ● Unary functions: adj(a), cos(a), sin(a), … ● Random numbers: // platform independent random(a), gaussian(a) ● Comparisons (booleans) a <= b, … ● Broadcasts: a = 0, … ● Reductions: sum(a), …

QDP++ Type Structure Lattice Fields have various kinds of indices Color: U ab (x) Spin:    Mixed:   a (x), Q ab  (x) Tensor Product of Indices forms Type:  Gauge Fields: Fermions: Scalars: Propagators: Gamma: Lattice Scalar Lattice Scalar Matrix(Nc) Vector(Nc) Scalar Matrix(Nc) Scalar Vector(Ns) Scalar Matrix(Ns) Complex Scalar Complex   Lattice Color Spin Complexity QDP++ forms these types via nested C++ templating Formation of new types (eg: half fermion) possible

QDP++ Expressions Can form expressions: c  i (x) = U  ab (x+ ) b  i (x) + 2 d  i (x) for all x multi1d U(Nd); LatticeFermion c, b, d; int nu, mu; c = shift(u[mu],FORWARD,nu)*b + 2*d; QDP++ code: PETE- Portable Expression Template Engine Temporaries eliminated, expressions optimised

Linear Algebra Example Naïve ops involve lattice temps – inefficient Eliminate lattice temps - PETE Allows further combining of operations (adj(x)*y) Overlap communications/computations Remaining performance limitations: Still have site temps Copies through “=“ Full perf. – expressions at site level // Lattice operation A = adj(B) + 2 * C; // Lattice temporaries t1 = 2 * C; t2 = adj(B); t3 = t2 + t1; A = t3; // Merged Lattice loop for (i =... ;... ;...) { A[i] = adj(B[i]) + 2 * C[i]; }

QDP++ Optimization Optimizations “under the hood” Select numerically intensive operations through template specialization. PETE recognises expression templates like: z = a * x + y from type information at compile time. Calls machine specific optimised routine (axpyz) Optimized routine can use assembler, reorganize loops etc. Optimized routines can be selected at configuration time, Unoptimized fallback routines exist for portability

Chroma: A lattice QCD Library using QDP++ and QMP. Work in development A lattice QCD toolkit/library built on top of QDP++ Library is a module – can be linked with other codes. Features: Utility libraries (gluonic measure, smearing, etc.) Fermion support (DWF, Overlap, Wilson, Asqtad) Applications Spectroscopy, Props & 3-pt funcs, eigenvalues Not finished – heatbath, HMC Optimization hooks – level 3 Wilson-Dslash for Pentium and now QCDOC Large commitment from UKQCD! eg: McNeile: computes propagators with CPS, measure pions with Chroma all in same code

Performance Test Case - Wilson Conjugate Gradient LatticeFermion psi, p, r; Real c, cp, a, d; for(int k = 1; k <= MaxCG; ++k) { // c = | r[k-1] |**2 c = cp; // a[k] := | r[k-1] |**2 / ; // Mp = M(u) * p M(mp, p, PLUS); // Dslash // d = | mp | ** 2 d = norm2(mp, s); a = c / d; // Psi[k] += a[k] p[k] psi[s] += a * p; // r[k] -= a[k] M^dag.M.p[k] ; M(mmp, mp, MINUS); r[s] -= a * mmp; cp = norm2(r, s); if ( cp <= rsd_sq ) return; // b[k+1] := |r[k]|**2 / |r[k-1]|**2 b = cp / c; // p[k+1] := r[k] + b[k+1] p[k] p[s] = r + b*p; } In C++ significant room for perf. degradation Performance limitations in Lin. Alg. Ops (VAXPY) and norms Optimization: Funcs return container holding function type and operands At “=“, replace expression with optimized code by template specialization VAXPY operations Norm squares

QCDOC Performance Benchmarks QCDO C Wilson 350Mhz, 4 nodes Dslash Mflops (a + b*D)*psiCG 2^4279 [38.8%] 232 [32.2%] 216 [30%] 136 [19%] [Assem linalg] 124 [17%] [C++ linalg] 4^4351 [48.8%] 324 [45%] 295 [41%] 283 [39%] 236 [33%] 4^2x8^ [49%] 323 [45%] 294 [41%] 293 [41%] 243 [34%] Assembly Wilson-dslash, optimized and non-opt. vaxpy/norms Optimized assembler routines by P. Boyle Percent peak lost outside dslash – reflects all overheads QDP++ overhead small ( ~ 1%) compared to best code

Pentium over Myrinet/GigE Performance Benchmarks WilsonDWF 3D mesh, 2.6Ghz 256 nodes / 8 nodes GigE Dslash Mflops/nod e CGDslashCG 4^4/node ^ D mesh, 2.0Ghz 128 nodes Myrinet 4^ ^ SSE Wilson-dslash Myrinet nodes, 2.0Ghz, 400 Mhz (front-side bus) GigE nodes, 2.6 Ghz, 533 Mhz (front-side bus) ~ 200Gflops sustained

QDP++ Status Version 1 Scalar and parallel versions Optimizations for P4 clusters, QCDOC Used in production of propagators now at JLab QIO (File I/O) with XML manipulation Supports both switch and grid-based machines today Tested on QCDOC, default version on gigE and switches Adopted by and support from UKQCD Single out thanks to Balint Joo and Peter Boyle for their outstanding contributions High efficiency achievable on QCDOC

Future Work  QMP  Further QMP/GigE perf. improvements.  QDP++  Generalize comm. structure to parallel transporters – allows multi-dir. shifts.  Continue leverage off optimized routines  Increase extent of optimizations for new physics apps  IO  Move to new Metadata Standard for gauge configs  On-going infrastructure devel. for cataloguing/delivery  Chroma  Finish HMC implementations for various fermion actions (UKQCD - overlap)