Sunday, 12 July 2015 NCTS 2006 School on Modern Numerical Methods in Mathematics and Physics Algorithms for Dynamical Fermions A D Kennedy School of Physics,

Sunday, 12 July 2015 NCTS 2006 School on Modern Numerical Methods in Mathematics and Physics Algorithms for Dynamical Fermions A D Kennedy School of Physics, The University of Edinburgh

2 Sunday, 12 July 2015A D Kennedy Outline Monte Carlo integration Importance sampling Markov chains Detailed balance, Metropolis algorithm Symplectic integrators Hybrid Monte Carlo Pseudofermions RHMC QCD Computers

3 Sunday, 12 July 2015A D Kennedy Monte Carlo methods: I Monte Carlo integration is based on the identification of probabilities with measures There are much better methods of carrying out low dimensional quadrature All other methods become hopelessly expensive for large dimensions In lattice QFT there is one integration per degree of freedom We are approximating an infinite dimensional functional integral

4 Sunday, 12 July 2015A D Kennedy Measure the value of  on each configuration and compute the average Monte Carlo methods: II Generate a sequence of random field configurations chosen from the probability distribution

5 Sunday, 12 July 2015A D Kennedy Central Limit Theorem: I Distribution of values for a single sample  =  (  ) Central Limit Theorem where the variance of the distribution of  is Law of Large Numbers The Laplace–DeMoivre Central Limit theorem is an asymptotic expansion for the probability distribution of

6 Sunday, 12 July 2015A D Kennedy The first few cumulants are Central Limit Theorem: II Note that this is an asymptotic expansion Generating function for connected moments

7 Sunday, 12 July 2015A D Kennedy Central Limit Theorem: III Connected generating function Distribution of the average of N samples

8 Sunday, 12 July 2015A D Kennedy Take inverse Fourier transform to obtain distribution Central Limit Theorem: IV

9 Sunday, 12 July 2015A D Kennedy Central Limit Theorem: V where and Re-scale to show convergence to Gaussian distribution

10 Sunday, 12 July 2015A D Kennedy Importance Sampling: I Sample from distribution Probability Normalisation Integral Estimator of integral Estimator of variance

11 Sunday, 12 July 2015A D Kennedy Importance Sampling: II Minimise variance Constraint N=1 Lagrange multiplier Optimal measure Optimal variance

12 Sunday, 12 July 2015A D Kennedy Example: Optimal weight Optimal variance Importance Sampling: III

13 Sunday, 12 July 2015A D Kennedy Importance Sampling: IV 1 Construct cheap approximation to |sin(x)| 2 Calculate relative area within each rectangle 5 Choose another random number uniformly 4 Select corresponding rectangle 3 Choose a random number uniformly 6 Select corresponding x value within rectangle 7 Compute |sin(x)|

14 Sunday, 12 July 2015A D Kennedy Importance Sampling: V But we can do better! For which With 100 rectangles we have V = 16.02328561 With 100 rectangles we have V = 0.011642808

15 Sunday, 12 July 2015A D Kennedy Markov Chains: I State space  (Ergodic) stochastic transitions P’:    Distribution converges to unique fixed point Deterministic evolution of probability distribution P: Q  Q

16 Sunday, 12 July 2015A D Kennedy The sequence Q, PQ, P²Q, P³Q,… is Cauchy Define a metric on the space of (equivalence classes of) probability distributions Prove that with  > 0, so the Markov process P is a contraction mapping The space of probability distributions is complete, so the sequence converges to a unique fixed point Convergence of Markov Chains: I

17 Sunday, 12 July 2015A D Kennedy Convergence of Markov Chains: II

18 Sunday, 12 July 2015A D Kennedy Convergence of Markov Chains: III

19 Sunday, 12 July 2015A D Kennedy Markov Chains: II Use Markov chains to sample from Q Suppose we can construct an ergodic Markov process P which has distribution Q as its fixed point Start with an arbitrary state (“field configuration”) Iterate the Markov process until it has converged (“thermalized”) Thereafter, successive configurations will be distributed according to Q But in general they will be correlated To construct P we only need relative probabilities of states Do not know the normalisation of Q Cannot use Markov chains to compute integrals directly We can compute ratios of integrals

20 Sunday, 12 July 2015A D Kennedy Metropolis algorithm Markov Chains: III How do we construct a Markov process with a specified fixed point ? Integrate w.r.t. y to obtain fixed point condition Sufficient but not necessary for fixed point Detailed balance Other choices are possible, e.g., Consider cases and separately to obtain detailed balance condition Sufficient but not necessary for detailed balance

21 Sunday, 12 July 2015A D Kennedy Markov Chains: IV Composition of Markov steps Let P 1 and P 2 be two Markov steps which have the desired fixed point distribution They need not be ergodic Then the composition of the two steps P 2  P 1 will also have the desired fixed point And it may be ergodic This trivially generalises to any (fixed) number of steps For the case where P 1 is not ergodic but (P 1 ) n is the terminology weakly and strongly ergodic are sometimes used

22 Sunday, 12 July 2015A D Kennedy Markov Chains: V This result justifies “sweeping” through a lattice performing single site updates Each individual single site update has the desired fixed point because it satisfies detailed balance The entire sweep therefore has the desired fixed point, and is ergodic But the entire sweep does not satisfy detailed balance Of course it would satisfy detailed balance if the sites were updated in a random order But this is not necessary And it is undesirable because it puts too much randomness into the system

23 Sunday, 12 July 2015A D Kennedy Markov Chains: VI  Coupling from the Past Propp and Wilson (1996) Use fixed set of random numbers Flypaper principle: If states coalesce they stay together forever – Eventually, all states coalesce to some state  with probability one – Any state from t = -  will coalesce to  –  is a sample from the fixed point distribution

24 Sunday, 12 July 2015A D Kennedy Markov Chains: VII Suppose we have a lattice That is, a partial ordering with a largest and smallest element And an update step that preserves it Then once the largest and smallest states have coalesced all the others must have too a bdc e g f

25 Sunday, 12 July 2015A D Kennedy Markov Chains: VIII  β ≥ 0 Ordering is spin configuration A ≥ B iff for every site A s ≥ B s Update is a local heatbath A non-trivial example of where this is possible is the ferrormagnetic Ising model

26 Sunday, 12 July 2015A D Kennedy Autocorrelations: I This corresponds to the exponential autocorrelation time Exponential autocorrelations The unique fixed point of an ergodic Markov process corresponds to the unique eigenvector with eigenvalue 1 All its other eigenvalues must lie within the unit circle In particular, the largest subleading eigenvalue is

27 Sunday, 12 July 2015A D Kennedy Autocorrelations: II Integrated autocorrelations Consider the autocorrelation of some operator Ω Without loss of generality we may assume Define the autocorrelation function

28 Sunday, 12 July 2015A D Kennedy Autocorrelations: III The autocorrelation function must fall faster that the exponential autocorrelation For a sufficiently large number of samples Define the integrated autocorrelation function

29 Sunday, 12 July 2015A D Kennedy In order to carry out Monte Carlo computations including the effects of dynamical fermions we would like to find an algorithm which Update the fields globally Because single link updates are not cheap if the action is not local Take large steps through configuration space Because small-step methods carry out a random walk which leads to critical slowing down with a dynamical critical exponent z=2 Hybrid Monte Carlo: I Does not introduce any systematic errors z relates the autocorrelation to the correlation length of the system,

30 Sunday, 12 July 2015A D Kennedy A useful class of algorithms with these properties is the (Generalised) Hybrid Monte Carlo (HMC) method Introduce a “fictitious” momentum p corresponding to each dynamical degree of freedom q Find a Markov chain with fixed point  exp[-H(q,p) ] where H is the “fictitious” Hamiltonian ½p 2 + S(q) The action S of the underlying QFT plays the rôle of the potential in the “fictitious” classical mechanical system This gives the evolution of the system in a fifth dimension, “fictitious” or computer time This generates the desired distribution exp[-S(q) ] if we ignore the momenta p (i.e., the marginal distribution) Hybrid Monte Carlo: II

31 Sunday, 12 July 2015A D Kennedy The HMC Markov chain alternates two Markov steps Molecular Dynamics Monte Carlo (MDMC) (Partial) Momentum Refreshment Both have the desired fixed point Together they are ergodic Hybrid Monte Carlo: III

32 Sunday, 12 July 2015A D Kennedy MDMC: I If we could integrate Hamilton’s equations exactly we could follow a trajectory of constant fictitious energy This corresponds to a set of equiprobable fictitious phase space configurations Liouville’s theorem tells us that this also preserves the functional integral measure dp  dq as required Any approximate integration scheme which is reversible and area preserving may be used to suggest configurations to a Metropolis accept/reject test With acceptance probability min[1,exp(-  H)]

33 Sunday, 12 July 2015A D Kennedy MDMC: II Molecular Dynamics (MD), an approximate integrator which is exactly We build the MDMC step out of three parts A Metropolis accept/reject step Area preserving, Reversible, A momentum flip The composition of these gives with y being a uniformly distributed random number in [0,1]

34 Sunday, 12 July 2015A D Kennedy The Gaussian distribution of p is invariant under F The extra momentum flip F ensures that for small  the momenta are reversed after a rejection rather than after an acceptance For  =  / 2 all momentum flips are irrelevant This mixes the Gaussian distributed momenta p with Gaussian noise  Partial Momentum Refreshment

35 Sunday, 12 July 2015A D Kennedy If A and B belong to any (non-commutative) algebra then, where  constructed from commutators of A and B (i.e., is in the Free Lie Algebra generated by {A,B }) Symplectic Integrators: I More precisely, where and Baker-Campbell-Hausdorff (BCH) formula

36 Sunday, 12 July 2015A D Kennedy Symplectic Integrators: II Explicitly, the first few terms are In order to construct reversible integrators we use symmetric symplectic integrators The following identity follows directly from the BCH formula

37 Sunday, 12 July 2015A D Kennedy Symplectic Integrators: III We are interested in finding the classical trajectory in phase space of a system described by the Hamiltonian The basic idea of such a symplectic integrator is to write the time evolution operator as

38 Sunday, 12 July 2015A D Kennedy Symplectic Integrators: IV Define and so that Since the kinetic energy T is a function only of p and the potential energy S is a function only of q, it follows that the action of and may be evaluated trivially

39 Sunday, 12 July 2015A D Kennedy Symplectic Integrators: V From the BCH formula we find that the PQP symmetric symplectic integrator is given by In addition to conserving energy to O (  ² ) such symmetric symplectic integrators are manifestly area preserving and reversible

40 Sunday, 12 July 2015A D Kennedy Symplectic Integrators: VI For each symplectic integrator there exists a Hamiltonian H’ which is exactly conserved This is given by substituting Poisson brackets for commutators in the BCH formula

41 Sunday, 12 July 2015A D Kennedy Symplectic Integrators: VII Poisson brackets form a Lie algebra Homework exercise: verify the Jacobi identity

42 Sunday, 12 July 2015A D Kennedy Symplectic Integrators: VIII The explicit result for H’ is Note that H’ cannot be written as the sum of a p-dependent kinetic term and a q-dependent potential term As H’ is conserved, δH is of O(δτ 2 ) for trajectories of arbitrary length Even if τ = O (δτ -k ) with k > 1

43 Sunday, 12 July 2015A D Kennedy Symplectic Integrators: IX Define and so that Multiple timescales Split the Hamiltonian into pieces Introduce a symmetric symplectic integrator of the form If then the instability in the integrator is tickled equally by each sub-stepinstability This helps if the most expensive force computation does not correspond to the largest force

44 Sunday, 12 July 2015A D Kennedy Dynamical fermions: I Pseudofermions Direct simulation of Grassmann fields is not feasible The problem is not that of manipulating anticommuting values in a computer We therefore integrate out the fermion fields to obtain the fermion determinant It is that is not positive, and thus we get poor importance sampling and always occur quadratically The overall sign of the exponent is unimportant

45 Sunday, 12 July 2015A D Kennedy Dynamical fermions: II Any operator  can be expressed solely in terms of the bosonic fields Use Schwinger’s source method and integrate out the fermions E.g., the fermion propagator is

46 Sunday, 12 July 2015A D Kennedy Dynamical fermions: III Represent the fermion determinant as a bosonic Gaussian integral with a non-local kernel The fermion kernel must be positive definite (all its eigenvalues must have positive real parts) otherwise the bosonic integral will not converge The new bosonic fields are called pseudofermions The determinant is extensive in the lattice volume, thus again we get poor importance sampling Including the determinant as part of the observable to be measured is not feasible

47 Sunday, 12 July 2015A D Kennedy Dynamical fermions: IV It is usually convenient to introduce two flavours of fermion and to write The evaluation of the pseudofermion action and the corresponding force then requires us to find the solution of a (large) set of linear equations This not only guarantees positivity, but also allows us to generate the pseudofermions from a global heatbath by applying to a random Gaussian distributed field The equations for motion for the boson (gauge) fields are

48 Sunday, 12 July 2015A D Kennedy Dynamical fermions: V It is not necessary to carry out the inversions required for the equations of motion exactly There is a trade-off between the cost of computing the force and the acceptance rate of the Metropolis MDMC step The inversions required to compute the pseudofermion action for the accept/reject step does need to be computed exactly, however We usually take “exactly” to by synonymous with “to machine precision”

49 Sunday, 12 July 2015A D Kennedy Reversibility: I Are HMC trajectories reversible and area preserving in practice? The only fundamental source of irreversibility is the rounding error caused by using finite precision floating point arithmetic For fermionic systems we can also introduce irreversibility by choosing the starting vector for the iterative linear equation solver time-asymmetrically We do this if we to use a Chronological Inverter, which takes (some extrapolation of) the previous solution as the starting vector Floating point arithmetic is not associative It is more natural to store compact variables as scaled integers (fixed point) Saves memory Does not solve the precision problem

50 Sunday, 12 July 2015A D Kennedy Reversibility: II Data for SU(3) gauge theory and QCD with heavy quarks show that rounding errors are amplified exponentially The underlying continuous time equations of motion are chaotic Ляпунов exponents characterise the divergence of nearby trajectories The instability in the integrator occurs when  H » 1 Zero acceptance rate anyhow

51 Sunday, 12 July 2015A D Kennedy Reversibility: III In QCD the Ляпунов exponents appear to scale with  as the system approaches the continuum limit     = constant This can be interpreted as saying that the Ляпунов exponent characterises the chaotic nature of the continuum classical equations of motion, and is not a lattice artefact Therefore we should not have to worry about reversibility breaking down as we approach the continuum limit Caveat: data is only for small lattices, and is not conclusive

52 Sunday, 12 July 2015A D Kennedy What is the best polynomial approximation p(x) to a continuous function f(x) for x in [0,1] ? Polynomial approximation Best with respect to the appropriate norm where n  1

53 Sunday, 12 July 2015A D Kennedy Weierstraß’ theorem Weierstraß: Any continuous function can be arbitrarily well approximated by a polynomial Taking n → ∞ this is the norm Taking n → ∞ this is the minimax norm

54 Sunday, 12 July 2015A D Kennedy Бернштейне polynomials The explicit solution is provided by Бернштейне polynomials

55 Sunday, 12 July 2015A D Kennedy Чебышев’s theorem The error |p(x) - f(x)| reaches its maximum at exactly d+2 points on the unit interval Чебышев: There is always a unique polynomial of any degree d which minimises

56 Sunday, 12 July 2015A D Kennedy Чебышев’s theorem: Necessity Suppose p-f has less than d+2 extrema of equal magnitude Then at most d+1 maxima exceed some magnitude And whose magnitude is smaller than the “gap” The polynomial p+q is then a better approximation than p to f This defines a “gap” We can construct a polynomial q of degree d which has the opposite sign to p-f at each of these maxima (Lagrange interpolation)

57 Sunday, 12 July 2015A D Kennedy Чебышев’s theorem: Sufficiency Then at each of the d+2 extrema Thus p’ – p = 0 as it is a polynomial of degree d Therefore p’ - p must have d+1 zeros on the unit interval Suppose there is a polynomial

58 Sunday, 12 July 2015A D Kennedy The notation is an old transliteration of Чебышев ! Чебышев polynomials Convergence is often exponential in d The best approximation of degree d-1 over [-1,1] to x d is Where the Чебышев polynomials are The error is

59 Sunday, 12 July 2015A D Kennedy Чебышев rational functions Чебышев’s theorem is easily extended to rational approximations Rational functions with nearly equal degree numerator and denominator are usually best Convergence is still often exponential Rational functions usually give a much better approximation A simple (but somewhat slow) numerical algorithm for finding the optimal Чебышев rational approximation was given by Ремез

60 Sunday, 12 July 2015A D Kennedy Чебышев rationals: Example A realistic example of a rational approximation is Using a partial fraction expansion of such rational functions allows us to use a multishift linear equation solver, thus reducing the cost significantly. The partial fraction expansion of the rational function above is This is accurate to within almost 0.1% over the range [0.003,1] This appears to be numerically stable.

61 Sunday, 12 July 2015A D Kennedy Polynomials versus rationals Optimal L 2 approximation with weight is Optimal L ∞ approximation cannot be too much better (or it would lead to a better L 2 approximation) Золотарев’s formula has L ∞ error This has L 2 error of O(1/n)

62 Sunday, 12 July 2015A D Kennedy Non-linearity of CG solver Suppose we want to solve A 2 x=b for Hermitian A by CG It is better to solve Ax=y, Ay=b successively Condition number  (A 2 ) =  (A) 2 Cost is thus 2  (A) <  (A 2 ) in general Suppose we want to solve Ax=b Why don’t we solve A 1/2 x=y, A 1/2 y=b successively? The square root of A is uniquely defined if A>0 This is the case for fermion kernels All this generalises trivially to n th roots No tuning needed to split condition number evenly How do we apply the square root of a matrix?

63 Sunday, 12 July 2015A D Kennedy Rational matrix approximation Functions on matrices Defined for a Hermitian matrix by diagonalisation H = U D U -1 f (H) = f (U D U -1 ) = U f (D) U -1 Rational functions do not require diagonalisation  H m +  H n = U (  D m +  D n ) U -1 H -1 = U D -1 U -1 Rational functions have nice properties Cheap (relatively) Accurate

64 Sunday, 12 July 2015A D Kennedy No Free Lunch Theorem We must apply the rational approximation with each CG iteration M 1/n  r(M) The condition number for each term in the partial fraction expansion is approximately  (M) So the cost of applying M 1/n is proportional to  (M) Even though the condition number  (M 1/n )=  (M) 1/n And even though  (r(M))=  (M) 1/n So we don’t win this way…

65 Sunday, 12 July 2015A D Kennedy Pseudofermions We want to evaluate a functional integral including the fermionic determinant det M We write this as a bosonic functional integral over a pseudofermion field with kernel M -1

66 Sunday, 12 July 2015A D Kennedy Multipseudofermions We are introducing extra noise into the system by using a single pseudofermion field to sample this functional integral This noise manifests itself as fluctuations in the force exerted by the pseudofermions on the gauge fields This increases the maximum fermion force This triggers the integrator instability This requires decreasing the integration step size A better estimate is det M = [det M 1/n ] n

67 Sunday, 12 July 2015A D Kennedy Hasenbusch’s method Clever idea due to Hasenbusch Start with the Wilson fermion action M=1 -  H Introduce the quantity M’=1 -  ’H Use the identity M = M’(M’ -1 M) Write the fermion determinant as det M = det M’ det (M’ -1 M) Introduce separate pseudofermions for each determinant Adjust  ’ to minimise the cost Easily generalises More than two pseudofermions Wilson-clover action

68 Sunday, 12 July 2015A D Kennedy Violation of NFL Theorem So let’s try using our n th root trick to implement multipseudofermions Condition number  (r(M))=  (M) 1/n So maximum force is reduced by a factor of n  (M) (1/n)-1 This is a good approximation if the condition number is dominated by a few isolated tiny eigenvalues This is so in the case of interest Cost reduced by a factor of n  (M) (1/n)-1 Optimal value n opt  ln  (M) So optimal cost reduction is (e ln  ) /  This works!

69 Sunday, 12 July 2015A D Kennedy Rational Hybrid Monte Carlo: I Generate pseudofermion from Gaussian heatbath RHMC algorithm for fermionic kernel Use accurate rational approximation Use less accurate approximation for MD,, so there are no double poles Use accurate approximation for Metropolis acceptance step

70 Sunday, 12 July 2015A D Kennedy Rational Hybrid Monte Carlo: II Reminders Apply rational approximations using their partial fraction expansions Denominators are all just shifts of the original fermion kernel All poles of optimal rational approximations are real and positive for cases of interest (Miracle #1) Only simple poles appear (by construction!) Use multishift solver to invert all the partial fractions using a single Krylov space Cost is dominated by Krylov space construction, at least for O(20) shifts Result is numerically stable, even in 32-bit precision All partial fractions have positive coefficients (Miracle #2) MD force term is of the usual form for each partial fraction Applicable to any kernel

71 Sunday, 12 July 2015A D Kennedy Comparison with R algorithm: I AlgorithmδtδtAB4B4 R0.00191.56(5) R0.00381.73(4) RHMC0.05584%1.57(2) Binder cumulant of chiral condensate, B 4, and RHMC acceptance rate A from a finite temperature study (2+1 flavour naïve staggered fermions, Wilson gauge action, V = 8 3 ×4, m ud = 0.0076, m s = 0.25, τ= 1.0)

72 Sunday, 12 July 2015A D Kennedy Comparison with R algorithm: II Algorithmm ud msms δtδt AP R0.04 0.010.60812(2) R0.020.040.010.60829(1) R0.020.040.0050.60817 RHMC0.04 0.0265.5%0.60779(1) RHMC0.020.040.018569.3%0.60809(1) The different masses at which domain wall results were gathered, together with the step-sizes δt, acceptance rates A, and plaquettes P (V = 16 3 ×32×8, DBW2 gauge action, β = 0.72) The step-size variation of the plaquette with m ud =0.02

73 Sunday, 12 July 2015A D Kennedy Comparison with R algorithm: III The integrated autocorrelation time of the 13 th time-slice of the pion propagator from the domain wall test, with m ud = 0.04

74 Sunday, 12 July 2015A D Kennedy Multipseudofermions with multiple timescales Semiempirical observation: The largest force from a single pseudofermion does not come from the smallest shift For example, look at the numerators in the partial fraction expansion we exhibited earlierpartial fraction expansion we exhibited earlier Make use of this by using a coarser timescale for the more expensive smaller shifts

75 Sunday, 12 July 2015A D Kennedy L 2 versus L ∞ Force Norms Wilson fermion forces (from Urbach et. al.)Urbach et. al. β=5.6, 24 3 ×32

76 Sunday, 12 July 2015A D Kennedy Berlin Wall for Wilson fermions HMC Results C Urbach, K Jansen, A Schindler, and U Wenger, hep-lat/0506011, hep-lat/0510064hep-lat/0506011 hep-lat/0510064 Comparable performance to Lüscher’s SAP algorithm RHMC? t.b.a.

77 Sunday, 12 July 2015A D Kennedy Conclusions (RHMC) Advantages of RHMC Exact No step-size errors; no step-size extrapolations Significantly cheaper than the R algorithm Allows easy implementation of Hasenbusch (multipseudofermion) acceleration Further improvements possible Such as multiple timescales for different terms in the partial fraction expansion Disadvantages of RHMC ???

78 Sunday, 12 July 2015A D Kennedy QCD Machines: I We want a cost-effective computer to solve interesting scientific problems In fact we wanted a computer to solve lattice QCD But it turned out that there is almost nothing that was not applicable to many other problems too Not necessary to solve all problems with one architecture Demise of the general-purpose computer? Development cost « hardware cost for one large machine Simple OS and software model Interleave a few long jobs without time- or space-sharing

79 Sunday, 12 July 2015A D Kennedy QCD Machines: II Take advantage of mass market components It is not cost- or time-effective to compete with PC market in designing custom chips Use standard software and tools whenever possible Do not expect compilers or optimisers to anything particularly smart Parallelism has to be built into algorithms and programs from the start Hand code critical kernels in assembler And develop these along with the hardware design

80 Sunday, 12 July 2015A D Kennedy QCD Machines: III Parallel applications Many real-world applications are intrinsically parallel Because they are approximations to continuous systems Lattice QCD is an good example Lattice is a discretisation of four dimensional space-time Lots of arithmetic on small complex matrices and vectors Relatively tiny amount of I/O required Amdahl’s law Amount of parallel work may be increased working on a larger volume Relevant parameter is σ, the number of sites per processor

81 Sunday, 12 July 2015A D Kennedy Strong Scaling The amount of computation V δ required to equilibrate a system of volume V increases faster than linearly If we are to have any hope of equilibrating large systems the value of δ cannot be much larger than one For lattice QCD we have algorithms with δ = 5/4 We therefore are driven to as small a value of σ as possible This corresponds to “thin nodes,” as opposed to “fat nodes” as in PC clusters Clusters are competitive in price/performance up to a certain maximum problem size This borderline increase with time, of course

82 Sunday, 12 July 2015A D Kennedy Data Parallelism All processors run the same code Not necessarily SIMD, where they share a common clock Synchronization on communication, or at explicit barriers Type of data parallel operations Pointwise arithmetic Nearest neighbour shifts Perhaps simultaneously in several directions Global operations Broadcasts, sums or other reductions

83 Sunday, 12 July 2015A D Kennedy Alternatives Paradigms Multithreading Parallelism comes from running many separate more-or-less independent threads Recent architectures propose running many light- weight threads on each processor to overcome memory latency What are the threads that need almost no memory Calculating zillions of digits of  ? Cryptography?

84 Sunday, 12 July 2015A D Kennedy Computational Grids In the future carrying out large scale computations using the Grid will be as easy as plugging into an electric socket

85 Sunday, 12 July 2015A D Kennedy Hardware Choices Cost/MFlop Goal is to be about 10 times more cost-effective than commercial machines Otherwise it is not worth the effort and risk of building our own machine Our current Teraflops machines cost about $1/MFlop For a Petaflops machine we will need to reach about $1/GFlop Power/cooling Most cost effective to use low-power components and high- density packaging Life is much easier if the machine can be air cooled

86 Sunday, 12 July 2015A D Kennedy Clock Speed Peak/Sustained speed The sustained performance should be about 20%—50% of peak Otherwise there is either too much or too little floating point hardware Real applications rarely have equal numbers of adds and multiplies Clock speed “Sweet point” is currently at 0.5—1 GHz chips High-performance chips running at 3 GHz are both hot and expensive Using moderate clock speed makes electrical design issues such as clock distribution much simpler

87 Sunday, 12 July 2015A D Kennedy Memory Systems Memory bandwidth This is currently the main bottleneck in most architectures There are two obvious solutions Data prefetching Vector processing is one way of doing this Requires more sophisticated software Feasible for our class of applications because the control flow is essentially static (almost no data dependencies) Hierarchical memory system (NUMA) We make use of both approaches

88 Sunday, 12 July 2015A D Kennedy Memory Systems On-chip memory For QCD the memory footprint is small enough that we can put all the required data into an on-chip embedded DRAM memory Cache Whether the on-chip memory is managed as a cache or as directly addressable memory is not too important Cache flushing for communications DMA is a nuisance Caches are built into most μ-processors, not worth designing our own! Off-chip memory The amount of off-chip memory is determined by cost If the cost of the memory is » 50% of the total cost then buy more processing nodes instead After all, the processors are almost free!

89 Sunday, 12 July 2015A D Kennedy Communications Network: I This is where a massively parallel computer differs from a desktop or server machine In the future the network will become the principal bottleneck For large data parallel applications We will end up designing networks and decorating them with processors and memories which are almost free

90 Sunday, 12 July 2015A D Kennedy Communications Network Topology Grid This is easiest to build How many dimensions? QCDSP had 4, Blue Gene/L has 3, and QCDOC has 6 Extra dimensions allow easier partitioning of the machine Hypercube/Fat Tree/Ω network/Butterfly network These are all essentially an infinite dimensional grid Good for FFTs Switch Expensive, and does not scale well

91 Sunday, 12 July 2015A D Kennedy 36 Global Operations 6810121234 5678 Grid wires Not very good error propagation O(N 1/4 ) latency (grows as the perimeter of the grid) O(N) hardware Combining network A tree which can perform arithmetic at each node Useful for global reductions But global operations can be performed using grid wires It is all a matter of cost Used in BGL, not in QCDx

92 Sunday, 12 July 2015A D Kennedy Global Operations 15234678371115102636 Combining tree Good error propagation O(log N) latency O(N log N) hardware Bit-wise operations Allow arbitrary precision Used on QCDSP Not used on QCDOC or BGL Because data sent byte- wise (for ECC)

93 Sunday, 12 July 2015A D Kennedy Broadcast 110100 Global Operations 110100 111000 Select Max MSB first > 11010 11100 ? 0 1101 1110 ? 00 110 111 upper 100 11 upper 0100 1 1 upper 10100 upper 110100 001011 000111 LSB first + Carry Sum 00101 00011 1 0 0010 0001 1 10 001 000 1 010 00 1 0010 0 0 0 10010 0 010010

94 Sunday, 12 July 2015A D Kennedy Communications Network: II Parameters Bandwidth Latency Packet size The ability to send small [O(10 2 ) bytes] packets between neighbours is vital if we are to run efficiently with a small number of sites per processor (σ) Control (DMA) The DMA engine needs to be sophisticated enough to interchange neighbouring faces without interrupting the CPU Block-strided moves I/O completion can be polled (or interrupt driven)

95 Sunday, 12 July 2015A D Kennedy Packet Size

96 Sunday, 12 July 2015A D Kennedy Block-strided Moves 15141312111098 76543210 15141312 111098 7654 3210 15141312 111098 7654 3210 15141312 111098 7654 321015141312111098 76543210 15141312111098 76543210 Block size 14 Stride 3 12 For each direction separately we specify in memory mapped registers:  Source starting address  Target starting address  Block size  Stride  Number of blocks

97 Sunday, 12 July 2015A D Kennedy Hardware Design Process Optimise machine for applications of interest We run simulations of lattice QCD to tune our design Most design tradeoffs can be evaluated using a spreadsheet as the applications are mainly static (data-independent) It also helps to debug the machine if you use the actual kernels you will run in production as test cases! But running QCD even on a RTL simulator is painfully slow Circuit design done using hardware description languages (e.g., VHDL) Time to completion is critical Trade-off between risk of respin and delay in tape-out

98 Sunday, 12 July 2015A D Kennedy Performance Model

99 Sunday, 12 July 2015A D Kennedy VHDL entity lcount5 is port( signal clk, res: in vlbit; signal reset, set:in vlbit; signal count: inout vlbit_1d(4 downto 0) ); end lcount5; architecture bhv of lcount5 is begin process begin wait until prising(clk) or res='1'; if res='1' then count<=b"00000"; else if reset='1' then count<=b"00000"; elsif set='1' then count(4)<=count(4) xor (count(0) and count(1) and count(2) and count(3)); count(3)<=count(3) xor (count(0) and count(1) and count(2)); count(2)<=count(2) xor (count(0) and count(1)); count(1)<=count(1) xor count(0); count(0)<=not count(0); end if;end if; end process ; end bhv;

100 Sunday, 12 July 2015A D Kennedy QCDSP Columbia University 0.4 TFlops BNL 0.6 Tflops

101 Sunday, 12 July 2015A D Kennedy 4 Mbytes of Embedded DRAM Complete Processor Node for QCD computer on a Chip fabricated by IBM using SA27E 0.18μ logic + embedded DRAM process IBM ASIC Library Component DMAController QCDOC ASIC Design DCR SDRAM Controller Custom Designed Logic InterruptControllerBoot/ClockSupport PLBArbiter PLB FPU 440 Core HSSL HSSL HSSL SCU PLL EDRAM PEC MCMALEthernetDMA location number IDCInterface Serial number Slave OFB-PLBBridge OFBArbiter OFB MII FIFO 4KB recv. FIFO EMACS 100 Mbit/sec Fast Ethernet 24 Off-Node Links 12 Gbit/sec Bandwidth 24 Link DMA Communications Control 2.6 Gbyte/sec EDRAM/SDRAM DMA 1 (0.8) Glops Double Precision RISC Processor 8 Gbyte/sec Memory-Processor Bandwidth 2.6 Gbyte/sec Interface to External Memory PEC = Prefetching EDRAM Controller EDRAM = Embedded DRAM DRAM = Dynamic RAM RAM = Random Access Memory

102 Sunday, 12 July 2015A D Kennedy QCDOC Components

103 Sunday, 12 July 2015A D Kennedy QCDOC

104 Sunday, 12 July 2015A D Kennedy Edinburgh ACF UKQCDOC is located in the ACF near Edinburgh Our insurers will not let us show photographs of the building for security reasons!

105 Sunday, 12 July 2015A D Kennedy Blue Gene/L

106 Sunday, 12 July 2015A D Kennedy Vicious Design Cycle 1.Communications DMA controller required Otherwise control by CPU needed Requires interrupts at each I/O completion CPU becomes bottleneck 2.Introduce more sophisticated DMA instructions Block-strided moves for QCDSP Sequences of block-strided moves for QCDOC Why not? It is cheap 3.Use CPU as DMA engine Second CPU in Blue Gene/L 4.Add FPU Why not? It is cheap, and you double you peak Flops 5.Use second FPU for computation Otherwise sustained fraction of peak is embarrassingly small 6.Go back to step 1

107 Sunday, 12 July 2015A D Kennedy Power Efficiencies

108 Sunday, 12 July 2015A D Kennedy What Next? We need a Petaflop computer for QCD New fermion formulations solve a lot of problems, but cost significantly more in computer time Algorithms are improving Slowly By factors of about 2 Can we build a machine about 10 times more cost effective than commercial machines? Otherwise not worth the effort & risk of building our own Blue Gene/L and its successors provide a new opportunity

Sunday, 12 July 2015 NCTS 2006 School on Modern Numerical Methods in Mathematics and Physics Algorithms for Dynamical Fermions A D Kennedy School of Physics,

Similar presentations

Presentation on theme: "Sunday, 12 July 2015 NCTS 2006 School on Modern Numerical Methods in Mathematics and Physics Algorithms for Dynamical Fermions A D Kennedy School of Physics,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sunday, 12 July 2015 NCTS 2006 School on Modern Numerical Methods in Mathematics and Physics Algorithms for Dynamical Fermions A D Kennedy School of Physics,

Similar presentations

Presentation on theme: "Sunday, 12 July 2015 NCTS 2006 School on Modern Numerical Methods in Mathematics and Physics Algorithms for Dynamical Fermions A D Kennedy School of Physics,"— Presentation transcript:

Similar presentations

About project

Feedback