Download presentation
Presentation is loading. Please wait.
Published byJoshua Scott Modified over 9 years ago
1
Next KEK machine Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005
2
Oct 4, 2005 Shoji Hashimoto (KEK) 2 KEK supercomputer Leading computing facility in that time 1985Hitachi S810/10350 MFlops 1989Hitachi S820/803 GFlops 1995Fujitsu VPP500128 GFlops 2000Hitachi SR8000 F11.2 TFlops 2006???
3
Oct 4, 2005 Shoji Hashimoto (KEK) 3 Formality “KEK Large Scale Simulation Program” : call for proposals of project to be performed on the supercomputer. Open for Japanese researcher working on high energy accelerator science (particle and nuclear physics, astrophysics, accelerator physics, material science related to the photon factory) Program Advisory Committee (PAC) decides the approval and machine time allocation.
4
Oct 4, 2005 Shoji Hashimoto (KEK) 4 Usage Lattice QCD is a dominant user. About 60-80% of the computer time for lattice QCD –Among them, ~60% is for the JLQCD collaboration –Others include Hatsuda-Sasaki, Nakamura et al., Suganuma et al., Suzuki et al. (Kanazawa), … Simulation for accelerator design is another big user: beam-beam simulation for the KEK-B factory.
5
Oct 4, 2005 Shoji Hashimoto (KEK) 5 JLQCD collaboration 1995~ (on VPP500) –Continuum limit in the quenched approximation BKBK f B, f D msms
6
Oct 4, 2005 Shoji Hashimoto (KEK) 6 JLQCD collaboration 2000~ (on SR8000) –Dynamical QCD with the improved Wilson fermion m V vs m PS 2 f B, f Bs K l3 form factor
7
Oct 4, 2005 Shoji Hashimoto (KEK) 7 Around the triangle
8
Oct 4, 2005 Shoji Hashimoto (KEK) 8 The wall Chiral extrapolation: very hard to go beyond m s /2 Problem for every physical quantities. Maybe, solved by the new algorithms and machines… JLQCD Nf=2 (2002) MILC coarse lattice (2004) New generation of dynamical QCD
9
Oct 4, 2005 Shoji Hashimoto (KEK) 9 Upgrade Thanks to Hideo Matsufuru (Computing Research Center, KEK) for his hard work. Upgrade scheduled on March 1 st 2006. Called for bids from vendors. At least 20x more computing power, measured mainly using the QCD codes. No restriction on architecture (scalar or vector, etc.), but some amount must be a shared memory machine. Decision was made, recently.
10
Oct 4, 2005 Shoji Hashimoto (KEK) 10 The next machine A combination of two systems: Hitachi SR11000 K1, 16 nodes, 2.15 TFlops peak performance. IBM Blue Gene/L, 10 racks, 57.3 TFlops peak performance. Hitachi Ltd. is the prime contractor.
11
Oct 4, 2005 Shoji Hashimoto (KEK) 11 Hitachi SR11000 K1 POWER5+: 2.1 GHz, dual core, 2 simultaneous multiply/add per cycle (8.4 GFlops/core), 1.875 MB L2 (on chip), 36 MB L3 (off chip) 8.5 GB/s chip-memory bandwidth, hardware and software prefetch 16-way SMP (134.4 GFlops/node), 32 GB memory (DDR2 SDRAM). 16 nodes (2.15 TFlops) Interconnect: Federation switch 8GB/s (bidirectional) Will be announced, tomorrow.
12
Oct 4, 2005 Shoji Hashimoto (KEK) 12 SR11000 node
13
Oct 4, 2005 Shoji Hashimoto (KEK) 13 16-way SMP
14
Oct 4, 2005 Shoji Hashimoto (KEK) 14 High Density Module
15
Oct 4, 2005 Shoji Hashimoto (KEK) 15 IBM Blue Gene/L Node: 2 PowerPC440 (dual core), 700 MHz, double FPU (5.6 GFlops/chip), 4MB on-chip L3 (shared), 512 MB memory. Interconnect: 3D torus, 1.4 Gbps/link (6 in + 6 out) from each node. Midplane: 8x8x8 nodes (2.87 TFlops); rack = 2 Midplane 10 rack system All the info in the following comes from the “Red book” (ibm.com/redbooks) and the articles in IBM Journal of Research and Development.
16
Oct 4, 2005 Shoji Hashimoto (KEK) 16 BG/L system 10 Racks
17
Oct 4, 2005 Shoji Hashimoto (KEK) 17 BG/L node ASIC Double Floating-Point- Unit (FPU) added to the PPC440 core. 2 fused multiply-add per core Not a true SMP: L1 has no cache coherency, L2 has a snoop. Shared 4MB L3. Communication between the two core through the “multiported shared SRAM buffer” Embedded memory controller and networks.
18
Oct 4, 2005 Shoji Hashimoto (KEK) 18 Compute note modes Virtual node mode: use both CPUs separately, running a different process on each core. Communication using MPI, etc. Memory and bandwidth are shared. Co-processor mode: use the secondary processor as a co-processor for communication. Peak performace is ½. Hybrid node mode: use the secondary processor also for computation. Need a special care about the L1 cache incoherency. Used for Linpack.
19
Oct 4, 2005 Shoji Hashimoto (KEK) 19 QCD code optimization Jun Doi and Hikaru Samukawa (IBM Japan): Use the virtual node mode Fully used the Double FPU (hand-written assembler code) Use a low-level communication API
20
Oct 4, 2005 Shoji Hashimoto (KEK) 20 Double FPU SIMD extension of PPC440. 32 pairs of 64-bit FP register, addresses are shared. Quadword load and store. Primary and secondary pipelines. Fused multiply- add for each pipe. Cross operations possible; best suited for complex arithmetic.
21
Oct 4, 2005 Shoji Hashimoto (KEK) 21 Examples InstructionMnemonicprimarysecondary Load floating parallel double indexed lfpdx P T =dw(E A )S T =dw(E A +8) Store floating parallel double indexed stfpdx Dw(E A )=P T Dw(E A +8)=S T Floating parallel multiply-add fpmadd P T =P A.P C +P B S T =S A.S C +S B Floating cross multiply-add fxmadd P T =S A.P C +P B S T =P A.S C +S B Asymmetric cross copy-primary nsub- primary multiply-add fxcpnpma P T =-P A.P C +P B S T =P A.S C +S B Floating cross cmplx nsub-primary multiply-add fxcxnpma P T =-S A.S C +P B S T =S A.P C +S B
22
Oct 4, 2005 Shoji Hashimoto (KEK) 22 SU(3) matrix*vector y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2]; y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2]; y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2]; complex mult: u[0][0] * x[0] FXPMUL (y[0],u[0][0],x[0]) FXCXNPMA (y[0],u[0][0],x[0],y[0]) + u[0][1] * x[1] + u[0][2] * x[2]; FXCPMADD (y[0],u[0][1],x[1],y[0]) FXCXNPMA (y[0],u[0][1],x[1],y[0]) FXCPMADD (y[0],u[0][2],x[2],y[0]) FXCXNPMA (y[0],u[0][2],x[2],y[0]) re(y[0])=re(u[0][0])*re(x[0]) im(y[0])=re(u[0][0])*im(x[0]) re(y[0])+=-im(u[0][0])*im(x[0]) im(y[0])+=im(u[0][0])*re(x[0]) must be combined with other rows to avoid pipeline stall (wait 5 cycles).
23
Oct 4, 2005 Shoji Hashimoto (KEK) 23 Scheduling 32+32 registers can hold 32 complex numbers. 3x3(=9) for a gauge link; 3x4(=12) for a spinor: need 2 spinors for input and output Load the gauge link while computing, using 6+6 registers. Straightforward for y+=U*x, but not so for y+=conjg(U)*x. Use the inline- assembler of gcc; xlf and xlc have intrinsic functions. Early xlf/xlc wasn’t good enough to produce these code, but is improved more recently.
24
Oct 4, 2005 Shoji Hashimoto (KEK) 24 Parallelization on BG/L Example: 24 3 x48 lattice. Use the virtual node mode. For the midplane, divide the entire lattice onto 2x8x8x8 processors. For one rack, 2x8x8x16. (2 is inner- node.) To use more than one rack, 32 3 x64 lattice is the minimum. Each processor has 12x3x3x6 (or 12x3x3x3) lattice.
25
Oct 4, 2005 Shoji Hashimoto (KEK) 25 Communication Communication is fast: 6 links to nearest- neighbors. 1.4 Gbps (bi-directional) for each link. latency is 140ns for one hop. MPI is too heavy: Need additional buffer copy = waste the cache and memory bandwidth. Multi-thread not available in the virtual node mode. Overlapping comp and comm is not possible within MPI.
26
Oct 4, 2005 Shoji Hashimoto (KEK) 26 “QCD Enhancement Package” Low-level communication API Directly send/recv by accessing the torus interface FIFO. No copy to memory buffer. Non-blocking send; blocking recv. Up to 224 byte data to send/recv at once (spinor at one site = 192 byte). Assuming the nearest-neighbor communication.
27
Oct 4, 2005 Shoji Hashimoto (KEK) 27 An example #define BGLNET_WORK_REG 30 #define BGLNET_HEADER_REG 30 BGLNetQuad* fifo; BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6); for(i=0;i<Nx;i++){ // put results to reg 24--29 BGLNet_Send_Enqueue_Header(fifo); BGLNet_Send_Enqueue(fifo,24); BGLNet_Send_Enqueue(fifo,25); BGLNet_Send_Enqueue(fifo,26); BGLNet_Send_Enqueue(fifo,27); BGLNet_Send_Enqueue(fifo,28); BGLNet_Send_Enqueue(fifo,29); BGLNet_Send_Packet(fifo); } Create packet header Put the packet header to the send buffer Put the data to the send buffer Kick!
28
Oct 4, 2005 Shoji Hashimoto (KEK) 28 Benchmark Wilson solver (BiCGstab) 24 3 x48 lattice on a midplace (8x8x8=512 nodes, half rack) 29.2% of the peak performance 32.6% if measured the Dslash only Domain-wall solver (CG) 24 3 x48 lattice on a midplace; Ns=16. Doesn’t fit in the on-chip L3 ~22% of the peak performance
29
Oct 4, 2005 Shoji Hashimoto (KEK) 29 Comparison Vranas @ Lattice 2004 ~50% improvement
30
Oct 4, 2005 Shoji Hashimoto (KEK) 30 Physics target “Future opportunities: ab initio calculations at the physical quark masses” Using dynamical overlap fermion Details are under discussion (actions, algorithms, etc.) Primitive code has been written; test runs are on-going on SR8000. Many things to do by March…
31
Oct 4, 2005 Shoji Hashimoto (KEK) 31 Summary New KEK machine will be made available for Japanese lattice community on March 1 st, 2006. Hitachi SR11000 (2.15 TF) + IBM BlueGene/L (57.3 TF)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.