Download presentation
Presentation is loading. Please wait.
Published byJack Andrews Modified over 9 years ago
1
ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010
2
Key Computation Issues Large volume of data ( disk / memory / network ) Significant number of solvers iterations due to numerical intractability Redundant memory accesses coming from interleaving data dependencies Use of double precision because of accuracy need (hardware penalty) Misaligned data (inherent to specific data structures) Exacerbates cache misses (depending on cache size) Becomes a serious problem when consider accelarators Leads to « false sharing » with Shared-Memory paradigm (Posix, OpenMP) Padding is one solution but would dramatically increase memory requirement Memory/Computation compromise in data organization (e.g. gauge replication) ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010
3
Why the CELL Processor ? Highest computing power in a single « computing node » Fast memory access Asynchronysm between data transfers and computation Issues with the CELL Processor ? Data alignment (both for calculations and transfers) Heavy use of list DMA Small size of the Local Store (SPU local memory) Ressources sharing with Dual Cell Based Blade Integration into an existing standard framework ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010
4
What we have done Implementation of each critical kernel on the CELL processor SIMD version of basic operators Appropriate DMA mechanism (efficient list DMA and double buffering) Merging of consecutive operations into a unique operator ( latency & memory reuse ) Aggregation of all these implementations into a single and standalone library Effective integration into the tmLQCD package Successful tests (QS20 and QS22) A single SPU thread holds the whole set of routines SPU thread remains « permanently » active during a working session Data re-alignment Routine calls replacement (invoke CELL versions in place of native ones) This should be the way to commit this back to tmLQCD (external library and « IsCELL » switch) ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010
5
Global Organization Task partitioning, distribution, and synchorization are done by the PPU Each SPE operates on its data portion by a typical loop of the form (DMA get + SIMD Computation + DMA put) The SPE, always active, switches to the appropriate operation on each request ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010
6
Optimal list DMA organization for the Wilson-Dirac Operator The computation of Wilson-Dirac action for a set of K contigous spinors required to get 8K spinors ( Example below with 32x16 3 lattice and even-odd ) S[0]P[2048]P[63488]P[128]P[1920]P[8]P[120]P[0]P[7] S[1]P[2049]P[63489]P[129]P[1921]P[9]P[121]P[1]P[0] S[2]P[2050]P[63490]P[130]P[1922]P[10]P[122]P[2]P[1] S[3]P[2051]P[63431]P[131]P[1923]P[11]P[123]P[3]P[2] A direct list DMA to get this « spinors matrix » involves 8x4 DMA items A list DMA to get the « transpose » involves 7 + 1 + 1 = 9 DMA items Generally, our list DMA is of size 8 + c k instead of 8K ( bin packing ) No impact on SPU performance because of the uniform access to the LS Significant improvment in global performance and scalability ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010
7
Performance results We consider a 32x16 3 lattice and CELL-accerated version of tmLQCD QS20 #SPETime(s)SpeedupGFlops 10.1091.000.95 20.0542.001.92 30.0363.002.89 40.0273.993.85 50.0224.984.73 60.0185.965.78 70.0156.936.94 80.0137.888.01 QS22 #SPETime(s)SpeedupGFlops 10.03741.00 2.76 20.01951.91 5.31 30.01342.79 7.76 40.01053.56 9.90 50.00904.1511.56 60.00814.6112.84 70.00764.9213.88 80.00755.7514.02 INTEL i7 quadcore 2.83 Ghz Without SSEWith SSE 1 core4 cores1 core4 cores 0.08200.03700.0400.0280 INTEL i7CELL (8 SPEs) SSE + 4cQS20QS22 GCR (57 iters)11.05 s3.78 s2.04 s CG (685 iters)89.54 s42.25 s22.78 s ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010
8
Comments We observed a factor 2 between QS20 and QS22 We observed a factor 4 between QS22 and Intel i7 quadcore 2.83 Ghz Good scalability on QS20 Scalability on QS22 is alterated beyond 4 SPEs (probably a binding issue on the Dual Cell Based Blade, which should be easy to fix) Fixing this scalability issue on QS22 will double actual performances ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010
9
Ways for improvment Implement the « non GAUGE COPY » version ( significant memory reduction / packing ) Explore the SU(3) reconstruct approach at SPE level ( memory and bandwith savings ) Having the PPU participate in the calculations ( makes sens in double precision ) Try to scale up to the 16 SPEs on the QS22 Dual Cell Based Blade Experiment with a cluster of CELL processors ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010
10
END ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010 Two accepted conference/workshop publications International Conference on Supercomputing International Workshop on Highly Efficient Accelerators and Reconfigurable Technologies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.