National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb
2 Presentation outline Introduction 1.Our view of MILC applications 2.Introduction to Cell Broadband Engine Implementation in Cell/B.E. 1.PPU performance and stream benchmark 2.Profile in CPU and kernels to be ported 3.Different approaches Performance Conclusion
3 Introduction Our target MIMD Lattice Computation (MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm Our view of the MILC applications A sequence of communication and computation blocks compute loop 1 compute loop n MPI scatter/gather Original CPU-based implementation CPU MPI scatter/gather for loop 2 MPI scatter/gather for loop 3 compute loop 2 MPI scatter/gather for loop n+1
4 Introduction Cell/B.E. processor One Power Processor Element (PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage 3.2 GHz processor 25.6 GB/s processor-to- memory bandwidth > 200 GB/s EIB sustained aggregate bandwidth Theoretical peak performance: GFLOPS (SP) and GFLOPS (DP)
5 Presentation outline Introduction 1.Our view of MILC applications 2.Introduction to Cell Broadband Engine Implementation in Cell/B.E. 1.PPE performance and stream benchmark 2.Profile in CPU and kernels to be ported 3.Different approaches Performance Conclusion
6 Performance in PPE Step 1: try to run it in PPE In PPE it runs approximately ~2-3x slower than modern CPU MILC is bandwidth-bound It agrees with what we see with stream benchmark
7 Execution profile and kernels to be ported 10 of these subroutines are responsible for >90% of overall runtime All kernels responsible for 98.8%
8 Kernel memory access pattern Kernel codes must be SIMDized Performance determined by how fast you DMA in/out the data, not by SIMDized code In each iteration, only small elements are accessed Lattice size: 1832 bytes su3_matrix: 72 bytes wilson_vector: 96 bytes Challenge: how to get data into SPUs as fast as possible? Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes. Data layout in MILC meets neither of them #define FORSOMEPARITY(i,s,choice) \ for( i=((choice)==ODD ? even_sites_on_node : 0 ), \ s= &(lattice[i]); \ i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++) FORSOMEPARITY(i,s,parity) { mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp ); mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) ); mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp ); mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu ); su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) ); } lattice site 0 Data accesses Data from neighbor One sample kernel from udadu_mu_nu() routine
9 Approach I: packing and unpacking Good performance in DMA operations Packing and unpacking are expensive in PPE PPE and main memory … struct site … Packing Unpacking DMA operations … SPEs
10 Approach II: Indirect memory access Replace elements in struct site with pointers Pointers point to continuous memory regions PPE overhead due to indirect memory access Original lattice Modified lattice Continuous mem … DMA operations SPEs … … PPE and main memory
11 Approach III: Padding and small memory DMAs Padding elements to appropriate size Padding struct site to appropriate size Gained good bandwidth performance with padding overhead Su3_matrix from 3x3 complex to 4x4 complex matrix 72 bytes 128 bytes Bandwidth efficiency lost: 44% Wilson_vector from 4x3 complex to 4x4 complex 98 bytes 128 bytes Bandwidth efficiency lost: 23% Original lattice Lattice after padding … SPEs … … DMA operations PPE and memory
12 Struct site Padding 128 byte stride access has different performance for different stride size This is due to 16 banks in main memory Odd numbers always reach peak We choose to pad the struct site to 2688 (21*128) bytes
13 Presentation outline Introduction 1.Our view of MILC applications 2.Introduction to Cell Broadband Engine Implementation in Cell/BE. 1.PPU performance and stream benchmark 2.Profile in CPU and kernels to be ported 3.Different approaches Performance Conclusion
14 Kernel performance GFLOPS are low for all kernels Bandwidth is around 80% of peak for most of kernels Kernel speedup compared to CPU for most of kernels are between 10x to 20x set_memory_to_zero kernel has ~40x speedup, su3mat_copy() speedup >15x
15 Application performance Single Cell Application performance speedup ~8–10x, compared to Xeon single core Cell Blade application performance speedup x, compared to Xeon 2 socket 8 cores Profile in Xeon 98.8% parallel code, 1.2% serial code speedup slowdown 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell PPE is standing in the way for further improvement 8x8x16x16 lattice 16x16x16x16 lattice
16 Application performance on two blades Execution time of the 54 kernels considered for the SPE implementation Execution time of the rest of the code (PPE portion in the case of Cell/B.E. processor) Total (seconds) Two Intel Xeon blades seconds 27.1 seconds (24.5 seconds due to MPI) seconds Two Cell/B.E. blades 15.9 seconds 67.9 seconds (47.6 seconds due to MPI) 83.8 seconds For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet More data needed for Cell blades connected through Infiniband
17 Application performance: a fair comparison 8x8x16x16 lattice16x16x16x16 lattice Intel Xeon time Cell/B.E. time speedup Intel Xeon time Cell/B.E. time speedup Single core Xeon vs. Cell/B.E. PPE Single core Xeon vs. Cell/B.E. PPE + 1 SPE Quad core Xeon vs. Cell/B.E. PPE + 8 SPEs Xeon blade vs. Cell/B.E. blade PPE is slower than Xeon PPE + 1 SPE is ~2x faster than Xeon A cell blade is x faster than 8-core Xeon blade
18 Conclusion We achieved reasonably good performance Gflops in one Cell processor for whole application We maintained the MPI framework Without the assumtion that the code runs on one Cell processor, certain optimization cannot be done, e.g. loop fusion Current site-centric data layout forces us to take the padding approach 23-44% efficiency lost for bandwidth Fix: field-centric data layout desired PPE slows the serial part, which is a problem for further improvement Fix: IBM putting a full-version power core in Cell/B.E. PPE may impose problems in scaling to multiple Cell blades PPE over Infiniband test needed