Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference on Computing Frontiers May 2-6, 2006, Italy Presentation by Aarul Jain
Introduce a performance model of Cell. Implement key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. Verify results from performance models against published results and implementations of Cell full system simulator. Compare cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2) and Vector (Cray X1E) architectures. Propose micro-architectural modifications that could significantly improve the efficiency of double-precision calculations.
Details and results from the paper. ◦ Programming Model used. ◦ Performance Model used for simulation. ◦ “Cell+” architecture for DP performance improvement. ◦ Dense Matrix-Matrix multiply. ◦ Sparse Matrix Vector multiply. ◦ Stencil Computations. ◦ Fast Fourier Transforms. Comments/Critiques Project Q/A
Three programming models ◦ Task parallelism. ◦ Pipelined parallelism. ◦ Data parallelism. Data-parallel programming model used. Rely heavily on SIMD intrinsic -> NO C. Double buffering used to overlap data movement with computation on SPEs. One month to implement first kernel, 600 lines of code.
Deterministic behavior of software controlled memory. In-order execution and fixed load-store memory latency of SPEs. Step1: Segmented code snippets that operate on data present in local store of SPE and did static timing analysis on its assembly. Step2: a model that tabulates the time required for DMA loads and stores of the operands required by code snippets. Compute total time by adding all the outer loops where each loop is computed by taking maximum of the snippet and DMA transfer times.
Double precision operations are implemented using 9-cycle pipelined FMA data path with 4 cycles of overhead for data movement. 6 cycles stall after issuing a DP instruction. Much detail about Cell+ architecture not discussed in the paper. (Proprietary?) Propose a design with a longer forwarding network to eliminate all but one stalls. More details on pipeline of SPE may be found at: ◦ B. Flachs et. al., A Streaming processor unit for a cell processor. ISSCC Dig. Tech. Papers, Feb. 2005
General Matrix Multiply and Add(GEMM) ◦ Column major ◦ Block data layout Each matrix is broken into 8n x n element tiles designed to fit into the memory available on Cell chip. Further they are divided into n x n element tiles that can fit into 8 SPE local stores.
Storage formats ◦ Compressed Sparse Row (CSR) ◦ Blocked Compressed Sparse Row (BCSR)
Two types of kernels used derived from Chombo and Cactus toolkits. Both solve 7 point stencils in 3D for each point.
Compute intensity less than matrix multiplications. Both 1D and 2D versions analyzed. Look-up tables used. No double buffering.
Broadest quantitative study of Cell’s performance. Cell’s three level software-controlled memory architecture provides several advantages over mainstream cache-based architectures. Disadvantage: unaligned load support. Propose Cell+ architecture for improving DP performance.
Cell is unique in its architecture -> future architectures based on Cell?? Authors have done considerable work in analyzing Cell performance. Critique1 Critique2
Title: FAST FOURIER TRANSFORM IMPLEMENTATION ON CELL BROADBAND ENGINE ARCHITECTURE Main Objectives: ◦ Explore Cell Architecture and find out limitations/advantages of Cell Architecture. ◦ Get familiar with Cell programming environment.
ibm.com/developerworks/power/library/pa-cellperf ibm.com/developerworks/power/library/pa-cellperf html html html html er.html er.html p=&isnumber=&arnumber= p=&isnumber=&arnumber=