High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa Franz Franchetti Markus Püschel Electrical & Computer Engineering Carnegie Mellon University Sponsors: DARPA-DESA, NSF, ARO, and Mercury Inc.
How do we harness the Cell’s impressive peak performance? Cell Broadband Engine Multicore cpu (8 SPEs+1 PPE) SPEs: SIMD cores designed for numerical computing 256KB “local store” per SPE (scratchpad-like) Programmer-driven DMA 204 Gflop/s peak Cell BE Chip Main Mem EIB SPE LS How do we harness the Cell’s impressive peak performance?
DFT on the Cell BE Spiral generated (this paper) 350x FFTC FFTW Numerical Recipes Platform-tuned code is 350x faster. But hard to write!
Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo: SPIRAL: Code Generation for DSP Transforms. Special issue, Proceedings of the IEEE 93(2), 2005
“Fitting” Dataflow to Hardware Core 0 Core 1 Parallel execution (multicore) Stage 1 Stage 2 Stage 3 Stage 4 Iterative Algorithm (programming ease) Stage 5 Stage 1 Recursive algorithm (memory hierarchy) Stage 2 Stage 3 Stage 4 To “fit” DFT to architecture: Various traversals Various factorizations How to map dataflow to architecture automatically?
“Fitting” Dataflow to Platform (contd.) 1 2 3 4 5 1 2 3 4 Core 0 Core 1 Intuition: rewrite formulas to obtain suitable dataflow
Program Generation in Spiral parallelization vectorization loop optimizations constant folding scheduling …… Optimization at all abstraction levels Transform user specified Fast algorithm in SPL many choices ∑-SPL Iteration of this process to search for the fastest But that’s not all … C Code
Common Abstraction: SPL SPL: Tensor-product representation Eg.: Cooley-Tukey fast Fourier transform (FFT): Algorithms in SPL: Products of structured sparse matrices Algorithms reduce arithmetic cost O(n2) O(n log n) Mathematical notation exposes structure: SPL (signal processing language) Tensor products in SPL represent loop structures
Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks
Mapping DFTs to the Cell Objective: High-performance transform library for Cell BE Cell BE Chip Main Mem EIB SPE LS DFT Cell’s architectural paradigms: Vectorize DFT for vector length Vectorization Parallelize DFT across p SPEs, and use a DMA packet size of Parallelization Optimize DFT for throughput (s DFTs required) Multibuffering Tags guide formula rewriting
Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA SPL to Parallel Code Natural parallel construct in SPL: A x y Processor 0 Processor 1 Processor 2 Processor 3 Independent, load-balanced, communication-free operation Parallelizing other constructs in SPL: Permutations require message exchange (on-chip DMA comm.) x y Idea: rewrite all SPL constructs to parallel constructs + on-chip DMA
Idea: rewrite algorithm at SPL level to achieve largest DMA packets SPL to Streaming Code Streaming: Overlapping computation with communication On-chip (SPE ↔ SPE) and off-chip (SPE ↔ Main memory) Idea: tensor loops become multi-buffered loops Useful for: Throughput-optimized code Large, out-of-chip sizes i'th iteration Write Ai-1 Compute Ai Read Ai+1 A A A (Trickier for other SPL constructs) x y Idea: rewrite algorithm at SPL level to achieve largest DMA packets
Generating Cell Code Transform user specified Rewriting Fast algorithm in SPL tag guided Streamed from memory for throughput Load balanced across p SPEs SIMD kernel optimized for memory hierarchy All-to-all communication (on-chip) Loop operations in ∑-SPL Cell-specific optimized C code (intrinsics, DMA etc.)
Generated Code Sample DFT 216: 4,000+ lines of code! vectorized DMA /* Complex-to-complex DFT size 64 on 2 SPEs */ dft_c2c_64(float *X, float *Y, int spuid) { // Block 1 (IxA)L for(i:=0; i<=7; i++) // Right most gather { DMA_GATHER(gath_func(X,i), gath_func(T1,i), 4) } // uses spu_mfcdma() spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait on gather // compute vectorized DFT kernel of size m for(i:=0; i<=7; i++) // Scatter at interface { DMA_SCATTER(scat_func(T1,i), scat_func(T2,i), 4) } all_to_all_synchronization_barrier(); // uses mailbox msgs // Block 2 (AxI) /* Gather is a no operation since the scatter above accounted for it */ // compute vectorized DFT kernel of size n for(i:=0; i<=7; i++) // Left most scatter { DMA_SCATTER(scat_func(T1,i), scat_func(Y,i), 4) } all_to_all_synchronization_barrier(); } vectorized DMA parallelized DFT 216: 4,000+ lines of code!
Problem Space: Options Parallelization Base (Vectorized) SPE DFT SPE DFT Vectorization assumed Single DFT parallelized across multiple SPEs SPE DFT Main Memory Operations (Only for small DFTs) SPE DFT Multiple independent DFTs on multiple SPEs Latency optimized (default) SPE DFT SPE DFT Multiple parallelized independent DFTs Throughput, multibuffered
Problem Space: Combinations Throughput-optimized usage scenarios Latency-optimized usage scenarios SPE DFT Parallel, multibuffered DFT Single DFT from main memory Independent DFTs multibuffered in parallel Devise rewrite rules for tags. Nestings describe all scenarios
Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks
SPE DFT 8-SPEs 4-SPEs 2-SPEs Single precision IBM QS22 1-SPE
4.5x faster than FFTW, 1.63x faster than FFTC SPE DFT Spiral: 1-SPE Spiral: 8-SPEs FFTC FFTW 4.5x faster than FFTW, 1.63x faster than FFTC
More Performance Results Single-SPE DFT code Split/interleaved complex formats Non-2-power sizes Double precision (PowerXCell 8i) Mercury Spiral Chow IBM SDK
Other Linear Transforms Discrete Sine, Cosine transforms, DFT with real inputs (single-SPE) 2-D DFTs Out-of-core sizes Limited to 2D DFTs on 1-SPE (for now) More performance results: Srinivas Chellappa, Franz Franchetti , and Markus Püschel: Computer Generation of fast Fourier Transforms for the Cell Broadband Engine Proceedings of International Conference on Supercomputing (ICS) 2009
Overview Background, Spiral Overview Generating DFTs for the Cell Performance Results Concluding Remarks
Conclusion Automatic generation of transform libraries High performance Variety of scenarios, formats High performance on Cell requires: Vectorization multi-core parallelization, streaming, DMA code Future processors likely to have similar paradigms, tradeoffs Spiral approach: Common abstraction of transform, algorithm, architecture (SPL) Rewrite rules to go from transform to architecture architecture space algorithm