Aarul Jain CSE520, Advanced Computer Architecture Fall 2007
Three versions of Fast Fourier Transform to be implemented on Cell BE simulator and their performance analyzed as the order of FFT is increased. Fast Fourier Transform on PPE/single SPU. Data/Task parallel on multiple SPUs. (single buffer v/s double buffer performance comparison.) Pipelined implementation on multiple SPUs. Performance : FFT kernel DMA data transfer
PPE 64bit Power architecture with VMX. In-order, 2-way SMT. 32KB L1, 512KB L2 Cache. SPE 256 KB local store. In-order, No speculation. 128 registers for all data types. EIB Four 16B data rings. Over 100 outstanding requests.
FFT compute intensity O(nlogn) Implementation on PPU ◦ Cache based memory architecture – No software controlled memory. Implementation on SPU ◦ Software controlled memory. ◦ Limited Local store memory decides the maximum size of the fft that can be implemented. (Data Structure Size = 16bytes * FFT size => 8K point FFT)
PPE EXECUTION TIMESSPE EXECUTION TIMES MEASURED ON PPEMEASURED ON SPEMEASURED ON PPE N(points)NLOGNCYCLES CYCLES(for fft only)CYCLES(for DMA) TOTAL CYCLE TIME CYCLES(THREAD CREATION)DIFFERENCE N v/s cycles
Number of cycles on PPU and SPU scale with order NlogN. Compute time on single SPU is greater than PPU due to cache misses in PPU. No cache for SPU -> direct local store access. Very efficient DMA. Thread creation on SPE very expensive. Thus SPUs need to be dedicated to a particular task for a period of time long enough to recoup the time it took to get it set up. DIFFERENCE (col 8) TOO LARGE?? Exact reason unknown. Possible reasons: ◦ Cycles for exiting the thread. (Upon exit are entries of Local Store invalidated?) ◦ Profile tool problem. (IBM says that simulator is used for profiling SPEs and not PPEs. Does this mean intrinsics provided for measuring cycles on PPE (__mftb) are not accurate?)
Multiple FFTs running on each SPU and each SPU works on different data. Limitation of local store memory. ◦ Single buffer approach => 8K points ◦ Double buffer approach => 4K points Single buffer v/s double buffer. Performance as number of active SPUs are increased.
SINGLE BUFFERDOUBLE BUFFER SPUsN CYCLES (THREAD CREATION) AVG. CYCLES(for fft only) AVG. CYCLES(for DMA) CYCLES (THREAD CREATION) AVG. CYCLES(for fft only) AVG. CYCLES(for DMA) (appx.)
More compute power with multi-processors ◦ For FFT -> almost 8 times if thread creation is not counted. Using double buffering may not always give speed advantage. (Amdahl’s law) Careful analysis of algorithm should be done to find out if its compute-intensive or memory-intensive with respect to Cell Architecture. ◦ Matrix multiplication is memory-intensive but FFT will be memory-intensive only for very large orders where all FFT samples cannot fit into Cell Local Store.
Reference No. of cycles for single 4K point FFT = No. of floating point operations = 4*1024*log(4*1024) = Frequency of system = 3.2Ghz No. of SPUs = 8 GFLOPS = (49152/24688) * 8 * 3.2G = 50.96Gflops/sec IBM RESULTS MY RESULTS
CELL architecture and its programming environment is completely new. Unknown problems come up. Runtime error -> “bus error”. Normally because of unaligned access. In my case I was making accesses more than 16K. Profiling is tricky with simulator supporting multiple modes. Use of assembly intrinsics is required to measure actual cycles. Running in “CYCLE” mode is very slow. ◦ Takes 2 days to run a 8K point fft. Simulator crashing when mode is changed multiple times. Debug support very complex.
Use the forum alphaworks: excellent forum with quick response time. To profile accurately run simulation in cycle mode. Commands for profiling ◦ __mftb() -> FOR PPE ◦ spu_writech(), spu_readch() -> FOR SPE
Pipelined implementation of FFT. Standalone mode. Higher order FFTs. Compiler performance.
Cell Broadband Engine Architecture Reference Manual, Ver 1.02, October 11, IBM Cell Broadband Engine Software Development Kit, Kahle J. A. et. al., Introduction to the Cell multiprocessor, IBM Journal of Research and Development, September Perrone M., Introduction to the Cell Processor (lecture), Krewell K., Cell Moves Into the Limelight, Microprocessor Report, February Krewell K., Chips, Software, and Systems, Microprocessor Report, January
loop( mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_get(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); fft_float (FFT_SIZE,cb1.RealIn,cb1.ImagIn,cb1.RealOut,cb1.ImagOut); mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all(); mfc_put(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); fft_float (FFT_SIZE,cb2.RealIn,cb2.ImagIn,cb2.RealOut,cb2.ImagOut); mfc_write_tag_mask (1<<x); mfc_read_tag_status_all(); mfc_put(&cb2+y*sizeof(cb1)/(FFT_SIZE/1024), argp+y*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb2)/(FFT_SIZE/1024), y+10, 0, 0); ) mfc_write_tag_mask (1<<(y+10)); mfc_read_tag_status_all();
mfc_get(&cb1), argp, sizeof(cb1) x, 0, 0); => WONT WORK FOR cb1>16KB SHOULD BE RECODED AS for (x=0;x<FFT_SIZE/1024;x++) { mfc_get(&cb1+x*sizeof(cb1)/(FFT_SIZE/1024), argp+x*sizeof(cb1)/(FFT_SIZE/1024), sizeof(cb1)/(FFT_SIZE/1024), x, 0, 0); }