Download presentation
Presentation is loading. Please wait.
1
Accelerating PFA FFT: Performance Comparison
Michael Perrone Acie Nobles Jizhu Lu
2
Outline PFA FFT Overview & Experimental Results Implementation
Vectorization PFA FFT on Cell - M. Perrone,
3
PFA FFT Algorithm Specifics
Prime-factor FFT algorithm (PFA) 2D FFT Single precision Complex-to-complex Nominal size 1K rows, 1600 points per row Factors implemented: 2, 3, 4, 5, 7, 8, 9, 11, 13, 16 PFA FFT on Cell - M. Perrone,
4
Performance Comparison
Cell vs Woodcrest Cell vs Opteron Execution Time Performance Comparison ( D images, in seconds) Matrix Size Intel AMD 3SW 3SWO 2SW 2SWO 364x240 16.47 38.8 6.63 4.74 5.56 5.31 616x308 45.92 135.59 11.86 8.21 9.5 9.05 840x462 146.22 246.09 24.3 16.96 18.71 17.83 1008x616 218.24 393.27 34.72 23.07 27.58 26.29 1260x840 416.56 559.05 59.71 39.94 50.84 48.38 1540x1008 687.79 995.49 86.16 57.65 79.1 75.66 PFA FFT on Cell - M. Perrone,
5
Performance: All PFA Sizes – 3 Step & 2 Step Algs
PFA FFT on Cell - M. Perrone,
6
Lessons Learned “numactl –m 0 –c 0” Binds jobs to BEs
NUMA utility “numactl –m 0 –c 0” Binds jobs to BEs Binds memory to BEs 2 runs instead of 1 Changed buffer size 4096 4104 elements added one data envelope (128B) Better memory access pattern Declare temporary variables locally Combining 2nd and 3rd steps PFA FFT on Cell - M. Perrone,
7
Outline PFA FFT Overview & Experimental Results Implementation
Vectorization PFA FFT on Cell - M. Perrone,
8
Implementation Overview
FFT distributed across SPEs Data vectorized DMAs double buffered Pass 1: For each buffer DMA Get buffer Transform signals to SIMD format Do four 1D FFTs in SIMD Tiles transposed DMA Put buffer Pass 2: For each buffer Pass 3: For each buffer Transform SIMD format to original data format Tile Buffer Input Image Transposed Image Transposed Tile Transposed Buffer PFA FFT on Cell - M. Perrone,
9
Two Step PFA FFT Algorithm
1st Step Get input data from main RAM by using DMA Vectorization Vectorized PFA FFT for 1st dimension Transpose and write back to main memory 2nd Step Vectorized PFA FFT for 2nd dimension Combined Transpose & Un-vectorization Write back to main memory PFA FFT on Cell - M. Perrone,
10
Do combined transpose and unvectorization
2nd Step Details Load buffer 1 Load buffer 2 PFAFFT on buffer 1 PFAFFT on buffer 2 Do combined transpose and unvectorization on buffer1 & buffer2 DMA back to main RAM in right places PFA FFT on Cell - M. Perrone,
11
Time Distribution on 2nd Step
Begin of the loop Efficiency = 6/13 = ~50% Load Comp Trans Unload Load Comp Trans Unload Load Comp Trans Unload Load Load Comp Trans Unload Load Comp Trans Unload Load Comp Trans Unload Load Comp End of the loop Load buf[0] Load buf[1] Load buf[1] Load buf[1] Comp buf[0] Comp buf[2] Comp buf[0] Load buf[2] Load buf[0] Load buf[2] Comp buf[1] Comp buf[1] Comp buf[1] T & UNLD buf[0] buf[1] T & UNLD buf[1] buf[2] T & UNLD buf[0] buf[1] PFA FFT on Cell - M. Perrone,
12
Outline PFA FFT Overview & Experimental Results Implementation
Vectorization PFA FFT on Cell - M. Perrone,
13
Data Layout Change in 2-Step PFAFFT
Original Input Data (each trace 4 complex numbers x 16 traces) c5 c6 c7 c8 c1 c2 c3 c4 d5 d6 d7 d8 d1 d2 d3 d4 1st buffer a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 b5 b6 b7 b8 e1 e2 e3 e4 e5 e6 e7 e8 f1 f2 f3 f4 5 f6 f7 f8 g5 g6 g7 g8 g1 g2 g3 g4 h5 h6 h7 h8 h1 h2 h3 h4 2nd buffer i1 i2 i3 i4 i5 i6 i7 i8 j1 j2 j3 j4 j5 j6 j7 j8 k5 k6 k7 k8 k1 k2 k3 k4 l5 l6 l7 l8 l1 l2 l3 l4 3rd buffer 4th buffer m1 m2 m3 m4 m5 m6 m7 m8 n1 n2 n3 n4 n5 n6 n7 n8 o5 o6 o7 o8 o1 o2 o3 o4 p5 p6 p7 p8 p1 p2 p3 p4 real real real real imaginary imaginary imaginary imaginary PFA FFT on Cell - M. Perrone,
14
Vectorization Shuffle Operation in 1st Step
imaginary imaginary imaginary imaginary real real real real c5 c6 c7 c8 c1 c2 c3 c4 d5 d6 d7 d8 d1 d2 d3 d4 a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 b5 b6 b7 b8 real imaginary real imaginary real imaginary real imaginary PFA FFT on Cell - M. Perrone,
15
After Vectorization in 1st Step
b1 d1 c1 a2 b2 d2 c2 a3 b3 d3 c3 a4 b4 d4 c4 a5 b5 d5 c5 a6 b6 d6 c6 a7 b7 d7 c7 a8 b8 d8 c8 real imaginary real imaginary real imaginary real imaginary e1 f1 h1 g1 e2 f2 h2 g2 e3 f3 h3 g3 e4 f4 h4 g4 e5 f5 h5 g5 e6 f6 h6 g6 e7 f7 h7 g7 e8 f8 h8 g8 1st buffer l1 i1 j1 k1 l2 i2 j2 k2 l3 i3 j3 k3 l4 i4 j4 k4 l5 i5 j5 k5 l6 i6 j6 k6 l7 i7 j7 k7 l8 i8 j8 k8 2nd buffer m1 n1 p1 o1 m2 n2 p2 o2 m3 n3 p3 o3 m4 n4 p4 o4 m5 n5 p5 o5 m6 n6 p6 o6 m7 n7 p7 o7 m8 n8 p8 o8 3rd buffer 4th buffer PFA FFT on Cell - M. Perrone,
16
After PFA FFT for 1st Dimension
b1 d1 c1 a2 b2 d2 c2 a3 b3 d3 c3 a4 b4 d4 c4 a5 b5 d5 c5 a6 b6 d6 c6 a7 b7 d7 c7 a8 b8 d8 c8 real imaginary real imaginary real imaginary real imaginary e1 f1 h1 g1 e2 f2 h2 g2 e3 f3 h3 g3 e4 f4 h4 g4 e5 f5 h5 g5 e6 f6 h6 g6 e7 f7 h7 g7 e8 f8 h8 g8 1st buffer l1 i1 j1 k1 l2 i2 j2 k2 l3 i3 j3 k3 l4 i4 j4 k4 l5 i5 j5 k5 l6 i6 j6 k6 l7 i7 j7 k7 l8 i8 j8 k8 2nd buffer m1 n1 p1 o1 m2 n2 p2 o2 m3 n3 p3 o3 m4 n4 p4 o4 m5 n5 p5 o5 m6 n6 p6 o6 m7 n7 p7 o7 m8 n8 p8 o8 3rd buffer 4th buffer PFA FFT on Cell - M. Perrone,
17
Transposition Shuffle Operation in 1st Step
real imaginary real imaginary real imaginary real imaginary a1 b1 d1 c1 a2 b2 d2 c2 a3 b3 d3 c3 a4 b4 d4 c4 a5 b5 d5 c5 a6 b6 d6 c6 a7 b7 d7 c7 a8 b8 d8 c8 real imaginary real imaginary real imaginary real imaginary PFA FFT on Cell - M. Perrone,
18
After Transposition DMA back to main RAM in 1st Step
real imaginary real imaginary real imaginary real imaginary a1 a3 a7 a5 a2 a4 a8 a6 b1 b3 b7 b5 b2 b4 b8 b6 c1 c3 c7 c5 c2 c4 c8 c6 d1 d3 d7 d5 d2 d4 d8 d6 1st buffer e1 e3 e7 e5 e2 e4 e8 e6 f1 f3 f7 f5 f2 f4 f8 f6 g1 g3 g7 g5 g2 g4 g8 g6 h1 h3 h7 h5 h2 h4 h8 h6 2nd buffer i1 i3 i7 i5 i2 i4 i8 i6 j1 j3 j7 j5 j2 j4 j8 j6 k1 k3 k7 k5 k2 k4 k8 k6 l1 l3 l7 l5 l2 l4 l8 l6 3rd buffer m1 m3 m7 m5 m2 m4 m8 m6 n1 n3 n7 n5 n2 n4 n8 n6 o1 o3 o7 o5 o2 o4 o8 o6 p1 p3 p7 p5 p2 p4 p8 p6 4th buffer PFA FFT on Cell - M. Perrone,
19
After DMA Load in 2nd Step from main RAM (all in 1 buffer)
real imaginary real imaginary real imaginary real imaginary a1 a3 a7 a5 a2 a4 a8 a6 b1 b3 b7 b5 b2 b4 b8 b6 c1 c3 c7 c5 c2 c4 c8 c6 d1 d3 d7 d5 d2 d4 d8 d6 e1 e3 e7 e5 e2 e4 e8 e6 f1 f3 f7 f5 f2 f4 f8 f6 g1 g3 g7 g5 g2 g4 g8 g6 h1 h3 h7 h5 h2 h4 h8 h6 i1 i3 i7 i5 i2 i4 i8 i6 j1 j3 j7 j5 j2 j4 j8 j6 k1 k3 k7 k5 k2 k4 k8 k6 l1 l3 l7 l5 l2 l4 l8 l6 m1 m3 m7 m5 m2 m4 m8 m6 n1 n3 n7 n5 n2 n4 n8 n6 o1 o3 o7 o5 o2 o4 o8 o6 p1 p3 p7 p5 p2 p4 p8 p6 PFA FFT on Cell - M. Perrone,
20
After PFA FFT for 2nd Dimension (just 1 buffer)
real imaginary real imaginary real imaginary real imaginary a1 a3 a7 a5 a2 a4 a8 a6 b1 b3 b7 b5 b2 b4 b8 b6 c1 c3 c7 c5 c2 c4 c8 c6 d1 d3 d7 d5 d2 d4 d8 d6 e1 e3 e7 e5 e2 e4 e8 e6 f1 f3 f7 f5 f2 f4 f8 f6 g1 g3 g7 g5 g2 g4 g8 g6 h1 h3 h7 h5 h2 h4 h8 h6 i1 i3 i7 i5 i2 i4 i8 i6 j1 j3 j7 j5 j2 j4 j8 j6 k1 k3 k7 k5 k2 k4 k8 k6 l1 l3 l7 l5 l2 l4 l8 l6 m1 m3 m7 m5 m2 m4 m8 m6 n1 n3 n7 n5 n2 n4 n8 n6 o1 o3 o7 o5 o2 o4 o8 o6 p1 p3 p7 p5 p2 p4 p8 p6 PFA FFT on Cell - M. Perrone,
21
Transposition & Un-Vectorization Shuffle Operation in 2nd Step
real imaginary real imaginary real imaginary real imaginary a1 a3 a7 a5 a2 a4 a8 a6 b1 b3 b7 b5 b2 b4 b8 b6 c1 c3 c7 c5 c2 c4 c8 c6 d1 d3 d7 d5 d2 d4 d8 d6 real real real real imaginary imaginary imaginary imaginary PFA FFT on Cell - M. Perrone,
22
After Combined Transposition and Un-vectorization Shuffle DMA back to main RAM
g1 g2 g4 g3 g5 g6 g8 g7 h1 h2 h4 h3 h5 h6 h8 h7 i1 i2 i4 i3 i5 i6 i8 i7 j1 j2 j4 j3 j5 j6 j8 j7 k1 k2 k4 k3 k5 k6 k8 k7 l1 l2 l4 l3 l5 l6 l8 l7 m1 m2 m4 m3 m5 m6 m8 m7 n1 n2 n4 n3 n5 n6 n8 n7 o1 o2 o4 o3 o5 o6 o8 o7 p1 p2 p4 p3 p5 p6 p8 p7 real real real real imaginary imaginary imaginary imaginary PFA FFT on Cell - M. Perrone,
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.