FFT Accelerator Project Rohit Prakash Anand Silodia Date: June 7 th, 2007
Objectives Analysis using random input points %age improvement (from the previous implementations) Cache profiling
Improvements Calls to sine/cosine decreased Separate arrays for power, some other terms –Division decreased –Multiplications decreased Error in last time corrected (FFTW floating point)
System Configuration Intel Pentium 4 (HT) 3.0Ghz RAM : 1GB Cache : 1MB L2 O.S. : Fedora Core 3 Compiler icc Flags used : -xW, -O3, -ipo-prec-div, - static
User time : vs. FFTW (single precision) Radix-4 works 1.5 times slower than fftw Radix-8 works 1.6 times slower than fftw
User time : previous (double) vs. new (float) Approximately 20% improvement
User time : previous (double) vs new (float) Approximately 19% improvement
Cache Organization Cache Level SizeAssociativityLine size L21 MB8-way64 I116 KB4-way64 D116KB4-way64
Radix-4 L2 misses Approximately 30% less L2 misses
Radix-4 D1 misses Approximately 1.6% less D1 misses
Radix-8 L2 misses Approximately 13.6% less L2 misses
Radix-8 D1 misses Approximately.96% less D1 misses
Profiling results: using vtune
Profiling results: using gprof
Profiling results : using vtune
Profiling results: using gprof
Profiling results: using vtune
Profiling results: using gprof
Profiling results: using vtune
Profiling results: using gprof
Profiling results: using vtune
Profiling results: using gprof
Further Improvements : use sse instructions Vectorize the loop T A[r] U w*A[r+p] V w*w*A[r+2*p] W w*w*w*A[r+3*p] Complex temp[4]; For(i = 1; i<4;i++) { temp[i] = twiddle[i*p]*A[r+ i*l] }
Thank You