FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) 4 th October, 2007
FPGA: Overview □Work done □Structure of a sample program □Ongoing Work □Next Step
FPGA : work done □Register handling and console IO □Modified simple.c □Implemented an adder □Used VirtualBase member of ADMXRC2_SPACE_INFO □Registers can be indexed using (23 downto 2) bits of LAD (local address/data) signal when it is used to address the fpga
Structure of simple.vhd entity simple is port( All the local bus signals required); end simple architecture …
Ongoing work : ZBT □Structure of zbt_main seems to be similar to simple.c □zbt.vhd is a wrapper for zbt_main.vhd □Same port names defined in the same way and port mapped to each other □Do not understand the reason for this wrapper □C code not available in ADMXRC2 demos □Lalit’s code also uses zbt and block rams, so looking at his C and vhdl code
Next Step □To work with zbt and block RAMs □FFT implementation on the FPGA
Multiprocessor FFT Overview □Some improvements to the existing code □Improve the theoretical model □Compare theoretical run-time with actual run time □Statistics of each processor □Further refinement: Using BSP model □Pointers for Cache Analysis
Optimizations to the code □Removed other arrays (reducing memory references considerably) □Twiddle factors □Bit reversal addresses □Bit reversal faster using bit operations O(1) for each address calculation □All multiplications/divisions involving 2 implemented using shift operations O(1) □Power (2^n) in constant time using bit operations O(1)
Previously…
Now…
Improvement □For larger input size, our program (radix-2) is comparable to FFTW □Our program might surpass FFTW □Using SIMD □Higher radix (e.g. 4,8,16) □Coding in C
Redefining the execution time □For p processors, the total execution time is : (T N /p) + (1 – 1/p)(2N/B + K N ) □p is a power of 2 □This assumes “RAM Model” □Assumes a flat memory address space with unit- cost access to any memory location □We did not take into account the memory hierarchy □E.g. matrix multiplication actually takes O(n 5 ) instead of expected O(n 3 ) [Alpern et al. 1994]
Redefining the execution time □Some observations □If the #processors are p, then the actual FFT computed if FFT(N/p) time taken is T N/ p and NOT T N / p □Time taken to combine (O(n) in RAM model) should be taken as: Σ K N/2 i (i = 1 to log p) □NOT included the synchronization time □Currently looking execution time only from the perspective of master processor □The overheads for establishing sends and receives have been neglected (on measuring this (using ping-pong approach) the time was negligible
New Theoretical Formula □Time taken for parallel execution with p processors is T N/p + (1-1/p)(2N/B) + ΣK N/2 i (i = 1 to log p)
Execution Time:
Input: (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=20.865T= T=26.591T= T= T= T=35.808T=35.555
Load Distribution: Processor 1
Load Distribution: Processor 2
Input: (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=20.773T=26.464T= T= T= T= T= T=29.532T= T= T= T=31.045T=33.96 T= T= T= Recv(4) T= T= T=39.85 T= T=40.120
Load Distribution: Processor 1
Load Distribution: Processor 2
Load Distribution: Processor 3
Load Distribution: Processor 4
Execution Time:
Input: (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T= T= T= T= T= T= T= T=
Load Distribution: Processor 1
Load Distribution: Processor 2
Input: (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=70.881T=91.281T= T= T= T= T= T=97.896T= T= T= T= T= T= T= T= Recv(4) T= T= T= T= T=
Load Distribution: Processor 1
Load Distribution: Processor 2
Load Distribution: Processor 3
Load Distribution: Processor 4
Execution Time:
Input: (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T= T= T= T= T= T= T= T=
Load Distribution: Processor 1
Load Distribution: Processor 2
Input: (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T= T= T= T= T= T= T= T= T= T= T= T= T= T= T= T= Recv(4) T= T= T= T= T=
Load Distribution: Processor 1
Load Distribution: Processor 2
Load Distribution: Processor 3
Load Distribution: Processor 4
Inference □The idle time is very less (for processor 1) □The theoretical model matches with actual results □But, we need to find a closed form solution for T N and K N
Calculating T N and K N □Depends upon □N : Size of the input □A: Cache Associativity □L: Cost incurred for a miss □M: Size of the cache □B: Number of Bytes it can transfer at a time
Contd… □Cache profilers give us the number of references that has been made to each level of the cache along with the number of misses □We have this table (computed in the summers) □We can multiply the total number of references and misses by the number of cycles it takes to do so to get an actual number
Theoretical Verification □S.Sen ET. Al. – “Towards a Theory of Cache-Efficient Algorithms” □It has given a formal method to analyze algorithms in Cache model (taking into account multiple memory hierarchy) □Still reading it
Modeling using BSP □BSP (Bulk Synchronous Parallel) model considers □The whole job as a series of supersteps □At each superstep, all processors do local computations and send messages to other processors. These messages are not available until the next synchronization has been finished
Modeling using BSP □BSP model uses the following parameters – □p the number of processors (p = ^2 for us) □w t the maximum local work performed by any processor □L the time machine needs for barrier synchronization (determined experimentally) □g the network bandwidth inefficiency (reciprocal of B,determined experimentally)
Modeling using BSP Send(2) Recv(1) Send(3) Send(4) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(1) Recv(3) Combine Recv(1) Send(1) Combine barrier step 0step 1step 2step 3step 4step 5step 6
Execution time □Step 0: L □Step1: L+max(time(Send(2)),time(Recv(1))) □Step 3: L+ max(time(Send(3),Send(4),Recv(1),Recv(2)) □Step 4: L+max(FFT i (N/p)) (0<=i<=p-1) □Step 5: L+ max(time(Send(2),Send(1),Recv(3),Recv(4)) □Step 6: L+max(time(combine i (N/4)) (i={1,2}) □Step 7: L+max(time(Send(1)),time(Recv(2))) □Step 8: L+ time(combine(N/2))
Generalizing this for p processors event(t) communications 0<= t < logp compute FFT(N/p) t = logp communications logp< t<= 3logp (t - logp odd) combine FFTs logp< t<= 3logp (t - logp even)
for t< logp Total # of steps = 2 t Sends and 2 t Recvs let time(send(N,i)) denote the time taken to send N data points to processor i let time(recv(N,j)) denote the time taken to receive N data points from parocessor j Total time taken for this group = ∑ max{time(send(N/(2 t+1 ),j-), time(send(N/(2 t+1 ), i-1))} +L(logp) 0<j<=2 t 2 t <i<=2 t+1 t=0 t=log p -1
t = logp □Let time(FFT i (N/p)) denote the time taken to compute FFT of size N/p on processor i □thus, time taken to calculate FFT of size N/p is max{FFT i (N/p)} + L 0<= i<= p-1
for t>logp (t-logp is odd) Time taken is only for communications Total time taken is ∑ max{time(send(N/h,j-1),time(recv(N/h,i-1))} +L(logp) 0<j<=h/2 h/2<i<=h t=log p +1 t=3log p -1 where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function
for t>logp (t-logp is even) Time taken is only for combining Let time(combine i (N)) denote the time to combine Total time taken is ∑ max{time(combine i (N/2h))} +L(logp) - L t=log p +2 t=3log p where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function 0<i<=h
Execution Time □The total time is the sum of all the above steps □In general, there would be 3(logp) steps □The actual time depends upon how well a particular part of the program schedules on a particular processor □(i.e.) the processing time can vary
Further Work □Formalize the BSP model for p divisions □Combine Inplace (using realloc) □Compare parallel FFT against parallel FFTW
References □S.Sen, S.Chatterjee, N.Dumir, 2000.Towards a Theory of Cache- Efficient Algorithms □Michael J. Quinn, Parallel Programming in C with MPI and OpenMP □L.G. Valiant, A bridging model for parallel computation
Thank You