Download presentation
Presentation is loading. Please wait.
Published bySamson Parker Modified over 9 years ago
1
FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) 4 th October, 2007
2
FPGA: Overview □Work done □Structure of a sample program □Ongoing Work □Next Step
3
FPGA : work done □Register handling and console IO □Modified simple.c □Implemented an adder □Used VirtualBase member of ADMXRC2_SPACE_INFO □Registers can be indexed using (23 downto 2) bits of LAD (local address/data) signal when it is used to address the fpga
4
Structure of simple.vhd entity simple is port( All the local bus signals required); end simple architecture …
5
Ongoing work : ZBT □Structure of zbt_main seems to be similar to simple.c □zbt.vhd is a wrapper for zbt_main.vhd □Same port names defined in the same way and port mapped to each other □Do not understand the reason for this wrapper □C code not available in ADMXRC2 demos □Lalit’s code also uses zbt and block rams, so looking at his C and vhdl code
6
Next Step □To work with zbt and block RAMs □FFT implementation on the FPGA
7
Multiprocessor FFT Overview □Some improvements to the existing code □Improve the theoretical model □Compare theoretical run-time with actual run time □Statistics of each processor □Further refinement: Using BSP model □Pointers for Cache Analysis
8
Optimizations to the code □Removed other arrays (reducing memory references considerably) □Twiddle factors □Bit reversal addresses □Bit reversal faster using bit operations O(1) for each address calculation □All multiplications/divisions involving 2 implemented using shift operations O(1) □Power (2^n) in constant time using bit operations O(1)
9
Previously…
10
Now…
11
Improvement □For larger input size, our program (radix-2) is comparable to FFTW □Our program might surpass FFTW □Using SIMD □Higher radix (e.g. 4,8,16) □Coding in C
12
Redefining the execution time □For p processors, the total execution time is : (T N /p) + (1 – 1/p)(2N/B + K N ) □p is a power of 2 □This assumes “RAM Model” □Assumes a flat memory address space with unit- cost access to any memory location □We did not take into account the memory hierarchy □E.g. matrix multiplication actually takes O(n 5 ) instead of expected O(n 3 ) [Alpern et al. 1994]
13
Redefining the execution time □Some observations □If the #processors are p, then the actual FFT computed if FFT(N/p) time taken is T N/ p and NOT T N / p □Time taken to combine (O(n) in RAM model) should be taken as: Σ K N/2 i (i = 1 to log p) □NOT included the synchronization time □Currently looking execution time only from the perspective of master processor □The overheads for establishing sends and receives have been neglected (on measuring this (using ping-pong approach) the time was negligible
14
New Theoretical Formula □Time taken for parallel execution with p processors is T N/p + (1-1/p)(2N/B) + ΣK N/2 i (i = 1 to log p)
15
Execution Time: 16777216
16
Input: 16777216 (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=20.865T=26.579 T=26.591T=29.799 T=29.848 T=35.541 T=35.808T=35.555
17
Load Distribution: Processor 1
18
Load Distribution: Processor 2
19
Input:16777216 (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=20.773T=26.464T=29.315 T=29.547 T=26.479 T=26.617 T=29.332 T=29.532T=30.816 T=30.835 T=31.032 T=31.045T=33.96 T=33.672 T=33.686 T=33.977 Recv(4) T=34.166 T=33.812 T=39.85 T=39.869 T=40.120
20
Load Distribution: Processor 1
21
Load Distribution: Processor 2
22
Load Distribution: Processor 3
23
Load Distribution: Processor 4
24
Execution Time: 33554432
25
Input: 33554432 (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=103.558T=114.954 T=114.965T=121.558 T=121.921 T=133.322 T=133.851T=133.335
26
Load Distribution: Processor 1
27
Load Distribution: Processor 2
28
Input: 33554432 (p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=70.881T=91.281T=96.982 T=97.909 T=91.294 T=91.579 T=97.001 T=97.896T=100.128 T=100.164 T=101.052 T=101.043T=106.939 T=105.854 T=105.864 T=106.951 Recv(4) T=107.351 T=106.116 T=118.748 T=118.757 T=119.261
29
Load Distribution: Processor 1
30
Load Distribution: Processor 2
31
Load Distribution: Processor 3
32
Load Distribution: Processor 4
33
Execution Time:67108864
34
Input: 67108864 (p=2) Send(2) Recv(1) P1 P2FFT(N/2) Recv(2) Send(1) Combine T=0T=176.271T=199.081 T=199.092T=212.858 T=221.761 T=252.553 T=324.062T=252.656
35
Load Distribution: Processor 1
36
Load Distribution: Processor 2
37
Input: 67108864(p=4) Send(2) Recv(1) Send(3) Send(4) Recv(2) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(3) Combine Recv(1) Send(1) Combine T=0T=193.211T=220.211T=233.173 T=232.65 T=220.196 T=220.772 T=233.192 T=232.645T=262.629 T=239.773 T=239.257 T=239.238T=250.893 T=274.299 T=274.300 T=250.903 Recv(4) T=252.737 T=280.422 T=305.326 T=305.333 T=544.529
38
Load Distribution: Processor 1
39
Load Distribution: Processor 2
40
Load Distribution: Processor 3
41
Load Distribution: Processor 4
42
Inference □The idle time is very less (for processor 1) □The theoretical model matches with actual results □But, we need to find a closed form solution for T N and K N
43
Calculating T N and K N □Depends upon □N : Size of the input □A: Cache Associativity □L: Cost incurred for a miss □M: Size of the cache □B: Number of Bytes it can transfer at a time
44
Contd… □Cache profilers give us the number of references that has been made to each level of the cache along with the number of misses □We have this table (computed in the summers) □We can multiply the total number of references and misses by the number of cycles it takes to do so to get an actual number
45
Theoretical Verification □S.Sen ET. Al. – “Towards a Theory of Cache-Efficient Algorithms” □It has given a formal method to analyze algorithms in Cache model (taking into account multiple memory hierarchy) □Still reading it
46
Modeling using BSP □BSP (Bulk Synchronous Parallel) model considers □The whole job as a series of supersteps □At each superstep, all processors do local computations and send messages to other processors. These messages are not available until the next synchronization has been finished
47
Modeling using BSP □BSP model uses the following parameters – □p the number of processors (p = ^2 for us) □w t the maximum local work performed by any processor □L the time machine needs for barrier synchronization (determined experimentally) □g the network bandwidth inefficiency (reciprocal of B,determined experimentally)
48
Modeling using BSP Send(2) Recv(1) Send(3) Send(4) Recv(1) P1 P2 P3 P4 FFT(N/4) Send(1) Send(2) Recv(1) Recv(3) Combine Recv(1) Send(1) Combine barrier step 0step 1step 2step 3step 4step 5step 6
49
Execution time □Step 0: L □Step1: L+max(time(Send(2)),time(Recv(1))) □Step 3: L+ max(time(Send(3),Send(4),Recv(1),Recv(2)) □Step 4: L+max(FFT i (N/p)) (0<=i<=p-1) □Step 5: L+ max(time(Send(2),Send(1),Recv(3),Recv(4)) □Step 6: L+max(time(combine i (N/4)) (i={1,2}) □Step 7: L+max(time(Send(1)),time(Recv(2))) □Step 8: L+ time(combine(N/2))
50
Generalizing this for p processors event(t) communications 0<= t < logp compute FFT(N/p) t = logp communications logp< t<= 3logp (t - logp odd) combine FFTs logp< t<= 3logp (t - logp even)
51
for t< logp Total # of steps = 2 t Sends and 2 t Recvs let time(send(N,i)) denote the time taken to send N data points to processor i let time(recv(N,j)) denote the time taken to receive N data points from parocessor j Total time taken for this group = ∑ max{time(send(N/(2 t+1 ),j-), time(send(N/(2 t+1 ), i-1))} +L(logp) 0<j<=2 t 2 t <i<=2 t+1 t=0 t=log p -1
52
t = logp □Let time(FFT i (N/p)) denote the time taken to compute FFT of size N/p on processor i □thus, time taken to calculate FFT of size N/p is max{FFT i (N/p)} + L 0<= i<= p-1
53
for t>logp (t-logp is odd) Time taken is only for communications Total time taken is ∑ max{time(send(N/h,j-1),time(recv(N/h,i-1))} +L(logp) 0<j<=h/2 h/2<i<=h t=log p +1 t=3log p -1 where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function
54
for t>logp (t-logp is even) Time taken is only for combining Let time(combine i (N)) denote the time to combine Total time taken is ∑ max{time(combine i (N/2h))} +L(logp) - L t=log p +2 t=3log p where h = 2 [|(t-3logp)/2|]+1 where | | refers to absolute and [] greatest integer function 0<i<=h
55
Execution Time □The total time is the sum of all the above steps □In general, there would be 3(logp) steps □The actual time depends upon how well a particular part of the program schedules on a particular processor □(i.e.) the processing time can vary
56
Further Work □Formalize the BSP model for p divisions □Combine Inplace (using realloc) □Compare parallel FFT against parallel FFTW
57
References □S.Sen, S.Chatterjee, N.Dumir, 2000.Towards a Theory of Cache- Efficient Algorithms □Michael J. Quinn, Parallel Programming in C with MPI and OpenMP □L.G. Valiant, 1990. A bridging model for parallel computation
58
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.