Presentation is loading. Please wait.

Presentation is loading. Please wait.

Acceleration of Cooley-Tukey algorithm using Maxeler machine

Similar presentations


Presentation on theme: "Acceleration of Cooley-Tukey algorithm using Maxeler machine"— Presentation transcript:

1 Acceleration of Cooley-Tukey algorithm using Maxeler machine
Author: Nemanja Trifunović Mentor: Professor dr. Veljko Milutinović

2 Introduction Cooley-Tukey algorithm Maxeler platform
Fast Fourier Transform Divide and conquer Uses: Digital Signal Processing, Telecommunications, The analysis of sound signals, … Maxeler platform Data flow (vs Control flow) FPGA U radu je prikazana jedna implementacija Cooley-Tukey algoritma na jeziku C++ i njeno ubrzanje koristeći Maxeler mašinu. Šta je to Cooley-Tukey algoritam ? Šta je to Maxeler platforma ? Cooley-Tukey: algoritam koji izračunava brzu Furijeovu transformaciju To znači da računa diskretnu Furijeovu transformaciju ulazne sekvence odabiraka tj. da pretvara ulazni signal iz vremenskog u spektralni domen radi po principu: podeli pa vladaj, tako što ulaznu sekvencu deli na pod sekvence, na osnovu čijih DFT-ova određuje DFT cele sekvence najkorišćeniji algoritam za izračunavanje brze Furijeove transformacije. Ima primene u mnogim oblastima elektrotehnike kao što su digitalna obrada signala, telekomunikacije, analiza zvučnih signala, ... Maxeler: Maxeler mašine su zasnovane na dataflow arhitekturi i implementirane su u FPGA tehnologiji. FPGA je skraćenica za Field-Programmable Gate Array. To znači da je ova vrsta čipova programabilna na terenu, tj. ova tehnologija omogućava izmenu veza unutar čipa nakon što je čip isporučen krajnjem korisniku. Postoje velike razlike izmedju data flow arhitekture i John von Neumann-ove control flow arhitekturu, koja se koristi u mikropocesorima. Zbog toga su neki problemi pogodniji za rešavanje koristeći controlflow arhitekturu dok su drugi problemi pak pogodniji za rešavanje koristeći dataflow arhitekturu. Najčešće se samo deo programa ubrzava korišćenjem Maxeler platforme. Ubrzani deo programa se naziva jezgro ( eng. kernel ) programa i obično predstavlja programsku petlju. Example of Fourier transformation. (Source: Illustration is published under Creative Commons licencom) 1/22

3 Problem statement Design and implementation of:
The fastest possible system for calculating Fast Fourier Transform using Maxeler machine. System that will outperform currently existing solutions to this problem. Kao što je već rečeno rad se bavi problemom: Ubrzanja Cooley-Tukey algoritma pomoću Maxeler mašine Ciljevi rada su ... 2/22

4 Problem statement Benefits of calculating Fast Fourier Transform with Maxeler machines Benefits Higher speed of calculation. Lower power consumption. Lower space consumption. Conditions Huge amounts of data. Koja je bila motivacija za implementaciju Cooley-Tukey algoritma na Maxeler platformi ? Šta čemo time dobit ? Pogodnosti koje možemo očekivati su u odnosu na ekvivalentnu mikroprocesorsku realizaciju. Da bi dobili nabrojane pogodnosti sledeći uslov moraju biti ispunjeni Ovo će kasnije biti detaljno objašnjeno. 3/22

5 Conditions and assumptions
Used Maxeler machine Two Maxeler card type MAX3424A. In experiments with multiprocessor systems only one processor core was used. Za izradu rada je korišćena Maxeler mašina koja se nalazi u Matematičkom institutu. ... Koristeći ovu mašinu sintetizovana je implementacija algoritma za: 64 odabiraka u ulaznoj sekvenci kada se koriste dve kartice 32 odabiraka u ulaznoj sekvenci kada se koristi jedna kartica Multiprocesorski eksperimenti: moji + drugih autora 4/22

6 Overview of existing solutions
FFT algorithms: Prime-factor, Bruun’s, Rader’s, Winograd, Bluestein’s, … The time complexity: O(N log N). Performance comparison of publicly available implementations. Matteo Frigo and Steven G. Johnson (from MIT) Ovi algoritmi u svojim izračunavanjima koriste različite matematičke metode. Korišćeni metodi variraju od teorije brojeva preko numeričke matematike do teorije grafova. Šta algoritmi imaju zajedničko ? ( složenost ) Iako ne postoji dokaz da je nemoguće konstruisati brži algoritam takav algoritam još nije konstruisan. Besplatne i komercijalne implementacije fftw3 - komercijalna verzija koju licencira MIT Matteo i Steven su u svojim testovima koristili samo jedno procesorsko jezgro. . 5/22

7 Illustration of Matteo Frigo’s and Steven G. Johnson’s experiments.
Objasniti mflops skalu mflops = 5 N log2(N) / ( vreme potrebno za izračunavanje jednog FFT-a u milisekundama) Illustration of Matteo Frigo’s and Steven G. Johnson’s experiments. (Soruce: 6/22

8 The proposed solution Parallelized radix 2 algorithm.
Pipeline of depth O(log N), where N is the length of input sequence. Latency is proportional to the depth of pipeline. After initial delay (latency) one result in every cycle. Da bi postigli ubrzanja treba napuniti/iskoristiti pipeline. Ako nema dovoljno podataka mašina ima malo iskorišćenje. 7/22

9 Formal analysis Radix 2 Cooley-Tukey algorithm operates as follows:
Input sequence is divided into two equal subsequences where even elements make first, while the odd elements make second sequence. Then, using the calculated DFT's of subsequences DFT of the whole sequence is calculated. 8/22

10 Formal analysis Detailed derivation of the following formula is given it the paper DFT of even sequence is denoted by Ek, DFT of odd sequence is denoted by a Ok and e-2πk/N is denoted by Wkn. 9/22

11 Illustration of pipelined execution of radix 2 algorithm.
10/22

12 Measurment and analysis of the performance of proposed implementation
Types of performed experiments Calculation of Fourier transform of 100, 1.000, , and consecutive input sequences of length 8, 16, 32 i 64 points. Maxeler implementation vs reference CPU implementation Maxeler implementation vs best publicly available implementations Urađeni su sledeći testovi kako bi se izmerile performanse predložene implementacije. 11/22

13 Generated graphs: Maxeler vs best publicly available implementations of FFT algorithm. Run-times, depending on the number of consecutive FFT calculations (for input sequences of length 8, 16, 32 and 64). Acceleration obtained using Maxeler machine, compared to the CPU execution, depending on the number of consecutive FFT calculations (for input sequences of length 8, 16, 32 and 64). 12/22

14 for input sequence of 8 elements.
4 ovakva grafika ( za ulazne sekvence dužina 8, 16, 32 i 64 ) The average execution time in seconds of publicly available algorithms for calculating FFT on different architectures for input sequence of 8 elements. 13/22

15 Acceleration of Maxeler implementation compared to CPU implementation depending on the number of elements in the input sequence . 14/22

16 4 ovakva grafika ( za ulazne sekvence dužina 8, 16, 32 i 64 )
Computation time of consecutive fast Fourier transforms expressed in seconds depending on the number of consecutive calculations. 15/22

17 4 ovakva grafika ( za ulazne sekvence dužina 8, 16, 32 i 64 )
Acceleration of Maxeler implementation compared to CPU implementation depending on the number of consecutive calculations. . 16/22

18 Analysis of scalability and bottlenecks of proposed solution
Transfer of data to Maxeler card and from Maxeler card Limited number of hardware resources on single Maxeler card Limited number of Maxeler cards Kod predložene implementacije postoje uska grla i problemi skalabilnosti koji se javljaju sa porastom dužine ulazne sekvence. Transfer podataka na Maxeler karticu i sa Maxeler kartice ne može da prati brzinu rada Maxeler kartice. U ovom slučaju usko grlo je u I/O kontroleru. Količina podataka koje treba poslati i vratiti sa Maxeler kartice raste linearno sa porastom dužine ulazne sekvence. Na jednoj Maxeler kartici ne postoji dovoljno hardverskih resursa da se sintetizuje jedna faza radix 2 algoritma. U ovom slučaju usko grlo je u broju raspoloživih resursa na jednoj Maxeler kartici. Količina potrebnih hardverskih resursa potrebnih da se sintetizuje jedna faza radix 2 algoritma raste linearno sa porastom dužine ulazne sekvence. Na Maxeler mašini nema dovoljno kartica da bi se sintetizovao ceo radix 2 algoritam. U ovom slučaju usko grlo je u ukupnom broju raspoloživih resursa nekog Maxeler sistema. Da bi se maksimalno iskoristila svaka raspoloživa Maxeler kartica, na jednoj Maxeler kartici, se sintetizuje onoliko faza koliko maksimalno na nju može da stane. Broj faza radix 2 algoritma raste sa zavisnošću log N u odnosu na dužinu ulazne sekvence. Količina potrebnih hardversko resursa potrebnih da se sintetizuje ceo radix 2 algoritam raste sa zavisnošću N log N u odnosu na dužinu ulazne sekvence. 17/22

19 Analysis of implementation
Maxeler implementation of Cooley-Tukey algorithm consists of: Rearrangement of the input sequence in bit reverse order and Radix 2 algorithm. U radu je data referentna implementacija koja radi na mikroprocesoru i iz koje je minimalnom transformacijom dobijena Maxeler implementacija. Dva kernela Bit Reverse kernel i radix 2 kernel Objasniti zašto je potreban Bit Reverse poredak 18/22

20 Illustration of the kernel
Kenel sadrži i bit reverse i radix 2 algoritam Illustration of the kernel 19/22

21 Implementation details
Two input and two output streams These streams are of type: arrayType DFEType floatType = dfeFloat(8, 24); DFEArrayType<DFEVar> arrayType = new DFEArrayType<DFEVar>(floatType, n); Ratios Wnk aren’t calculated on Maxeler machine Parameters: N first_level last_level 20/22

22 Conclusion It’s show that proposed solution has expected performance and that it works correctly. Performance of the proposed solution is better than performance of any publicly available implementation of Fast Fourier Transform. To achieve these speedups it is needed to do consecutive calculations of Fast Fourier Transform 21/22

23 Thank you for attention
Q/A Thank you for attention


Download ppt "Acceleration of Cooley-Tukey algorithm using Maxeler machine"

Similar presentations


Ads by Google