Acceleration of Cooley-Tukey algorithm using Maxeler machine

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

Chapter 19 Fast Fourier Transform (FFT) (Theory and Implementation)

Chapter 19 Fast Fourier Transform

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

2. Getting Started Heejin Park College of Information and Communications Hanyang University.

DFT & FFT Computation.

Analysis of Computer Algorithms

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

Greening Backbone Networks Shutting Off Cables in Bundled Links Will Fisher, Martin Suchara, and Jennifer Rexford Princeton University.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 5 Author: Julia Richards and R. Scott Hawley.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

FIGURE 9.1 Control of temperature by process control.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

Arithmetic and Geometric Means

Multiplying binomials You will have 20 seconds to answer each of the following multiplication problems. If you get hung up, go to the next problem when.

ALGEBRAIC EXPRESSIONS

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Reductions Complexity ©D.Moshkovitz.

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

SE-292 High Performance Computing

Advance Database Systems and Applications COMP 6521

ABC Technology Project

Cache and Virtual Memory Replacement Algorithms

MANINDER KAUR Maninder Kaur 1

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

5 August, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Squares and Square Root WALK. Solve each problem REVIEW:

David Hansen and James Michelussi

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.

© 2012 National Heart Foundation of Australia. Slide 2.

1 Chapter 4 The while loop and boolean operators Samuel Marateck ©2010.

Chapter 5 Test Review Sections 5-1 through 5-4.

GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.

Addition 1’s to 20.

25 seconds left…...

Test B, 100 Subtraction Facts

Fourier Transform Fourier transform decomposes a signal into its frequency components Used in telecommunications, data compression, digital signal processing,

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

CSE Lecture 17 – Balanced trees

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

A SMALL TRUTH TO MAKE LIFE 100%

1 Unit 1 Kinematics Chapter 1 Day

PSSA Preparation.

1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

How Cells Obtain Energy from Food

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Dataflow Programming with MaxCompiler.

The Student Handbook to T HE A PPRAISAL OF R EAL E STATE 1 Chapter 23 Yield Capitalization — Theory and Basic Applications.

12-Apr-15 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.

FFT USING OPEN-MP Done by: HUSSEIN SALIM QASIM & Tiba Zaki Abdulhameed

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

FFT: Accelerator Project Rohit Prakash Anand Silodia.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

PERFORMANCE EVALUATIONS

Centar ( Global Signal Processing Expo

4.1 DFT In practice the Fourier components of data are obtained by digital computation rather than by analog processing. The analog values have to be.

Presentation transcript:

Acceleration of Cooley-Tukey algorithm using Maxeler machine Author: Nemanja Trifunović Mentor: Professor dr. Veljko Milutinović

Introduction Cooley-Tukey algorithm Maxeler platform Fast Fourier Transform Divide and conquer Uses: Digital Signal Processing, Telecommunications, The analysis of sound signals, … Maxeler platform Data flow (vs Control flow) FPGA U radu je prikazana jedna implementacija Cooley-Tukey algoritma na jeziku C++ i njeno ubrzanje koristeći Maxeler mašinu. Šta je to Cooley-Tukey algoritam ? Šta je to Maxeler platforma ? Cooley-Tukey: algoritam koji izračunava brzu Furijeovu transformaciju To znači da računa diskretnu Furijeovu transformaciju ulazne sekvence odabiraka tj. da pretvara ulazni signal iz vremenskog u spektralni domen radi po principu: podeli pa vladaj, tako što ulaznu sekvencu deli na pod sekvence, na osnovu čijih DFT-ova određuje DFT cele sekvence najkorišćeniji algoritam za izračunavanje brze Furijeove transformacije. Ima primene u mnogim oblastima elektrotehnike kao što su digitalna obrada signala, telekomunikacije, analiza zvučnih signala, ... Maxeler: Maxeler mašine su zasnovane na dataflow arhitekturi i implementirane su u FPGA tehnologiji. FPGA je skraćenica za Field-Programmable Gate Array. To znači da je ova vrsta čipova programabilna na terenu, tj. ova tehnologija omogućava izmenu veza unutar čipa nakon što je čip isporučen krajnjem korisniku. Postoje velike razlike izmedju data flow arhitekture i John von Neumann-ove control flow arhitekturu, koja se koristi u mikropocesorima. Zbog toga su neki problemi pogodniji za rešavanje koristeći controlflow arhitekturu dok su drugi problemi pak pogodniji za rešavanje koristeći dataflow arhitekturu. Najčešće se samo deo programa ubrzava korišćenjem Maxeler platforme. Ubrzani deo programa se naziva jezgro ( eng. kernel ) programa i obično predstavlja programsku petlju. Example of Fourier transformation. (Source: https://en.wikipedia.org/wiki/File:Rectangular_function.svg; https://en.wikipedia.org/wiki/File:Sinc_function_(normalized).svg, Illustration is published under Creative Commons licencom) 1/22

Problem statement Design and implementation of: The fastest possible system for calculating Fast Fourier Transform using Maxeler machine. System that will outperform currently existing solutions to this problem. Kao što je već rečeno rad se bavi problemom: Ubrzanja Cooley-Tukey algoritma pomoću Maxeler mašine Ciljevi rada su ... 2/22

Problem statement Benefits of calculating Fast Fourier Transform with Maxeler machines Benefits Higher speed of calculation. Lower power consumption. Lower space consumption. Conditions Huge amounts of data. Koja je bila motivacija za implementaciju Cooley-Tukey algoritma na Maxeler platformi ? Šta čemo time dobit ? Pogodnosti koje možemo očekivati su … u odnosu na ekvivalentnu mikroprocesorsku realizaciju. Da bi dobili nabrojane pogodnosti sledeći uslov moraju biti ispunjeni Ovo će kasnije biti detaljno objašnjeno. 3/22

Conditions and assumptions Used Maxeler machine Two Maxeler card type MAX3424A. In experiments with multiprocessor systems only one processor core was used. Za izradu rada je korišćena Maxeler mašina koja se nalazi u Matematičkom institutu. ... Koristeći ovu mašinu sintetizovana je implementacija algoritma za: 64 odabiraka u ulaznoj sekvenci kada se koriste dve kartice 32 odabiraka u ulaznoj sekvenci kada se koristi jedna kartica Multiprocesorski eksperimenti: moji + drugih autora 4/22

Overview of existing solutions FFT algorithms: Prime-factor, Bruun’s, Rader’s, Winograd, Bluestein’s, … The time complexity: O(N log N). Performance comparison of publicly available implementations. Matteo Frigo and Steven G. Johnson (from MIT) Ovi algoritmi u svojim izračunavanjima koriste različite matematičke metode. Korišćeni metodi variraju od teorije brojeva preko numeričke matematike do teorije grafova. Šta algoritmi imaju zajedničko ? ( složenost ) Iako ne postoji dokaz da je nemoguće konstruisati brži algoritam takav algoritam još nije konstruisan. Besplatne i komercijalne implementacije fftw3 - komercijalna verzija koju licencira MIT Matteo i Steven su u svojim testovima koristili samo jedno procesorsko jezgro. . 5/22

Illustration of Matteo Frigo’s and Steven G. Johnson’s experiments. Objasniti mflops skalu mflops = 5 N log2(N) / ( vreme potrebno za izračunavanje jednog FFT-a u milisekundama) Illustration of Matteo Frigo’s and Steven G. Johnson’s experiments. (Soruce: http://www.fftw.org/speed/Pentium4-3.60GHz-icc) 6/22

The proposed solution Parallelized radix 2 algorithm. Pipeline of depth O(log N), where N is the length of input sequence. Latency is proportional to the depth of pipeline. After initial delay (latency) one result in every cycle. Da bi postigli ubrzanja treba napuniti/iskoristiti pipeline. Ako nema dovoljno podataka mašina ima malo iskorišćenje. 7/22

Formal analysis Radix 2 Cooley-Tukey algorithm operates as follows: Input sequence is divided into two equal subsequences where even elements make first, while the odd elements make second sequence. Then, using the calculated DFT's of subsequences DFT of the whole sequence is calculated. 8/22

Formal analysis Detailed derivation of the following formula is given it the paper DFT of even sequence is denoted by Ek, DFT of odd sequence is denoted by a Ok and e-2πk/N is denoted by Wkn. 9/22

Illustration of pipelined execution of radix 2 algorithm. 10/22

Measurment and analysis of the performance of proposed implementation Types of performed experiments Calculation of Fourier transform of 100, 1.000, 10.000, 1.000.000 and 10.000.000 consecutive input sequences of length 8, 16, 32 i 64 points. Maxeler implementation vs reference CPU implementation Maxeler implementation vs best publicly available implementations Urađeni su sledeći testovi kako bi se izmerile performanse predložene implementacije. 11/22

Generated graphs: Maxeler vs best publicly available implementations of FFT algorithm. Run-times, depending on the number of consecutive FFT calculations (for input sequences of length 8, 16, 32 and 64). Acceleration obtained using Maxeler machine, compared to the CPU execution, depending on the number of consecutive FFT calculations (for input sequences of length 8, 16, 32 and 64). 12/22

for input sequence of 8 elements. 4 ovakva grafika ( za ulazne sekvence dužina 8, 16, 32 i 64 ) The average execution time in seconds of publicly available algorithms for calculating FFT on different architectures for input sequence of 8 elements. 13/22

Acceleration of Maxeler implementation compared to CPU implementation depending on the number of elements in the input sequence . 14/22

4 ovakva grafika ( za ulazne sekvence dužina 8, 16, 32 i 64 ) Computation time of consecutive fast Fourier transforms expressed in seconds depending on the number of consecutive calculations. 15/22

4 ovakva grafika ( za ulazne sekvence dužina 8, 16, 32 i 64 ) Acceleration of Maxeler implementation compared to CPU implementation depending on the number of consecutive calculations. . 16/22

Analysis of scalability and bottlenecks of proposed solution Transfer of data to Maxeler card and from Maxeler card Limited number of hardware resources on single Maxeler card Limited number of Maxeler cards Kod predložene implementacije postoje uska grla i problemi skalabilnosti koji se javljaju sa porastom dužine ulazne sekvence. Transfer podataka na Maxeler karticu i sa Maxeler kartice ne može da prati brzinu rada Maxeler kartice. U ovom slučaju usko grlo je u I/O kontroleru. Količina podataka koje treba poslati i vratiti sa Maxeler kartice raste linearno sa porastom dužine ulazne sekvence. Na jednoj Maxeler kartici ne postoji dovoljno hardverskih resursa da se sintetizuje jedna faza radix 2 algoritma. U ovom slučaju usko grlo je u broju raspoloživih resursa na jednoj Maxeler kartici. Količina potrebnih hardverskih resursa potrebnih da se sintetizuje jedna faza radix 2 algoritma raste linearno sa porastom dužine ulazne sekvence. Na Maxeler mašini nema dovoljno kartica da bi se sintetizovao ceo radix 2 algoritam. U ovom slučaju usko grlo je u ukupnom broju raspoloživih resursa nekog Maxeler sistema. Da bi se maksimalno iskoristila svaka raspoloživa Maxeler kartica, na jednoj Maxeler kartici, se sintetizuje onoliko faza koliko maksimalno na nju može da stane. Broj faza radix 2 algoritma raste sa zavisnošću log N u odnosu na dužinu ulazne sekvence. Količina potrebnih hardversko resursa potrebnih da se sintetizuje ceo radix 2 algoritam raste sa zavisnošću N log N u odnosu na dužinu ulazne sekvence. 17/22

Analysis of implementation Maxeler implementation of Cooley-Tukey algorithm consists of: Rearrangement of the input sequence in bit reverse order and Radix 2 algorithm. U radu je data referentna implementacija koja radi na mikroprocesoru i iz koje je minimalnom transformacijom dobijena Maxeler implementacija. Dva kernela Bit Reverse kernel i radix 2 kernel Objasniti zašto je potreban Bit Reverse poredak 18/22

Illustration of the kernel Kenel sadrži i bit reverse i radix 2 algoritam Illustration of the kernel 19/22

Implementation details Two input and two output streams These streams are of type: arrayType DFEType floatType = dfeFloat(8, 24); DFEArrayType<DFEVar> arrayType = new DFEArrayType<DFEVar>(floatType, n); Ratios Wnk aren’t calculated on Maxeler machine Parameters: N first_level last_level 20/22

Conclusion It’s show that proposed solution has expected performance and that it works correctly. Performance of the proposed solution is better than performance of any publicly available implementation of Fast Fourier Transform. To achieve these speedups it is needed to do consecutive calculations of Fast Fourier Transform 21/22

Thank you for attention Q/A Thank you for attention