Optimizing the Fast Fourier Transform on a Multi-core Architecture

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Structure of Computer Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
ICS 556 Parallel Algorithms Ebrahim Malalla Office: Bldg 22, Room
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Types of Parallel Computers
Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.
Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.
V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.
Min-Sheng Lee Efficient use of memory bandwidth to improve network processor throughput Jahangir Hasan 、 Satish ChandraPurdue University T. N. VijaykumarIBM.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Fast Memory Addressing Scheme for Radix-4 FFT Implementation Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Xin Xiao, Erdal Oruklu and.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Case Study: Single-path Delay Feedback FFT Speaker: Yu-Min.
Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Philipp Gysel ECE Department University of California, Davis
Computer Organization and Architecture Lecture 1 : Introduction
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Ioannis E. Venetis Department of Computer Engineering and Informatics
EECE571R -- Harnessing Massively Parallel Processors ece
Microarchitecture.
Computer Structure Multi-Threading
Presented by: Tim Olson, Architect
Constructing a system with multiple computers or processors
Multi-Processing in High Performance Computer Architecture:
Morgan Kaufmann Publishers
Fast Fourier Transform
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Multi-Processing in High Performance Computer Architecture:
Text Book Computer Organization and Architecture: Designing for Performance, 7th Ed., 2006, William Stallings, Prentice-Hall International, Inc.
ECE/CS 757: Advanced Computer Architecture II
STUDY AND IMPLEMENTATION
Introduction and History of Cray Supercomputers
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Memory System Performance Chapter 3
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Speaker: Chris Chen Advisor: Prof. An-Yeu Wu Date: 2014/10/28
Types of Parallel Computers
6- General Purpose GPU Programming
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Research: Past, Present and Future
Presentation transcript:

Optimizing the Fast Fourier Transform on a Multi-core Architecture Presentation by Tao Liu Long Chen; Ziang Hu; Junmin Lin; Gao, G.R.; IEEE International Parallel and Distributed Processing Symposium, 2007.

Outline FFT Introduction FFT Implementation-State of Art Introduce this paper Summary Relationship to our project Q&A

FFT Introduction Media Application (2D) Communication Application (1D) 256* 256 256*256 Original picture and video require huge number of data DFT can reconstruct the picture with less data. Communication Application (1D)

FFT Introduction (cont.) Fundamental Mathematics 2D N*N -> N*N 1D Example

FFT Introduction (cont.) 8-points 1D FFT

Outline FFT Introduction FFT Implementation-State of Art Introduce this paper Summary Relationship to our project Q&A

FFT Implementation-State of Art Hypercube architecture [9, 13, 17] Arrays architecture [12] Mesh architectures[18] An instructive shared-memory FFT for Alliant FX/8 [19]. Additional performance studies on shared-memory FFT [3, 16] Memory hierarchy to an effective FFT implementation[4] CRAY-2 [5] EARTH[20], a finegrained dataflow architecture. etc

Outline FFT Introduction FFT Implementation-State of Art Introduce this paper Summary Relationship to our project Q&A

About this paper Platform: IBM Cyclops-64 (C64) Optimization on 1D FFT

IBM Cyclops-64 (C64) 80 64-bit processors, each processor 1 floating point unit (FPU) + 2 thread units (TUs). 64 64-bit registers and 32 KB SRAM. 16 shared instruction caches (ICs) 4 off-chip DRAM controllers, Crossbar network with a large number of ports (96×96), which sustains a 4GB/s bandwidth per port, thus 384GB/s in total. Memory: scratch-pad (SP) memory, on-chip global interleaved memory (GM), and off-chip DRAM GigaBit Ethernet controller and other I/O devices Etc.

IBM Cyclops-64 (C64) 80 cores 160 threads

1D FFT Base Parallel Implementation Optimal Work Unit Special Handling of the First Stages Eliminating Unnecessary Memory Operations Loop Unrolling Register Renaming and Instruction Scheduling (Manually) Memory Hierarchy Aware Compilation

1D FFT (Cont.)

Optimal Work Unit 1 butterfly operation 1. read 2-point data and the twiddle factor from GM 2. perform a bufferfly operations upon them, then 3. write the 2-point results back to GM 1 butterfly operation -> Thread

Optimal Work Unit 1 butterfly operation 4 butterfly operations

Optimal Work Unit Number of cycles per butterfly operation VS the the size of work unit (8 point is the best) Register spilling for large WU (Need 112 for 16-point)

2D FFT Base Parallel Implementation Load Balancing Work Distribution and Data Reuse Memory Hierarchy Aware Compilation

2D FFT –Base Parallel Implementation All row FFTs are independent of each other, so they can be computed in parallel. Row -> Column Work units are distributed to threads in the round-robin way Some threads remain idle (e.g. 180 rows, 160 threads)

2D FFT –Load Balancing Load Balancing entire row/column1D FFT as a work unit -> small tasks Small task: 8-point work unit. (8 input<-> 8 output) it needs more barriers to synchronize threads working on the same row/column FFT

Results of This Paper 1D FFT 2^16 points

Results of This Paper 2D FFT 256*256

Outline FFT Introduction FFT Implementation-State of Art Introduce this paper Summary Relationship to our project Q&A

Summary Introduction Strengths Weakness Implemented and optimized FFT on a specific architecture –IBM Cyclops-64 Strengths Several optimization techniques are proposed. Achieved performances are over 20Gflops for both 1D FFT and 2D FFT, which is about 4 times of Intel Xeon Pentium 4 processor (5.5Glops) (Essentiality: reduce memory operation) Weakness The fast scratchpad memory (SP) for each thread unit has not been fully explored. Memory issue of large FFT size where data cannot be fully stored in on-chip memories.

Relationship to our project Implement FFT on a FPGA platform Architecture is changeable. #cores/Memory/cache/ Performance/power etc What we can use from this paper Task partition & Distribution Data Reuse etc

Q&A

Reference [1] R. Agarwal and J. Cooley. Vectorized mixed radix discrete fourier transform algorithms. In Proceedings of the IEEE,volume 75, pages 1283– 1292, 1987. [2] M. Ashworth and A. G. Lyne. A segmented FFT algorithm for vector computers. Parallel Computing, 6:217–224, 1988. [3] A. Averbuch, E. Gabber, B. Gordissky, and Y. Medan. A parallel FFT on an MIMD machine. Parallel Computing, 15:61–74, 1990. [4] D. H. Bailey. FFTs in external or hierarchical memory. In Proceedings of the Supercomputing 89, pages 234–242,1989. [5] D. A. Carlson. Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer. J. Supercomput., 4:345–356, 1990. [6] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Math. Comput.,19:297–301, 1965. [7] J. del Cuvillo,W. Zhu, Z. Hu, and G. R. Gao. FAST: A functionally accurate simulation toolset for the Cyclops64 cellular architecture. In Workshop on Modeling, Benchmarking, and Simulation (MoBS2005), in conjuction with the 32nd Annual International Symposium on Computer Architecture (ISCA2005), Madison, Wisconsin, June 2005. [8] M. Frigo and S. Johnson. FFTW. [9] A. G. and P. I. Parallel implementation of 2-d FFT algorithms on a hypercube. In Proc. Parallel Computing Action, Workshop ISPRA, 1990. [10] A. Gupta and V. Kumar. On the scalability of FFT on parallel

Reference computers. In FMPSC: Frontiers of Massively Parallel Scientific Computation. National Aeronautics and Space Administration NASA, IEEE Computer Society Press, 1990. [11] J. Johnson, R. Johnson, D. Rodriguez, and R. Tolimieri. A methodology for designing, modifying, and implementing fourier transform algorithms on various architectures. Circuits, Systems, and Signal Processing, 9:449–500, 1990. [12] S. Johnsson and D. Cohen. Computational arrays for the discrete Fourier transform. 1981. [13] S. L. Johnsson and R. L. Krawitz. Cooley-tukey FFT on the connection machine. Parallel Computing, 18(11):1201– 1221, 1992. [14] C. V. Loan. Computational framework for the fast Fourier transform. SIAM, Philadelphia, 1992. [15] H. Nguyen and L. K. John. Exploiting SIMD parallelism in DSP and multimedia algorithms using the AltiVec technology. In International Conference on Supercomputing, pages 11–20, 1999. [16] A. Norton and A. J. Silberger. Parallelization and performance analysis of the cooley-tukey fft algorithm for sharedmemory architectures. IEEE Transactions on Computers, 36(5):581–591, 1987. [17] D. M. S. L. Johnsson, R.L. Krawitz and R. Frye. A radix 2 FFT on the connection machine. In Proceedings of Supercomputing 89, pages 809–819, 1989. [18] V. Singh, V. Kumar, G. Agha, and C. Tomlinson. Scalability of parallel sorting on mesh multicomputers. In International Parallel Processing Symposium, pages 92–101, 1991.

Reference [19] P. N. Swarztrauber. Multiprocessor FFTs. Parallel Computing, 5(1-2):197–210, 1987. [20] K. B. Theobald. EARTH: An Efficient Architecture for Running Threads. PhD thesis, May 1999. [21] P. Thulasiraman, K. B. Theobald, A. A. Khokhar, and G. R. Gao. Multithreaded algorithms for the fast fourier transform. In ACM Symposium on Parallel Algorithms and Architectures, pages 176–185, 2000