Accelerating PFA FFT: Performance Comparison

Slides:



Advertisements
Similar presentations
Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Introduction to Matlab
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
1 Collective Operations Dr. Stephen Tse Lesson 12.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Solving systems using matrices
Towards a Theory of Cache-Efficient Algorithms Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan.
October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.
High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.
Problem Solving Process
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Matlab tutorial course Lesson 2: Arrays and data types
Little Linear Algebra Contents: Linear vector spaces Matrices Special Matrices Matrix & vector Norms.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.
Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,
Accelerating MATLAB with CUDA
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Matlab Programming for Engineers Dr. Bashir NOURI Introduction to Matlab Matlab Basics Branching Statements Loops User Defined Functions Additional Data.
Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Introduction to C++ Programming Language Assistant Professor Jeon, Seokhee Assistant Professor Department of Computer Engineering, Kyung Hee University,
Aim: Add & Subtract Complex Numbers Course: Adv. Alg. & Trig. Aim: How do we add and subtract complex numbers? Do Now: Simplify:
IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
MATRIX MULTIPLICATION 4 th week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM MATRIX MULTIPLICATION 4 th week References Sequential matrix.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
Parallel Algorithms for array processors
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Basic Communication Operations Carl Tropper Department of Computer Science.
COMP 5704 Project Presentation Parallel Buffer Trees and Searching Cory Fraser School of Computer Science Carleton University, Ottawa, Canada
CS 450: COMPUTER GRAPHICS TRANSFORMATIONS SPRING 2015 DR. MICHAEL J. REALE.
ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.
Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Matrices Rules & Operations.
Matrix 2015/11/18 Hongfei Yan zip(*a) is matrix transposition
Embedded Systems Design
EGR 115 Introduction to Computing for Engineers
Other Kinds of Arrays Chapter 11
Matrix 2016/11/30 Hongfei Yan zip(*a) is matrix transposition
9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept.,
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
OptiSystem-MATLAB data formats (Version 1.0)
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Matlab tutorial course
Express each number in terms of i.
1.2 Add & Subtract Complex Numbers
1.2 Add & Subtract Complex Numbers
Multicore and GPU Programming
Homework 5 (Due: 6/28) Write the Matlab program to compute the FFT of two N-point real signals x and y using only one N-point FFT.
Presentation transcript:

Accelerating PFA FFT: Performance Comparison Michael Perrone Acie Nobles Jizhu Lu 2007.06.06

Outline PFA FFT Overview & Experimental Results Implementation Vectorization PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

PFA FFT Algorithm Specifics Prime-factor FFT algorithm (PFA) 2D FFT Single precision Complex-to-complex Nominal size 1K rows, 1600 points per row Factors implemented: 2, 3, 4, 5, 7, 8, 9, 11, 13, 16 PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Performance Comparison Cell vs Woodcrest Cell vs Opteron Execution Time Performance Comparison (40960 2D images, in seconds) Matrix Size Intel AMD 3SW 3SWO 2SW 2SWO 364x240 16.47 38.8 6.63 4.74 5.56 5.31 616x308 45.92 135.59 11.86 8.21 9.5 9.05 840x462 146.22 246.09 24.3 16.96 18.71 17.83 1008x616 218.24 393.27 34.72 23.07 27.58 26.29 1260x840 416.56 559.05 59.71 39.94 50.84 48.38 1540x1008 687.79 995.49 86.16 57.65 79.1 75.66 PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Performance: All PFA Sizes – 3 Step & 2 Step Algs PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Lessons Learned “numactl –m 0 –c 0” Binds jobs to BEs NUMA utility “numactl –m 0 –c 0” Binds jobs to BEs Binds memory to BEs 2 runs instead of 1 Changed buffer size 4096  4104 elements added one data envelope (128B) Better memory access pattern Declare temporary variables locally Combining 2nd and 3rd steps PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Outline PFA FFT Overview & Experimental Results Implementation Vectorization PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Implementation Overview FFT distributed across SPEs Data vectorized DMAs double buffered Pass 1: For each buffer DMA Get buffer Transform signals to SIMD format Do four 1D FFTs in SIMD Tiles transposed DMA Put buffer Pass 2: For each buffer Pass 3: For each buffer Transform SIMD format to original data format Tile Buffer Input Image Transposed Image Transposed Tile Transposed Buffer PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Two Step PFA FFT Algorithm 1st Step Get input data from main RAM by using DMA Vectorization Vectorized PFA FFT for 1st dimension Transpose and write back to main memory 2nd Step Vectorized PFA FFT for 2nd dimension Combined Transpose & Un-vectorization Write back to main memory PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Do combined transpose and unvectorization 2nd Step Details Load buffer 1 Load buffer 2 PFAFFT on buffer 1 PFAFFT on buffer 2 Do combined transpose and unvectorization on buffer1 & buffer2 DMA back to main RAM in right places PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Time Distribution on 2nd Step Begin of the loop Efficiency = 6/13 = ~50% Load Comp Trans Unload Load Comp Trans Unload Load Comp Trans Unload Load Load Comp Trans Unload Load Comp Trans Unload Load Comp Trans Unload Load Comp End of the loop Load buf[0] Load buf[1] Load buf[1] Load buf[1] Comp buf[0] Comp buf[2] Comp buf[0] Load buf[2] Load buf[0] Load buf[2] Comp buf[1] Comp buf[1] Comp buf[1] T & UNLD buf[0] buf[1] T & UNLD buf[1] buf[2] T & UNLD buf[0] buf[1] PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Outline PFA FFT Overview & Experimental Results Implementation Vectorization PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Data Layout Change in 2-Step PFAFFT Original Input Data (each trace 4 complex numbers x 16 traces) c5 c6 c7 c8 c1 c2 c3 c4 d5 d6 d7 d8 d1 d2 d3 d4 1st buffer a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 b5 b6 b7 b8 e1 e2 e3 e4 e5 e6 e7 e8 f1 f2 f3 f4 5 f6 f7 f8 g5 g6 g7 g8 g1 g2 g3 g4 h5 h6 h7 h8 h1 h2 h3 h4 2nd buffer i1 i2 i3 i4 i5 i6 i7 i8 j1 j2 j3 j4 j5 j6 j7 j8 k5 k6 k7 k8 k1 k2 k3 k4 l5 l6 l7 l8 l1 l2 l3 l4 3rd buffer 4th buffer m1 m2 m3 m4 m5 m6 m7 m8 n1 n2 n3 n4 n5 n6 n7 n8 o5 o6 o7 o8 o1 o2 o3 o4 p5 p6 p7 p8 p1 p2 p3 p4 real real real real imaginary imaginary imaginary imaginary PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Vectorization Shuffle Operation in 1st Step imaginary imaginary imaginary imaginary real real real real c5 c6 c7 c8 c1 c2 c3 c4 d5 d6 d7 d8 d1 d2 d3 d4 a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 b5 b6 b7 b8 real imaginary real imaginary real imaginary real imaginary PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

After Vectorization in 1st Step b1 d1 c1 a2 b2 d2 c2 a3 b3 d3 c3 a4 b4 d4 c4 a5 b5 d5 c5 a6 b6 d6 c6 a7 b7 d7 c7 a8 b8 d8 c8 real imaginary real imaginary real imaginary real imaginary e1 f1 h1 g1 e2 f2 h2 g2 e3 f3 h3 g3 e4 f4 h4 g4 e5 f5 h5 g5 e6 f6 h6 g6 e7 f7 h7 g7 e8 f8 h8 g8 1st buffer l1 i1 j1 k1 l2 i2 j2 k2 l3 i3 j3 k3 l4 i4 j4 k4 l5 i5 j5 k5 l6 i6 j6 k6 l7 i7 j7 k7 l8 i8 j8 k8 2nd buffer m1 n1 p1 o1 m2 n2 p2 o2 m3 n3 p3 o3 m4 n4 p4 o4 m5 n5 p5 o5 m6 n6 p6 o6 m7 n7 p7 o7 m8 n8 p8 o8 3rd buffer 4th buffer PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

After PFA FFT for 1st Dimension b1 d1 c1 a2 b2 d2 c2 a3 b3 d3 c3 a4 b4 d4 c4 a5 b5 d5 c5 a6 b6 d6 c6 a7 b7 d7 c7 a8 b8 d8 c8 real imaginary real imaginary real imaginary real imaginary e1 f1 h1 g1 e2 f2 h2 g2 e3 f3 h3 g3 e4 f4 h4 g4 e5 f5 h5 g5 e6 f6 h6 g6 e7 f7 h7 g7 e8 f8 h8 g8 1st buffer l1 i1 j1 k1 l2 i2 j2 k2 l3 i3 j3 k3 l4 i4 j4 k4 l5 i5 j5 k5 l6 i6 j6 k6 l7 i7 j7 k7 l8 i8 j8 k8 2nd buffer m1 n1 p1 o1 m2 n2 p2 o2 m3 n3 p3 o3 m4 n4 p4 o4 m5 n5 p5 o5 m6 n6 p6 o6 m7 n7 p7 o7 m8 n8 p8 o8 3rd buffer 4th buffer PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Transposition Shuffle Operation in 1st Step real imaginary real imaginary real imaginary real imaginary a1 b1 d1 c1 a2 b2 d2 c2 a3 b3 d3 c3 a4 b4 d4 c4 a5 b5 d5 c5 a6 b6 d6 c6 a7 b7 d7 c7 a8 b8 d8 c8 real imaginary real imaginary real imaginary real imaginary PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

After Transposition DMA back to main RAM in 1st Step real imaginary real imaginary real imaginary real imaginary a1 a3 a7 a5 a2 a4 a8 a6 b1 b3 b7 b5 b2 b4 b8 b6 c1 c3 c7 c5 c2 c4 c8 c6 d1 d3 d7 d5 d2 d4 d8 d6 1st buffer e1 e3 e7 e5 e2 e4 e8 e6 f1 f3 f7 f5 f2 f4 f8 f6 g1 g3 g7 g5 g2 g4 g8 g6 h1 h3 h7 h5 h2 h4 h8 h6 2nd buffer i1 i3 i7 i5 i2 i4 i8 i6 j1 j3 j7 j5 j2 j4 j8 j6 k1 k3 k7 k5 k2 k4 k8 k6 l1 l3 l7 l5 l2 l4 l8 l6 3rd buffer m1 m3 m7 m5 m2 m4 m8 m6 n1 n3 n7 n5 n2 n4 n8 n6 o1 o3 o7 o5 o2 o4 o8 o6 p1 p3 p7 p5 p2 p4 p8 p6 4th buffer PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

After DMA Load in 2nd Step from main RAM (all in 1 buffer) real imaginary real imaginary real imaginary real imaginary a1 a3 a7 a5 a2 a4 a8 a6 b1 b3 b7 b5 b2 b4 b8 b6 c1 c3 c7 c5 c2 c4 c8 c6 d1 d3 d7 d5 d2 d4 d8 d6 e1 e3 e7 e5 e2 e4 e8 e6 f1 f3 f7 f5 f2 f4 f8 f6 g1 g3 g7 g5 g2 g4 g8 g6 h1 h3 h7 h5 h2 h4 h8 h6 i1 i3 i7 i5 i2 i4 i8 i6 j1 j3 j7 j5 j2 j4 j8 j6 k1 k3 k7 k5 k2 k4 k8 k6 l1 l3 l7 l5 l2 l4 l8 l6 m1 m3 m7 m5 m2 m4 m8 m6 n1 n3 n7 n5 n2 n4 n8 n6 o1 o3 o7 o5 o2 o4 o8 o6 p1 p3 p7 p5 p2 p4 p8 p6 PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

After PFA FFT for 2nd Dimension (just 1 buffer) real imaginary real imaginary real imaginary real imaginary a1 a3 a7 a5 a2 a4 a8 a6 b1 b3 b7 b5 b2 b4 b8 b6 c1 c3 c7 c5 c2 c4 c8 c6 d1 d3 d7 d5 d2 d4 d8 d6 e1 e3 e7 e5 e2 e4 e8 e6 f1 f3 f7 f5 f2 f4 f8 f6 g1 g3 g7 g5 g2 g4 g8 g6 h1 h3 h7 h5 h2 h4 h8 h6 i1 i3 i7 i5 i2 i4 i8 i6 j1 j3 j7 j5 j2 j4 j8 j6 k1 k3 k7 k5 k2 k4 k8 k6 l1 l3 l7 l5 l2 l4 l8 l6 m1 m3 m7 m5 m2 m4 m8 m6 n1 n3 n7 n5 n2 n4 n8 n6 o1 o3 o7 o5 o2 o4 o8 o6 p1 p3 p7 p5 p2 p4 p8 p6 PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

Transposition & Un-Vectorization Shuffle Operation in 2nd Step real imaginary real imaginary real imaginary real imaginary a1 a3 a7 a5 a2 a4 a8 a6 b1 b3 b7 b5 b2 b4 b8 b6 c1 c3 c7 c5 c2 c4 c8 c6 d1 d3 d7 d5 d2 d4 d8 d6 real real real real imaginary imaginary imaginary imaginary PFA FFT on Cell - M. Perrone, mpp@us.ibm.com

After Combined Transposition and Un-vectorization Shuffle DMA back to main RAM g1 g2 g4 g3 g5 g6 g8 g7 h1 h2 h4 h3 h5 h6 h8 h7 i1 i2 i4 i3 i5 i6 i8 i7 j1 j2 j4 j3 j5 j6 j8 j7 k1 k2 k4 k3 k5 k6 k8 k7 l1 l2 l4 l3 l5 l6 l8 l7 m1 m2 m4 m3 m5 m6 m8 m7 n1 n2 n4 n3 n5 n6 n8 n7 o1 o2 o4 o3 o5 o6 o8 o7 p1 p2 p4 p3 p5 p6 p8 p7 real real real real imaginary imaginary imaginary imaginary PFA FFT on Cell - M. Perrone, mpp@us.ibm.com