2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Slides:



Advertisements
Similar presentations
Very Large Fast DFT (VL FFT) Implementation on KeyStone Multicore Applications.
Advertisements

Lecture 6: Multicore Systems
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Today’s topics Single processors and the Memory Hierarchy
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Introduction CS 524 – High-Performance Computing.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
High Performance Computing with cloud Xu Tong. About the topic Why HPC(high performance computing) used on cloud What’s the difference between cloud and.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
High Performance Linear Transform Program Generation for the Cell BE
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Optimum System Balance for Systems of Finite Price John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2004 Pittsburgh, PA November 10, 2004.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Slide-1 HPCchallenge Benchmarks MITRE ICL/UTK HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. (MITRE) Jack Dongarra (UTK) Piotr Luszczek (ICL/UTK)
Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
- 1 - Workshop on Pattern Analysis Data Flow Pattern Analysis of Scientific Applications Michael Frumkin Parallel Systems & Applications Intel Corporation.
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Yuanrui Zhang, Mahmut Kandemir
Constructing a system with multiple computers or processors
Fast Fourier Transform
High Performance Computing (CS 540)
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
SiCortex Update IDC HPC User Forum
Constructing a system with multiple computers or processors
Hybrid Programming with OpenMP and MPI
Memory System Performance Chapter 3
Types of Parallel Computers
Presentation transcript:

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational Sciences/ Graduate School of Systems and Information Engineering University of Tsukuba

2007/11/2 First French-Japanese PAAP Workshop 2 Outline HPC Challenge (HPCC) Benchmark Suite –Overview –The Benchmark Tests –Example Results FFTE: A High-Performance FFT Library –Background –Related Works –Block Six-Step/Nine-Step FFT Algorithm –Performance Results –Conclusion and Future Work

2007/11/2 First French-Japanese PAAP Workshop 3 Overview of the HPC Challenge (HPCC) Benchmark Suite HPC Challenge (HPCC) is a suite of tests that examine the performance of HPC architectures using kernels. The suite provides benchmarks that bound the performance of many real applications as a function of memory access characteristics, e.g., –Spatial locality –Temporal locality

2007/11/2 First French-Japanese PAAP Workshop 4 The Benchmark Tests The HPC Challenge benchmark consists at this time of 7 performance tests: –HPL (High Performance Linpack) –DGEMM (matrix-matrix multiplication) –STREAM (sustainable memory bandwidth) –PTRANS (A=A+B^T, parallel matrix transpose) –RandomAccess (integer updates to random memory locations) –FFT (complex 1-D discrete Fourier transform) –b_eff (MPI latency/bandwidth test)

2007/11/2 First French-Japanese PAAP Workshop 5 Targeted Application Areas in the Memory Access Locality Space Temporal locality Spatial locality PTRANS STREAM RandomAccessFFT HPL DGEMM Applications CFDRadar X-section TSP DSP 0

2007/11/2 First French-Japanese PAAP Workshop 6 HPCC Testing Scenarios Local (S-STREAM, S-RandomAccess, S-DGEMM, S-FFTE) –Only single MPI process computes. Embarrassingly parallel (EP-STREAM, EP-RandomAccess, EP-DGEMM, EP-FFTE) –All processes compute and do not communicate (explicitly). Global (G-HPL, G-PTRANS, G-RandomAccess, G-FFTE) –All processes compute and communicate. Network only (RandomRing Bandwidth, etc.)

2007/11/2 First French-Japanese PAAP Workshop 7 Sample results page

2007/11/2 First French-Japanese PAAP Workshop 8 The winners of the 2006 HPC Challenge Class 1 Awards G-HPL: 259 TFlops/s –IBM Blue Gene/L ( Procs) G-RandomAccess: 35 GUPS –IBM Blue Gene/L ( Procs) G-FFTE: 2311 GFlop/s –IBM Blue Gene/L ( Procs) EP-STREAM-Triad (system): 160TB/s –IBM Blue Gene/L ( Procs)

2007/11/2 First French-Japanese PAAP Workshop 9 FFTE: A High-Performance FFT Library FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions. It includes complex, mixed-radix and parallel transforms. –Shared / Distributed memory parallel computers (OpenMP, MPI and OpenMP + MPI) It also supports Intel’s SSE2/SSE3 instructions. The FFTE library can be obtained from

2007/11/2 First French-Japanese PAAP Workshop 10 Background One goal for large FFTs is to minimize the number of cache misses. Many FFT algorithms work well when data sets fit into a cache. When a problem exceeds the cache size, however, the performance of these FFT algorithms decreases dramatically. The conventional six-step FFT algorithm requires –Two multicolumn FFTs. –Three data transpositions. → The chief bottlenecks in cache-based processors.

2007/11/2 First French-Japanese PAAP Workshop 11 Related Works FFTW [Frigo and Johnson (MIT)] –The recursive call is employed to access main memory hierarchically. –This technique is very effective in the case that the total amount of data is not so much greater than the cache size. –For parallel FFT, the conventional six-step FFT is used. – SPIRAL [Pueschel et al. (CMU)] –The goal of SPIRAL is to push the limits of automation in software and hardware development and optimization for digital signal processing (DSP) algorithms. –

2007/11/2 First French-Japanese PAAP Workshop 12 Approach Some previously presented six-step FFT algorithms separate the multicolumn FFTs from the transpositions. Taking the opposite approach, we combine the multicolumn FFTs and transpositions to reduce the number of cache misses. We modify the conventional six-step FFT algorithm to reuse data in the cache memory. → We will call it a “block six-step FFT”.

2007/11/2 First French-Japanese PAAP Workshop 13 Discrete Fourier Transform (DFT) DFT is given by

2007/11/2 First French-Japanese PAAP Workshop 14 2-D Formulation If has factors and then

2007/11/2 First French-Japanese PAAP Workshop 15 Six-Step FFT Algorithm individual - point FFTs Transpose

2007/11/2 First French-Japanese PAAP Workshop 16 Block Six-Step FFT Algorithm individua l -point FFTs Partial Transpose Transpose

2007/11/2 First French-Japanese PAAP Workshop 17 3-D Formulation For very large FFTs, we should switch to a 3-D formulation. If has factors, and then

2007/11/2 First French-Japanese PAAP Workshop 18 Parallel Block Nine-Step FFT Partial Transpose All-to-all comm.

2007/11/2 First French-Japanese PAAP Workshop 19 Operation Counts for -point FFT Conventional FFT algorithms (e.g., Cooley-Tukey FFT, Stockham FFT) –Arithmetic operations: –Main memory accesses: Block Nine-Step FFT –Arithmetic operations: –Main memory accesses (ideal case):

2007/11/2 First French-Japanese PAAP Workshop 20 Performance Results To evaluate the implemented parallel FFTs, we compared –The implemented parallel FFT, named FFTE (ver 4.0, supports SSE3, using MPI) –FFTW (ver , not support SSE3, using MPI) Target parallel machine: –A 32-node dual PC SMP cluster (Irwindale 3GHz, 1GB DDR2-400 SDRAM / node, Linux smp). –Interconnected through a Gigabit Ethernet switch. –LAM/MPI was used as a communication library –The compilers used were gcc and g

2007/11/2 First French-Japanese PAAP Workshop 21

2007/11/2 First French-Japanese PAAP Workshop 22

2007/11/2 First French-Japanese PAAP Workshop 23 Discussion For N = 2^29 and P = 32, the FFTE runs about 1.72 times faster than the FFTW. –The performance of the FFTE remains at a high level even for the larger problem size, owing to cache blocking. –Since the FFTW uses the conventional six-step FFT, each column FFT does not fit into the L1 data cache. –Moreover, the FFTE exploits the SSE3 instructions. These are three reasons why the FFTE is most advantageous than the FFTW.

2007/11/2 First French-Japanese PAAP Workshop 24 Conclusion and Future Work The block nine-step FFT algorithm is most advantageous with processors that have a considerable gap between the speed of the cache memory and that of the main memory. Towards Petascale computing systems, –Exploiting the multi-level parallelism: SIMD or Vector accelerator Multi-core Multi-socket Multi-node –Reducing the number of main memory accesses. –Improving the all-to-all communication performance. In the G-FFTE, the all-to-all communication occurs three times.