1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Distributed Systems CS
Computer Abstractions and Technology
SYSTEM PROGRAMMING & SYSTEM ADMINISTRATION
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Performance Analysis of Multiprocessor Architectures
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.
Parallelization of FFT in AFNI Huang, Jingshan Xi, Hong Department of Computer Science and Engineering University of South Carolina.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
GPGPU platforms GP - General Purpose computation using GPU
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
What is Concurrent Programming? Maram Bani Younes.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Computer System Architectures Computer System Software
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Low-Power Wireless Sensor Networks
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
MIT Lincoln Laboratory hch-1 HCH 5/26/2016 Achieving Portable Task and Data Parallelism on Parallel Signal Processing Architectures Hank Hoffmann.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Microcontroller Presented by Hasnain Heickal (07), Sabbir Ahmed(08) and Zakia Afroze Abedin(19)
Feb 20, 2002Soumya Mohanty, AEI1 GEO++ A general purpose C++ DSP library Plan of presentation Soumya Mohanty: Overview R. Balasubramanian: MPI Shell, Frame.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 7 OS System Structure.
Evaluation of the VSIPL++ Serial Specification Using the DADS Beamformer HPEC 2004 September 30, 2004 Dennis Cottel Randy Judd.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
1 Parallel Vector & Signal Processing Mark Mitchell CodeSourcery, LLC April 3, 2003.
Single Node Optimization Computational Astrophysics.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
2.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition System Programs (p73) System programs provide a convenient environment.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
Distributed Computing Systems CSCI 6900/4900. Review Definition & characteristics of distributed systems Distributed system organization Design goals.
Parallel Computing Presented by Justin Reschke
GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.
Background Computer System Architectures Computer System Software.
Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Threads.
CS5102 High Performance Computer Systems Thread-Level Parallelism
Microprocessor and Assembly Language
Embedded Systems Design
GPU VSIPL: High Performance VSIPL Implementation for GPUs
Parallel Vector & Signal Processing
Compiler Construction
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Chapter 4: Threads & Concurrency
VSIPL++: Parallel Performance HPEC 2004
Mapping the FFT Algorithm to the IBM Cell Processor
Presentation transcript:

1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004

2 Challenge “Object oriented technology reduces software cost.” “Fully utilizing HPEC systems for SIP applications requires managing operations at the lowest possible level.” “There is great concern that these two approaches may be fundamentally at odds.”

3 Parallel Performance Vision VSIPL++ API SIMD Support Multiprocessor Support SIP Program Code “Drastically reduce the performance penalties associated with deploying object-oriented software on high performance parallel embedded systems.” “Automated to reduce implementation cost.”

4 Advantages of VSIPL Portability Code can be reused on any system for which a VSIPL implementation is available. Performance Vendor-optimized implementations perform better than most handwritten code. Productivity Reduces SLOC count. Code is easier to read. Skills learned on one project are applicable to others. Eliminates use of assembly code.

5 Limitations of VSIPL Uses C Programming Language “Modern object oriented languages (e.g., C++) have consistently reduced the development time of software projects.” Manual memory management. Cumbersome syntax. Inflexible Abstractions prevent users from adding new high- performance functionality. No provisions for loop fusion. No way to avoid unnecessary block copies. Not Scalable No support for MPI or threads. SIMD support must be entirely coded by vendor; user cannot take advantage of SIMD directly.

6 Parallelism: Current Practice MPI used for communication, but: MPI code often a significant fraction of total program code. MPI code notoriously hard to debug. Tendency to hard-code number of processors, data sizes, etc. Reduces portability! Conclusion: users should specify only data layout.

7 Atop VSIPL’s Foundation Open Standard: Specification, Reference Implementation Optimized Vendor Implementations: High Performance C Programming Language C++: OOP, memory management Extensible: Operators, data formats, etc. Limited Extensibility Scalable Multiprocessor Computation Serial VSIPLVSIPL++

8 Leverage VSIPL Model Same terminology: Blocks store data. Views provide access to data. Etc. Same basic functionality: Element-wise operations. Signal processing. Linear algebra.

9 VSIPL++ Status Serial Specification: Version 1.0a Support for all functionality of VSIPL. Flexible block abstraction permits varying data storage formats. Specification permits loop fusion, efficient use of storage. Automated memory management. Reference Implementation: Version 0.95 Support for functionality in the specification. Used in several demo programs — see next talks. Built atop VSIPL reference implementation for maximum portability. Parallel Specification: Version 0.5 High-level design complete.

10 k- Beamformer Input: Noisy signal arriving at a row of uniformly distributed sensors. Output: Bearing and frequency of signal sources.  

11 SIP Primitives Used Computation: FIR filters Element-wise operations (e.g, magsq) FFTs Minimum/average values Communication: Corner-turn All-to-all communication Minimum/average values Gather

12 Computation 1.Filter signal to remove high- frequency noise. (FIR) 2.Remove side-lobes resulting from discretization of data. (mult) 3.Apply Fourier transform in time domain. (FFT) 4.Apply Fourier transform in space domain. (FFT) 5.Compute power spectra. (mult, magsq)

13 Diagram of the Kernel FIR input * weights row-wise FFT magsq, *1/n column-wise FFT Add this matrix to the sum. optional corner turn one row per sensor “corner turn” from sensor domain to time domain removes side lobes

14 VSIPL Kernel Seven statements required: for (i = n; i > 0; --i) { filtered = filter (firs, signal); vsip_mmul_f (weights, filtered, filtered); vsip_rcfftmpop_f (space_fft, filtered, fft_output); vsip_ccfftmpi_f (time_fft, fft_output); vsip_mcmagsq_f (fft_output, power); vsip_ssmul_f (1.0 / n, power); vsip_madd_f (power, spectra, spectra); }

15 One statement required: No changes are required for distributed operation. VSIPL++ Kernel for (i = n; i > 0; --i) spectra += 1/n * magsq ( time_fft (space_fft (weights * filter (firs, signal)));

16 Distribution in User Code Serial case: Parallel case: User writes no MPI code. Matrix > signal_matrix; typedef Dense subblock; typedef Distributed Block2R_t; Matrix signal_matrix;

17 VSIPL++ Implementation Added DistributedBlock : Uses a “standard” VSIPL++ block on each processor. Uses MPI routines for communication when performing block assignment. Added specializations: FFT, FIR, etc. modified to handle DistributedBlock.

18 Performance Measurement Test system: AFRL HPC system 2.2GHz Pentium 4 cluster Measured only main loop No input/output Used Pentium Timestamp Counter MPI All-to-all not included in timings Accounts for 10-25%

19 VSIPL++ Performance

20 Parallel Speedup 2 processes 4 processes 8 processes 1 process VSIPL, VSIPL++ perfect linear speedup 10% slower than perfect linear speedup 10% faster than perfect linear speedup Corner turn improves execution. Minimal parallel overhead.

21 Conclusions VSIPL++ imposes no overhead: VSIPL++ performance nearly identical to VSIPL performance. VSIPL++ achieves near-linear parallel speedup: No tuning of MPI, VSIPL++, or application code. Absolute performance limited by VSIPL implementation, MPI implementation, compiler.

22 VSIPL++ Visit the HPEC-SI website for VSIPL++ specifications for VSIPL++ reference implementation to participate in VSIPL++ development

23 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004