VSIPL++: Parallel Performance HPEC 2004

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Performance Analysis of Multiprocessor Architectures
 Understanding the Sources of Inefficiency in General-Purpose Chips.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
GPGPU platforms GP - General Purpose computation using GPU
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
What is Concurrent Programming? Maram Bani Younes.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Microcontroller Presented by Hasnain Heickal (07), Sabbir Ahmed(08) and Zakia Afroze Abedin(19)
Feb 20, 2002Soumya Mohanty, AEI1 GEO++ A general purpose C++ DSP library Plan of presentation Soumya Mohanty: Overview R. Balasubramanian: MPI Shell, Frame.
Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
1 Parallel Vector & Signal Processing Mark Mitchell CodeSourcery, LLC April 3, 2003.
Single Node Optimization Computational Astrophysics.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
Background Computer System Architectures Computer System Software.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Advanced Computer Systems
Chapter 4: Threads.
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
M. Richards1 ,D. Campbell1 (presenter), R. Judd2, J. Lebak3, and R
Stateless Combinational Logic and State Circuits
SOFTWARE DESIGN AND ARCHITECTURE
Microprocessor and Assembly Language
Parallel Programming By J. H. Wang May 2, 2017.
Embedded Systems Design
GPU VSIPL: High Performance VSIPL Implementation for GPUs
Parallel Vector & Signal Processing
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Compiler Construction
Unit# 8: Introduction to Computer Programming
חוברת שקפים להרצאות של ד"ר יאיר ויסמן מבוססת על אתר האינטרנט:
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Chapter 4: Threads.
GEOMATIKA UNIVERSITY COLLEGE CHAPTER 2 OPERATING SYSTEM PRINCIPLES
PLC’s Are ... Similar to a Microcontroller: Microprocessor Based
MATLAB HPCS Extensions
What is Concurrent Programming?
Shared Memory Programming
Operating System 4 THREADS, SMP AND MICROKERNELS
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
What is Concurrent Programming?
Outline Chapter 2 (cont) OS Design OS structure
CSCE569 Parallel Computing
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Mapping the FFT Algorithm to the IBM Cell Processor
MATLAB HPCS Extensions
System calls….. C-program->POSIX call
Presentation transcript:

VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004

Challenge “Object oriented technology reduces software cost.” “Fully utilizing HPEC systems for SIP applications requires managing operations at the lowest possible level.” “There is great concern that these two approaches may be fundamentally at odds.”

Parallel Performance Vision VSIPL++ API SIMD Support Multiprocessor SIP Program Code “Drastically reduce the performance penalties associated with deploying object-oriented software on high performance parallel embedded systems.” “Automated to reduce implementation cost.”

Advantages of VSIPL Portability Performance Productivity Code can be reused on any system for which a VSIPL implementation is available. Performance Vendor-optimized implementations perform better than most handwritten code. Productivity Reduces SLOC count. Code is easier to read. Skills learned on one project are applicable to others. Eliminates use of assembly code.

Limitations of VSIPL Uses C Programming Language Inflexible “Modern object oriented languages (e.g., C++) have consistently reduced the development time of software projects.” Manual memory management. Cumbersome syntax. Inflexible Abstractions prevent users from adding new high-performance functionality. No provisions for loop fusion. No way to avoid unnecessary block copies. Not Scalable No support for MPI or threads. SIMD support must be entirely coded by vendor; user cannot take advantage of SIMD directly.

Parallelism: Current Practice MPI used for communication, but: MPI code often a significant fraction of total program code. MPI code notoriously hard to debug. Tendency to hard-code number of processors, data sizes, etc. Reduces portability! Conclusion: users should specify only data layout.

Atop VSIPL’s Foundation Serial Scalable Multiprocessor Computation Limited Extensibility Extensible: Operators, data formats, etc. C Programming Language C++: OOP, memory management Optimized Vendor Implementations: High Performance Open Standard: Specification, Reference Implementation

Leverage VSIPL Model Same terminology: Same basic functionality: Blocks store data. Views provide access to data. Etc. Same basic functionality: Element-wise operations. Signal processing. Linear algebra.

VSIPL++ Status Serial Specification: Version 1.0a Support for all functionality of VSIPL. Flexible block abstraction permits varying data storage formats. Specification permits loop fusion, efficient use of storage. Automated memory management. Reference Implementation: Version 0.95 Support for functionality in the specification. Used in several demo programs — see next talks. Built atop VSIPL reference implementation for maximum portability. Parallel Specification: Version 0.5 High-level design complete.

k- Beamformer Input: Noisy signal arriving at a row of uniformly distributed sensors. Output: Bearing and frequency of signal sources. w1 w2

SIP Primitives Used Computation: Communication: FIR filters Element-wise operations (e.g, magsq) FFTs Minimum/average values Communication: Corner-turn All-to-all communication Gather

Computation Filter signal to remove high-frequency noise. (FIR) Remove side-lobes resulting from discretization of data. (mult) Apply Fourier transform in time domain. (FFT) Apply Fourier transform in space domain. (FFT) Compute power spectra. (mult, magsq)

Diagram of the Kernel input weights row-wise FFT FIR * optional corner one row per sensor removes side lobes optional corner turn “corner turn” from sensor domain to time domain column-wise FFT magsq, *1/n Add this matrix to the sum.

VSIPL Kernel Seven statements required: for (i = n; i > 0; --i) { filtered = filter (firs, signal); vsip_mmul_f (weights, filtered, filtered); vsip_rcfftmpop_f (space_fft, filtered, fft_output); vsip_ccfftmpi_f (time_fft, fft_output); vsip_mcmagsq_f (fft_output, power); vsip_ssmul_f (1.0 / n, power); vsip_madd_f (power, spectra, spectra); }

VSIPL++ Kernel One statement required: No changes are required for distributed operation. for (i = n; i > 0; --i) spectra += 1/n * magsq ( time_fft (space_fft (weights * filter (firs, signal)));

Distribution in User Code Serial case: Parallel case: User writes no MPI code. Matrix<float_t, Dense<2, float_t> > signal_matrix; typedef Dense<2, float_t> subblock; typedef Distributed<2, float_t, subblock, ROW> Block2R_t; Matrix<float_t, Block2R_t> signal_matrix;

VSIPL++ Implementation Added DistributedBlock: Uses a “standard” VSIPL++ block on each processor. Uses MPI routines for communication when performing block assignment. Added specializations: FFT, FIR, etc. modified to handle DistributedBlock.

Performance Measurement Test system: AFRL HPC system 2.2GHz Pentium 4 cluster Measured only main loop No input/output Used Pentium Timestamp Counter MPI All-to-all not included in timings Accounts for 10-25%

VSIPL++ Performance

Parallel Speedup Minimal parallel overhead. 2 processes 10% slower than perfect linear speedup 4 processes 8 processes perfect linear speedup VSIPL, VSIPL++ 1 process Corner turn improves execution. 10% faster than perfect linear speedup

Conclusions VSIPL++ imposes no overhead: VSIPL++ performance nearly identical to VSIPL performance. VSIPL++ achieves near-linear parallel speedup: No tuning of MPI, VSIPL++, or application code. Absolute performance limited by VSIPL implementation, MPI implementation, compiler.

VSIPL++ Visit the HPEC-SI website http://www.hpec-si.org for VSIPL++ specifications for VSIPL++ reference implementation to participate in VSIPL++ development

VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004