Www.codesourcery.com 1 Parallel Vector & Signal Processing Mark Mitchell CodeSourcery, LLC April 3, 2003.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
Processing Data by Blocks
Vector: Data Layout Vector: x[n] P processors Assume n = r * p
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Lecture 8 – Collective Pattern Collectives Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Sorting Algorithms CS 524 – High-Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
Parallelizing Compilers Presented by Yiwei Zhang.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
1 Friday, November 03, 2006 “The greatest obstacle to discovery is not ignorance, but the illusion of knowledge.” -D. Boorstin.
4. Multirate Systems and their Applications. We compute here … and throw away most of them here!!!! Inefficient Implementation of Downsampling.
Computer Organization Computer Organization & Assembly Language: Module 2.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
CS32310 MATRICES 1. Vector vs Matrix transformation formulae Geometric reasoning allowed us to derive vector expressions for the various transformation.
Dax: Rethinking Visualization Frameworks for Extreme-Scale Computing DOECGF 2011 April 28, 2011 Kenneth Moreland Sandia National Laboratories SAND P.
Supplementary Text 2. If the relationship between two random variables X and Y is approximately illustrated in the following figure, then how can we discretize.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 5-1 Two allocations of a 16X16 array to 16 processes: (a) 2-dimensional blocks;
Evaluation of the VSIPL++ Serial Specification Using the DADS Beamformer HPEC 2004 September 30, 2004 Dennis Cottel Randy Judd.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo
STMIK Jakarta STI&K, Jakarta - September Designing Image Processing Component using FPGA Device By : Sunny Arief Sudiro.
Cousins HPEC 2002 Session 4: Emerging High Performance Software David Cousins Division Scientist High Performance Computing Dept. Newport,
Kronecker Products-based Regularized Image Interpolation Techniques
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Fortress John Burgess and Richard Chang CS691W University of Massachusetts Amherst.
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
Computer Organization. This module surveys the physical resources of a computer system.  Basic components  CPU  Memory  Bus  I/O devices  CPU structure.
IB Computer Science – Logic
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Jungpyo Lee Plasma Science & Fusion Center(PSFC), MIT Parallelization for a Block-Tridiagonal System with MPI 2009 Spring Term Project.
Data Structures and Algorithms in Parallel Computing Lecture 7.
click to start Example: A LINEAR TRANSFORMATION.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Non-Linear Algebra Problems Sathish Vadhiyar SERC IISc.
Lecture 9 Architecture Independent (MPI) Algorithm Design
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
2 2.4 © 2016 Pearson Education, Ltd. Matrix Algebra PARTITIONED MATRICES.
15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
M. Richards1 ,D. Campbell1 (presenter), R. Judd2, J. Lebak3, and R
Computer Engg, IIT(BHU)
Parallel Vector & Signal Processing
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Parallel build blocks.
<PETE> Shape Programmability
VSIPL++: Parallel Performance HPEC 2004
Dense Linear Algebra (Data Distributions)
Presentation transcript:

1 Parallel Vector & Signal Processing Mark Mitchell CodeSourcery, LLC April 3, 2003

2 Deliverables VSIPL++ Specification –CodeSourcery et. al. VSIPL++ Reference Implementation –CodeSourcery. High-Performance Implementations –Various vendors.

3 Objectives Parallelism Performance Portability –ANSI/ISO C++ Compatibility with VSIPL –On a conceptual level. –Allow VSIPL to be used to implement VSIPL++.

4 Major Components Data distribution: –Mapping from logical arrays to physical processors. High-performance serial computation: –Expression templates: Intuitive syntax. Loop fusion, elimination of temporaries. –VSIPL Can be used to implement VSIPL++.

5 Data Distribution Model Views –Lightweight access to data. Blocks –Logically rectangular arrays of data. Partitions –Block-cyclic data distribution determines subblocks. Processor Map –Map from from subblocks to physical processors.

6 Computational Model Data driven computation: Computation occurs on the processor that owns the data being updated. Parallel computation is divided into chunks that can be performed efficiently in serial.

7 Implementation of zherk A rank-k update of Hermitian matrix C: C    A  conjug(A) t +   C CMatrix A(10,15); CMatrix C(10,10); C = alpha * prodh(A,A) + beta * C;

8 Parallel Vector & Signal Processing Mark Mitchell CodeSourcery, LLC April 3, 2003