PVTOL: A High-Level Signal Processing Library for Multicore Processors

Slides:

Advertisements

Similar presentations

MPI Message Passing Interface

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Real-Time Video Analysis on an Embedded Smart Camera for Trafﬁc Surveillance Presenter: Yu-Wei Fan.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Parallel Vector Tile-Optimized Library (PVTOL) Architecture

© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Low-Power Wireless Sensor Networks

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc

material assembled from the web pages at

MIT Lincoln Laboratory hch-1 HCH 5/26/2016 Achieving Portable Task and Data Parallelism on Parallel Signal Processing Architectures Hank Hoffmann.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

HPEC_NDP-1 MAH 10/11/2015 MIT Lincoln Laboratory Efficient Multidimensional Polynomial Filtering for Nonlinear Digital Predistortion Matthew Herman, Benjamin.

Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Radar Pulse Compression Using the NVIDIA CUDA SDK

MIT Lincoln Laboratory Projective Transform on Cell: A Case Study Sharon Sacco, Hahn Kim, Sanjeev Mohindra, Peter Boettcher, Chris Bowen, Nadya Bliss,

HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.

Cousins HPEC 2002 Session 4: Emerging High Performance Software David Cousins Division Scientist High Performance Computing Dept. Newport,

HPEC2002_Session1 1 DRM 11/11/2015 MIT Lincoln Laboratory Session 1: Novel Hardware Architectures David R. Martinez 24 September 2002 This work is sponsored.

Scalable Multi-core Sonar Beamforming with Computational Process Networks Motivation Sonar beamforming requires significant computation and input/output.

Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.

The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.

Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.

Slide-1 Multicore Productivity MIT Lincoln Laboratory Evaluating the Productivity of a Multicore Architecture Jeremy Kepner and Nadya Bliss MIT Lincoln.

HPEC HN 9/23/2009 MIT Lincoln Laboratory RAPID: A Rapid Prototyping Methodology for Embedded Systems Huy Nguyen, Thomas Anderson, Stuart Chen, Ford.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,

1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.

High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory

MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

M. Richards1 ,D. Campbell1 (presenter), R. Judd2, J. Lebak3, and R

Embedded Systems Design

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James.

Vector Processing => Multimedia

Parallel Processing in ROSA II

Multi-Processing in High Performance Computer Architecture:

Matthew Lyon James Montgomery Lucas Scotta 13 October 2010

Linchuan Chen, Peng Jiang and Gagan Agrawal

Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos

Chapter 1 Introduction.

VSIPL++: Parallel Performance HPEC 2004

Multicore and GPU Programming

Presentation transcript:

PVTOL: A High-Level Signal Processing Library for Multicore Processors Hahn Kim, Nadya Bliss, Ryan Haney, Jeremy Kepner, Matthew Marzilli, Sanjeev Mohindra, Sharon Sacco, Glenn Schrader, Edward Rutledge HPEC 2007 20 September 2007 Title slide This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendataions are those of the author and are not necessarily endorsed by the United States Government.

Outline Background Parallel Vector Tile Optimizing Library Results Motivation Multicore Processors Programming Challenges Parallel Vector Tile Optimizing Library Results Summary Outline slide

Future DoD Sensor Missions Space- Based Radar SIGINT & ELINT Mobile Command & Control Aircraft Unmanned Ground Vehicles Future DoD missions will heavily rely on a multitude of high resolution sensors on a number of different platforms. Missions will have short reaction times, requiring low latencies. Additionally, sensors are becoming increasingly networked, thus data collected by each sensor will have multiple users. The combination of high resolutions, low latencies and net-centricity requirements imposes high computational requirements on these sensors. Carrier Group DoD missions must exploit High resolution sensors Integrated multi-modal data Short reaction times Many net-centric users Advanced Destroyer Manpack Submarine

Embedded Processor Evolution 1990 2000 2010 10 100 1000 10000 Year MFLOPS / Watt i860 XR MPC7447A Cell MPC7410 MPC7400 603e 750 SHARC High Performance Embedded Processors Asymmetric multicore processor 1 PowerPC core 8 SIMD cores i860 SHARC PowerPC PowerPC with AltiVec Cell (estimated) Growth in embedded processor performance, in terms of FLOPS/Watt, has grown exponentially over the last 20 years. No single processing architecture has dominated over this period, hence in order to leverage this increase in performance, embedded system designers must switch processing architectures approximately every 5 years. IBM’s Cell Broadband Engine is the current high performance architecture. MFLOPS / W for i860, SHARC, 603e, 750, 7400, and 7410 are extrapolated from board wattage. They also include other hardware energy use such as memory, memory controllers, etc. 7447A and the Cell estimate are for the chip only. Effective FLOPS for all processors are based on 1024 FFT timings. Cell estimate uses hand coded TDFIR timings for effective FLOPS. 20 years of exponential growth in FLOPS / Watt Requires switching architectures every ~5 years Cell processor is current high performance architecture

Multicore Programming Challenge Past Programming Model: Von Neumann Future Programming Model: ??? Great success of Moore’s Law era Simple model: load, op, store Many transistors devoted to delivering this model Moore’s Law is ending Need transistors for performance Processor topology includes: Registers, cache, local memory, remote memory, disk Multicore processors have multiple programming models For decades, Moore’s Law has enabled ever faster processors that have supported the traditional von Neumann programming model, i.e. load data from memory, process, then save the results to memory. As clock speeds near 4 GHz, physical limitations in transistor size are leading designers to build more processor cores (or “tiles”) on each chip rather than faster processors. Multicore processors improve raw performance but expose the underlying processor and memory topologies. This results in increased programming complexity, i.e. the loss of the von Neumann programming model. Increased performance at the cost of exposing complexity to the programmer

Example: Time-Domain FIR ANSI C Cell Manager C (Communication) for (i = K; i > 0; i--) { /* Set accumulators and pointers for dot * product for output point */ r1 = Rin; r2 = Iin; o1 = Rout; o2 = Iout; /* Calculate contributions from a single * kernel point */ for (j = 0; j < N; j++) { *o1 += *k1 * *r1 - *k2 * *r2; *o2 += *k2 * *r1 + *k1 * *r2; r1++; r2++; o1++; o2++; } /* Update input pointers */ k1++; k2++; Rout++; Iout++; /* Fill input vector and filter buffers and send to workers*/ mcf_m_tile_channel_get_buffer(..., &vector_in, ...); mcf_m_tile_channel_get_buffer(..., &filter, ...); init_buffer(&vector_in); init_buffer(&filter); mcf_m_tile_channel_put_buffer(..., &vector_in, ...); mcf_m_tile_channel_put_buffer(..., &filter, ...); /* Wait for worker to process and get a full output vector buffer */ mcf_m_tile_channel_get_buffer(..., &vector_out, ...); Cell Worker C (Communication) while (mcf_w_tile_channel_is_not_end_of_channel(...)) { /* Get buffers */ mcf_w_tile_channel_get_buffer(..., &vector_in, ...); mcf_w_tile_channel_get_buffer(..., &filter, ...); mcf_w_tile_channel_get_buffer(..., &vector_out ...); /* Perform the filter operation */ filter_vect(vector_in, filter, vector_out); /* Send results back to manager */ mcf_w_tile_channel_put_buffer(..., &vector_out, ...); /* Put back input vector and filter buffers */ mcf_w_tile_channel_put_buffer(..., &filter, ...); mcf_w_tile_channel_put_buffer(..., &vector_in, ...); } ANSI C is easy to understand Cell increases complexity: Communication requires synchronization Computation requires SIMD The ANSI C code is oriented towards solving a problem. The major optimization techniques on the part of the programmer are accessing memory well, writing good C code, and selecting compiler switches. Most C implementations have little sensitivity to the underlying hardware. With C extensions, the user has limited visibility of the hardware. Here the SIMD register are used. Using SIMD registers requires the use of the “shuffle” instructions for the convolution. Adding these SIMD extensions to increase performance also increases the complexity of the code. Cell Worker C (Computation) /* Load reference data and shift */ ir0 = *Rin++; ii0 = *Iin++; ir1 = (vector float) spu_shuffle(irOld, ir0, shift1); ii1 = (vector float) spu_shuffle(iiOld, ii0, shift1); Rtemp = spu_madd(kr0, ir0, Rtemp); Itemp = spu_madd(kr0, ii0, Itemp); Rtemp = spu_nmsub(ki0, ii0, Rtemp); Itemp = spu_madd(ki0, ir0, Itemp);

Example: Time-Domain FIR Performance vs. Effort Software Lines of Code (SLOC) and Performance for TDFIR C SIMD C Hand Coding Parallel (8 SPE) Lines of Code 33 110 371 546 Performance Efficiency (1 SPE) 0.014 0.27 0.88 0.82 GFLOPS (2.4 GHz) 5.2 17 126 103 PVTOL Goal Parallel 102 Speedup from C Hand Coding SIMD C 101 C 100 Time is intended as units of time, not actual time. The actual time will depend on the skills of the programmer. Here the implementation is simple since it is not required to cover all possible cases. The usual estimate for implementing the first full convolution in an algorithms group is 1 - 2 weeks for most processors. This estimate includes design, coding and debug time. Performance efficiency is used here rather than time since it can be applied to any Cell. The measurement of the TDFIR code here was made entirely on an SPE without communicating with the outside memory. The decrementer was measured just before the code and immediately afterwards. This should be scalable to any frequency. The parallel numbers for 8 SPEs do not include lines of code or coding effort measurements. The performance efficiency includes data transfers. The other performance efficiencies were measured for the computational code only. SLOC / C SLOC PVTOL Goal: Achieve high performance with little effort S. Sacco, et al. “Exploring the Cell with HPEC Challenge Benchmarks.” HPEC 2006.

Outline Background Parallel Vector Tile Optimizing Library Results Map-Based Programming PVTOL Architecture Results Summary Outline slide

Map-Based Programming A map is an assignment of blocks of data to processing elements Maps have been demonstrated in several technologies Technology Organization Language Year Parallel Vector Library MIT-LL C++ 2000 pMatlab MATLAB 2003 VSIPL++ HPEC-SI 2006 Map grid: 1x2 dist: block procs: 0:1 Grid specification together with processor list describe where the data is distributed Map-based programming is method for simplifying the task of assigning data across processors. Map-based programming has been demonstrated in several technologies, both at Lincoln and outside Lincoln. This slide shows an example illustrating how maps are use to distribute matrices. Distribution specification describes how the data is distributed Cluster Proc Proc 1

New Challenges A B C D E Hierarchy Heterogeneity Automated Mapping Extend maps to support the entire storage hierarchy Heterogeneity Different architectures between processors Different architectures within a processor Tiles Registers AMD Intel Instr. Operands Cache Blocks Local Memory Messages Remote Memory PowerPC Processing Element Pages Several new challenges to map-based programming have emerged. Multicore architectures expose the storage hierarchy to the programmer. Processing platforms are increasingly being composed of heterogeneous processing architectures, both within the processor and across processors. Automated mapping techniques are desireable to automatically construct maps for the user. Maps must be extended in order to support all of these new challenges. Disk Synergistic Processing Elements Automated Mapping Allow maps to be constructed using automated techniques MULT FFT A B C D E

Make parallel programming as easy as serial programming PVTOL Goals PVTOL is a portable and scalable middleware library for multicore processors Enables incremental development Cluster 2. Parallelize code Embedded Computer 3. Deploy code 1. Develop serial code Desktop PVTOL is focused on addressing the programming complexity of associated with emerging “Topological Processors”. Topological Processors require the programmer to understand the physical topology of the chip to get high efficiency. There are many such processors emerging into the market. The Cell processor is an important example of such a chip. The current PVTOL effort is focused on getting high performance from the Cell processor on signal and image processing applications. The PVTOL interface is designed to address a wide range of processors including multicore and FPGAs. PVTOL enables software developers to develop high-performance signal processing application on a desktop computer, parallelize the code on commodity clusters, then deploy the code onto an embedded computer, with minimal changes to the code. PVTOL also includes automated mapping technology that will automatically parallelize the application for a given platform. Applications developed on a workstation can then be deployed on an embedded computer and the library will parallelize the application without any changes to the code. 4. Automatically parallelize code Make parallel programming as easy as serial programming

PVTOL preserves the simple load-store programming model in software PVTOL Architecture This slide shows a layered view of the PVTOL architecture. At the top is the application. The PVTOL API exposes high-level structures for data (e.g. vectors), data distribution (e.g. maps), communication (e.g. conduits) and computation (e.g. tasks and computational kernels). High level structures improve the productivity of the programmer. By being built on top of existing technologies, optimized for different platforms, PVTOL provides high performance. And by supporting a range of processor architectures, PVTOL applications are portable. The end result is that rather than learning new programming models for new processor technologies, PVTOL preserves the simple von Neumann programming model most programmers are used to. PVTOL preserves the simple load-store programming model in software Portability: Runs on a range of architectures Performance: Achieves high performance Productivity: Minimizes effort at user level

PVTOL API Maps describe how to assign blocks of data across the storage hierarchy Data structures encapsulate data allocated across the storage hierarchy into objects Kernel objects/functions encapsulate common operations. Kernels operate on PVTOL data structures Tasks encapsulate computation. Conduits pass data between tasks The PVTOL API is composed of high-level objects that abstract the complexity of programming multicore processors.

Automated Mapping 1 A B C D E 2 4 8 11 9400 9174 2351 1176 937 Simulate the architecture using pMapper simulator infrastructure Use pMapper to automate mapping and predict performance #procs Tp(s) MULT FFT 9400 1 A B C D E 9174 2 4 2351 1176 8 11 937 PVTOL’s automated mapping capability is built on pMapper technology. pMapper can both simulate the expected performance of the target architecture and automate the construction of maps for an application running on the target architecture. N. Bliss, et al. “Automatic Mapping of the HPEC Challenge Benchmarks.” HPEC 2006.

Hierarchical Arrays eXtreme Virtual Memory provides hierarchical arrays and maps Hierarchical arrays hide details of the processor and memory hierarchy Hierarchical maps concisely describe data distribution at each level CELL Cluster grid: 1x2 dist: block nodes: 0:1 map: cellMap clusterMap CELL CELL 1 grid: 1x4 dist: block policy: default nodes: 0:3 map: speMap cellMap Maps describe how to partition an array. There are two types of maps: spatial and temporal. A spatial map describes how elements of an array are divided between multiple processors. A physical spatial map breaks up an array into blocks that are physically located on separate processors, e.g. dividing an array between two Cell processors. A logical map differs from a physical map in that the array being partitioned remains intact. Rather, the arrays is logically divided into blocks that are owned by different processors. For example, on a single Cell processor, the array resides in main memory. A logical map may assign blocks of rows to each SPE. Finally, temporal maps divided arrays into blocks that are loaded into memory one at a time. For example, a logicial map may divide an array owned by a single SPE into blocks of rows. The array resides in main memory and the SPE loads one block at a time into its local store. SPE 0 SPE 1 SPE 2 SPE 3 … SPE 0 SPE 1 SPE 2 SPE 3 … grid: 4x1 dist: block policy: default speMap LS LS LS LS LS LS LS LS LS LS H. Kim, et al. “Advanced Hardware and Software Technologies for Ultra-long FFT’s.” HPEC 2005.

Computational Kernels & Processor Interface Mercury Scientific Algorithm Library (SAL) Vector Signal and Image Processing Library Intel IPP VSIPL VSIPL++ Performance (1.5x) Portability (3x) Productivity (3x) HPEC Software Initiative Demonstrate Develop Prototype Object Oriented Open Standards Interoperable & Scalable Convolution FFT Mercury Multicore Framework (MCF) IBM Cell API POSIX-like threads PVTOL uses existing technologies to provide optimized computational kernels and interfaces to target processing architectures. For example, Mercury has a version of their Scientific Algorithm Library that contains optimized signal processing kernels for the Cell. Intel’s Integrated Performance Primitives provides optimized kernels for Intel architectures. VSIPL++ is a signal processing standard supported on a range of architectures. Mercury Multicore Framework provides a programming interface to the Cell processor that is built on IBM’s Cell API and interoperates with SAL.

Outline Background Parallel Vector Tile Optimizing Library Results Projective Transform Example Code Summary Outline slide

PT PT Projective Transform Many DoD optical applications use mobile cameras Consecutive frames may be skewed relative to each other Standardizing the perspective allows feature extraction PT PT Projective transform is a homogeneous warp transform Each pixel in destination image is mapped to a pixel in the source image Example of a real life application with a complex data distribution Projective transform is a useful kernel for many DoD optical sensor applications. It provides a useful example of a computational kernel required by real life applications that has a complex data distribution whose implementation can be simplified by using PVTOL. S. Sacco, et al. “Projective Transform on Cell: A Case Study.” HPEC 2007.

Projective Transform Data Distribution 1. Mapping between source and destination pixels is data dependent Can not use regular data distributions for both source and destination Extent box 3. 2. PT PT Break destination image into blocks Map destination block to source image Compute extent box of source block This slide shows the process of distributing an image to be processed by the projective transform. Transfer source and destination blocks to SPE local store SPE applies transform to source and destination blocks 4. 4. 5.

Projective Transform Code Serial typedef Dense<2, short int, tuple<0, 1> > DenseBlk; typedef Dense<2, float, tuple<0, 1> > DenseCoeffBlk; typedef Matrix<short int, DenseBlk, LocalMap> SrcImage16; typedef Matrix<short int, DenseBlk, LocalMap> DstImage16; typedef Matrix<float, DenseCoeffBlk, LocalMap> Coeffs; int main(int argc, char** argv) { Pvtol pvtol(argc, argv); // Allocate 16-bit images and warp coefficients SrcImage16 src(Nrows, Ncols); DstImage16 dst(Nrows, Ncols); Coeffs coeffs(3, 3); // Load source image and initialize warp coefficients loadSourceImage(&src); initWarpCoeffs(&coeffs); // Perform projective transform projective_transform(&src, &dst, &coeffs); } This slide shows serial PVTOL code for the projective transform.

Projective Transform Code Parallel typedef RuntimeMap<DataDist<BlockDist, BlockDist> > RuntimeMap; typedef Dense<2, short int, tuple<0, 1> > DenseBlk; typedef Dense<2, float, tuple<0, 1> > DenseCoeffBlk; typedef Matrix<short int, DenseBlk, LocalMap> SrcImage16; typedef Matrix<short int, DenseBlk, RuntimeMap> DstImage16; typedef Matrix<float, DenseCoeffBlk, LocalMap> Coeffs; int main(int argc, char** argv) { Pvtol pvtol(argc, argv); Grid dstGrid(1, 1, Grid::ARRAY); // Allocate on 1 Cell ProcList pList(pvtol.processorSet()); RuntimeMap dstMap(dstGrid, pList); // Allocate 16-bit images and warp coefficients SrcImage16 src(Nrows, Ncols); DstImage16 dst(Nrows, Ncols, dstMap); Coeffs coeffs(3, 3); // Load source image and initialize warp coefficients loadSourceImage(&src); initWarpCoeffs(&coeffs); // Perform projective transform projective_transform(&src, &dst, &coeffs); } This slide shows parallel PVTOL code for the projective transform. Code required to make the serial code parallel is shown in red.

Projective Transform Code Embedded typedef RuntimeMap<DataDist<BlockDist, BlockDist> > RuntimeMap; typedef Dense<2, short int, tuple<0, 1> > DenseBlk; typedef Dense<2, float, tuple<0, 1> > DenseCoeffBlk; typedef Matrix<short int, DenseBlk, LocalMap> SrcImage16; typedef Matrix<short int, DenseBlk, RuntimeMap> DstImage16; typedef Matrix<float, DenseCoeffBlk, LocalMap> Coeffs; int main(int argc, char** argv) { Pvtol pvtol(argc, argv); // Hierarchical map for the destination image Grid dstTileGrid(PT_BLOCKSIZE, PT_BLOCKSIZE, Grid::ELEMENT); // Break into blocks DataMgmtPolicy tileDataPolicy; RuntimeMap dstTileMap(dstTileGrid, tileDataPolicy); Grid dstSPEGrid(1, pvtol.numTileProcessor(), Grid::ARRAY); // Distribute across SPE’s ProcList speProcList(pvtol.tileProcessorSet()); RuntimeMap dstSPEMap(dstSPEGrid, speProcList, dstTileMap); Grid dstGrid(1, 1, Grid::ARRAY); // Allocate on 1 Cell ProcList pList(pvtol.processorSet()); RuntimeMap dstMap(dstGrid, pList, dstSPEMap); // Allocate 16-bit images and warp coefficients SrcImage16 src(Nrows, Ncols); DstImage16 dst(Nrows, Ncols, dstMap); Coeffs coeffs(3, 3); // Load source image and initialize warp coefficients loadSourceImage(&src); initWarpCoeffs(&coeffs); // Perform projective transform projective_transform(&src, &dst, &coeffs); } This slide shows the hierarchical PVTOL code for the projective transform for an embedded Cell platform. Code required to make the parallel code hierarchical is shown in red.

Projective Transform Code Automapped typedef Dense<2, short int, tuple<0, 1> > DenseBlk; typedef Dense<2, float, tuple<0, 1> > DenseCoeffBlk; typedef Matrix<short int, DenseBlk, LocalMap> SrcImage16; typedef Matrix<short int, DenseBlk, AutoMap> DstImage16; typedef Matrix<float, DenseCoeffBlk, LocalMap> Coeffs; int main(int argc, char** argv) { Pvtol pvtol(argc, argv); // Allocate 16-bit images and warp coefficients SrcImage16 src(Nrows, Ncols); DstImage16 dst(Nrows, Ncols); Coeffs coeffs(3, 3); // Load source image and initialize warp coefficients loadSourceImage(&src); initWarpCoeffs(&coeffs); // Perform projective transform projective_transform(&src, &dst, &coeffs); } This slide shows the automapped PVTOL code for the projective transform. Code required to make the serial code automapped is shown in red.

Image size (Megapixels) PVTOL adds minimal overhead Results Performance GOPS vs. Megapixels 20.14 19.04 GOPS 0.95 This slide shows the performance of three implementations of the projective transform. The Baseline C code is an ANSI C implementation. The Mercury MCF code directly implements the projective transform with MCF. The PVTOL code wraps the MCF implementation using prototype PVTOL constructs. The PVTOL prototype adds some overhead to the MCF version, but as data sizes increase the overhead becomes negligable. For small data sizes, the Baseline C version outperforms both PVTOL and MCF. This is due to inherent overhead of the MCF library. But as data sizes increase, PVTOL and MCF outperform C. Image size (Megapixels) PVTOL adds minimal overhead

Results Performance vs. Effort GOPS* SLOCs ANSI C 0.95 52 MCF 20.14 736 PVTOL 19.04 46 This slide compares the performance of various implementations of the projective transform against the number of software lines of code required for each. Note that SLOCs are broken into two columns. The first denotes the number of SLOCs the user must write. In the case of ANSI C, the user must actually implement the projective transform but in the case of MCF and PVTOL, an optimized SPE kernel would be provided. The second column includes the SLOCs required to implement the SPE kernel. The ANSI SLOCs remain the same because it only runs on the PPE. The results show that PVTOL code is comparable to the number of lines required in ANSI C, keeping in mind that the computation code becomes the responsibility of the PVTOL library. PVTOL acheives high performance with effort comparable to ANSI C * 10 Megapixel image, 3.2 GHz Intel Xeon (ANSI C), 3.2 GHz Cell w/ 8 SPEs (MCF and PVTOL)

Outline Background Parallel Vector Tile Optimizing Library Results Summary Outline slide

Future of Multicore Processors AMD Phenom Intel Core 2 Extreme AMD vs. Intel Different flavors of multicore Replication vs. hierarchy “Many-core” processors In 2006, Intel achieved 1 TeraFLOP on an 80-core processor Heterogeneity Multiple types of cores & architectures Different threads running on different architectures Intel Polaris Multicore processing architectures is still a young field and the future will introduce even more programming challenges. Both AMD and Intel are introducing quad-core processors, but using different approached. AMD’s processor replicates a single core four times while Intel replicate two dual-core processors. Hierarchical arrays will allows programmers to distribute data in a flat manner across AMD architectures or in a hierarchical manner across Intel’s architectures. “Many-core” describes future multicore architectures that contain a large number of cores (>>16). The challenge becomes how to allocate applications to such a large number of cores. Finally, both multicore and many-core processors will increasingly encounter heterogeneity, both in having different processing architectures in the same chip and different threads of an application running on different cores. PVTOL’s programming constructs will enable programmers to easily program to these future multicore architectures. Broadcom BMC1480 GPU’s PVTOL will extend to support future multicore designs

Summary Emerging DoD intelligence missions will collect more data Real-time performance requires significant amount of processing power Processor vendors are moving to multicore architectures Extremely difficult to program PVTOL provides a simple means to program multicore processors Have demonstrated for a real-life application Summary slide