PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
What is Concurrent Programming? Maram Bani Younes.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
PVTOL: A High-Level Signal Processing Library for Multicore Processors
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Lecture on Central Process Unit (CPU)
Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
My Coordinates Office EM G.27 contact time:
Background Computer System Architectures Computer System Software.
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Introduction to Operating Systems Concepts
Chapter 4 – Thread Concepts
These slides are based on the book:
Module 3: Operating-System Structures
Processes and threads.
Credits: 3 CIE: 50 Marks SEE:100 Marks Lab: Embedded and IOT Lab
Computing Systems Organization
The Mach System Sri Ramkrishna.
Enabling machine learning in embedded systems
Chapter 4 – Thread Concepts
Operating System Structure
Architecture & Organization 1
Parallel Processing in ROSA II
COT 5611 Operating Systems Design Principles Spring 2014
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
What is Concurrent Programming?
Operating Systems : Overview
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Threads Chapter 4.
Multithreaded Programming
What is Concurrent Programming?
Operating Systems : Overview
Operating Systems : Overview
CS510 - Portland State University
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
VSIPL++: Parallel Performance HPEC 2004
Chapter 4 Multiprocessors
Operating Systems : Overview
Chapter 01: Introduction
Multicore and GPU Programming
Chapter 13: I/O Systems.
Multicore and GPU Programming
♪ Embedded System Design: Synthesizing Music Using Programmable Logic
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Programming Parallel Computers
Presentation transcript:

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James Geraci, Ryan Haney, Jeremy Kepner, Sanjeev Mohindra, Sharon Sacco, Eddie Rutledge HPEC 2008 25 September 2008 Title slide This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendataions are those of the author and are not necessarily endorsed by the United States Government.

Outline Background Tasks & Conduits Maps & Arrays Results Summary Motivation Multicore Processors Programming Challenges Tasks & Conduits Maps & Arrays Results Summary Outline slide 2

SWaP* for Real-Time Embedded Systems Modern DoD sensors continue to increase in fidelity and sampling rates Real-time processing will always be a requirement U-2 Global Hawk Decreasing SWaP Modern DoD sensors continue to increase in fidelity and sampling rates, resulting in increasingly larger data sets. Yet, size, weight and power (SWaP) requirements for embedded processing systems remain the same or even shrink. Consider the replacement of manned reconnaissance aircraft (e.g. U-2) with UAVs (e.g. Global Hawk). UAVs can impose considerably tight requirements for size (several cubic feet), weight (tens of pounds) and power (tens to hundreds of Watts) on real-time embedded systems. Modern sensor platforms impose tight SWaP requirements on real-time embedded systems * SWaP = Size, Weight and Power

Embedded Processor Evolution 1990 2000 2010 10 100 1000 10000 Year MFLOPS / Watt i860 XR MPC7447A Cell MPC7410 MPC7400 603e 750 SHARC High Performance Embedded Processors Multicore processor 1 PowerPC core 8 SIMD cores i860 SHARC PowerPC PowerPC with AltiVec Cell (estimated) GPU PowerXCell 8i Growth in embedded processor performance, in terms of FLOPS/Watt, has grown exponentially over the last 20 years. No single processing architecture has dominated over this period, hence in order to leverage this increase in performance, embedded system designers must switch processing architectures approximately every 5 years. MFLOPS / W for i860, SHARC, 603e, 750, 7400, and 7410 are extrapolated from board wattage. They also include other hardware energy use such as memory, memory controllers, etc. 7447A and the Cell estimate are for the chip only. Effective FLOPS for all processors are based on 1024 FFT timings. Cell estimate uses hand coded TDFIR timings for effective FLOPS. Multicore processors help achieve performance requirements within tight SWaP constraints 20 years of exponential growth in FLOPS / W Must switch architectures every ~5 years Current high performance architectures are multicore

Parallel Vector Tile Optimizing Library PVTOL is a portable and scalable middleware library for multicore processors Enables unique software development process for real-time signal processing applications Cluster 2. Parallelize code Embedded Computer 3. Deploy code 1. Develop serial code Desktop PVTOL is focused on addressing the programming complexity of associated with emerging hierarchical processors. Hierarchical Processors require the programmer to understand the physical hierarchy of the chip to get high efficiency. There are many such processors emerging into the market. The Cell processor is an important example of such a chip. The current PVTOL effort is focused on getting high performance from the Cell processor on signal and image processing applications. The PVTOL interface is designed to address a wide range of processors including multicore, FPGAs and GPUs PVTOL enables software developers to develop high-performance signal processing application on a desktop computer, parallelize the code on commodity clusters, then deploy the code onto an embedded computer, with minimal changes to the code. PVTOL also includes automated mapping technology that will automatically parallelize the application for a given platform. Applications developed on a workstation can then be deployed on an embedded computer and the library will parallelize the application without any changes to the code. 4. Automatically parallelize code Make parallel programming as easy as serial programming 5

PVTOL Architecture Tasks & Conduits Concurrency and data movement Maps & Arrays Distribute data across processor and memory hierarchies Functors Abstract computational kernels into objects This slide shows a layered view of the PVTOL architecture. At the top is the application. The PVTOL API exposes high-level structures for data (e.g. vectors), data distribution (e.g. maps), communication (e.g. conduits) and computation (e.g. tasks and computational kernels). High level structures improve the productivity of the programmer. By being built on top of existing technologies, optimized for different platforms, PVTOL provides high performance. And by supporting a range of processor architectures, PVTOL applications are portable. The end result is that rather than learning new programming models for new processor technologies, PVTOL preserves the simple von Neumann programming model most programmers are used to. This presentation focuses on the three main parts of the API: Tasks & Conduits, Hierarchical Arrays and Functors. Portability: Runs on a range of architectures Performance: Achieves high performance Productivity: Minimizes effort at user level 6

Outline Background Tasks & Conduits Maps & Arrays Results Summary Outline slide 7

Multicore Programming Challenges Inside the Box Outside the Box Desktop Embedded Board Cluster Embedded Multicomputer Threads Pthreads OpenMP Shared memory Pointer passing Mutexes, condition variables Processes MPI (MPICH, Open MPI, etc.) Mercury PAS Distributed memory Message passing Different programming paradigms are used for programming inside the box and outside the box. Technically speaking, different programming paradigms are used for programming inside a process, i.e threads, and across processes, i.e. message-passing. PVTOL provides consistent semantics for both multicore and cluster computing

Tasks & Conduits Tasks provide concurrency Collection of 1+ threads in 1+ processes Tasks are SPMD, i.e. each thread runs task code Task Maps specify locations of Tasks Conduits move data Safely move data Multibuffering Synchronization load(B) cdt1.write(B) DIT Disk cdt1 cdt1.read(B) A = B cdt2.write(A) A B DAT cdt2 cdt2.read(A) save(A) DOT A Task is a collection of parallel threads running in multiple processes. The allocation of threads on specific processors is specified using the Task Map. Conduits are used to send data between Tasks. The DIT-DAT-DOT is a pipeline structure commonly found in real-time embedded signal processing systems. In this example, the DIT will 1) read a vector, B, from disk and 2) pass B to the DAT via a conduit. The DAT will 3) accept B from the conduit, 4) allocate a new vector, A, and assign B to A, then 5) pass A to the DOT via another conduit. The DOT will 6) accept A from the conduit and 7) write A to disk. DIT Read data from source (1 thread) DAT Process data (4 threads) DOT Output results (1 thread) Conduits Connect DIT to DAT and DAT to DOT * DIT – Data Input Task, DAT – Data Analysis Task, DOT – Data Output Task

Pipeline Example DIT-DAT-DOT int main(int argc, char** argv) { // Create maps (omitted for brevity) ... // Create the tasks Task<Dit> dit("Data Input Task", ditMap); Task<Dat> dat("Data Analysis Task", datMap); Task<Dot> dot("Data Output Task", dotMap); // Create the conduits Conduit<Matrix <double> > ab("A to B Conduit"); Conduit<Matrix <double> > bc("B to C Conduit"); // Make the connections dit.init(ab.getWriter()); dat.init(ab.getReader(), bc.getWriter()); dot.init(bc.getReader()); // Complete the connections ab.setupComplete(); bc.setupComplete(); // Launch the tasks dit.run(); dat.run(); dot.run(); // Wait for tasks to complete dit.waitTillDone(); dat.waitTillDone(); dot.waitTillDone(); } Main function creates tasks, connects tasks with conduits and launches the task computation dit dat dot ab bc dit dat dot ab bc Parallel code using Task and Conduits. The code here creates the tasks and conduits and connects them together to create a DIT-DAT-DOT pipeline.

Pipeline Example Data Analysis Task (DAT) class Dat { private: Conduit<Matrix <double> >::Reader m_Reader; Conduit<Matrix <double> >::Writer m_Writer; public: void init(Conduit<Matrix <double> >::Reader& reader, Conduit<Matrix <double> >::Writer& writer) // Get data reader for the conduit reader.setup(tr1::Array<int, 2>(ROWS, COLS)); m_Reader = reader; // Get data writer for the conduit writer.setup(tr1::Array<int, 2>(ROWS, COLS)); m_Writer = writer; } void run() Matrix <double>& B = m_Reader.getData(); Matrix <double>& A = m_Writer.getData(); A = B; m_reader.releaseData(); m_writer.releaseData(); }; Tasks read and write data using Reader and Writer interfaces to Conduits Readers and Writer provide handles to data buffers reader writer The code here shows the implementation of the DAT, which reads data from an input conduit using a reader object, processes the data then writes the data to the output conduit using a writer object. reader A = B writer B A

Outline Background Tasks & Conduits Maps & Arrays Results Summary Hierarchy Functors Results Summary Outline slide 12

Map-Based Programming Technology Organization Language Year Parallel Vector Library MIT-LL* C++ 2000 pMatlab MIT-LL MATLAB 2003 VSIPL++ HPEC-SI† 2006 A map is an assignment of blocks of data to processing elements Maps have been demonstrated in several technologies Map Map Map grid: 1x2 dist: block procs: 0:1 grid: 1x2 dist: cyclic procs: 0:1 grid: 1x2 dist: block- cyclic procs: 0:1 Grid specification together with processor list describe where data are distributed Distribution specification describes how data are distributed Map-based programming is method for simplifying the task of assigning data across processors. Map-based programming has been demonstrated in several technologies, both at Lincoln and outside Lincoln. This slide shows an example illustrating how maps are use to distribute matrices. Cluster Cluster Cluster Proc Proc 1 Proc Proc 1 Proc Proc 1 * MIT Lincoln Laboratory † High Performance Embedded Computing Software Initiative

Remote Processor Memory Local Processor Memory PVTOL Machine Model Processor Hierarchy Processor: Scheduled by OS Co-processor: Dependent on processor for program control Memory Hierarchy Each level in the processor hierarchy can have its own memory Disk CELL Cluster Remote Processor Memory Processor Local Processor Memory CELL CELL 1 A processor hierarchy contains a main processor and one or more co-processors that depend on the main processor for program control. On the Cell the PPE is responsible for spawning threads on the SPEs. Each level in the processor hierarchy can have its own memory. Unlike caches, levels in the memory hierarchy may be disjoint, requiring explicit movement of data through the hierarchy. On the Cell, the PPE first loads data into main memory, then SPEs transfer data from main memory into local store via DMAs. Cache/Loc. Co-proc Mem. SPE 0 SPE 1 … SPE 0 SPE 1 … Read register Write register Write data Write Read data Registers Co-processor PVTOL extends maps to support hierarchy 14

Remote Processor Memory Local Processor Memory PVTOL Machine Model Processor Hierarchy Processor: Scheduled by OS Co-processor: Dependent on processor for program control Memory Hierarchy Each level in the processor hierarchy can have its own memory Disk x86 Cluster Remote Processor Memory Processor Local Processor Memory x86/PPC x86/PPC 1 This hierarchical model supports other co-processor architectures. For example, an Intel or PowerPC processor may be the main processor and a GPU or FPGA may be the co-processor. The semantics for describing this type of processor hierarchy do not change. Cache/Loc. Co-proc Mem. GPU / FPGA 0 GPU / FPGA 1 … GPU / FPGA 0 GPU / FPGA 1 … Read register Write register Write data Write Read data Registers Co-processor Semantics are the same across different architectures 15

Hierarchical Maps and Arrays Serial PPC PVTOL provides hierarchical maps and arrays Hierarchical maps concisely describe data distribution at each level Hierarchical arrays hide details of the processor and memory hierarchy Parallel PPC 1 Cluster grid: 1x2 dist: block procs: 0:1 Program Flow Define a Block Data type, index layout (e.g. row-major) Define a Map for each level in the hierarchy Grid, data distribution, processor list Define an Array for the Block Parallelize the Array with the Hierarchical Map (optional) Process the Array Hierarchical LS SPE 1 CELL 1 SPE 0 Cluster grid: 1x2 dist: block procs: 0:1 block: 1x2 Maps describe how to partition an array. A map breaks an array into blocks. In this example, the first map divides the array between two Cell processors. The next map divides the adta on Cell processor between two SPEs. Finally, the last array divides the data on each SPE into blocks that are loaded into SPE local store, processed, then written back to main memory one at a time.

Hierarchical Maps and Arrays Example - Serial PPC int main(int argc, char *argv[]) { PvtolProgram pvtol(argc, argv); // Allocate the array typedef Dense<2, int> BlockType; typedef Matrix<int, BlockType> MatType; MatType matrix(4, 8); } Parallel PPC 1 Cluster grid: 1x2 dist: block procs: 0:1 Hierarchical LS SPE 1 CELL 1 SPE 0 Cluster grid: 1x2 dist: block procs: 0:1 block: 1x2 This slide shows PVTOL code to allocate the serial array.

Hierarchical Maps and Arrays Example - Parallel Serial PPC int main(int argc, char *argv[]) { PvtolProgram pvtol(argc, argv); // Distribute columns across 2 Cells Grid cellGrid(1, 2); DataDistDescription cellDist(BlockDist(0), BlockDist(0)); RankList cellProcs(2); RuntimeMap cellMap(cellProcs, cellGrid, cellDist); // Allocate the array typedef Dense<2, int> BlockType; typedef Matrix<int, BlockType, RuntimeMap> MatType; MatType matrix(4, 8, cellMap); } Parallel PPC 1 Cluster grid: 1x2 dist: block procs: 0:1 Hierarchical LS SPE 1 CELL 1 SPE 0 Cluster grid: 1x2 dist: block procs: 0:1 block: 1x2 This slide shows PVTOL code to allocate the parallel array.

Hierarchical Maps and Arrays Example - Hierarchical Serial PPC int main(int argc, char *argv[]) { PvtolProgram pvtol(argc, argv); // Distribute into 1x1 blocks unsigned int speLsBlockDims[2] = {1, 2}; TemporalBlockingInfo speLsBlock(2, speLsBlockDims); TemporalMap speLsMap(speLsBlock); // Distribute columns across 2 SPEs Grid speGrid(1, 2); DataDistDescription speDist(BlockDist(0), BlockDist(0)); RankList speProcs(2); RuntimeMap speMap(speProcs, speGrid, speDist, speLsMap); // Distribute columns across 2 Cells vector<RuntimeMap *> vectSpeMaps(1); vectSpeMaps.push_back(&speMap); Grid cellGrid(1, 2); DataDistDescription cellDist(BlockDist(0), BlockDist(0)); RankList cellProcs(2); RuntimeMap cellMap(cellProcs, cellGrid, cellDist, vectSpeMaps); // Allocate the array typedef Dense<2, int> BlockType; typedef Matrix<int, BlockType, RuntimeMap> MatType; MatType matrix(4, 8, cellMap); } Parallel PPC 1 Cluster grid: 1x2 dist: block procs: 0:1 Hierarchical LS SPE 1 CELL 1 SPE 0 Cluster grid: 1x2 dist: block procs: 0:1 block: 1x2 This slide shows PVTOL code to allocate the hierarchical array.

Functor Fusion Expressions contain multiple operations E.g. A = B + C .* D Functors encapsulate computation in objects Fusing functors improves performance by removing need for temporary variables Let Xi be block i in array X Unfused Perform tmp = C .* D for all blocks: 1. Load Di into SPE local store 2. Load Ci into SPE local store 3. Perform tmpi = Ci .* Di 4. Store tmpi in main memory   Perform A = tmp + B for all blocks: 5. Load tmpi into SPE local store 6. Load Bi into SPE local store 7. Perform Ai = tmpi + Bi 8. Store Ai in main memory Unfused D C tmp B A PPE Main Memory 1. 2. 4. 3. 5. 6. 8. 7. SPE Local Store Functors can further improve performance by fusing data-parallel functors in a single expression. Consider the expression A = B + C .* D (.* indicates elementwise multiplication). If A, B, C and D are allocated as hierarchical arrays, then SPEs must process blocks one at a time. A naïve implementation would process the entire multiplication for all blocks in C and D, then the entire addition. This requires a temporary variable, tmp, in main memory that stores the results of the multiplication. The addition is then applied to B and tmp. This adds extra DMA transfers between the SPE local store and main memory to read and write tmp. A more efficient implementation will recognize that the multiplication and addition are part of a single expression and fuse the operations such that the entire expression is applied to a set of blocks across all arrays. Fusing operations removes the overhead of moving data between local store and main memory required by the temporary variable. .* + Fused Perform A = B + C .* D for all blocks: 1. Load Di into SPE local store 2. Load Ci into SPE local store 3. Perform tmpi = Ci .* Di 4. Load Bi into SPE local store 5. Perform Ai = tmpi + Bi 6. Store Ai in main memory Fused D C B A 6. 4. 2. 1. 3. 5. PPE Main Memory SPE Local Store .* + .* = elementwise multiplication

Outline Background Tasks & Conduits Maps & Arrays Results Summary Outline slide 21

Persistent Surveillance Canonical Front End Processing Processing Requirements ~300 Gflops GPS/ INS Sensor package Video and GPS/IMU data Stabilization/ Registration (Optic Flow) Projective Transform Detection ~600 ops/pixel (8 iterations) x 10% = 120 Gflops ~40 ops/pixel = 80 Gflops ~50 ops/pixel = 100 Gflops Logical Block Diagram JPEG 2000 JPEG 2000 Disk controller This slide shows the basic processing chain for aerial persistent surveillance. A camera captures video. Stabilization/registration determines a set of warp coefficients that is used by the projective transform to warp image so that the scene in each frame is viewed from the same perspective. After images have been warped, detection locates moving targets in the images. A real-time system is being developed that will performs these steps in-flight using Cell-based processing hardware. 4 x AMD motherboard CAB CAB 4U Mercury Server 2 x AMD CPU motherboard 2 x Mercury Cell Accelerator Boards (CAB) 2 x JPEG 2000 boards PCI Express (PCI-E) bus PCI-E Signal and image processing turn sensor data into viewable images

Post-Processing Software Current CONOPS Record video in-flight Apply registration and detection on the ground Analyze results on the ground Future CONOPS Apply registration and detection in-flight Analyze data on the ground Disk read(S) gaussianPyramid(S) for (nLevels) { for (nIters) { D = projectiveTransform(S, C) C = opticFlow(S, D) } write(D) S D Currentyl, video imagery is recorded in-flight, then downloaded from the storage drives and processed on the ground.

Real-Time Processing Software Step 1: Create skeleton DIT-DAT-DOT Input and output of DAT should match input and output of application Input and output of DAT should match input and output of application read(B) cdt1.insert(B) DIT Disk cdt1 cdt1.extract(B) A = B cdt2.insert(A) A B DAT cdt2 The first step in the process of turning the post-processing application into a real-time application is to build a skeleton DIT-DAT-DOT. The input and output of the DAT will be read and written to from disk by the DIT and DOT, respectively. The input and output data should match that of the post-processing application. The DAT will do no actual processing at this point. The goal is to establish the data flow. cdt2.extract(A) write(A) Tasks and Conduits separate I/O from computation DOT * DIT – Data Input Task, DAT – Data Analysis Task, DOT – Data Output Task

Tasks and Conduits make it easy to change components Real-Time Processing Software Step 2: Integrate application code into DAT Replace disk I/O with conduit reader and writer Input and output of DAT should match input and output of application DIT read(S) cdt1.insert(S) cdt1 Disk read(S) gaussianPyramid(S) for (nLevels) { for (nIters) { D = projectiveTransform(S, C) C = opticFlow(S, D) } write(D) Replace DAT with application code S D DAT cdt2 Next, skeleton DAT is replaced with the post-processing application. The disk I/O in the post-processing application is replaced with the input and output conduits. The functionality is the same as the original post-processing application, but the DAT is now isolated from the I/O. Tasks and Conduits make it easy to change components cdt2.extract(D) write(D) DOT

Real-Time Processing Software Step 3: Replace disk with camera DIT get(S) cdt1.insert(S) Camera cdt1 read(S) gaussianPyramid(S) for (nLevels) { for (nIters) { D = projectiveTransform(S, C) C = opticFlow(S, D) } write(D) S Input and output of DAT should match input and output of application Replace disk I/O with bus I/O that retrieves data from the camera DAT D cdt2 Finally, the disk I/O in the DIT is replaced with images captured directly from the camera instead of recorded data stored on disk. The camera I/O is performed on the AMD motherboard while the processing occurs on the Cells. cdt2.extract(D) put(D) DOT Disk

Tasks and Conduits incur little overhead Performance 44 imagers per Cell 1 image # imagers per Cell Registration Time (w/o Tasks & Conduits) (w/ Tasks & Conduits*) All imagers 1188 ms 1246 ms 1/2 of imagers 594 ms 623 ms 1/4 of imagers 297 ms 311 ms Real-time Target Time 500 ms The camera for our real-time system is comprised of 4 apertures, each with 44 imagers. Each quadrant of the image registered by a Cell processor. Registration can be performed using varying numbers of imagers in a quadrant. The table shows the measured time for registration on all imagers, 1/2 of the imagers and 1/4 of the imagers, for a quadrant. Times are shown for the application with and without tasks and conduits. Note that tasks and conduits impose little overhead. Tasks and Conduits incur little overhead * Double-buffered

Performance vs. Effort Benefits of Tasks & Conduits Runs on 1 Cell procs Reads from disk Non real-time 2-3% increase Runs on integrated system Reads from disk or camera Real-time Benefits of Tasks & Conduits Isolates I/O code from computation code Can switch between disk I/O and camera I/O Can create test jigs for computation code I/O and computation run concurrently Can move I/O and computation to different processors Can add multibuffering The addition of Tasks & Conduits increases the total number of software lines of code in registration by only 2-3%. This small increase in code size delivers significant benefits, including the ability to create modules that separate I/O and computation code. This allows us to change the I/O without having to change the computation code. This also enables us to create test jigs for the computation code. Tasks & Conduits also allow us to easily run I/O and computation concurrently. Currently, I/O and computation tasks run in separate threads, but could run on separate processors simply by changing the Task Map. Finally, multibuffering can be scaled by simply changing a parameter in the Conduit.

Outline Background Tasks & Conduits Hierarchical Maps & Arrays Results Summary Outline slide 29

Future (Co-)Processor Trends Multicore FPGAs GPUs IBM PowerXCell 8i 9 cores: 1 PPE + 8 SPE 204.8 GFLOPS single precision 102.4 GFLOPS double precision 92 W peak (est.) Tilera TILE64 64 cores 443 GOPS 15 – 22 W @ 700 MHz Xilinx Virtex-5 Up to 330,000 logic cells 580 GMACS using DSP slices PPC 440 processor block Curtis Wright CHAMP-FX2 VPX-REDI 2 Xilinx Virtex-5 FPGAs Dual-core PPC 8641D NVIDIA Tesla C1060 PCI-E x16 ~1 TFLOPS single precision 225 W peak, 160 W typical ATI FireStream 9250 ~200 GFLOPS double precision 150 W This slide shows various technologies that casn be used in future heterogeneous, co-processor architectures. * Information obtained from manufacturers’ websites

Summary Modern DoD sensors have tight SWaP constraints Multicore processors help achieve performance requirements within these constraints Multicore architectures are extremely difficult to program Fundamentally changes the way programmers have to think PVTOL provides a simple means to program multicore processors Refactored a post-processing application for real-time using Tasks & Conduits No performance impact Real-time application is modular and scalable We are actively developing PVTOL for Intel and Cell Plan to expand to other technologies, e.g. FPGA’s, automated mapping Will propose to HPEC-SI for standardization Summary slide

Acknowledgements Persistent Surveillance Team Bill Ross Herb DaSilva Peter Boettcher Chris Bowen Cindy Fang Imran Khan Fred Knight Gary Long Bobby Ren PVTOL Team Bob Bond Nayda Bliss Karen Eng Jeremiah Gale James Geraci Ryan Haney Jeremy Kepner Sanjeev Mohindra Sharon Sacco Eddie Rutledge Acknowledgements