PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James.

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures
Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James Geraci, Ryan Haney, Jeremy Kepner, Sanjeev Mohindra, Sharon Sacco, Eddie Rutledge HPEC 2008 25 September 2008 Title slide This work is sponsored by the Department of the Air Force under Air Force contract FA C Opinions, interpretations, conclusions and recommendataions are those of the author and are not necessarily endorsed by the United States Government.

Outline Background Tasks & Conduits Maps & Arrays Results Summary
Motivation Multicore Processors Programming Challenges Tasks & Conduits Maps & Arrays Results Summary Outline slide 2

SWaP* for Real-Time Embedded Systems
Modern DoD sensors continue to increase in fidelity and sampling rates Real-time processing will always be a requirement U-2 Global Hawk Decreasing SWaP Modern DoD sensors continue to increase in fidelity and sampling rates, resulting in increasingly larger data sets. Yet, size, weight and power (SWaP) requirements for embedded processing systems remain the same or even shrink. Consider the replacement of manned reconnaissance aircraft (e.g. U-2) with UAVs (e.g. Global Hawk). UAVs can impose considerably tight requirements for size (several cubic feet), weight (tens of pounds) and power (tens to hundreds of Watts) on real-time embedded systems. Modern sensor platforms impose tight SWaP requirements on real-time embedded systems * SWaP = Size, Weight and Power

Embedded Processor Evolution
1990 2000 2010 10 100 1000 10000 Year MFLOPS / Watt i860 XR MPC7447A Cell MPC7410 MPC7400 603e 750 SHARC High Performance Embedded Processors Multicore processor 1 PowerPC core 8 SIMD cores i860 SHARC PowerPC PowerPC with AltiVec Cell (estimated) GPU PowerXCell 8i Growth in embedded processor performance, in terms of FLOPS/Watt, has grown exponentially over the last 20 years. No single processing architecture has dominated over this period, hence in order to leverage this increase in performance, embedded system designers must switch processing architectures approximately every 5 years. MFLOPS / W for i860, SHARC, 603e, 750, 7400, and 7410 are extrapolated from board wattage. They also include other hardware energy use such as memory, memory controllers, etc A and the Cell estimate are for the chip only. Effective FLOPS for all processors are based on 1024 FFT timings. Cell estimate uses hand coded TDFIR timings for effective FLOPS. Multicore processors help achieve performance requirements within tight SWaP constraints 20 years of exponential growth in FLOPS / W Must switch architectures every ~5 years Current high performance architectures are multicore

Parallel Vector Tile Optimizing Library
PVTOL is a portable and scalable middleware library for multicore processors Enables unique software development process for real-time signal processing applications Cluster 2. Parallelize code Embedded Computer 3. Deploy code 1. Develop serial code Desktop PVTOL is focused on addressing the programming complexity of associated with emerging hierarchical processors. Hierarchical Processors require the programmer to understand the physical hierarchy of the chip to get high efficiency. There are many such processors emerging into the market. The Cell processor is an important example of such a chip. The current PVTOL effort is focused on getting high performance from the Cell processor on signal and image processing applications. The PVTOL interface is designed to address a wide range of processors including multicore, FPGAs and GPUs PVTOL enables software developers to develop high-performance signal processing application on a desktop computer, parallelize the code on commodity clusters, then deploy the code onto an embedded computer, with minimal changes to the code. PVTOL also includes automated mapping technology that will automatically parallelize the application for a given platform. Applications developed on a workstation can then be deployed on an embedded computer and the library will parallelize the application without any changes to the code. 4. Automatically parallelize code Make parallel programming as easy as serial programming 5

PVTOL Architecture Tasks & Conduits Concurrency and data movement
Maps & Arrays Distribute data across processor and memory hierarchies Functors Abstract computational kernels into objects This slide shows a layered view of the PVTOL architecture. At the top is the application. The PVTOL API exposes high-level structures for data (e.g. vectors), data distribution (e.g. maps), communication (e.g. conduits) and computation (e.g. tasks and computational kernels). High level structures improve the productivity of the programmer. By being built on top of existing technologies, optimized for different platforms, PVTOL provides high performance. And by supporting a range of processor architectures, PVTOL applications are portable. The end result is that rather than learning new programming models for new processor technologies, PVTOL preserves the simple von Neumann programming model most programmers are used to. This presentation focuses on the three main parts of the API: Tasks & Conduits, Hierarchical Arrays and Functors. Portability: Runs on a range of architectures Performance: Achieves high performance Productivity: Minimizes effort at user level 6

Outline slide 7

Multicore Programming Challenges
Inside the Box Outside the Box Desktop Embedded Board Cluster Embedded Multicomputer Threads Pthreads OpenMP Shared memory Pointer passing Mutexes, condition variables Processes MPI (MPICH, Open MPI, etc.) Mercury PAS Distributed memory Message passing Different programming paradigms are used for programming inside the box and outside the box. Technically speaking, different programming paradigms are used for programming inside a process, i.e threads, and across processes, i.e. message-passing. PVTOL provides consistent semantics for both multicore and cluster computing

Tasks & Conduits Tasks provide concurrency
Collection of 1+ threads in 1+ processes Tasks are SPMD, i.e. each thread runs task code Task Maps specify locations of Tasks Conduits move data Safely move data Multibuffering Synchronization load(B) cdt1.write(B) DIT Disk cdt1 cdt1.read(B) A = B cdt2.write(A) A B DAT cdt2 cdt2.read(A) save(A) DOT A Task is a collection of parallel threads running in multiple processes. The allocation of threads on specific processors is specified using the Task Map. Conduits are used to send data between Tasks. The DIT-DAT-DOT is a pipeline structure commonly found in real-time embedded signal processing systems. In this example, the DIT will 1) read a vector, B, from disk and 2) pass B to the DAT via a conduit. The DAT will 3) accept B from the conduit, 4) allocate a new vector, A, and assign B to A, then 5) pass A to the DOT via another conduit. The DOT will 6) accept A from the conduit and 7) write A to disk. DIT Read data from source (1 thread) DAT Process data (4 threads) DOT Output results (1 thread) Conduits Connect DIT to DAT and DAT to DOT * DIT – Data Input Task, DAT – Data Analysis Task, DOT – Data Output Task

Pipeline Example DIT-DAT-DOT
int main(int argc, char** argv) { // Create maps (omitted for brevity) ... // Create the tasks Task<Dit> dit("Data Input Task", ditMap); Task<Dat> dat("Data Analysis Task", datMap); Task<Dot> dot("Data Output Task", dotMap); // Create the conduits Conduit<Matrix <double> > ab("A to B Conduit"); Conduit<Matrix <double> > bc("B to C Conduit"); // Make the connections dit.init(ab.getWriter()); dat.init(ab.getReader(), bc.getWriter()); dot.init(bc.getReader()); // Complete the connections ab.setupComplete(); bc.setupComplete(); // Launch the tasks dit.run(); dat.run(); dot.run(); // Wait for tasks to complete dit.waitTillDone(); dat.waitTillDone(); dot.waitTillDone(); } Main function creates tasks, connects tasks with conduits and launches the task computation dit dat dot ab bc dit dat dot ab bc Parallel code using Task and Conduits. The code here creates the tasks and conduits and connects them together to create a DIT-DAT-DOT pipeline.

Pipeline Example Data Analysis Task (DAT)
class Dat { private: Conduit<Matrix <double> >::Reader m_Reader; Conduit<Matrix <double> >::Writer m_Writer; public: void init(Conduit<Matrix <double> >::Reader& reader, Conduit<Matrix <double> >::Writer& writer) // Get data reader for the conduit reader.setup(tr1::Array<int, 2>(ROWS, COLS)); m_Reader = reader; // Get data writer for the conduit writer.setup(tr1::Array<int, 2>(ROWS, COLS)); m_Writer = writer; } void run() Matrix <double>& B = m_Reader.getData(); Matrix <double>& A = m_Writer.getData(); A = B; m_reader.releaseData(); m_writer.releaseData(); }; Tasks read and write data using Reader and Writer interfaces to Conduits Readers and Writer provide handles to data buffers reader writer The code here shows the implementation of the DAT, which reads data from an input conduit using a reader object, processes the data then writes the data to the output conduit using a writer object. reader A = B writer B A

Hierarchy Functors Results Summary Outline slide 12

Map-Based Programming
Technology Organization Language Year Parallel Vector Library MIT-LL* C++ 2000 pMatlab MIT-LL MATLAB 2003 VSIPL++ HPEC-SI† 2006 A map is an assignment of blocks of data to processing elements Maps have been demonstrated in several technologies Map Map Map grid: 1x2 dist: block procs: 0:1 grid: 1x2 dist: cyclic procs: 0:1 grid: 1x2 dist: block- cyclic procs: 0:1 Grid specification together with processor list describe where data are distributed Distribution specification describes how data are distributed Map-based programming is method for simplifying the task of assigning data across processors. Map-based programming has been demonstrated in several technologies, both at Lincoln and outside Lincoln. This slide shows an example illustrating how maps are use to distribute matrices. Cluster Cluster Cluster Proc Proc 1 Proc Proc 1 Proc Proc 1 * MIT Lincoln Laboratory † High Performance Embedded Computing Software Initiative

Remote Processor Memory Local Processor Memory
PVTOL Machine Model Processor Hierarchy Processor: Scheduled by OS Co-processor: Dependent on processor for program control Memory Hierarchy Each level in the processor hierarchy can have its own memory Disk CELL Cluster Remote Processor Memory Processor Local Processor Memory CELL CELL 1 A processor hierarchy contains a main processor and one or more co-processors that depend on the main processor for program control. On the Cell the PPE is responsible for spawning threads on the SPEs. Each level in the processor hierarchy can have its own memory. Unlike caches, levels in the memory hierarchy may be disjoint, requiring explicit movement of data through the hierarchy. On the Cell, the PPE first loads data into main memory, then SPEs transfer data from main memory into local store via DMAs. Cache/Loc. Co-proc Mem. SPE 0 SPE 1 … SPE 0 SPE 1 … Read register Write register Write data Write Read data Registers Co-processor PVTOL extends maps to support hierarchy 14

Remote Processor Memory Local Processor Memory
PVTOL Machine Model Processor Hierarchy Processor: Scheduled by OS Co-processor: Dependent on processor for program control Memory Hierarchy Each level in the processor hierarchy can have its own memory Disk x86 Cluster Remote Processor Memory Processor Local Processor Memory x86/PPC x86/PPC 1 This hierarchical model supports other co-processor architectures. For example, an Intel or PowerPC processor may be the main processor and a GPU or FPGA may be the co-processor. The semantics for describing this type of processor hierarchy do not change. Cache/Loc. Co-proc Mem. GPU / FPGA 0 GPU / FPGA 1 … GPU / FPGA 0 GPU / FPGA 1 … Read register Write register Write data Write Read data Registers Co-processor Semantics are the same across different architectures 15

Hierarchical Maps and Arrays
Serial PPC PVTOL provides hierarchical maps and arrays Hierarchical maps concisely describe data distribution at each level Hierarchical arrays hide details of the processor and memory hierarchy Parallel PPC 1 Cluster grid: 1x2 dist: block procs: 0:1 Program Flow Define a Block Data type, index layout (e.g. row-major) Define a Map for each level in the hierarchy Grid, data distribution, processor list Define an Array for the Block Parallelize the Array with the Hierarchical Map (optional) Process the Array Hierarchical LS SPE 1 CELL 1 SPE 0 Cluster grid: 1x2 dist: block procs: 0:1 block: 1x2 Maps describe how to partition an array. A map breaks an array into blocks. In this example, the first map divides the array between two Cell processors. The next map divides the adta on Cell processor between two SPEs. Finally, the last array divides the data on each SPE into blocks that are loaded into SPE local store, processed, then written back to main memory one at a time.

Hierarchical Maps and Arrays Example - Serial
PPC int main(int argc, char *argv[]) { PvtolProgram pvtol(argc, argv); // Allocate the array typedef Dense<2, int> BlockType; typedef Matrix<int, BlockType> MatType; MatType matrix(4, 8); } Parallel PPC 1 Cluster grid: 1x2 dist: block procs: 0:1 Hierarchical LS SPE 1 CELL 1 SPE 0 Cluster grid: 1x2 dist: block procs: 0:1 block: 1x2 This slide shows PVTOL code to allocate the serial array.

Hierarchical Maps and Arrays Example - Parallel
Serial PPC int main(int argc, char *argv[]) { PvtolProgram pvtol(argc, argv); // Distribute columns across 2 Cells Grid cellGrid(1, 2); DataDistDescription cellDist(BlockDist(0), BlockDist(0)); RankList cellProcs(2); RuntimeMap cellMap(cellProcs, cellGrid, cellDist); // Allocate the array typedef Dense<2, int> BlockType; typedef Matrix<int, BlockType, RuntimeMap> MatType; MatType matrix(4, 8, cellMap); } Parallel PPC 1 Cluster grid: 1x2 dist: block procs: 0:1 Hierarchical LS SPE 1 CELL 1 SPE 0 Cluster grid: 1x2 dist: block procs: 0:1 block: 1x2 This slide shows PVTOL code to allocate the parallel array.

Hierarchical Maps and Arrays Example - Hierarchical
Serial PPC int main(int argc, char *argv[]) { PvtolProgram pvtol(argc, argv); // Distribute into 1x1 blocks unsigned int speLsBlockDims[2] = {1, 2}; TemporalBlockingInfo speLsBlock(2, speLsBlockDims); TemporalMap speLsMap(speLsBlock); // Distribute columns across 2 SPEs Grid speGrid(1, 2); DataDistDescription speDist(BlockDist(0), BlockDist(0)); RankList speProcs(2); RuntimeMap speMap(speProcs, speGrid, speDist, speLsMap); // Distribute columns across 2 Cells vector<RuntimeMap *> vectSpeMaps(1); vectSpeMaps.push_back(&speMap); Grid cellGrid(1, 2); DataDistDescription cellDist(BlockDist(0), BlockDist(0)); RankList cellProcs(2); RuntimeMap cellMap(cellProcs, cellGrid, cellDist, vectSpeMaps); // Allocate the array typedef Dense<2, int> BlockType; typedef Matrix<int, BlockType, RuntimeMap> MatType; MatType matrix(4, 8, cellMap); } Parallel PPC 1 Cluster grid: 1x2 dist: block procs: 0:1 Hierarchical LS SPE 1 CELL 1 SPE 0 Cluster grid: 1x2 dist: block procs: 0:1 block: 1x2 This slide shows PVTOL code to allocate the hierarchical array.

Functor Fusion Expressions contain multiple operations
E.g. A = B + C .* D Functors encapsulate computation in objects Fusing functors improves performance by removing need for temporary variables Let Xi be block i in array X Unfused Perform tmp = C .* D for all blocks: 1. Load Di into SPE local store 2. Load Ci into SPE local store 3. Perform tmpi = Ci .* Di 4. Store tmpi in main memory Perform A = tmp + B for all blocks: 5. Load tmpi into SPE local store 6. Load Bi into SPE local store 7. Perform Ai = tmpi + Bi 8. Store Ai in main memory Unfused D C tmp B A PPE Main Memory 1. 2. 4. 3. 5. 6. 8. 7. SPE Local Store Functors can further improve performance by fusing data-parallel functors in a single expression. Consider the expression A = B + C .* D (.* indicates elementwise multiplication). If A, B, C and D are allocated as hierarchical arrays, then SPEs must process blocks one at a time. A naïve implementation would process the entire multiplication for all blocks in C and D, then the entire addition. This requires a temporary variable, tmp, in main memory that stores the results of the multiplication. The addition is then applied to B and tmp. This adds extra DMA transfers between the SPE local store and main memory to read and write tmp. A more efficient implementation will recognize that the multiplication and addition are part of a single expression and fuse the operations such that the entire expression is applied to a set of blocks across all arrays. Fusing operations removes the overhead of moving data between local store and main memory required by the temporary variable. .* + Fused Perform A = B + C .* D for all blocks: 1. Load Di into SPE local store 2. Load Ci into SPE local store 3. Perform tmpi = Ci .* Di 4. Load Bi into SPE local store 5. Perform Ai = tmpi + Bi 6. Store Ai in main memory Fused D C B A 6. 4. 2. 1. 3. 5. PPE Main Memory SPE Local Store .* + .* = elementwise multiplication

Outline slide 21

Persistent Surveillance Canonical Front End Processing
Processing Requirements ~300 Gflops GPS/ INS Sensor package Video and GPS/IMU data Stabilization/ Registration (Optic Flow) Projective Transform Detection ~600 ops/pixel (8 iterations) x 10% = 120 Gflops ~40 ops/pixel = 80 Gflops ~50 ops/pixel = 100 Gflops Logical Block Diagram JPEG 2000 JPEG 2000 Disk controller This slide shows the basic processing chain for aerial persistent surveillance. A camera captures video. Stabilization/registration determines a set of warp coefficients that is used by the projective transform to warp image so that the scene in each frame is viewed from the same perspective. After images have been warped, detection locates moving targets in the images. A real-time system is being developed that will performs these steps in-flight using Cell-based processing hardware. 4 x AMD motherboard CAB CAB 4U Mercury Server 2 x AMD CPU motherboard 2 x Mercury Cell Accelerator Boards (CAB) 2 x JPEG 2000 boards PCI Express (PCI-E) bus PCI-E Signal and image processing turn sensor data into viewable images

Post-Processing Software
Current CONOPS Record video in-flight Apply registration and detection on the ground Analyze results on the ground Future CONOPS Apply registration and detection in-flight Analyze data on the ground Disk read(S) gaussianPyramid(S) for (nLevels) { for (nIters) { D = projectiveTransform(S, C) C = opticFlow(S, D) } write(D) S D Currentyl, video imagery is recorded in-flight, then downloaded from the storage drives and processed on the ground.

Real-Time Processing Software Step 1: Create skeleton DIT-DAT-DOT
Input and output of DAT should match input and output of application Input and output of DAT should match input and output of application read(B) cdt1.insert(B) DIT Disk cdt1 cdt1.extract(B) A = B cdt2.insert(A) A B DAT cdt2 The first step in the process of turning the post-processing application into a real-time application is to build a skeleton DIT-DAT-DOT. The input and output of the DAT will be read and written to from disk by the DIT and DOT, respectively. The input and output data should match that of the post-processing application. The DAT will do no actual processing at this point. The goal is to establish the data flow. cdt2.extract(A) write(A) Tasks and Conduits separate I/O from computation DOT * DIT – Data Input Task, DAT – Data Analysis Task, DOT – Data Output Task

Tasks and Conduits make it easy to change components
Real-Time Processing Software Step 2: Integrate application code into DAT Replace disk I/O with conduit reader and writer Input and output of DAT should match input and output of application DIT read(S) cdt1.insert(S) cdt1 Disk read(S) gaussianPyramid(S) for (nLevels) { for (nIters) { D = projectiveTransform(S, C) C = opticFlow(S, D) } write(D) Replace DAT with application code S D DAT cdt2 Next, skeleton DAT is replaced with the post-processing application. The disk I/O in the post-processing application is replaced with the input and output conduits. The functionality is the same as the original post-processing application, but the DAT is now isolated from the I/O. Tasks and Conduits make it easy to change components cdt2.extract(D) write(D) DOT

Real-Time Processing Software Step 3: Replace disk with camera
DIT get(S) cdt1.insert(S) Camera cdt1 read(S) gaussianPyramid(S) for (nLevels) { for (nIters) { D = projectiveTransform(S, C) C = opticFlow(S, D) } write(D) S Input and output of DAT should match input and output of application Replace disk I/O with bus I/O that retrieves data from the camera DAT D cdt2 Finally, the disk I/O in the DIT is replaced with images captured directly from the camera instead of recorded data stored on disk. The camera I/O is performed on the AMD motherboard while the processing occurs on the Cells. cdt2.extract(D) put(D) DOT Disk

Tasks and Conduits incur little overhead
Performance 44 imagers per Cell 1 image # imagers per Cell Registration Time (w/o Tasks & Conduits) (w/ Tasks & Conduits*) All imagers 1188 ms 1246 ms 1/2 of imagers 594 ms 623 ms 1/4 of imagers 297 ms 311 ms Real-time Target Time 500 ms The camera for our real-time system is comprised of 4 apertures, each with 44 imagers. Each quadrant of the image registered by a Cell processor. Registration can be performed using varying numbers of imagers in a quadrant. The table shows the measured time for registration on all imagers, 1/2 of the imagers and 1/4 of the imagers, for a quadrant. Times are shown for the application with and without tasks and conduits. Note that tasks and conduits impose little overhead. Tasks and Conduits incur little overhead * Double-buffered

Performance vs. Effort Benefits of Tasks & Conduits
Runs on 1 Cell procs Reads from disk Non real-time 2-3% increase Runs on integrated system Reads from disk or camera Real-time Benefits of Tasks & Conduits Isolates I/O code from computation code Can switch between disk I/O and camera I/O Can create test jigs for computation code I/O and computation run concurrently Can move I/O and computation to different processors Can add multibuffering The addition of Tasks & Conduits increases the total number of software lines of code in registration by only 2-3%. This small increase in code size delivers significant benefits, including the ability to create modules that separate I/O and computation code. This allows us to change the I/O without having to change the computation code. This also enables us to create test jigs for the computation code. Tasks & Conduits also allow us to easily run I/O and computation concurrently. Currently, I/O and computation tasks run in separate threads, but could run on separate processors simply by changing the Task Map. Finally, multibuffering can be scaled by simply changing a parameter in the Conduit.

Outline Background Tasks & Conduits Hierarchical Maps & Arrays Results
Summary Outline slide 29

Future (Co-)Processor Trends
Multicore FPGAs GPUs IBM PowerXCell 8i 9 cores: 1 PPE + 8 SPE 204.8 GFLOPS single precision 102.4 GFLOPS double precision 92 W peak (est.) Tilera TILE64 64 cores 443 GOPS 15 – MHz Xilinx Virtex-5 Up to 330,000 logic cells 580 GMACS using DSP slices PPC 440 processor block Curtis Wright CHAMP-FX2 VPX-REDI 2 Xilinx Virtex-5 FPGAs Dual-core PPC 8641D NVIDIA Tesla C1060 PCI-E x16 ~1 TFLOPS single precision 225 W peak, 160 W typical ATI FireStream 9250 ~200 GFLOPS double precision 150 W This slide shows various technologies that casn be used in future heterogeneous, co-processor architectures. * Information obtained from manufacturers’ websites

Summary Modern DoD sensors have tight SWaP constraints
Multicore processors help achieve performance requirements within these constraints Multicore architectures are extremely difficult to program Fundamentally changes the way programmers have to think PVTOL provides a simple means to program multicore processors Refactored a post-processing application for real-time using Tasks & Conduits No performance impact Real-time application is modular and scalable We are actively developing PVTOL for Intel and Cell Plan to expand to other technologies, e.g. FPGA’s, automated mapping Will propose to HPEC-SI for standardization Summary slide

Acknowledgements Persistent Surveillance Team Bill Ross Herb DaSilva
Peter Boettcher Chris Bowen Cindy Fang Imran Khan Fred Knight Gary Long Bobby Ren PVTOL Team Bob Bond Nayda Bliss Karen Eng Jeremiah Gale James Geraci Ryan Haney Jeremy Kepner Sanjeev Mohindra Sharon Sacco Eddie Rutledge Acknowledgements

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James.

Similar presentations

Presentation on theme: "PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James.

Similar presentations

Presentation on theme: "PVTOL: Designing Portability, Productivity and Performance for Multicore Architectures Hahn Kim, Nadya Bliss, Jim Daly, Karen Eng, Jeremiah Gale, James."— Presentation transcript:

Similar presentations

About project

Feedback