Slide-1 HPEC-SI MITRE AFRL MIT Lincoln Laboratory www.hpec-si.org High Performance Embedded Computing Software Initiative (HPEC-SI) This work is sponsored.

Slides:



Advertisements
Similar presentations
The HPEC Challenge Benchmark Suite
Advertisements

1 Storage Today Victor Hatridge – CIO Nashville Electric Service (615)
Types of Parallel Computers
Introduction CS 524 – High-Performance Computing.
University of Kansas Construction & Integration of Distributed Systems Jerry James Oct. 30, 2000.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
Parallelization of FFT in AFNI Huang, Jingshan Xi, Hong Department of Computer Science and Engineering University of South Carolina.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Android is a trademark of Google Inc. Use of this trademark is subject to Google Permissions. Linux is the registered trademark of Linus Torvalds in the.
System Integration Management (SIM)
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
© 2002 Mercury Computer Systems, Inc. © 2002 MPI Software Technology, Inc. For public release. Data Reorganization Interface (DRI) Overview Kenneth Cain.
Slide 1 Copyright © 2003 Encapsule Systems, Inc. Hyperworx Platform Brief Modeling and deploying component software services with the Hyperworx™ platform.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
The “State” and “Future” of Middleware for HPEC Anthony Skjellum RunTime Computing Solutions, LLC & University of Alabama at Birmingham HPEC 2009.
Low-Power Wireless Sensor Networks
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
MIT Lincoln Laboratory hch-1 HCH 5/26/2016 Achieving Portable Task and Data Parallelism on Parallel Signal Processing Architectures Hank Hoffmann.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Slide-1 SC2002 Tutorial MIT Lincoln Laboratory DoD Sensor Processing: Applications and Supporting Software Technology Dr. Jeremy Kepner MIT Lincoln Laboratory.
Slide 1 MIT Lincoln Laboratory Toward Mega-Scale Computing with pMatlab Chansup Byun and Jeremy Kepner MIT Lincoln Laboratory Vipin Sachdeva and Kirk E.
Headquarters U. S. Air Force I n t e g r i t y - S e r v i c e - E x c e l l e n c e © 2008 The MITRE Corporation. All rights reserved From Throw Away.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Chapter 18 Object Database Management Systems. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Motivation for object.
Evaluation of the VSIPL++ Serial Specification Using the DADS Beamformer HPEC 2004 September 30, 2004 Dennis Cottel Randy Judd.
Slide-1 Parallel Matlab MIT Lincoln Laboratory Parallel Matlab: The Next Generation Dr. Jeremy Kepner /MIT Lincoln Laboratory Ms. Nadya Travinin / MIT.
MIT Lincoln Laboratory S3p-HPEC-jvk.ppt S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures Mr. Henry.
Cousins HPEC 2002 Session 4: Emerging High Performance Software David Cousins Division Scientist High Performance Computing Dept. Newport,
HPEC2002_Session1 1 DRM 11/11/2015 MIT Lincoln Laboratory Session 1: Novel Hardware Architectures David R. Martinez 24 September 2002 This work is sponsored.
XYZ 11/13/2015 MIT Lincoln Laboratory 300x Matlab Dr. Jeremy Kepner MIT Lincoln Laboratory September 25, 2002 HPEC Workshop Lexington, MA This.
XYZ 11/16/2015 MIT Lincoln Laboratory AltiVec Extensions to the Portable Expression Template Engine (PETE)* Edward Rutledge HPEC September,
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.
March 2004 At A Glance NASA’s GSFC GMSEC architecture provides a scalable, extensible ground and flight system approach for future missions. Benefits Simplifies.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Programming Sensor Networks Andrew Chien CSE291 Spring 2003 May 6, 2003.
Implementation of a Shipboard Ballistic Missile Defense Processing Application Using the High Performance Embedded Computing Software Initiative (HPEC-SI)
Slide-1 Parallel MATLAB MIT Lincoln Laboratory Multicore Programming in pMatlab using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
© 2003 Mercury Computer Systems, Inc. Data Reorganization Interface (DRI) Kenneth Cain Jr. Mercury Computer Systems, Inc. High Performance Embedded Computing.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Cray XD1 Reconfigurable Computing for Application Acceleration.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
© 2002 Mercury Computer Systems, Inc. © 2002 MPI Software Technology, Inc. Data Reorganization Interface (DRI) Kenneth Cain Jr., Mercury Computer Systems,
PTCN-HPEC-02-1 AIR 25Sept02 MIT Lincoln Laboratory Resource Management for Digital Signal Processing via Distributed Parallel Computing Albert I. Reuther.
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
High Performance Embedded Computing Software Initiative (HPEC-SI)
Embedded Digital Systems Group
Parallel processing is not easy
M. Richards1 ,D. Campbell1 (presenter), R. Judd2, J. Lebak3, and R
Welcome: Intel Multicore Research Conference
Parallel Vector & Signal Processing
Parallel Processing in ROSA II
Matthew Lyon James Montgomery Lucas Scotta 13 October 2010
Compiler Back End Panel
Compiler Back End Panel
Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond
VSIPL++: Parallel Performance HPEC 2004
Defining the Grid Fabrizio Gagliardi EMEA Director Technical Computing
Presentation transcript:

Slide-1 HPEC-SI MITRE AFRL MIT Lincoln Laboratory High Performance Embedded Computing Software Initiative (HPEC-SI) This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F C Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. Dr. Jeremy Kepner MIT Lincoln Laboratory

Slide-2 MITRE AFRL Lincoln DoD Need Program Structure Outline Introduction Software Standards Parallel VSIPL++ Future Challenges Summary

Slide-3 MITRE AFRL Lincoln Enhanced Tactical Radar Correlator (ETRAC) Overview - High Performance Embedded Computing (HPEC) Initiative HPEC Software Initiative Programs Demonstration Development Applied Research DARPA ASARS-2 Shared memory server Embedded multi- processor Challenge: Transition advanced software technology and practices into major defense acquisition programs Common Imagery Processor (CIP)

Slide-4 MITRE AFRL Lincoln Why Is DoD Concerned with Embedded Software? Source: “HPEC Market Study” March 2001 Estimated DoD expenditures for embedded signal and image processing hardware and software ($B) COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards

Slide-5 MITRE AFRL Lincoln NSSN AEGIS Rivet Joint Standard Missile Predator Global Hawk U-2 JSTARS MSAT-Air P-3/APS-137 F-16 MK-48 Torpedo Issues with Current HPEC Development Inadequacy of Software Practices & Standards Today – Embedded Software Is: Not portable Not scalable Difficult to develop Expensive to maintain Today – Embedded Software Is: Not portable Not scalable Difficult to develop Expensive to maintain System Development/Acquisition Stages 4 Years Program Milestones System Tech. Development System Field Demonstration Engineering/ manufacturing Development Insertion to Military Asset Signal Processor Evolution 1st gen. 2nd gen. 3rd gen. 4th gen. 5th gen. 6th gen. High Performance Embedded Computing pervasive through DoD applications –Airborne Radar Insertion program 85% software rewrite for each hardware platform –Missile common processor Processor board costs < $100k Software development costs > $100M –Torpedo upgrade Two software re-writes required after changes in hardware design

Slide-6 MITRE AFRL Lincoln Evolution of Software Support Towards “Write Once, Run Anywhere/Anysize” 1990 Application Vendor Software Vendor SW DoD software development COTS development Application software has traditionally been tied to the hardware Support “Write Once, Run Anywhere/Anysize” Middleware Vendor Software Application Middleware Application Middleware Application Middleware Many acquisition programs are developing stove-piped middleware “standards” Open software standards can provide portability, performance, and productivity benefits Vendor Software Application Middleware Embedded SW Standards Middleware

Slide-7 MITRE AFRL Lincoln Overall Initiative Goals & Impact Performance (1.5x) Portability (3x) Productivity (3x) HPEC Software Initiative Demonstrate Develop Prototype Object Oriented Open Standards Interoperable & Scalable Portability: reduction in lines-of-code to change port/scale to new system Productivity: reduction in overall lines-of- code Performance:computation and communication benchmarks Program Goals Develop and integrate software technologies for embedded parallel systems to address portability, productivity, and performance Engage acquisition community to promote technology insertion Deliver quantifiable benefits Program Goals Develop and integrate software technologies for embedded parallel systems to address portability, productivity, and performance Engage acquisition community to promote technology insertion Deliver quantifiable benefits

Slide-8 MITRE AFRL Lincoln HPEC-SI Path to Success Reduces software cost & schedule Enables rapid COTS insertion Improves cross-program interoperability Basis for improved capabilities Benefit to DoD Programs Reduces software complexity & risk Easier comparisons/more competition Increased functionality Benefit to DoD Contractors Lower software barrier to entry Reduced software maintenance costs Evolution of open standards Benefit to Embedded Vendors HPEC Software Initiative builds on Proven technology Business models Better software practices HPEC Software Initiative builds on Proven technology Business models Better software practices

Slide-9 MITRE AFRL Lincoln Organization Demonstration Dr. Keith Bromley SPAWAR Dr. Richard Games MITRE Dr. Jeremy Kepner, MIT/LL Mr. Brian Sroka MITRE Mr. Ron Williams MITRE... Government Lead Dr. Rich Linderman AFRL Technical Advisory Board Dr. Rich Linderman AFRL Dr. Richard Games MITRE Mr. John Grosh OSD Mr. Bob Graybill DARPA/ITO Dr. Keith Bromley SPAWAR Dr. Mark Richards GTRI Dr. Jeremy Kepner MIT/LL Executive Committee Dr. Charles Holland PADUSD(S+T) RADM Paul Sullivan N77 Development Dr. James Lebak MIT/LL Dr. Mark Richards GTRI Mr. Dan Campbell GTRI Mr. Ken Cain MERCURY Mr. Randy Judd SPAWAR... Applied Research Mr. Bob Bond MIT/LL Mr. Ken Flowers MERCURY Dr. Spaanenburg PENTUM Mr. Dennis Cottel SPAWAR Capt. Bergmann AFRL Dr. Tony Skjellum MPISoft... Advanced Research Mr. Bob Graybill DARPA Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs Over 100 participants from over 20 organizations Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs Over 100 participants from over 20 organizations

Slide-10 MITRE AFRL Lincoln Standards Overview Future Standards Outline Introduction Software Standards Parallel VSIPL++ Future Challenges Summary

Slide-11 MITRE AFRL Lincoln P0P1P2P3 Node Controller Parallel Embedded Processor System Controller Consoles Other Computers Control Communication: CORBA, HP-CORBA Data Communication: MPI, MPI/RT, DRI Computation: VSIPL VSIPL++, ||VSIPL++ Definitions VSIPL = Vector, Signal, and Image Processing Library ||VSIPL++ = Parallel Object Oriented VSIPL MPI = Message-passing interface MPI/RT = MPI real-time DRI = Data Re-org Interface CORBA = Common Object Request Broker Architecture HP-CORBA = High Performance CORBA Emergence of Component Standards HPEC Initiative - Builds on completed research and existing standards and libraries

Slide-12 MITRE AFRL Lincoln The Path to Parallel VSIPL++ Demonstrate insertions into fielded systems (e.g., CIP) Demonstrate 3x portability High-level code abstraction Reduce code size 3x Unified embedded computation/ communication standard Demonstrate scalability Demonstration: Existing Standards Phase 1 Phase 2 Phase 3 Time Development: Object-Oriented Standards Applied Research: Unified Comp/Comm Lib Demonstration: Object-Oriented Standards Applied Research: Fault tolerance Demonstration: Unified Comp/Comm Lib Development: Fault tolerance Applied Research: Self-optimization Development: Unified Comp/Comm Lib Functionality VSIPL++ prototype Parallel VSIPL++ prototype VSIPL MPI VSIPL++ Parallel VSIPL++ (world’s first parallel object oriented standard) First demo successfully completed VSIPL++ v0.5 spec completed VSIPL++ v0.1 code available Parallel VSIPL++ spec in progress High performance C++ demonstrated

Slide-13 MITRE AFRL Lincoln Working Group Technical Scope Development VSIPL++ -MAPPING (task/pipeline parallel) -Reconfiguration (for fault tolerance) -Threads -Reliability/Availability -Data Permutation (DRI functionality) -Tools (profiles, timers,...) -Quality of Service -MAPPING (data parallelism) -Early binding (computations) -Compatibility (backward/forward) -Local Knowledge (accessing local data) -Extensibility (adding new functions) -Remote Procedure Calls (CORBA) -C++ Compiler Support -Test Suite -Adoption Incentives (vendor, integrator) Applied Research Parallel VSIPL++

Slide-14 MITRE AFRL Lincoln VSIPL (Vector, Signal, and Image Processing Library) MPI (Message Passing Interface) VSIPL++ (Object Oriented) v0.1 Spec v0.1 Code v0.5 Spec & Code v1.0 Spec & Code Parallel VSIPL++ v0.1 Spec v0.1 Code v0.5 Spec & Code v1.0 Spec & Code Fault Tolerance/ Self Optimizing Software Task Name FY01FY02FY03FY04FY05FY06FY07FY08 NearMidLong Overall Technical Tasks and Schedule Applied Research Development Demonstrate CIP Demo 2 Demo 3Demo 4 Demo 5Demo 6

Slide-15 MITRE AFRL Lincoln HPEC-SI Goals 1st Demo Achievements Performance Goal 1.5x Achieved 2x Achieved 10x+ Goal 3x Portability Achieved 6x* Goal 3x Productivity HPEC Software Initiative Demonstrate Develop Prototype Object Oriented Open Standards Interoperable & Scalable Portability: zero code changes required Productivity: DRI code 6x smaller vs MPI (est*) Performance: 3x reduced cost or form factor

Slide-16 MITRE AFRL Lincoln Technical Basis Examples Outline Introduction Software Standards Parallel VSIPL++ Future Challenges Summary

Slide-17 MITRE AFRL Lincoln Parallel Computer Parallel Pipeline Beamform X OUT = w *X IN Detect X OUT = |X IN |>c Filter X OUT = FIR(X IN ) Signal Processing Algorithm Mapping Data Parallel within stages Task/Pipeline Parallel across stages Data Parallel within stages Task/Pipeline Parallel across stages

Slide-18 MITRE AFRL Lincoln Types of Parallelism Input FIR FIlters FIR FIlters Scheduler Detector 2 Detector 2 Detector 1 Detector 1 Beam- former 2 Beam- former 2 Beam- former 1 Beam- former 1 Task Parallel Pipeline Round Robin Data Parallel

Slide-19 MITRE AFRL Lincoln Algorithm and hardware mapping are linked Resulting code is non-scalable and non-portable Algorithm and hardware mapping are linked Resulting code is non-scalable and non-portable Current Approach to Parallel Code Algorithm + Mapping Code Proc 1 Proc 2 Proc 2 Stage 1 Proc 3 Proc 4 Proc 4 Stage 2 while(!done) { if ( rank()==1 || rank()==2 ) stage1 (); else if ( rank()==3 || rank()==4 ) stage2(); } Proc 5 Proc 5 Proc 6 Proc 6 while(!done) { if ( rank()==1 || rank()==2 ) stage1(); else if ( rank()==3 || rank()==4) || rank()==5 || rank==6 ) stage2(); }

Slide-20 MITRE AFRL Lincoln Scalable Approach Single Processor Mapping Multi Processor Mapping A = B + C #include void addVectors(aMap, bMap, cMap) { Vector > a(‘a’, aMap, LENGTH); Vector > b(‘b’, bMap, LENGTH); Vector > c(‘c’, cMap, LENGTH); b = 1; c = 2; a=b+c; } A = B + C Single processor and multi-processor code are the same Maps can be changed without changing software High level code is compact Single processor and multi-processor code are the same Maps can be changed without changing software High level code is compact Lincoln Parallel Vector Library (PVL)

Slide-21 MITRE AFRL Lincoln C++ Expression Templates and PETE A=B+C*D BinaryNode<OpAssign, Vector, BinaryNode<OpAdd, Vector BinaryNode<OpMultiply, Vector, Vector >>> Expression Templates Expression Expression TypeParse Tree B+CA Main Operator + Operator = + B& C& 1. Pass B and C references to operator + 4. Pass expression tree reference to operator 2. Create expression parse tree 3. Return expression parse tree 5. Calculate result and perform assignment copy & copy B&, C& Parse trees, not vectors, created Expression Templates enhance performance by allowing temporary variables to be avoided

Slide-22 MITRE AFRL Lincoln PETE Linux Cluster Experiments A=B+CA=B+C*DA=B+C*D/E+fft(F) Vector Length Vector Length Relative Execution Time Relative Execution Time PVL with VSIPL has a small overhead PVL with PETE can surpass VSIPL PVL with VSIPL has a small overhead PVL with PETE can surpass VSIPL

Slide-23 MITRE AFRL Lincoln PowerPC AltiVec Experiments Results Hand coded loop achieves good performance, but is problem specific and low level Optimized VSIPL performs well for simple expressions, worse for more complex expressions PETE style array operators perform almost as well as the hand-coded loop and are general, can be composed, and are high-level AltiVec loopVSIPL (vendor optimized)PETE with AltiVec C For loop Direct use of AltiVec extensions Assumes unit stride Assumes vector alignment C AltiVec aware VSIPro Core Lite ( No multiply-add Cannot assume unit stride Cannot assume vector alignment C++ PETE operators Indirect use of AltiVec extensions Assumes unit stride Assumes vector alignment A=B+C*D A=B+CA=B+C*D+E*F A=B+C*D+E/F Software Technology

Slide-24 MITRE AFRL Lincoln Technical Basis Examples Outline Introduction Software Standards Parallel VSIPL++ Future Challenges Summary

Slide-25 MITRE AFRL Lincoln A = sin(A) + 2 * B; Generated code (no temporaries) for (index i = 0; i < A.size(); ++i) A.put(i, sin(A.get(i)) + 2 * B.get(i)); Apply inlining to transform to for (index i = 0; i < A.size(); ++i) Ablock[i] = sin(Ablock[i]) + 2 * Bblock[i]; Apply more inlining to transform to T* Bp = &(Bblock[0]); T* Aend = &(Ablock[A.size()]); for (T* Ap = &(Ablock[0]); Ap < pend; ++Ap, ++Bp) *Ap = fmadd (2, *Bp, sin(*Ap)); Or apply PowerPC AltiVec extensions Each step can be automatically generated Optimization level whatever vendor desires Each step can be automatically generated Optimization level whatever vendor desires

Slide-26 MITRE AFRL Lincoln BLAS zherk Routine BLAS = Basic Linear Algebra Subprograms Hermitian matrix M: conjug(M) = M t zherk performs a rank-k update of Hermitian matrix C: C    A  conjug(A) t +   C VSIPL code A = vsip_cmcreate_d(10,15,VSIP_ROW,MEM_NONE); C = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE); tmp = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE); vsip_cmprodh_d(A,A,tmp); /* A*conjug(A) t */ vsip_rscmmul_d(alpha,tmp,tmp);/*  *A*conjug(A) t */ vsip_rscmmul_d(beta,C,C); /*  *C */ vsip_cmadd_d(tmp,C,C); /*  *A*conjug(A) t +  *C */ vsip_cblockdestroy(vsip_cmdestroy_d(tmp)); vsip_cblockdestroy(vsip_cmdestroy_d(C)); vsip_cblockdestroy(vsip_cmdestroy_d(A)); VSIPL++ code (also parallel) Matrix > A(10,15); Matrix > C(10,10); C = alpha * prodh(A,A) + beta * C;

Slide-27 MITRE AFRL Lincoln Simple Filtering Application int main () { using namespace vsip; const length ROWS = 64; const length COLS = 4096; vsipl v; FFT, complex, FORWARD, 0, MULTIPLE, alg_hint ()> forward_fft (Domain (ROWS,COLS), 1.0); FFT, complex, INVERSE, 0, MULTIPLE, alg_hint ()> inverse_fft (Domain (ROWS,COLS), 1.0); const Matrix > weights (load_weights (ROWS, COLS)); try { while (1) output (inverse_fft (forward_fft (input ()) * weights)); } catch (std::runtime_error) { // Successfully caught access outside domain. }

Slide-28 MITRE AFRL Lincoln Explicit Parallel Filter #include using namespace VSIPL; const int ROWS = 64; const int COLS = 4096; int main (int argc, char **argv) { Matrix > W (ROWS, COLS, "WMap"); // weights matrix Matrix > X (ROWS, COLS, "WMap"); // input matrix load_weights (W) try { while (1) { input (X); // some input function Y = IFFT ( mul (FFT(X), W)); output (Y); // some output function } catch (Exception &e) {cerr << e << endl}; }

Slide-29 MITRE AFRL Lincoln Multi-Stage Filter (main) using namespace vsip; const length ROWS = 64; const length COLS = 4096; int main (int argc, char **argv) { sample_low_pass_filter > LPF(); sample_beamform > BF(); sample_matched_filter > MF(); try { while (1) output (MF(BF(LPF(input ())))); } catch (std::runtime_error) { // Successfully caught access outside domain. }

Slide-30 MITRE AFRL Lincoln Multi-Stage Filter (low pass filter) template class sample_low_pass_filter { public: sample_low_pass_filter() : FIR1_(load_w1 (W1_LENGTH), FIR1_LENGTH), FIR2_(load_w2 (W2_LENGTH), FIR2_LENGTH) { } Matrix operator () (const Matrix & Input) { Matrix output(ROWS, COLS); for (index row=0; row<ROWS; row++) output.row(row) = FIR2_(FIR1_(Input.row(row)).second).second; return output; } private: FIR FIR1_; FIR FIR2_; }

Slide-31 MITRE AFRL Lincoln Multi-Stage Filter (beam former) template class sample_beamform { public: sample_beamform() : W3_(load_w3 (ROWS,COLS)) { } Matrix operator () (const Matrix & Input) const { return W3_ * Input; } private: const Matrix W3_; }

Slide-32 MITRE AFRL Lincoln Multi-Stage Filter (matched filter) template class sample_matched_filter { public: matched_filter() : W4_(load_w4 (ROWS,COLS)), forward_fft_ (Domain (ROWS,COLS), 1.0), inverse_fft_ (Domain (ROWS,COLS), 1.0) {} Matrix operator () (const Matrix & Input) const { return inverse_fft_ (forward_fft_ (Input) * W4_); } private: const Matrix W4_; FFT, complex, complex, FORWARD, 0, MULTIPLE, alg_hint()> forward_fft_; FFT, complex, complex, INVERSE, 0, MULTIPLE, alg_hint()> inverse_fft_; }

Slide-33 MITRE AFRL Lincoln Fault Tolerance Self Optimization High Level Languages Outline Introduction Software Standards Parallel VSIPL++ Future Challenges Summary

Slide-34 MITRE AFRL Lincoln Dynamic Mapping for Fault Tolerance Parallel Processor Spare Failure Input Task X IN Map 0 Nodes: 0,1 Map 1 Nodes: 0,2 Output Task Map 2 Nodes: 1,3 X OUT Switching processors is accomplished by switching maps No change to algorithm required Switching processors is accomplished by switching maps No change to algorithm required

Slide-35 MITRE AFRL Lincoln Dynamic Mapping Performance Results Data Size Relative Time Good dynamic mapping performance is possible

Slide-36 MITRE AFRL Lincoln Optimal Mapping of Complex Algorithms Input X IN Low Pass Filter X IN W1W1 W1W1 FIR1 X OUT W2W2 W2W2 FIR2 Beamform X IN W3W3 W3W3 mult X OUT Matched Filter X IN W4W4 W4W4 FFT IFFT X OUT Workstation Embedded Multi-computer PowerPC Cluster Embedded Board Intel Cluster Application Hardware Different Optimal Maps Need to automate process of mapping algorithm to hardware

Slide-37 MITRE AFRL Lincoln Self-optimizing Software for Signal Processing Find –Min(latency | #CPU) –Max(throughput | #CPU) S3P selects correct optimal mapping Excellent agreement between S3P predicted and achieved latencies and throughputs Find –Min(latency | #CPU) –Max(throughput | #CPU) S3P selects correct optimal mapping Excellent agreement between S3P predicted and achieved latencies and throughputs #CPU Latency (seconds) Large (48x128K)Small (48x4K) Throughput (frames/sec) Problem Size #CPU

Slide-38 MITRE AFRL Lincoln High Level Languages DoD Sensor Processing High Performance Matlab Applications Parallel Matlab Toolbox DoD Mission Planning Scientific Simulation Commercial Applications User Interface Hardware Interface Parallel Computing Hardware Parallel Matlab need has been identified HPCMO (OSU) Required user interface has been demonstrated Matlab*P (MIT/LCS) PVL (MIT/LL) Required hardware interface has been demonstrated MatlabMPI (MIT/LL) Parallel Matlab Toolbox can now be realized

Slide-39 MITRE AFRL Lincoln MatlabMPI deployment (speedup) Maui –Image filtering benchmark (300x on 304 cpus) Lincoln –Signal Processing (7.8x on 8 cpus) –Radar simulations (7.5x on 8 cpus) –Hyperspectral (2.9x on 3 cpus) MIT –LCS Beowulf (11x Gflops on 9 duals) –AI Lab face recognition (10x on 8 duals) Other –Ohio St. EM Simulations –ARL SAR Image Enhancement –Wash U Hearing Aid Simulations –So. Ill. Benchmarking –JHU Digital Beamforming –ISL Radar simulation –URI Heart modeling Rapidly growing MatlabMPI user base demonstrates need for parallel matlab Demonstrated scaling to 300 processors Rapidly growing MatlabMPI user base demonstrates need for parallel matlab Demonstrated scaling to 300 processors Number of Processors Performance (Gigaflops) Image Filtering on IBM SP at Maui Computing Center

Slide-40 MITRE AFRL Lincoln Summary HPEC-SI Expected benefit –Open software libraries, programming models, and standards that provide portability (3x), productivity (3x), and performance (1.5x) benefits to multiple DoD programs Invitation to Participate –DoD Program offices with Signal/Image Processing needs –Academic and Government Researchers interested in high performance embedded computing –Contact:

Slide-41 MITRE AFRL Lincoln The Links High Performance Embedded Computing Workshop High Performance Embedded Computing Software Initiative Vector, Signal, and Image Processing Library MPI Software Technologies, Inc. Data Reorganization Initiative CodeSourcery, LLC MatlabMPI