High Performance Embedded Computing Software Initiative (HPEC-SI)

High Performance Embedded Computing Software Initiative (HPEC-SI)
Dr. Jeremy Kepner / Lincoln Laboratory This work is sponsored by the Department of Defense under Air Force Contract F C Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government. Title Slide: Technical and program review of HPEC-SI effort 1 1

Outline Introduction Demonstration Development Applied Research
Goals Program Structure Introduction Demonstration Development Applied Research Future Challenges Summary Outline Slide: Introduction to program goals and program structure

Overview - High Performance Embedded Computing (HPEC) Initiative
Common Imagery Processor (CIP) HPEC Software Initiative Programs Demonstration Development Applied Research DARPA Embedded multi-processor Shared memory server ASARS-2 Enhanced Tactical Radar Correlator (ETRAC) Core activity is transition of advanced software research into DoD programs Challenge: Transition advanced software technology and practices into major defense acquisition programs

Why Is DoD Concerned with Embedded Software?
Source: “HPEC Market Study” March 2001 Estimated DoD expenditures for embedded signal and image processing hardware and software ($B) DoD HPEC SIP hardware and software costs are increasing COTS acquisition practices have shifted the burden from “point design” hardware to “point design” software Software costs for embedded systems could be reduced by one-third with improved programming models, methodologies, and standards

System Development/Acquisition Stages
Issues with Current HPEC Development Inadequacy of Software Practices & Standards NSSN AEGIS Rivet Joint Standard Missile Predator Global Hawk U-2 JSTARS MSAT-Air P-3/APS-137 F-16 MK-48 Torpedo High Performance Embedded Computing pervasive through DoD applications Airborne Radar Insertion program 85% software rewrite for each hardware platform Missile common processor Processor board costs < $100k Software development costs > $100M Torpedo upgrade Two software re-writes required after changes in hardware design System Development/Acquisition Stages 4 Years Program Milestones System Tech. Development System Field Demonstration Engineering/ manufacturing Development Insertion to Military Asset Signal Processor Evolution 1st gen. 2nd gen. 3rd gen. 4th gen. 5th gen. 6th gen. Today – Embedded Software Is: Not portable Not scalable Difficult to develop Expensive to maintain DoD timescale is 10 years, COTS timescale is 18 months.

Evolution of Software Support Towards “Write Once, Run Anywhere/Anysize”
DoD software development COTS development Application Application Application Application Middleware Application software has traditionally been tied to the hardware Middleware Embedded Standards Vendor Software Vendor Software Vendor Software Vendor SW Many acquisition programs are developing stove-piped middleware “standards” 1990 2000 2005 Open software standards can provide portability, performance, and productivity benefits Application Application Application Natural evolution of software is towards portable middleware and scalable applications Middleware Support “Write Once, Run Anywhere/Anysize”

Quantitative Goals & Impact
Program Goals Develop and integrate software technologies for embedded parallel systems to address portability, productivity, and performance Engage acquisition community to promote technology insertion Deliver quantifiable benefits Performance (1.5x) Portability (3x) Productivity (3x) HPEC Software Initiative Demonstrate Develop Prototype Object Oriented Open Standards Interoperable & Scalable The goal of this initiative is to improve in a quantifiable manner the practice of developing software for embedded signal and image processing. Past and current efforts have demonstrated that the use of standards should provide at least a 3x improvement in software portability. Using object oriented programming languages like C++ typically provides a 3x reduction in code size. Current efforts indicate that domain of signal and image processing should expect the same level of code compaction as seen in other domains. Past and current efforts have demonstrated that performance overhead of using object oriented languages can be eliminated and can even result in performance improvements as high as 1.5x. Portability: reduction in lines-of-code to change port/scale to new system Productivity: reduction in overall lines-of-code Performance: computation and communication benchmarks

Organization Demonstration Dr. Keith Bromley SPAWAR Dr. Richard Games MITRE Dr. Jeremy Kepner, MIT/LL Mr. Brian Sroka MITRE Mr. Ron Williams MITRE ... Government Lead Dr. Rich Linderman AFRL Technical Advisory Board Mr. John Grosh OSD Mr. Bob Graybill DARPA/ITO Dr. Mark Richards GTRI Dr. Jeremy Kepner MIT/LL Executive Committee Dr. Charles Holland PADUSD(S+T) … Development Dr. James Lebak MIT/LL Mr. Dan Campbell GTRI Mr. Ken Cain MERCURY Mr. Randy Judd SPAWAR Applied Research Mr. Bob Bond MIT/LL Mr. Ken Flowers MERCURY Dr. Spaanenburg PENTUM Mr. Dennis Cottel SPAWAR Capt. Bergmann AFRL Dr. Tony Skjellum MPISoft Advanced Research Mr. Bob Graybill DARPA The Embedded Software Initiative involves a wide range of participants. The effort has a large number of participants from DoD Labs, FFRDCs and Universities. As the effort progresses vendor and contractor representatives will be added to the working groups. Partnership with ODUSD(S&T), Government Labs, FFRDCs, Universities, Contractors, Vendors and DoD programs Over 100 participants from over 20 organizations

HPEC-SI Capability Phases
First demo successfully completed SeconddDemo Selected VSIPL++ v0.8 spec completed VSIPL++ v0.2 code available Parallel VSIPL++ v0.1 spec completed High performance C++ demonstrated Time Phase 3 Applied Research: Hybrid Architectures Phase 2 Applied Research: Fault tolerance Development: Fault tolerance prototype Phase 1 Applied Research: Unified Comp/Comm Lib Development: Unified Comp/Comm Lib Parallel VSIPL++ Demonstration: Unified Comp/Comm Lib prototype Development: Object-Oriented Standards Demonstration: Object-Oriented Standards VSIPL++ Functionality Parallel VSIPL++ Demonstration: Existing Standards The embedded software initiative has three planned phases each of which is culminated by demonstrations. The demonstrations have specific numerical targets in terms of portability, productivity and scalability. For all demonstrations compute and communication performance will also be measured with the expectation of no loss in performance. Unified embedded computation/ communication standard Demonstrate scalability VSIPL++ VSIPL MPI High-level code abstraction (AEGIS) Reduce code size 3x Demonstrate insertions into fielded systems (CIP) Demonstrate 3x portability

Future Challenges Summary Common Imagery Processor AEGIS BMD (planned) Outline Slide: Demonstrations in Common Imagery Processor and AEGIS BMD

Common Imagery Processor - Demonstration Overview -
Common Imagery Processor (CIP) is a cross-service component TEG and TES Sample list of CIP modes U-2 (ASARS-2, SYERS) F/A-18 ATARS (EO/IR/APG-73) LO HAE UAV (EO, SAR) System Manager 38.5” CIP* ETRAC Common Imagery Processor is used in a wide variety of DoD assets JSIPS-N and TIS JSIPS and CARS * CIP picture courtesy of Northrop Grumman Corporation

Common Imagery Processor - Demonstration Overview -
Demonstrate standards-based platform-independent CIP processing (ASARS-2) Assess performance of current COTS portability standards (MPI, VSIPL) Validate SW development productivity of emerging Data Reorganization Interface MITRE and Northrop Grumman Embedded Multicomputers Common Imagery Processor APG 73 SAR IF The first demonstration of this Initiative is to port one mode of the APG 73 image formation pipeline to an embedded multi-processor using open software standards. Single code base optimized for all high performance architectures provides future flexibility Shared-Memory Servers Commodity Clusters Massively Parallel Processors

Embedded Multicomputers
Software Ports Embedded Multicomputers CSPI - 500MHz PPC7410 (vendor loan) Mercury - 500MHz PPC7410 (vendor loan) Sky - 333MHz PPC7400 (vendor loan) Sky - 500MHz PPC7410 (vendor loan) Mainstream Servers HP/COMPAQ ES40LP MHz Alpha ev6 (CIP hardware) HP/COMPAQ ES MHz Alpha ev6 (CIP hardware) SGI Origin MHz R10k (CIP hardware) SGI Origin MHz R12k (ARL MSRC) IBM 1.3GHz Power 4 (ARL MSRC) Generic LINUX Cluster CIP software successfully ported to a wide range of HPEC and HPC platforms

Portability: SLOC Comparison
~1% Increase ~5% Increase CIP software showed minimal code bloat Sequential VSIPL Shared Memory VSIPL DRI VSIPL

Shared Memory / CIP Server versus Distributed Memory / Embedded Vendor
Latency requirement Shared memory limit for this Alpha server HPEC systems able to meet requirement with 8 CPUs Application can now exploit many more processors, embedded processors (3x form factor advantage) and Linux clusters (3x cost advantage)

Form Factor Improvements
Current Configuration Possible Configuration IOP IOP IFP1 6U VME IFP2 CIP could accommodate more CPUs or shrink overall form factor IOP: 6U VME chassis (9 slots potentially available) IFP: HP/COMPAQ ES40LP IOP could support 2 G4 IFPs form factor reduction (x2) 6U VME can support 5 G4 IFPs processing capability increase (x2.5)

HPEC-SI Goals 1st Demo Achievements
Portability: zero code changes required Productivity: DRI code 6x smaller vs MPI (est*) Performance: 2x reduced cost or form factor Demonstrate Achieved Goal 3x Portability Achieved* Goal 3x Productivity Object Oriented Open Standards HPEC Software Initiative All quantitative goals were achieved or acceded Portability: reduction in lines-of-code to change port/scale to new system Productivity: reduction in overall lines-of-code Performance: computation and communication benchmarks Interoperable & Scalable Develop Prototype Performance Goal 1.5x Achieved

Future Challenges Summary Object Oriented (VSIPL++) Parallel (||VSIPL++) Outline Slides: Development of VSIPL++ and ||VSIPL++ standards

Emergence of Component Standards
Parallel Embedded Processor System Controller Node Controller Data Communication: MPI, MPI/RT, DRI Control Communication: CORBA, HP-CORBA P0 P1 P2 P3 Consoles Other Computers Computation: VSIPL VSIPL++, ||VSIPL++ HPEC-SI builds on a number of existing HPEC and mainstream standards Definitions VSIPL = Vector, Signal, and Image Processing Library ||VSIPL++ = Parallel Object Oriented VSIPL MPI = Message-passing interface MPI/RT = MPI real-time DRI = Data Re-org Interface CORBA = Common Object Request Broker Architecture HP-CORBA = High Performance CORBA HPEC Initiative - Builds on completed research and existing standards and libraries

VSIPL++ Productivity Examples
BLAS zherk Routine BLAS = Basic Linear Algebra Subprograms Hermitian matrix M: conjug(M) = Mt zherk performs a rank-k update of Hermitian matrix C: C  a  A  conjug(A)t + b  C VSIPL code A = vsip_cmcreate_d(10,15,VSIP_ROW,MEM_NONE); C = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE); tmp = vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE); vsip_cmprodh_d(A,A,tmp); /* A*conjug(A)t */ vsip_rscmmul_d(alpha,tmp,tmp);/* a*A*conjug(A)t */ vsip_rscmmul_d(beta,C,C); /* b*C */ vsip_cmadd_d(tmp,C,C); /* a*A*conjug(A)t + b*C */ vsip_cblockdestroy(vsip_cmdestroy_d(tmp)); vsip_cblockdestroy(vsip_cmdestroy_d(C)); vsip_cblockdestroy(vsip_cmdestroy_d(A)); VSIPL++ code (also parallel) Matrix<complex<double> > A(10,15); Matrix<complex<double> > C(10,10); C = alpha * prodh(A,A) + beta * C; VSIPL++ code will be smaller and easier to understand Sonar Example K-W Beamformer Converted C VSIPL to VSIPL++ 2.5x less SLOCs

PVL PowerPC AltiVec Experiments
Results Hand coded loop achieves good performance, but is problem specific and low level Optimized VSIPL performs well for simple expressions, worse for more complex expressions PETE style array operators perform almost as well as the hand-coded loop and are general, can be composed, and are high-level A=B+C A=B+C*D+E*F A=B+C*D A=B+C*D+E/F Software Technology VSIPL++ code can also take better advantage of special hardware features AltiVec loop VSIPL (vendor optimized) PETE with AltiVec C For loop Direct use of AltiVec extensions Assumes unit stride Assumes vector alignment C AltiVec aware VSIPro Core Lite ( No multiply-add Cannot assume unit stride Cannot assume vector alignment C++ PETE operators Indirect use of AltiVec extensions Assumes unit stride Assumes vector alignment

Parallel Pipeline Mapping
Signal Processing Algorithm Filter XOUT = FIR(XIN ) Beamform XOUT = w *XIN Mapping Detect XOUT = |XIN|>c Parallel Computer Parallel Mapping Challenge. Mapping is the process of deciding how an algorithm is assigned to elements of the hardware. In this example a three stage pipelin is mapped to distinct sets of processors in a parallel computing. Preserving the software investment requires the ability to write software that will run efficiently on a variety of computing systems without having to be rewritten. The primary technical challenge in writing parallel software is how to assign different parts of the algorithm to processors in the parallel computing hardware. This process is referred to as “mapping” (see Figure above) and is analogous to the “place and route” problem encountered by hardware designers. Thus, for a program to run a variety of systems it must be map independent (i.e. the algorithm description does not depend upon the details of the parallel hardware). To achieve this level of hardware independence it is most efficient to write applications on top of a library that is built of hardware independent pieces. The library, or middleware, that we have developed to provide this capability is called the Parallel Vector Library or PVL. Data Parallel within stages Task/Pipeline Parallel across stages 1 1

Scalable Approach Lincoln Parallel Vector Library (PVL)
Single Processor Mapping #include <Vector.h> #include <AddPvl.h> void addVectors(aMap, bMap, cMap) { Vector< Complex<Float> > a(‘a’, aMap, LENGTH); Vector< Complex<Float> > b(‘b’, bMap, LENGTH); Vector< Complex<Float> > c(‘c’, cMap, LENGTH); b = 1; c = 2; a=b+c; } A = B + C Multi Processor Mapping A = B + C Scalable approach separate mapping from the algorithm Single processor and multi-processor code are the same Maps can be changed without changing software High level code is compact Lincoln Parallel Vector Library (PVL)

Future Challenges Summary Fault Tolerance Parallel Specification Hybrid Architectures (see SBR) Outline Slide: Applied research being conducted on Fault Tolerance, Parallel Specification and Hybrid Architectures

Dynamic Mapping for Fault Tolerance
Nodes: 0,2 XIN XOUT Map0 Nodes: 0,1 Map2 Nodes: 1,3 Input Task Output Task Failure Dynamic Mapping for Fault Tolerance. The separation of mapping from the algorithm allows the application to more easily adapt to changing hardware by modify mapping at runtime. Remapping a matrix to spare processors allows the program to react to failed processors without requiring any fundamental change in the signal processing algorithm. The fundamental goal of ||VSIPL++ is to be able to write software that is independent of the underlying hardware; this includes if the hardware needs to change because of failure. By abstracting the hardware details, it is possible to implement a variety of fault tolerance strategies without having to modify the core signal processing algorithm. For example, if a matrix was built on one set of processors, it is not difficult to destroy the matrix and rebuild on another set of processors by simply providing the matrix with a new map (see Figure above). Spare Parallel Processor Switching processors is accomplished by switching maps No change to algorithm required Developing requirements for ||VSIPL++ 1 1

Parallel Specification
Clutter Calculation (Linux Cluster) % Initialize pMATLAB_Init; Ncpus=comm_vars.comm_size; % Map X to first half and Y to second half. mapX=map([1 Ncpus/2],{},[1:Ncpus/2]) mapY=map([Ncpus/2 1],{},[Ncpus/2+1:Ncpus]); % Create arrays. X = complex(rand(N,M,mapX),rand(N,M,mapX)); Y = complex(zeros(N,M,mapY); % Initialize coefficents coefs = ... weights = ... % Parallel filter + corner turn. Y(:,:) = conv2(coefs,X); % Parallel matrix multiply. Y(:,:) = weights*Y; % Finalize pMATLAB and exit. pMATLAB_Finalize; exit; Parallel performance Speedup Parallel Matlab allows parallel specification using same constructs as ||VSIPL++ Number of Processors Matlab is the main specification language for signal processing pMatlab allows parallel specifciations using same mapping constructs being developed for ||VSIPL++

Future Challenges Summary Future Challenges to be addressed by HPEC-SI program

Optimal Mapping of Complex Algorithms
Application Input XIN Low Pass Filter XIN W1 FIR1 XOUT W2 FIR2 Beamform XIN W3 mult XOUT Matched Filter XIN W4 FFT IFFT XOUT Different Optimal Maps Intel Cluster Optimal Mapping Challenge. Real applications can easily involve several several stages with potentially thousands of mapped vectors, matrices and computations. The maps that produce the best performance will change depending on the hardware. Automated capabilities are necessary to generate maps and to determine which maps are the best. PVL allows algorithm concerns to be separated from mapping concerns. However, the current issue of determining the mapping of an application has not changed. As software is moved from one piece of hardware to another, the mapping will need to change (see Figure above). The current approach is for humans with both knowledge of the algorithm and the parallel computer to use intuition and simple rules to try estimate what the best mapping will be for a particular application on a particular hardware. This method is time consuming, labor intensive, inaccurate and does not in any way guarantee and optimal solution. The generation of optimal mapping is an important problem because without this key piece of information it is not possible to write truly portable parallel software. To address this problem, we have developed a framework called Self-optimizing Software for Signal Processing (S3P). Workstation Embedded Multi-computer Embedded Board PowerPC Cluster Hardware Need to automate process of mapping algorithm to hardware 1 1

HPEC-SI Future Challenges
Time End of 5 Year Plan Phase 5 Applied Research: Higher Languages (Java?) Phase 4 Applied Research: PCA/Self-optimization Development: Self-optimization prototype Phase 3 Applied Research: Hybrid Architectures Development: Hybrid Architectures Hybrid VSIPL Demonstration: Hybrid Architectures prototype Development: Fault tolerance Demonstration: Fault tolerance FT VSIPL Functionality Hybrid VSIPL Demonstration: Unified Comp/Comm Lib HPEC-SI 5 Year plan ends after phase 3. Future phases are being planned to leverage HPEC-SI technology transition model. Portability across architectures RISC/FPGA Transparency Fault Tolerant VSIPL Parallel VSIPL++ Demonstrate Fault Tolerance Increased reliability Unified Comp/Comm Standard Demonstrate scalability

Summary HPEC-SI Program on track toward changing software practice in DoD HPEC Signal and Image Processing Outside funding obtained for DoD program specific activities (on top of core HPEC-SI effort) 1st Demo completed; 2nd selected Worlds first parallel, object oriented standard Applied research into task/pipeline parallelism; fault tolerance; parallel specification Keys to success Program Office Support: 5 Year Time horizon better match to DoD program development Quantitative goals for portability, productivity and performance Engineering community support Summary of HPEC-SI program accomplishments and keys to success

Web Links High Performance Embedded Computing Workshop http://www. ll
Web Links High Performance Embedded Computing Workshop High Performance Embedded Computing Software Initiative Vector, Signal, and Image Processing Library MPI Software Technologies, Inc. Data Reorganization Initiative CodeSourcery, LLC MatlabMPI Important HPEC-SI weblinks

High Performance Embedded Computing Software Initiative (HPEC-SI)

Similar presentations

Presentation on theme: "High Performance Embedded Computing Software Initiative (HPEC-SI)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Embedded Computing Software Initiative (HPEC-SI)

Similar presentations

Presentation on theme: "High Performance Embedded Computing Software Initiative (HPEC-SI)"— Presentation transcript:

Similar presentations

About project

Feedback