MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-1 GES 4/21/2004 A KASSPER Real-Time Embedded Signal Processor Testbed Glenn Schrader Massachusetts Institute.

Slides:



Advertisements
Similar presentations
The HPEC Challenge Benchmark Suite
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
MotoHawk Training Model-Based Design of Embedded Systems.
SSP Re-hosting System Development: CLBM Overview and Module Recognition SSP Team Department of ECE Stevens Institute of Technology Presented by Hongbing.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Introduction CS 524 – High-Performance Computing.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Use of FOS for Airborne Radar Target Detection of other Aircraft Example PDS Presentation for EEE 455 / 457 Preliminary Design Specification Presentation.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
BENCHMARK SUITE RADAR SIGNAL & DATA PROCESSING CERES EPC WORKSHOP
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
MIT Lincoln Laboratory Ind. Day 3/10/ For Official Use Only / ITAR exempt from mandatory disclosure under the FOIA, Exemption 3 applies. Not approved.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 Design of an SIMD Multimicroprocessor for RCA GaAs Systolic Array Based on 4096 Node Processor Elements Adaptive signal processing is of crucial importance.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
Computational Technologies for Digital Pulse Compression
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.
Presented by: (U) I2WD support of FOPEN-GXP Peter Lamanna UNCLASSIFIED//FOR OFFICIAL USE ONLY.
Dr A VENGADARAJAN, Sc ‘F’, LRDE
Efficient FPGA Implementation of QR
© 2001 Mercury Computer Systems, Inc. Real-Time Geo- Registration on High-Performance Computers Alan Chao Monica Burke ALPHATECH, Inc. High Performance.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
MIT Lincoln Laboratory hch-1 HCH 5/26/2016 Achieving Portable Task and Data Parallelism on Parallel Signal Processing Architectures Hank Hoffmann.
Master Thesis Defense Jan Fiedler 04/17/98
Fan Zhang, Yang Gao and Jason D. Bakos
HPEC_NDP-1 MAH 10/11/2015 MIT Lincoln Laboratory Efficient Multidimensional Polynomial Filtering for Nonlinear Digital Predistortion Matthew Herman, Benjamin.
System-level, Unified In-band and Out-of-band Dynamic Thermal Control Dong LiVirginia Tech Rong GeMarquette University Kirk CameronVirginia Tech.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Cousins HPEC 2002 Session 4: Emerging High Performance Software David Cousins Division Scientist High Performance Computing Dept. Newport,
HPEC2002_Session1 1 DRM 11/11/2015 MIT Lincoln Laboratory Session 1: Novel Hardware Architectures David R. Martinez 24 September 2002 This work is sponsored.
Haney - 1 HPEC 9/28/2004 MIT Lincoln Laboratory pMatlab Takes the HPCchallenge Ryan Haney, Hahn Kim, Andrew Funk, Jeremy Kepner, Charles Rader, Albert.
Performance evaluation on grid Zsolt Németh MTA SZTAKI Computer and Automation Research Institute.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
1 Overview Importing data from generic raster files Creating surfaces from point samples Mapping contours Calculating summary attributes for polygon features.
MIT Lincoln Laboratory HPEC JML 28 Sep 2004 Mapping Signal Processing Kernels to Tiled Architectures Henry Hoffmann James Lebak [Presenter] Massachusetts.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Visual Odometry David Nister, CVPR 2004
New Sensor Signal Processor Paradigms: When One Pass Isn’t Enough
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
PTCN-HPEC-02-1 AIR 25Sept02 MIT Lincoln Laboratory Resource Management for Digital Signal Processing via Distributed Parallel Computing Albert I. Reuther.
The objective of Nonlinear Equalization (NLEQ) is to extend the dynamic range of RF receivers by reducing in-band nonlinear distortions. This enables detection.
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Computer Architecture Organization and Architecture
HPEC-1 SMHS 7/7/2016 MIT Lincoln Laboratory Focus 3: Cell Sharon Sacco / MIT Lincoln Laboratory HPEC Workshop 19 September 2007 This work is sponsored.
Atindra Mitra Joe Germann John Nehrbass
Parallel processing is not easy
Chilimbi, et al. (2014) Microsoft Research
Parallel Processing in ROSA II
Final Project presentation
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-1 GES 4/21/2004 A KASSPER Real-Time Embedded Signal Processor Testbed Glenn Schrader Massachusetts Institute of Technology Lincoln Laboratory 28 September 2004 This work is sponsored by the Defense Advanced Research Projects Agency, under Air Force Contract F C Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-2 GES 4/21/2004 KASSPER Program Overview KASSPER Processor Testbed Computational Kernel Benchmarks –STAP Weight Application –Knowledge Database Pre-processing Summary Outline

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-3 GES 4/21/2004 SAR Spot GMTI Track Ground Moving Target Indication (GMTI) Search Synthetic Aperture Radar (SAR) Stripmap MIT Lincoln Laboratory Knowledge-Aided Sensor Signal Processing and Expert Reasoning (KASSPER) MIT LL Program Objectives Develop systems-oriented approach to revolutionize ISR signal processing for operation in difficult environments Develop and evaluate high- performance integrated radar system/algorithm approaches Incorporate knowledge into radar resource scheduling, signal, and data processing functions Define, develop, and demonstrate appropriate real-time signal processing architecture

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-4 GES 4/21/2004 Representative KASSPER Problem Launcher ISR Platform with KASSPER System Convoy A Convoy B Radar System Challenges –Heterogenous clutter environments –Clutter discretes –Dense target environments Results in decreased system performance (e.g. higher Minimum Detectable Velocity, Probability of Detection vs. False Alarm Rate) KASSPER System Characteristics –Knowledge Processing – Knowledge-enabled Algorithms – Knowledge Repository –Multi-function Radar – Multiple Waveforms + Algorithms – Intelligent Scheduling Knowledge Types –Mission Goals –Static Knowledge –Dynamic Knowledge.

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-5 GES 4/21/2004 KASSPER Program Overview KASSPER Processor Testbed Computational Kernel Benchmarks –STAP Weight Application –Knowledge Database Pre-processing Summary Outline

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-6 GES 4/21/2004 High Level KASSPER Architecture Target Detects Sensor Data Intertial Navigation System (INS), Global Positioning System (GPS), etc. Knowledge Database Static (i.e. “Off-line”) Knowledge Source Signal Processing Look-ahead Scheduler Data Processor (Tracker, etc) Track Data e.g. Terrain Knowledge e.g. Mission Knowledge Operator External System e.g. Real-time Intelligence

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-7 GES 4/21/2004 GMTI Signal Processing GMTI Functional Pipeline Filtering Task Beam- former Task Filter Task Detect Task Data Output Task Target Detects Raw Sensor Data Input Task Baseline Signal Processing Chain Look-ahead Scheduler Data Processor (Tracker, etc) Knowledge Database INS, GPS, etc Baseline data cube sizes: 3 channels x 131 pulses x 1676 range cells 12 channels x 35 pulses x 1666 range cells Track Data Knowledge Processing.

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-8 GES 4/21/2004 KASSPER Real-Time Processor Testbed Processing Architecture Software Architecture Application Parallel Vector Library (PVL) Kernel Rapid Prototyping High Level Abstractions Scalability Productivity Rapid Algorithm Insertion Results Sensor DataINS, GPS, etc Knowledge Database Off-line Knowledge Source (DTED, VMAP, etc) Signal Processing Look-ahead Scheduler Data Processor (Tracker, etc) Sensor Data Storage “Knowledge” Storage Surrogate for actual radar system KASSPER Signal Processor 120 PPC G4 500 MHZ 4 GFlop/Sec/CPU = 480 GFlop/Sec

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-9 GES 4/21/2004 KASSPER Knowledge-Aided Processing Benefits

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-10 GES 4/21/2004 KASSPER Program Overview KASSPER Processor Testbed Computational Kernel Benchmarks –STAP Weight Application –Knowledge Database Pre-processing Summary Outline

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-11 GES 4/21/2004 Space Time Adaptive Processing (STAP) ? sample 1 sample 2 sample N-1 sample N … STAP Weights (W) Training Data (A) Solve for weights ( R=A H A, W=R -1 v) Range cell of interest Training samples are chosen to form an estimate of the background Multiplying the STAP Weights by the data at a range cell suppresses features that are similar to the features that were in the training data X Range cell/w suppressed clutter Steering Vector (v)

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-12 GES 4/21/2004 Space Time Adaptive Processing (STAP) ? sample 1 sample 2 sample N-1 sample N … STAP Weights (W) Training Data (A) Solve for weights ( R=A H A, W=R -1 v) Range cell of interest Simple training selection approaches may lead to degraded performance since the clutter in training set may not match the clutter in the range cell of interest X Range cell/w suppressed clutter Steering Vector (v) Breaks Assumptions

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-13 GES 4/21/2004 Space Time Adaptive Processing (STAP) ? sample 1 sample 2 sample N-1 sample N … STAP Weights (W) Training Data (A) Solve for weights ( R=A H A, W=R -1 v) Range cell of interest Use of terrain knowledge to select training samples can mitigate the degradation X Range cell/w suppressed clutter Steering Vector (v)

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-14 GES 4/21/2004 STAP Weights and Knowledge STAP weights computed from training samples within a training region To improve processor efficiency, many designs apply weights across a swath of range (i.e. matrix multiply) With knowledge enhanced STAP techniques, MUST assume that adjacent range cells will not use the same weights (i.e. weight selection/computation is data dependant) STAP Output Weights Training * * Weights select 1 of N External data Cannot use Matrix Multiply since the weights applied to each column are data dependant STAP Input Equivalent to Matrix Multiply since same weights are used for all columns

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-15 GES 4/21/2004 Benchmark Algorithm Overview STAP Input STAP Output Training Set * N pre-computed Weight vectors STAP Weights Weights applied to a column selected sequentially modulo N Benchmark’s goal is to determine the achievable performance compared to the well-known matrix multiply algorithm. Benchmark Data Dependant Algorithm (Multiple Weight Multiply) Benchmark Reference Algorithm (Single Weight Multiply) STAP InputWeightsSTAP Output X=

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-16 GES 4/21/2004 Test Case Overview X JK LK Data Set J x K x L Single Weight Multiply X L K K J N Multiple Weight Multiply Complex Data –Single precision interleaved Two test architectures –PowerPC G4 (500Mhz) –Pentium 4 (2.8Ghz) N = # of weight matrices = 10 J and K dimension length L dimension length Four test cases –Single weight multiply/wo cache flush –Single weight multiply/w cache flush –Multiple weight multiply/wo cache flush –Multiple weight multiply/w cache flush

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-17 GES 4/21/2004 Data Set #1 - LxLxL X L L L L 10 Multiple Weight Multiply X LL LL Single Weight Multiply Length (L) MFLOP/Sec Single/wo flush Single/w flush Multiple/wo flush Multiple/w flush

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-18 GES 4/21/2004 Data Set #2 – 32x32xL X L Multiple Weight Multiply X 32 L Single Weight Multiply Length (L) MFLOP/Sec Single/wo flush Single/w flush Multiple/wo flush Multiple/w flush

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-19 GES 4/21/2004 Data Set #3 – 20x20xL X L Multiple Weight Multiply X 20 L Single Weight Multiply Length (L) MFLOP/Sec Single/wo flush Single/w flush Multiple/wo flush Multiple/w flush

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-20 GES 4/21/2004 Data Set #4 – 12x12xL X L Multiple Weight Multiply X 12 L Single Weight Multiply Length (L) MFLOP/Sec Single/wo flush Single/w flush Multiple/wo flush Multiple/w flush

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-21 GES 4/21/2004 Data Set #5 – 8x8xL X L Multiple Weight Multiply X 88 L8 Single Weight Multiply Length (L) MFLOP/Sec Single/wo flush Single/w flush Multiple/wo flush Multiple/w flush

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-22 GES 4/21/2004 Performance Observations Data dependent processing cannot be optimized as well –Branching prevents compilers from “unrolling” loops to improve pipelining Data dependent processing does not allow efficient “re- use” of already loaded data –Algorithms cannot make simplifying assumptions about already having the data that is needed. Data dependent processing does not vectorize –Using either Altivec or SSE is very difficult since data movement patterns become data dependant Higher memory bandwidth reduces cache impact Actual performance scales roughly with clock rate rather than theoretical peak performance Approx 3x to 10x performance degradation, Processing efficiency of 2-5% depending on CPU architecture.

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-23 GES 4/21/2004 KASSPER Program Overview KASSPER Processor Testbed Computational Kernel Benchmarks –STAP Weight Application –Knowledge Database Pre-processing Summary Outline

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-24 GES 4/21/2004 Knowledge Database Architecture Look-ahead Scheduler Signal Processing Results Sensor Data GPS, INS, User Inputs, etc Downstream Processing (Tracker, etc) Command Load/ Store/ Send Knowledge Manager Knowledge Cache Knowledge Processing Knowledge Store New Knowledge Load Store Stored Data Send New Knowledge (Time Critial Targets, etc) Stored Data Raw a priori Raster & Vector Knowledge (DTED,VMAP,etc) Off-line Knowledge Reformatter Stored Data Lookup Update Off-line Pre-Processing Knowledge Database

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-25 GES 4/21/2004 Static Knowledge Vector Data –Each feature represented by a set of point locations: Points - longitude and latitude (i.e. towers, etc) Lines - list of points, first and last points are not connected (i.e. roads, rail, etc) Areas - list of points, first and last point are connected (i.e. forest, urban, etc) –Standard Vector Formats Vector Smart Map (VMAP) Matrix Data –Rectangular arrays of evenly spaced data points –Standard Raster Formats Digital Terrain Elevation Data (DTED) Trees (area) Roads (line) DTED

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-26 GES 4/21/2004 Terrain Interpolation Earth Radius Terain Height Slant Range Sampled Terrain Height & Type data Data Geographic Position Converting from radar to geographic coordinates requires iterative refinement of estimate N

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-27 GES 4/21/2004 Terrain Interpolation Performance Results CPU TypeTime per interpolationProcessing Rate P uSec/interpolation144MFlop/sec (1.2%) G uSec/interpolation46MFlop/sec (2.3%) Data dependent processing does not vectorize –Using either Altivec or SSE is very difficult Data dependent processing cannot be optimized as well –Compilers cannot “unroll” loops to improve pipelining Data dependent processing does not allow efficient “re- use” of already loaded data –Algorithms cannot make simplifying assumptions about already having the data that is needed Processing efficiency of 1-3% depending on CPU architecture.

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-28 GES 4/21/2004 KASSPER Program Overview KASSPER Processor Testbed Computational Kernel Benchmarks –STAP Weight Application –Knowledge Database Pre-processing Summary Outline

MIT Lincoln Laboratory HPEC 2004 KASSPER Testbed-29 GES 4/21/2004 Summary Observations –The use of knowledge processing provides a significant system performance improvement –Compared to traditional signal processing algorithms, the implementation of knowledge enabled algorithms can result in a factor of 4 or 5 lower CPU efficiency (as low as 1%-5%) –Next-generation systems that employ cognitive algorithms are likely to have similar performance issues Important Research Directions –Heterogeneous processors (i.e. a mix of COTS CPUs, FPGAs, GPUs, etc) can improve efficiency by better matching hardware to the individual problems being solved. For what class of systems is the current technology adequate? –Are emerging architectures for cognitive information processing needed (e.g. Polymorphous Computing Architecture – PCA, Application Specific Instruction set Processors - ASIPs)?