The HPEC Challenge Benchmark Suite Ryan Haney, Theresa Meuse, Jeremy Kepner and James Lebak Massachusetts Institute of Technology Lincoln Laboratory HPEC 2005 The title of this talk is, “The HPEC Challenge Benchmark Suite.” This work is sponsored by the Defense Advanced Research Projects Agency under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government. 1 1
Acknowledgements Lincoln Laboratory PCA Team Shomo Tech Systems Matthew Alexander Jeanette Baran-Gale Hector Chan Edmund Wong Shomo Tech Systems Marti Bancroft Silicon Graphics Incorporated William Harrod Sponsor Robert Graybill, DARPA PCA and HPCS Programs
HPEC Challenge Benchmark Suite PCA program kernel benchmarks Single-processor operations Drawn from many different DoD applications Represent both “front-end” signal processing and “back-end” knowledge processing HPCS program Synthetic SAR benchmark Multi-processor compact application Representative of a real application workload Designed to be easily scalable and verifiable The Polymorphous Computing Architectures (PCA), and High Productivity Computing Systems (HPCS) programs have combined work to form a new set of benchmarks called the “High Performance Embedded Computing Challenge Benchmarks.” From the PCA program, eight single-processor kernel-level benchmarks were drawn from a survey of DoD applications. These kernels are representative of operations found in “front-end” signal processing and “back-end” knowledge processing. From the HPCS program, a Synthetic SAR multi-processor application was developed. The application was designed to be easily scalable in terms of its parallelism as well as computation sizes, easily verifiable, and representative of workloads found in real-world applications.
Outline Introduction Kernel Level Benchmarks SAR Benchmark Overview System Architecture Computational Components Release Information Summary
Spotlight SAR System Principal performance goal: Throughput Maximize rate of results Overlapped IO and computing Principal performance goal is throughput (rate at which answers are produced by a supercomputer). Overlapping IO and computing is allowed. Intent of the Compact Application: Scalable – operates on a range of systems, from workstation to petascale computer High compute fidelity – representative computations of SAR processing Low physical fidelity – not a full spotlight SAR system (reduces unneeded benchmark complexity) Self-verifying Benchmark is serial (sequential processing), and must be parallelizable. Intent of Compact App: Scalable High Compute Fidelity Self-Verifying
SAR System Architecture Front-End Sensor Processing Scalable Data and Template Generator SAR Image Kernel #1 Data Read and Image Formation Template Insertion Kernel #2 Image Storage SAR Image Templates Raw SAR Files Raw SAR File SAR Image Files Groups of Template Files Template Files File IO Computation Raw SAR Data Files Image Files Groups of Template Files Sub-Image Detection Files SAR Image Files The full system architecture of the SAR system benchmark involves stressful computational and I/O requirements. However, the benchmark can be run in various modes turning on or off I/O or computation. For the purposes of this talk, we’ll be concentrating on the computational components only. Template Files Sub-Image Detection Files Template Files Kernel #3 Image Retrieval SAR Images Kernel #4 Detection Validation HPEC community has traditionally focused on Computation … Detections … but File IO performance is increasingly important Templates Back-End Knowledge Formation
Data Generation and Computational Stages SAR Image Knowledge Formation Files Raw File Template Groups of Kernel #2 Storage Detection Kernel #3 Retrieval Sub-Image Sensor Processing Raw SAR Data Files Raw SAR Templates Image Template Insertion Scalable Data and Template Generator Kernel #1 Formation There are 4 major computational components of the SAR system benchmark. The components are introduced here and discussed in more detail in the subsequent slides. The Scalable Data Generator produces raw SAR data as well as template images that will later be inserted as targets. The raw data is taken into the Image Formation kernel, converted from the temporal to the spatial domain, and interpolated from a polar to a rectangular swath, to form the desired image. The SAR images and target templates are passed to the Template Insertion block where the template targets are pseudorandomly inserted into the SAR image. Next the SAR image with templates is passed to the Detection kernel where, with simple image differencing, thresholding and small correlations, target detections are found and reported. These detections are passed on to a Validation block which requires 100% object recognition with no false positives. <see animation with three left mouse clicks> Validation Detections Kernel #4 Detection SAR Image Templates
SAR Overview Radar captures echo returns from a ‘swath’ on the ground Notional linear FM chirp pulse train, plus two ideally non-overlapping echoes returned from different positions on the swath Summation and scaling of echo returns realizes a challengingly long antenna aperture along the flight path Synthetic Aperture, L Fixed to Broadside . . . The Scalable Synthetic Data Generator produces raw SAR data approximating what would be obtained from a real SAR system. As an airplane flies adjacent to the field of interest or ‘swath,’ pulse trains are transmitted. The echo returns are scaled (to mimic different reflection coefficients at various points on the swath) and time delayed (to mimic different times at which echoes are returned from different points on the swath) and summed. The size of the SAR synthetic aperture is then determined by the distance that the sensor flies while the radar is capturing returns from the ground; this realizes a challengingly long antenna aperture length. To alleviate its coding complexity, the benchmark makes some simplifications of the SAR problem (which are shown in red): Broadside only processing (instead of being able to process different angular looks) Its synthetic aperture is set equal to its swath’s cross-range Range, X = 2X0 delayed transmitted SAR waveform reflection coefficient scale factor, different for each return from the swath Cross-Range, Y = 2Y0 received ‘raw’ SAR
Scalable Synthetic Data Generator Generates synthetic raw SAR complex data Data size is scalable to enable rigorous testing of high performance computing systems User defined scale factor determines the size of images generated Generates ‘templates’ that consist of rotated and pixelated capitalized letters Cross-Range Range Spotlight SAR Returns The raw complex data generated by the Data Generator is scalable. A user defined scale factor determines the size of the ‘swath’ that an echo return is gathered from, determining the size of the images generated. This allows the user to scale data sizes to better stress high performance systems. Along with the raw complex SAR data, target templates are generated that consist of rotated pixelated capitalized letters. The templates will be passed on so that later they can be inserted as random targets into the raw SAR images; these templates are also passed on to the Detection kernel.
Kernel 1 — SAR Image Formation Spatial Frequency Domain Interpolation s(w,ku) f(x,y) F(kx,ky) Interpolation kx = sqrt(4k2 –ku2) ky = ku Matched Filtering Fourier Transform (t,u)B(w,ku) Inverse Fourier Transform (kx,ky) B (x,y) s*0(w,ku) s(t,u) Received Samples Fit a Polar Swath Processed Samples Fit a Rectangular Swath f o kx ky Spotlight SAR Reconstruction Cross-Range, Pixels The SAR image formation step, kernel 1, consists of: fast Fourier transform (FFT) into the frequency domain - to ease processing further down the processing chain pulse compression (also called matched filtering) – to remove the transmitted waveform’s spectral components spatial frequency domain interpolation – to change the swath’s representation from a polar to rectangular coordinate system and two-dimensional inverse fast Fourier transform (IFFT) – to produce an observable spatial-domain image The synthesized SAR image displays a grid pattern of lobes corresponding to what would have been the placement of reflectors on the swath. Range, Pixels
Template Insertion (untimed) Inserts rotated pixelated capital letter templates into each SAR image Non-overlapping locations and rotations Randomly selects 50% Used as ideal detection targets in Kernel 4 If inserted with %100 Templates Image only inserted with %50 random Templates The template insertion step inserts pixelated rotated capital letter templates into the SAR image: these letters are the “targets” that will later be detected. Templates are placed at and non-overlapping locations and rotations (to later forgo image alignment problems), and in the valleys of the SAR lobes. Template magnitude is set to the average power of the original raw SAR data (well below the magnitude of the SAR peaks). [To make them visible, template magnitude is shown here is exaggerated.] Y Pixels Y Pixels X Pixels X Pixels
Thresholded Difference Kernel 4 — Detection Detects targets in SAR images Image difference Threshold Sub-regions Correlate with every template max is target ID Computationally difficult Many small correlations over random pieces of a large image 100% recognition no false alarms Image A Image Difference Thresholded Difference Sub-region The detection kernel compares two images to detect changes that represent moving targets. Two images are differenced, effectively removing the SAR components. 1. If SAR reflector peaks are not formed (e.g.: under focused or blurred), templates may be irrecoverably buried in the blurred SAR image. 2. If SAR reflector peaks are not well formed (floor-to-peak power ratio, and placement) part/all of a template could be irrecoverably buried under a SAR lobe (so it becomes unidentifiable). Image is threshold, to distinguish newly visible targets from targets that moved out of the picture and thus ceased to be of interest. Image is broken-up into sub-regions; each region is correlated against each template; max correlation decides the target’s ID. Requirement: Throughput cannot be meaningfully measured until 100% recognition and no false alarms is achieved. Image B
Benchmark Summary and Computational Challenges Front-End Sensor Processing Back-End Knowledge Formation Raw SAR Templates Image Template Insertion Scalable Data and Template Generator Kernel #1 Formation Validation Detections Kernel #4 Detection SAR Image Templates SAR Image Formation, has large scale parallel two-dimensional (2D) Inverse Fast Fourier Transform (IFFT); may require a ‘corner turn’ or a ‘gather scatter’ (depending on architecture), with large quantities of data. Polar interpolation is known to be even more computationally intense than IFFT. Streaming image data storage to an I/O device (write) may involve large block data transfers, storing one large image after another (Kernel 2) Random location image sequence retrieval from an I/O device (read) also involving large quantities of data, with stressful memory access patterns, and locality issues (Kernel 3) Kernel 4, detection, involves performing many small correlations on random pieces of large images. Scalable synthetic data generation Pulse compression Polar Interpolation FFT, IFFT (corner turn) Sequential store Non-sequential retrieve Large & small IO Large Images difference & Threshold Many small correlations on random pieces of large image
Outline Introduction Kernel Level Benchmarks SAR Benchmark Release Information Summary
HPEC Challenge Benchmark Release http://www.ll.mit.edu/HPECChallenge/ Future site of documentation and software Initial release is available to PCA, HPCS, and HPEC SI program members through respective program web pages Documentation ANSI C Kernel Benchmarks Single processor MATLAB SAR System Benchmark Complete release will be made available to the public in first quarter of CY06 The HPEC challenge benchmarks will be made available to the public and released on the web site listed in the first quarter of calendar year 2006. In the meantime, a preliminary release of the benchmarks will be made through respective program web pages for PCA, HPCS, and HPEC-SI members. This release will include documentation, source code for the ANSI C Kernel Benchmarks, and a single processor Matlab SAR system benchmark.
Summary The HPEC Challenge is a publicly available suite of benchmarks for the embedded space Representative of a wide variety of DoD applications Benchmarks stress computation, communication and I/O Benchmarks are provided at multiple levels Kernel: small enough to easily understand and optimize Compact application: representative of real workloads Single-processor and multi-processor For more information, see http://www.ll.mit.edu/HPECChallenge/