Rad Hard By Software for Space Multicore Processing David Bueno, Eric Grobelny, Dave Campagna, Dave Kessler, and Matt Clark Honeywell Space Electronic.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Using GPUs to Enable Highly Reliable Embedded Storage University of Alabama at Birmingham 115A Campbell Hall 1300 University Blvd. Birmingham, AL
Types of Parallel Computers
Performance Characterization of the Tile Architecture Précis Presentation Dr. Matthew Clark, Dr. Eric Grobelny, Andrew White Honeywell Defense & Space,
History of Distributed Systems Joseph Cordina
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.
Chapter 13 Embedded Systems
Figure 1.1 Interaction between applications and the operating system.
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
12 Chapter 12 Client/Server Systems Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
ECE 526 – Network Processing Systems Design
John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor.
5205 – IT Service Delivery and Support
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
The Pursuit for Efficient S/C Design The Stanford Small Sat Challenge: –Learn system engineering processes –Design, build, test, and fly a CubeSat project.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
Computer System Architectures Computer System Software
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Distributed Systems 1 CS- 492 Distributed system & Parallel Processing Sunday: 2/4/1435 (8 – 11 ) Lecture (1) Introduction to distributed system and models.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
NMP ST8 Dependable Multiprocessor Precis Presentation Dr. John R. Samson, Jr. Honeywell Defense & Space Systems U.S. Highway 19 North Clearwater,
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Srihari Makineni & Ravi Iyer Communications Technology Lab
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
March 2004 At A Glance NASA’s GSFC GMSEC architecture provides a scalable, extensible ground and flight system approach for future missions. Benefits Simplifies.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
NMP ST8 Dependable Multiprocessor (DM)
Accelerated Long Range Traverse (ALERT) Paul Springer Michael Mossey.
Introduction Why are virtual machines interesting?
Rad Hard By Software for Space Multicore Processing David Bueno, Eric Grobelny, Dave Campagna, Dave Kessler, and Matt Clark Honeywell Space Electronic.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
NMP ST8 Dependable Multiprocessor (DM)
Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
Dependable Multiprocessing with the Cell Broadband Engine Dr. David Bueno- Honeywell Space Electronic Systems, Clearwater, FL Dr. Matt Clark- Honeywell.
NMP ST8 Dependable Multiprocessor (DM) Precis Presentation Dr. John R. Samson, Jr. Honeywell Defense & Space Systems U.S. Highway 19 North Clearwater,
© 2004 IBM Corporation Power Everywhere POWER5 Processor Update Mark Papermaster VP, Technology Development IBM Systems and Technology Group.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
March 2004 At A Glance The AutoFDS provides a web- based interface to acquire, generate, and distribute products, using the GMSEC Reference Architecture.
Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
CFS Use at Goddard The Johns Hopkins University Applied Physics Laboratory core Flight Software System Workshop October 26, 2015 Alan Cudmore – NASA Goddard.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
Honeywell Defense & Space, Clearwater, FL
Dependable Multiprocessing with the Cell Broadband Engine
Constructing a system with multiple computers or processors
Chapter 17 Parallel Processing
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Cluster Computers.
Presentation transcript:

Rad Hard By Software for Space Multicore Processing David Bueno, Eric Grobelny, Dave Campagna, Dave Kessler, and Matt Clark Honeywell Space Electronic Systems, Clearwater, FL HPEC 2008 Workshop September 25, 2008

2 Outline Rad Hard By Software Overview ST8 Dependable Multiprocessor Next-Generation Dependable Multiprocessor Testbed Performance Results Conclusions and Future Work

3 September 25, 2008 Why Rad Hard By Software? Future payloads can be expected to require high performance data processing Traditional component hardening approaches to rad hard processing suffer several key drawbacks - Large capability gap between rad hard and COTS processors - Poor SWaP characteristics vs. processing capacity - Extremely high cost vs. processing capacity - Dissimilarity with COTS technology drives high-cost software development units Honeywell Rad Hard By Software (RHBS) approach solves these problems by moving most data processing to high performance COTS single board computers - Leading edge capability - Software fault mitigation = less hardware = reduced SWaP - Inexpensive - No difference between development and flight hardware - COTS software development tools and familiar programming models

4 September 25, 2008 A high-performance, COTS-based, fault tolerant cluster onboard processing system that can operate in a natural space radiation environment  high throughput, low power, scalable, & fully programmable >300 MOPS/watt (>100)  high system availability >0.995 (>0.95)  high system reliability for timely and correct delivery of data >0.995 (>0.95)  technology independent system software that manages cluster of high performance COTS processing elements  technology independent system software that enhances radiation upset tolerance Benefits to future users if DM ST8 experiment is successful: - 10X – 100X more delivered computational throughput in space than currently available - enables heretofore unrealizable levels of science data and autonomy processing - faster, more efficient applications software development -- robust, COTS-derived, fault tolerant cluster processing -- port applications directly from laboratory to space environment --- MPI-based middleware --- compatible with standard cluster processing application software including existing parallel processing libraries - minimizes non-recurring development time and cost for future missions - highly efficient, flexible, and portable SW fault tolerant approach applicable to space and other harsh environments - DM technology directly portable to future advances in hardware and software technology NASA Level 1 Requirements (Minimum) DM Technology Advance: Overview

5 September 25, 2008 Next Generation DM Testbed Dependable Multiprocessor (DM) is Honeywell’s first-generation Rad Hard By Software technology Coarse-grained software-based fault detection and recovery - Similar to the way modern communication protocols detect errors at the packet level rather than the byte level - Rad Hard By Software detects errors at the “operation” rather than the instruction level Typical system - One low-performance rad-hard SBC for “cluster” monitoring and severe upset recovery  Could also serve as spacecraft control processor - One or more high-performance COTS SBCs for data processing  Connected via high-speed interconnects - One or more fault-tolerant storage/memory cards for shared memory - Dependable Multiprocessing (DM) software stack System Controller 233 MHz PowerPC 750 RHBS Mgmt. RHBS Data Processor Dual Processor PowerPC 970FX SMP Honeywell Next-Generation DM Testbed Gigabit Ethernet Data Processor 8641D 1.5 GHz Dual Core PowerPC Data Processor PA Semi 2 GHz Dual Core PowerPC This work applies DM to multicore/multiprocessor targets including the PA Semi PA6T-1682M, Freescale 8641D, and IBM 970FX

6 September 25, 2008 DMM – Dependable Multiprocessor Middleware DMM Software Stack Rad Hard SBC DMM OS – WindRiver VxWorks 5.5 Gigabit Ethernet High-performance COTS Data Processor Application OS – Linux DMM DMM components and agents Science/Defense Application Generic Fault Tolerant Framework OS/Hardware Specific Application Specific System ControllerData Processor Application Programming Interface (API) SAL (System Abstraction Layer)... Mission Specific Applications Policies Configuration Parameters S/C Interface SW and SOH And Exp. Data Collection

7 September 25, 2008 Application Benchmark Overview FFTW - 1K-, 8K-, or 64K-point radix-2 FFT (FFTW1K, FFTW8K, FFTW64K) - Single-precision floating point - Supports multi-threading via small alterations (~5 lines) to application source code and linking multi-threaded FFTW library Matrix Multiply - 800x800 and 3000x3000 variants (MM800/MM3000) - Single-precision floating point - Uses ATLAS/BLAS linear algebra libraries - Supports transparent multi-threading by linking the pthreads version of the BLAS library Hyper-Spectral Imaging (HSI) detection and classification - 256x256x512 data cube - Single-precision floating point - Uses ATLAS/BLAS linear algebra libraries - Supports transparent multi-threading by linking the pthreads version of the BLAS library

8 September 25, 2008 Application Performance Results Next-gen architectures provide significant performance improvement over existing DM 7447A (ST8 Baseline) for each application Largest speedups on large matrix multiply - Best exploits parallelism in multi-core architectures FFTW does not efficiently exploit both processor cores, limiting speedup PA Semi provides 5x performance of DM ST8 baseline for HSI application - Advantage over 8641D and 970FX for HSI largely due to custom-built ATLAS BLAS library for PA Semi vs precompiled binary library for others

9 September 25, 2008 One-Thread vs. Two-Thread PA Semi Results Nearly 2x speedup provided for 3000x3000 matrix multiply - Smaller matrix multiply suffers due to dataset size HSI application speedup limited to 1.63x by highly serialized Weight Computation stage - Autocorrelation Sample Matrix (ACSM) and Target Classification stages take advantage of both cores fairly efficiently FFTW actually slowed down for multi-core implementations - Suspect likely due to inefficiencies in fine-grain parallelization of 1D FFT, expect much better performance for 2D FFTs with coarse-grain parallelization Similar trends observed on 8641D (and 970FX SMP with 2 processors) Dual-Threaded PA Semi Speedup (Slowdown) vs. Single-Threaded PA Semi MM800MM3000HSIFFTW1KFFTW8KFFTW64K (6.10)(1.41)(1.05)

10 September 25, 2008 Comparison to “State-of-the-Art” for Space Reference architecture is a 233 MHz PowerPC 750 with 512 KB of L2 cache Base DM system provides ~10x speedup over PPC750 for FFTW and MM800 More modern architectures improve upon this speedup by 2-4x Other applications did not run on PPC750 due to memory limitations

11 September 25, 2008 Estimated Throughput Density Assumes: - 12W for 7447A board - 20W for PA Semi board - 35W for 8641D board - 7W for 233 MHz PPC 750 board 970FX not appropriate for space systems and not included PA Semi provides significant throughput density enhancements vs. 8641D and ST8 7447a All architectures provide 1+ order of magnitude throughput density enhancement vs. PPC 750 HSI throughput density conservative in most cases - Op count only includes ACSM stage which accounts for ~90% of execution time, but time includes all compute stages D version still suffers due to older ATLAS library

12 September 25, 2008 Summary DM provides a low-overhead approach for increasing availability and reliability of COTS hardware in space - DM easily portable to most Linux-based platforms a processing platform selected near start of NASA/JPL ST8 program (DM), but better options now exist Modern processing platforms provided impressive overall speedups for existing DM applications with no additional development effort - ~5-6x speedup vs. existing 7447a-based DM platform  Leverages optimized libraries for SIMD and multiprocessing - ~2-3x gain in throughput density (MFLOPS/W) vs. existing DM solution - ~20-40x performance of state-of-the-art rad hard by process solutions Potential future work - Exploration of high-speed networking technologies with DM - Enhancements to DM middleware for performance/availability/reliability - Explore options for using additional cores to increase reliability Explore additional general purpose multicore next-generation processing engines - Purchase of PA Semi by Apple potentially makes it a less attractive solution - New Freescale 2- and 8-core devices at 45nm are a possible alternative Explore port of DM to advanced multicore architectures - Tilera TILE64 - Cell Broadband Engine - Further evaluation of future processing platforms (rad testing, etc.) DM enables high-performance space computing with modern COTS processing engines