Scientific Computing in Space Using COTS Processors Jeremy Ramos Honeywell DSES jeremy.ramos@honeywell.com Roger Sowada Honeywell DSES roger.j. sowada@honeywell.com David Lupia Honeywell DSES david.lupia@honeywell.com
Agenda Introduction Background Detail Description Implementation Approach Development Efforts Acknowledgements University of Florida Key contributors to software prototype effort and research Alan George and the High-performance Computing and Simulation Lab Physical Sciences Inc. SEU Sensor Provider Gary Galica and Robin Cox WW Technologies Inc. RPI Middleware Provider Chris Walters and Technical Staff NASA New Millennium Program Program Sponsor
Processing Platforms for New Science The success of recent rover missions are a perfect example of the type of science we want to support Though returns from rover missions are significant they could be orders of magnitude greater with sufficient autonomy and on-board processing capabilities Similarly, deep space probes as well as Earth orbiting instruments can benefit from increases in on-board processing capabilities In all cases increases in science data returns are dependant on the spacecraft’s processing platform capabilities
Payload Processing Conceptual Model Sample-Level Signal Processing Frame-Level Signal Processing High-Level Logic Operations Time Dependent Processing TDP Object Dependent Processing ODP Mission Dependent Processing MDP Telemetry Sensor Array Low BW DATA RATES 10,000 100,000 1,000 10,000 Data Rates (Mbps) Algorithm Complexity (MIPSMOPS/) 1,000 100 Algorithm Complexity/Abstraction 100 10 1 10 TDP ODP MDP
Technology Advance A spacecraft onboard payload data processing system architecture, including a software framework and set of fault tolerance techniques, which provides: An architecture and methodology that enables COTS based, high performance, scalable, multi-computer systems, incorporating reconfigurable co-processors, and supporting parallel/distributed processing for science codes, that accommodates future COTS parts/standards through upgrades. An application software development and runtime environment that is familiar to science application developers, and facilitates porting of applications from the laboratory to the spacecraft payload data processor. An autonomous and adaptive controller for fault tolerance configuration, responsive to environment, application criticality and system mode, that maintains required dependability and availability while optimizing resource utilization and system efficiency. Methods and tools which allow the prediction of the system’s behavior in the space environment, including: predictions of availability, dependability, fault rates/types, and system level performance.
Radiation Environments Traditionally microelectronics have been designed and manufactured specifically for use in radiation environments Some COTS microelectronic manufacturing process yield components that are partly resistant to radiation effects (tolerant to TID and latch-up immune) In most cases Single Event Effects are of greatest concern - Resulting in mostly bit flips (SEU) and functional interrupts (SEFIs) The Department of Defense (DoD) requires radiation tolerant microelectronics to ensure that key military systems can perform in the combined nuclear and natural radiation environments. Without this capability, U.S. military power -- including conventional military power -- could be undermined on future battlefields. Although this need was widely recognized during the Cold War, it is less recognized today. In an era of nuclear proliferation where the possibility of limited nuclear usage may be greater than during the Cold War, the need for radiation tolerant microelectronics remains and may be increasing. For example, without radiation tolerant circuits, a single low-yield nuclear detonation in lower space could rapidly degrade the performance or cause catastrophic failure of many critical U.S. satellite systems, directly impacting command and control capabilities and degrading battlefield performance. Furthermore, some defense mission requirements demand survival at radiation levels that have no commercial equivalents. Discrete Simulation for 7 orbits of Xilinx V2 FPGA Shows trend driven by changes in particle flux Orbit: 300km perigee, 1400 apogee, 70° inclination Natural Radiation
N-Modular Redundancy The popular approach for mitigating SEUs is to employ fixed component level redundancy. This technique can be applied at all levels of the system hierarchy from circuit to box. One major disadvantage of fixed redundancy is low efficiency and unrealized system capacity. Example N-Mod Redundancy TMR (Triple Modular Redundancy) Typically used in COTS-based microprocessor and Xilinx FPGA-based reconfigurable designs. Module 1 Module 2 Module 3 Majority Voter
Adaptive Fault Tolerance Current COTS-based space computing/electronics systems use fixed-architecture designs based on brute-force, worst case fault masking techniques. Triple Modular Redundancy (TMR) is typically a hard-wired design approach for Rad Tolerant G4 PPC processors and Xilinx FPGAs The effectiveness and performance (MIPS/W) gains that the COTS device brings is degraded substantially by the use of a fixed design, worst-case redundancy scheme. EAFTC enables the computer subsystem to take advantage of changing orbital environments during a mission life to utilize the COTS processing elements more efficiently as the environment allows. This allows the EAFTC system to adaptively trade performance verses reliability in real time. EAFTC Based System Software Implemented FT COTS Processing Components in a Reconfigurable Arch Environmental Sensory (Radiation, position) Adaptive Control Algorithms
EAFTC Operational Scenario Average MIPS/Watt for EAFTC design MIPS per Watt SEU Rates MIPS/Watt for worst case design Orbit Position EAFTC exploits the SEU to orbit position relation as well as the variable criticality of system tasks The fundamental process implemented in the system consists of three steps: measure the environment and system state assess the environmental threat to the applications availability adapt the processing applications configuration (i.e. fault tolerance) to effectively mitigate the threat presented by the environment. On average more computation can be performed using EAFTC with less energy
Hardware Architecture APC Cluster Consist of several APC Nodes Networked together with RapidIO Adaptive Processing Computer Reconfigurable based processing node Multiple modes/configurations High-performance COTS processor (PPC) RapidIO network interface Reconfigurable co-processor System Controller Controller for APC Cluster Hosts EAFTC controller software and other experiment related control software RadHard processor and interfaces for reliable controller of COTS cluster SEU Alarm Provides measure of SEU-inducing flux & particle energy Used by EAFTC controller to determine real-time threat level to SEUs Separate heavy ion and proton sensors
Adaptive Processing Computer Conceptual Block Diagram
... EAFTC Application Platform FT Manager EAFTC Controller Job Manager Scientific Application Application Specific FT Application Programming Interface (API) System Controller Data Processor Policies Configuration Parameters Application FT Lib Co Proc Lib Application Specific Mission Specific FT Control Applications FT Middleware FT Middleware Generic Fault Tolerant Framework OS OS OS/Hardware Specific Hardware Hardware FPGA Network Local Management Agents Replication Services Fault Detection SAL (System Abstraction Layer)
EAFTC Middleware Provides a high-performance platform for parallel/distributed applications Cluster and job management to provide a single system view to the application Message Passing Interface API Platform abstraction to include OS system calls and hardware registers Mission Level Customization through policies Scalable architecture to support clustering of resources on multi-computer system Reconfigurable co-processors devices for application acceleration Provides a high-availability platform for applications An autonomous and adaptive controller for fault tolerance configuration that maintains required dependability and availability while optimizing resource utilization and system efficiency. Checkpoint and rollback service for application recovery in the event of a fault. Application level replication services to facilitate reliable deployment of applications in SEU susceptible COTS processing resources EAFTC Middleware offers numerous benefits as a system platform Capitalize on cost savings in the use of commercial hardware Capitalize on latest processing technology through technology refresh Reduces cost and extends system life through a software-based middleware solution Scales to meet system requirements Customizable degree of fault tolerance to meet specific system needs
EAFTC Software Architecture Fault Tolerance Management Services Maintain system in configuration that delivers highest availability to the application manages the pool of system resources health, system level error monitoring, error recovery, and error logging Job Management Services Maintains list and description of jobs to be performed Schedules jobs as resources become available or on periodic schedule Maintains resource status information Facilitates fault tolerance recovery via process control Environmental Server Monitor Sensor measurements are continuously monitored by the ESM. Based on the sensor inputs and the mission define resource susceptibility the ESM generates a set of cluster level alerts. ESM then publishes the alert level to the Job Manager which responds by adapting the cluster computer’s configuration and application deployment thereby enhancing the cluster’s fault tolerance. Fault Tolerant Message Passing Interface provides the primary communication mechanism for parallel applications Checkpoint and Rollback Service Provides the application a checkpoint and rollback API and service to support application level fault recovery Replication Service RPS employs a clustering approach to manage redundant resources RPS provides transparent Fault Detection, Isolation and Removal services that monitor for deviations in replicate behavior…… to ensure that faulty participants are properly handled. Frame Scheduler Service Data Integrity Service Process Group Service FPGA Co-Processor Service
EAFTC Software Components Collaboration
EAFTC Technology Advances to TRL7 Flight Experiment Validation Increasing fidelity and capability TRL6 Technology Validation TRL7 Validation - Demonstrate EAFTC technologies in a real space environment - Validate predictive models and predictive model parameters with experiment data - TRL7 experiments will be identical to those performed and rung out during TRL6 demonstration and validation TRL6 Validation - Demonstrate enhanced EAFTC technologies in a laboratory environment on prototype flight hardware including exposure to radiation beam - Validate and refine predictive models and predictive model parameters with experiment data - complete set of canonical fault injection experiments TRL5 Technology Validation TRL4 Validation - Demonstrated basic EAFTC technologies in a laboratory environment on COTS hardware testbed including radiation source and sensor - Environment Sensor - Alert Generator - High Availability Middleware - Replication Services NASA adds requirement for fault tolerant cluster and MPI capability TRL5 Validation - Demonstrate basic EAFTC technologies in a laboratory environment on testbed hardware with partially integrated Fault Tolerance Services - Develop predictive models - Validate and refine predictive models and predictive model parameters with experiment data - partial set of canonical fault injection experiments TRL4 Technology Validation
EAFTC Model Flow Canonical Fault Model Rad Effects Model HW SEU Inputs: Orbit Epoch Radiation characterization of components System architecture HW architecture Inputs: Decomposed HW Architecture Comprehensive Fault Model Canonical Fault Model Particle fluxes, Energies, & component SEE effects Canonical fault types Rad Effects Model Inputs: Mission application characterization and constraints Peak Throughput per CPU Number of nodes in cluster Algorithm/Architecture Coupling Efficiency for application Network-level parallelization efficiency Measured OS and FT Services overhead Measured execution times for applications Canonical fault types HW SEU Susceptibility Model Model Fault rates for each fault type in the canonical fault model (ln) Availability & Reliability Models Inputs: Probability that fault effects application Detection coverage for each fault/error type in the canonical model Recovery coverage for each fault/error type in the canonical fault model Detection and recovery latencies for each fault Number of mode change types and rates Time to effect mode change Probability that mode change is successful Availability & Reliability Performance Model Delivered Throughput Delivered Throughput Density Effective System Utilization
TRL4 EAFTC System Technology Demonstration Successful demonstration of EAFTC system The EAFTC prototype comprises key technology elements Cluster Computer Autonomous Controller Replication Services Environment input is simulated via SPENVIS radiation models Instrumentation for power utilization is included in the model Profiling is integrated on Data Processors for cpu utilization measurement Workload is provided via synthetic benchmark application on Data Processors Hardware Components Ganymede Single Board Computer represents the System Controller SBC Synergymicro Raptro-DX Single Board Computer represents the APC operating in Microprocessor mode Ethernet Based interconnect represents RapidIO interconnect Honeywell Reconfigurable Space Computer prototype represents APC operating in Reconfigurable Mode Software Components EAFTC Controller Reliable Platform Middleware Fault Tolerant Controller/Node Messaging Middleware Benchmark Application System Software - OS, profiling, and Honeywell Integrated Payload components
Computer Capacity Experiment TMR 3 node system EAFTC 4 node system average power: 72 Watts average system effective MIPS: 973 MIPS average system efficiency: 13 MIPS/Watt average power: 97 Watts average system effective MIPS: 2661 MIPS average system efficiency: 28 MIPS/Watt Comparison: 35% increase in power consumption, 173% increase in effective MIPS, and 115% increase in efficiency
TRL5 Platform Consists of 4 Data Processors implemented with COTS Single Board Computers (SBCs) and PCI Mezzanine Cards SBCs will implement a PPC 750FX microprocessor running the Linux operating system and a Software Fault Injectors for fault simulation. The PMCs will implement a Xilinx Virtex2 FPGA that will serve as the co-processor for its host SBC The System Controller will be implemented with a software development unit of our flight SBC. All nodes in the cluster will be interconnected via a GigE switch. A Development Workstation will be used for software development, experiment control, and instrumentation data collection. Software Implemented Fault Injection (SWIFI) will be the primary method for simulating faults. Other methods may be used such as manual node resets, network traffic fault injections (via software or hardware fault injection methods), and test port inserted faults Integrate, test, and demonstrate FT Cluster Manager and FT-MPI on EAFTC P5 system hardware with High Availability Middleware and provide hooks for incorporating ABFT and other fault tolerance techniques in TRL6 validation - spiral develop and demonstration plan (see TRL5 program schedule) Develop, test, and demonstrate FPGA support capability and performance improvement on EAFTC P5 system Implement and demonstrate stressing NASA science mission and stressing fault tolerance application on P5 system with and without injected faults (limited to key targeted faults) Perform radiation performance analysis for proposed ST-8 mission orbit to determine the anticipated fault rates and expected number of upsets in the TRL7 flight experiment Develop Radiation Effects/HW SEU Susceptibility Model, Fault Model, Availability Model, and Performance Model and use these models to successfully predict EAFTC P5 system performance Refine the existing system synthesis process and analysis models for evaluating the migration of COTS to space
New Millennium Program Space Technology 8 NASA program for technology development Currently working on its 8th technology development program In Formulation phase to evaluate 4 subsystem technologies (one of them EAFTC) The objective of the NMP ST8 EAFTC mission is to validate EAFTC technology at TRL7 through experimentation in space. SSR 7/05 PDR 5/06 (TRL5) CDR 5/07 (TRL6) Launch 12/08 (TRL7 after 6 month on-orbit experiment) Our team’s overall goal is to demonstrate that EAFTC is a competitive and low-risk solution for missions needing COTS high-performance on-board payload processing. We will demonstrate that by using EAFTC we can maximize and significantly improve the performance of a COTS based computer in orbit.
Summary EAFTC is an enabling technology for high performance spacecraft computing. As part of our NMP sponsored efforts a TRL4 system has been demonstrated Efforts continue towards a TRL5 system demonstration.