John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor.

John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor

 Traditional space computing limited primarily to mission-critical applications ◦ Spacecraft control ◦ Life support  Data collected in space and processed on the ground  Data sets in space applications continue to grow Background and Motivation

 Communication bandwidth not growing fast enough to cope with increasing size of data sets ◦ Instruments and sensors grow in capability  Increasing need for on-board data processing ◦ Perform data filtering and other operations onboard ◦ Autonomous systems demand more computing power Background and Motivation

 Advanced Onboard Signal Processor (AOSP) ◦ Developed in 70’s and 80’s ◦ Helped develop understanding of radiation on computing systems and components.  Advanced Architecture Onboard Processor (AAOP) ◦ Engineered new approaches to onboard data processing Related Work

 Space Touchstone ◦ First COTS-based, FT, high-performance system  Remote Exploration and Experimentation ◦ Extended FT techniques to parallel and cluster computing ◦ Focused on low-cost, high-performance, good power-ratio compute cluster designs. Related Work

 Address need for increased data processing requirements  Bring COTS systems to space ◦ COTS (Commodity Off-The-Shelf)  Less expensive  General-purpose  Need special considerations to meet requirements of aerospace environments  Fault-tolerance  High reliability  High availability Goal

 A reconfigurable cluster computer with centralized control. Dependable Multiprocessor is…

 A hardware architecture ◦ High-performance characteristics ◦ Scalable ◦ Upgradable (thanks to reliance on COTS)  A parallel processing environment ◦ Support common scientific computing development environment (FEMPI)  A fault-tolerant computing platform ◦ System controllers provide FT properties  A toolset for predicting application behavior ◦ Fault behavior, performance, availability… Dependable Multiprocessor is…

 Redundant radiation-hardened system controller  Cluster of COTS-based reconfigurable data processors  Redundant COTS-based packet-switched networks  Radiation-hardened mass data store  Redundancy available in: ◦ System controller ◦ Network ◦ Configurable N-of-M sparing in compute nodes Hardware Architecture

 Scalability ◦ Variable number of compute nodes ◦ Cluster-of-cluster  Compute nodes ◦ IBM PowerPC 750FX general processor ◦ Xilinx VirtexII 6000 FPGA co-processor  Reconfigurable to fulfill various roles  DSP processor  Data compression  Vector processing  Applications implemented in hardware can be very fast ◦ Memory and other support chips Hardware Architecture

 Network Interconnect ◦ Gigabit Ethernet for data exchange ◦ A low-latency, low-bandwidth bus used for control  Mission Interface ◦ Provides interface to rest of space vehicle’s computer systems ◦ Radiation-hardened Hardware Architecture

 Current hardware implementation ◦ Four data processors ◦ Two redundant system controllers ◦ One mass data store ◦ Two gigabit ethernet networks including two network switches ◦ Software-controlled instrumented power supply ◦ Workstation running spacecraft system emulator software Hardware Architecture

 Platform layer is lowest layer, interfaces hardware to middleware, hardware-specific software, network drivers ◦ Uses Linux, allows for use of many existing software tools  Mission Layer:  Middleware: includes DM System Services: fault tolerance, job management, etc.

 DM Framework is application independent, platform independent  API to communicate with mission layer, SAL (System Abstraction Layer) for platform layer  Allows for future applications by facilitating porting to new platforms

 HA Middleware foundation includes: Availability Management (AMS), Distributed Messaging (DMS), Cluster Management (CMS)  Primary functions ◦ Resource monitoring ◦ Fault detection, diagnosis, recovery and reporting ◦ Cluster configuration ◦ Event logging ◦ Distributed messaging  Based on small, cross-platform kernel

 Hosted on the cluster’s system controller  Managed Resources include: ◦ Applications ◦ Operating System ◦ Chassis ◦ I/O cards ◦ Redundant CPUs ◦ Networks ◦ Peripherals ◦ Clusters ◦ Other middleware

 Provides a reliable messaging layer for communications in DM cluster  Used for Checkpointing, Client/server, Communications, Event notification, Fault management, Time-critical communications  Application opens a DMS connection (channel) to pass data to interested subscribers  Since messaging is in middleware instead of lower layers, application doesn’t have to specify explicitly destination address  Messages are classified and machines choose to receive message of a certain type

 Manages physical nodes or instances of HA middleware  Discovers and monitors nodes in a cluster  Passes node failures to AMS and FT Manager via DMS

 Database Management  Logging Services  Tracing

 Interface to control computer or ground station  Communicates with system via DMS  Monitors system health with FT Manager ◦ “Heartbeat”

 Detects and recovers from system faults  FTM refers to set of recovery policies at runtime  Relies on distributed software agents to gather system and application liveliness information ◦ Avoids monitoring bottleneck

 Provides application scheduling, resource allocation  Opportunistic load balancing scheduler  Jobs are registered and trace by the JM via tables  Checkpointing to allow seamless recovery of the JM  Heartbeats to the FT via middleware

 Fault-Tolerant Embedded Message Passing Interface ◦ Application independent FT middleware ◦ Message Passing Interface (MPI) Standard ◦ Built on top of HA middleware

 Recovery from failure should be automatic, with minimal impact  Needs to maintain global awareness of the processes in parallel applications  3 Stages: ◦ Fault Detection ◦ Notification ◦ Recovery  Process failures vs Network failures  Survives the crash of n-1 processes in an n- process job

 Proprietary nature of FPGA industry  USURP - USURP’s Standard for Unified Reconfigurable Platforms ◦ Standard to interact with hardware ◦ Provides middleware for portability ◦ Black box IP cores ◦ Wrappers mask FPGA board

 Not a universal tool for mapping high-level code with hardware design  OpenFPGA  Adaptive Computing System (ACS) vs USURP ◦ Object Oriented Models vs Software APIs  IGOL  BLAST  CARMA

 Responsible for:  Unifying vendor APIs  Standardizing HW interface  Organization of data for the user application core  Exposing the developer to common FPGA resources.

 User level protocol for system recovery  Consists of: ◦ Server Process that runs on Mass Data Store  DMS ◦ API for applications  C-type interfaces

 Algorithm-based Fault Tolerance Library  Collection of mathematical routines that can detect and correct faults  BLAS-3 Library ◦ Matrix multiply, LU decomposition, QR decomposition, single-value decompositions (SVD) and fast Fourier transform (FFT).  Uses checksums

 Triple Modular Redunancy  Process Level Replication

 System architecture has been defined  Testbench has been assembled  Improvements: ◦ More aggressively address power consumption issues ◦ Add support for other scientific computing platforms such as Fortran Conclusion

John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor.

Similar presentations

Presentation on theme: "John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor.

Similar presentations

Presentation on theme: "John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor."— Presentation transcript:

Similar presentations

About project

Feedback