Presentation is loading. Please wait.

Presentation is loading. Please wait.

John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor.

Similar presentations


Presentation on theme: "John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor."— Presentation transcript:

1 John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor

2  Traditional space computing limited primarily to mission-critical applications ◦ Spacecraft control ◦ Life support  Data collected in space and processed on the ground  Data sets in space applications continue to grow Background and Motivation

3  Communication bandwidth not growing fast enough to cope with increasing size of data sets ◦ Instruments and sensors grow in capability  Increasing need for on-board data processing ◦ Perform data filtering and other operations on- board ◦ Autonomous systems demand more computing power Background and Motivation

4  Advanced Onboard Signal Processor (AOSP) ◦ Developed in 70’s and 80’s ◦ Helped develop understanding of radiation on computing systems and components.  Advanced Architecture Onboard Processor (AAOP) ◦ Engineered new approaches to onboard data processing Related Work

5  Space Touchstone ◦ First COTS-based, FT, high-performance system  Remote Exploration and Experimentation ◦ Extended FT techniques to parallel and cluster computing ◦ Focused on low-cost, high-performance, good power-ratio compute cluster designs. Related Work

6  Address need for increased data processing requirements  Bring COTS systems to space ◦ COTS (Commodity Off-The-Shelf)  Less expensive  General-purpose  Need special considerations to meet requirements of aerospace environments  Fault-tolerance  High reliability  High availability Goal

7  A reconfigurable cluster computer with centralized control. Dependable Multiprocessor is…

8  A hardware architecture ◦ High-performance characteristics ◦ Scalable ◦ Upgradable (thanks to reliance on COTS)  A parallel processing environment ◦ Support common scientific computing development environment (FEMPI)  A fault-tolerant computing platform ◦ System controllers provide FT properties  A toolset for predicting application behavior ◦ Fault behavior, performance, availability… Dependable Multiprocessor is…

9  Redundant radiation-hardened system controller  Cluster of COTS-based reconfigurable data processors  Redundant COTS-based packet-switched networks  Radiation-hardened mass data store  Redundancy available in: ◦ System controller ◦ Network ◦ Configurable N-of-M sparing in compute nodes Hardware Architecture

10

11  Scalability ◦ Variable number of compute nodes ◦ Cluster-of-cluster  Compute nodes ◦ IBM PowerPC 750FX general processor ◦ Xilinx VirtexII 6000 FPGA co-processor  Reconfigurable to fulfill various roles  DSP processor  Data compression  Vector processing  Applications implemented in hardware can be very fast ◦ Memory and other support chips Hardware Architecture

12

13

14  Network Interconnect ◦ Gigabit Ethernet for data exchange ◦ A low-latency, low-bandwidth bus used for control  Mission Interface ◦ Provides interface to rest of space vehicle’s computer systems ◦ Radiation-hardened Hardware Architecture

15  Current hardware implementation ◦ Four data processors ◦ Two redundant system controllers ◦ One mass data store ◦ Two gigabit ethernet networks including two network switches ◦ Software-controlled instrumented power supply ◦ Workstation running spacecraft system emulator software Hardware Architecture

16

17

18  Platform layer is lowest layer, interfaces hardware to middleware, hardware-specific software, network drivers ◦ Uses Linux, allows for use of many existing software tools  Mission Layer:  Middleware: includes DM System Services: fault tolerance, job management, etc.

19

20  DM Framework is application independent, platform independent  API to communicate with mission layer, SAL (System Abstraction Layer) for platform layer  Allows for future applications by facilitating porting to new platforms

21  HA Middleware foundation includes: Availability Management (AMS), Distributed Messaging (DMS), Cluster Management (CMS)  Primary functions ◦ Resource monitoring ◦ Fault detection, diagnosis, recovery and reporting ◦ Cluster configuration ◦ Event logging ◦ Distributed messaging  Based on small, cross-platform kernel

22  Hosted on the cluster’s system controller  Managed Resources include: ◦ Applications ◦ Operating System ◦ Chassis ◦ I/O cards ◦ Redundant CPUs ◦ Networks ◦ Peripherals ◦ Clusters ◦ Other middleware

23  Provides a reliable messaging layer for communications in DM cluster  Used for Checkpointing, Client/server, Communications, Event notification, Fault management, Time-critical communications  Application opens a DMS connection (channel) to pass data to interested subscribers  Since messaging is in middleware instead of lower layers, application doesn’t have to specify explicitly destination address  Messages are classified and machines choose to receive message of a certain type

24  Manages physical nodes or instances of HA middleware  Discovers and monitors nodes in a cluster  Passes node failures to AMS and FT Manager via DMS

25  Database Management  Logging Services  Tracing

26  Interface to control computer or ground station  Communicates with system via DMS  Monitors system health with FT Manager ◦ “Heartbeat”

27  Detects and recovers from system faults  FTM refers to set of recovery policies at runtime  Relies on distributed software agents to gather system and application liveliness information ◦ Avoids monitoring bottleneck

28  Provides application scheduling, resource allocation  Opportunistic load balancing scheduler  Jobs are registered and trace by the JM via tables  Checkpointing to allow seamless recovery of the JM  Heartbeats to the FT via middleware

29  Fault-Tolerant Embedded Message Passing Interface ◦ Application independent FT middleware ◦ Message Passing Interface (MPI) Standard ◦ Built on top of HA middleware

30  Recovery from failure should be automatic, with minimal impact  Needs to maintain global awareness of the processes in parallel applications  3 Stages: ◦ Fault Detection ◦ Notification ◦ Recovery  Process failures vs Network failures  Survives the crash of n-1 processes in an n- process job

31  Proprietary nature of FPGA industry  USURP - USURP’s Standard for Unified Reconfigurable Platforms ◦ Standard to interact with hardware ◦ Provides middleware for portability ◦ Black box IP cores ◦ Wrappers mask FPGA board

32  Not a universal tool for mapping high-level code with hardware design  OpenFPGA  Adaptive Computing System (ACS) vs USURP ◦ Object Oriented Models vs Software APIs  IGOL  BLAST  CARMA

33  Responsible for:  Unifying vendor APIs  Standardizing HW interface  Organization of data for the user application core  Exposing the developer to common FPGA resources.

34  User level protocol for system recovery  Consists of: ◦ Server Process that runs on Mass Data Store  DMS ◦ API for applications  C-type interfaces

35  Algorithm-based Fault Tolerance Library  Collection of mathematical routines that can detect and correct faults  BLAS-3 Library ◦ Matrix multiply, LU decomposition, QR decomposition, single-value decompositions (SVD) and fast Fourier transform (FFT).  Uses checksums

36  Triple Modular Redunancy  Process Level Replication

37  System architecture has been defined  Testbench has been assembled  Improvements: ◦ More aggressively address power consumption issues ◦ Add support for other scientific computing platforms such as Fortran Conclusion


Download ppt "John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor."

Similar presentations


Ads by Google