John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Database Architectures and the Web
2. Computer Clusters for Scalable Parallel Computing
Page 1 Dorado 400 Series Server Club Page 2 First member of the Dorado family based on the Next Generation architecture Employs Intel 64 Xeon Dual.
Sensor Network Platforms and Tools
Types of Parallel Computers
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Chapter 19: Network Management Business Data Communications, 4e.
Distributed Processing, Client/Server, and Clusters
Distributed components
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
Operating Systems High Level View Chapter 1,2. Who is the User? End Users Application Programmers System Programmers Administrators.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Distributed Hardware How are computers interconnected ? –via a bus-based –via a switch How are processors and memories interconnected ? –Private –shared.
Chapter 13 Embedded Systems
Figure 1.1 Interaction between applications and the operating system.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Building an Application Server for Home Network based on Android Platform Yi-hsien Liao Supervised by : Dr. Chao-huang Wei Department of Electrical Engineering.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
Computer System Architectures Computer System Software
Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Cluster Reliability Project ISIS Vanderbilt University.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
Embedded Runtime Reconfigurable Nodes for wireless sensor networks applications Chris Morales Kaz Onishi 1.
August 3-4, 2004 San Jose, CA Developing a Complete VoIP System Asif Naseem Senior Vice President & CTO GoAhead Software.
Middleware for FIs Apeego House 4B, Tardeo Rd. Mumbai Tel: Fax:
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
DISTRIBUTED COMPUTING. Computing? Computing is usually defined as the activity of using and improving computer technology, computer hardware and software.
MAPLD 2005/254C. Papachristou 1 Reconfigurable and Evolvable Hardware Fabric Chris Papachristou, Frank Wolff Robert Ewing Electrical Engineering & Computer.
Grid Computing Framework A Java framework for managed modular distributed parallel computing.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 1: Characterization of Distributed & Mobile Systems Dr. Michael R.
Programming Sensor Networks Andrew Chien CSE291 Spring 2003 May 6, 2003.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Background Computer System Architectures Computer System Software.
SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Introduction to Distributed Platforms
Grid Computing.
QNX Technology Overview
Design.
Distributed Systems and Concurrency: Distributed Systems
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

John David Eriksen Jamie Unger-Fink High-Performance, Dependable Multiprocessor

 Traditional space computing limited primarily to mission-critical applications ◦ Spacecraft control ◦ Life support  Data collected in space and processed on the ground  Data sets in space applications continue to grow Background and Motivation

 Communication bandwidth not growing fast enough to cope with increasing size of data sets ◦ Instruments and sensors grow in capability  Increasing need for on-board data processing ◦ Perform data filtering and other operations on- board ◦ Autonomous systems demand more computing power Background and Motivation

 Advanced Onboard Signal Processor (AOSP) ◦ Developed in 70’s and 80’s ◦ Helped develop understanding of radiation on computing systems and components.  Advanced Architecture Onboard Processor (AAOP) ◦ Engineered new approaches to onboard data processing Related Work

 Space Touchstone ◦ First COTS-based, FT, high-performance system  Remote Exploration and Experimentation ◦ Extended FT techniques to parallel and cluster computing ◦ Focused on low-cost, high-performance, good power-ratio compute cluster designs. Related Work

 Address need for increased data processing requirements  Bring COTS systems to space ◦ COTS (Commodity Off-The-Shelf)  Less expensive  General-purpose  Need special considerations to meet requirements of aerospace environments  Fault-tolerance  High reliability  High availability Goal

 A reconfigurable cluster computer with centralized control. Dependable Multiprocessor is…

 A hardware architecture ◦ High-performance characteristics ◦ Scalable ◦ Upgradable (thanks to reliance on COTS)  A parallel processing environment ◦ Support common scientific computing development environment (FEMPI)  A fault-tolerant computing platform ◦ System controllers provide FT properties  A toolset for predicting application behavior ◦ Fault behavior, performance, availability… Dependable Multiprocessor is…

 Redundant radiation-hardened system controller  Cluster of COTS-based reconfigurable data processors  Redundant COTS-based packet-switched networks  Radiation-hardened mass data store  Redundancy available in: ◦ System controller ◦ Network ◦ Configurable N-of-M sparing in compute nodes Hardware Architecture

 Scalability ◦ Variable number of compute nodes ◦ Cluster-of-cluster  Compute nodes ◦ IBM PowerPC 750FX general processor ◦ Xilinx VirtexII 6000 FPGA co-processor  Reconfigurable to fulfill various roles  DSP processor  Data compression  Vector processing  Applications implemented in hardware can be very fast ◦ Memory and other support chips Hardware Architecture

 Network Interconnect ◦ Gigabit Ethernet for data exchange ◦ A low-latency, low-bandwidth bus used for control  Mission Interface ◦ Provides interface to rest of space vehicle’s computer systems ◦ Radiation-hardened Hardware Architecture

 Current hardware implementation ◦ Four data processors ◦ Two redundant system controllers ◦ One mass data store ◦ Two gigabit ethernet networks including two network switches ◦ Software-controlled instrumented power supply ◦ Workstation running spacecraft system emulator software Hardware Architecture

 Platform layer is lowest layer, interfaces hardware to middleware, hardware-specific software, network drivers ◦ Uses Linux, allows for use of many existing software tools  Mission Layer:  Middleware: includes DM System Services: fault tolerance, job management, etc.

 DM Framework is application independent, platform independent  API to communicate with mission layer, SAL (System Abstraction Layer) for platform layer  Allows for future applications by facilitating porting to new platforms

 HA Middleware foundation includes: Availability Management (AMS), Distributed Messaging (DMS), Cluster Management (CMS)  Primary functions ◦ Resource monitoring ◦ Fault detection, diagnosis, recovery and reporting ◦ Cluster configuration ◦ Event logging ◦ Distributed messaging  Based on small, cross-platform kernel

 Hosted on the cluster’s system controller  Managed Resources include: ◦ Applications ◦ Operating System ◦ Chassis ◦ I/O cards ◦ Redundant CPUs ◦ Networks ◦ Peripherals ◦ Clusters ◦ Other middleware

 Provides a reliable messaging layer for communications in DM cluster  Used for Checkpointing, Client/server, Communications, Event notification, Fault management, Time-critical communications  Application opens a DMS connection (channel) to pass data to interested subscribers  Since messaging is in middleware instead of lower layers, application doesn’t have to specify explicitly destination address  Messages are classified and machines choose to receive message of a certain type

 Manages physical nodes or instances of HA middleware  Discovers and monitors nodes in a cluster  Passes node failures to AMS and FT Manager via DMS

 Database Management  Logging Services  Tracing

 Interface to control computer or ground station  Communicates with system via DMS  Monitors system health with FT Manager ◦ “Heartbeat”

 Detects and recovers from system faults  FTM refers to set of recovery policies at runtime  Relies on distributed software agents to gather system and application liveliness information ◦ Avoids monitoring bottleneck

 Provides application scheduling, resource allocation  Opportunistic load balancing scheduler  Jobs are registered and trace by the JM via tables  Checkpointing to allow seamless recovery of the JM  Heartbeats to the FT via middleware

 Fault-Tolerant Embedded Message Passing Interface ◦ Application independent FT middleware ◦ Message Passing Interface (MPI) Standard ◦ Built on top of HA middleware

 Recovery from failure should be automatic, with minimal impact  Needs to maintain global awareness of the processes in parallel applications  3 Stages: ◦ Fault Detection ◦ Notification ◦ Recovery  Process failures vs Network failures  Survives the crash of n-1 processes in an n- process job

 Proprietary nature of FPGA industry  USURP - USURP’s Standard for Unified Reconfigurable Platforms ◦ Standard to interact with hardware ◦ Provides middleware for portability ◦ Black box IP cores ◦ Wrappers mask FPGA board

 Not a universal tool for mapping high-level code with hardware design  OpenFPGA  Adaptive Computing System (ACS) vs USURP ◦ Object Oriented Models vs Software APIs  IGOL  BLAST  CARMA

 Responsible for:  Unifying vendor APIs  Standardizing HW interface  Organization of data for the user application core  Exposing the developer to common FPGA resources.

 User level protocol for system recovery  Consists of: ◦ Server Process that runs on Mass Data Store  DMS ◦ API for applications  C-type interfaces

 Algorithm-based Fault Tolerance Library  Collection of mathematical routines that can detect and correct faults  BLAS-3 Library ◦ Matrix multiply, LU decomposition, QR decomposition, single-value decompositions (SVD) and fast Fourier transform (FFT).  Uses checksums

 Triple Modular Redunancy  Process Level Replication

 System architecture has been defined  Testbench has been assembled  Improvements: ◦ More aggressively address power consumption issues ◦ Add support for other scientific computing platforms such as Fortran Conclusion