Low Overhead Fault Tolerant Networking (in Myrinet)

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
FIU Chapter 7: Input/Output Jerome Crooks Panyawat Chiamprasert
Model for Supporting High Integrity and Fault Tolerance Brian Dobbing, Aonix Europe Ltd Chief Technical Consultant.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
1-1 CMPE 259 Sensor Networks Katia Obraczka Winter 2005 Transport Protocols.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
8. Fault Tolerance in Software
Architecture and Real Time Systems Lab University of Massachusetts, Amherst An Application Driven Reliability Measures and Evaluation Tool for Fault Tolerant.
Figure 1.1 Interaction between applications and the operating system.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
USB 2.0 INTRODUCTION NTUT CSIE 學 生:許家豪 指導教授:柯開維教授.
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Process-to-Process Delivery:
Chapter 8 Input/Output. Busses l Group of electrical conductors suitable for carrying computer signals from one location to another l Each conductor in.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
The University of New Hampshire InterOperability Laboratory Introduction To PCIe Express © 2011 University of New Hampshire.
TCP/IP Yang Wang Professor: M.ANVARI.
Basic LAN techniques IN common with all other computer based systems networks require both HARDWARE and SOFTWARE to function. Networks are often explained.
TCP Throughput Collapse in Cluster-based Storage Systems
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
Connectivity Devices Hakim S. ADICHE, MSc
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
Cisco S2 C4 Router Components. Configure a Router You can configure a router from –from the console terminal (a computer connected to the router –through.
Distributed Systems: Concepts and Design Chapter 1 Pages
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
The OSI Model.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
TCP/IP Transport and Application (Topic 6)
Data and Computer Communications Circuit Switching and Packet Switching.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation Mrs. Shazia Maqbool and Dr. Craig I Underwood Maqbool 1 MAPLD 2005/P181.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Network Models.
The Client Server Model And Software Design
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
McGraw-Hill Chapter 23 Process-to-Process Delivery: UDP, TCP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Distributed Computing & Embedded Systems Chapter 4: Remote Method Invocation Dr. Umair Ali Khan.
Computer Networking Lecture 16 – Reliable Transport.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Protocols and layering Network protocols and software Layered protocol suites The OSI 7 layer model Common network design issues and solutions.
Network Models.
High Performance and Reliable Multicast over Myrinet/GM-2
EEC 688/788 Secure and Dependable Computing
J.M. Landgraf, M.J. LeVine, A. Ljubicic, Jr., M.W. Schulz
CS 286 Computer Organization and Architecture
Data Communication and Computer Networks
QNX Technology Overview
Process-to-Process Delivery:
Fault Tolerance Distributed Web-based Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Co-designed Virtual Machines for Reliable Computer Systems
Error Checking continued
Cluster Computers.
Presentation transcript:

Low Overhead Fault Tolerant Networking (in Myrinet) Architecture and Real-Time Systems (ARTS) Lab. Department of Electrical and Computer Engineering University of Massachusetts Amherst MA 01003

Motivation An increasing use of COTS components in systems has been motivated by the need to Reduce cost in design and maintenance Reduce software complexity The emergence of low cost, high performance COTS networking solutions e.g., Myrinet, SCI, FiberChannel etc. The increasing complexity of network interfaces has renewed concerns about its reliability The amount of silicon used has increased tremendously We need to use what is available to provide the fault detection and recovery Nothing fancy is typically available

How can we incorporate fault tolerance The Basic Question How can we incorporate fault tolerance into a COTS network technology without greatly compromising its performance?

Microprocessor-based Networks Most modern network technologies have processors in their interface cards that help to achieve superior network performance Many of these technologies allow changes in the program running on the network processor Such programmable interfaces offer numerous benefits: Developing different fault tolerance techniques Validating fault recovery using fault injection experimenting with different communication protocols We use Myrinet as the platform for our study You have to tell here that we now look one side of the relationship; the host processor to the rescue of the network interface We will first take a sample high speed network and study its vulnerability to Failures. What are the drawbacks of the current network software and how we have

Myrinet Myrinet is a cost-effective high performance (2.2 Gb/s) packet switching technology At its core is a powerful RISC processor It is scalable to thousands of nodes Low latency communication (8 ms) is achieved through direct interaction with network interface (“OS bypass”) Flow control, error control and simple “heartbeat mechanisms” are incorporated in hardware Link and routing specifications are public & standard Myrinet support software is supplied “open source” You have to tell here that we now look one side of the relationship; the host processor to the rescue of the network interface We will first take a sample high speed network and study its vulnerability to Failures. What are the drawbacks of the current network software and how we have

Myrinet Configuration Host Node System Memory Host Processor System Bridge I/O Bus LANai SRAM Timers 1 2 PCI Bridge DMA Engine Host Interface Packet Interface SAN/LAN Conversion RISC PCIDMA LANai 9

Myrinet Control Program Hardware & Software Application Host Processor System Memory Middleware (e.g., MPI) TCP/IP interface OS driver I/O Bus Myrinet Card Network Processor Local Memory Myrinet Control Program Programmable Interface

Susceptability to Failures Dependability evaluation was carried out using software implemented fault injection Faults were injected in the Control Program (MCP) A wide range of failures were observed Unexpected latencies and reduction of bandwidth The network processor can hang and stop responding A host system can crash/hang A remote network interface can get affected Similar type of failures can be expected from other high-speed networks Such failures can greatly impact the reliability/availability of the system

Summary of Experiments Failure Category Count % of Injections 57.9 1205 No Impact 1.15 23 Other Errors 0.43 9 Host Computer Crash 3.1 65 MCP Restart 12.7 264 Messages Dropped/Corrupted 24.6 514 Host Interface Hang Total 2080 100 More than 50% of the failures were host interface hangs

Design Considerations The faults must be detected and diagnosed as quickly as possible The network interface must be up and running as soon as possible The recovery process must ensure that no messages are lost or improperly received/sent Complete correctness should be achieved The overhead on the normal running of the system must be minimal The fault tolerance should be made as transparent to the user as possible

Fault Detection Continuously polling the card can be very costly We use a spare interval timer to implement a watchdog timer functionality for fault detection We set the LANai to raise an interrupt when the timer expires A routine (L_timer) that the LANai is supposed to execute every so often resets this interval timer If the interface hangs, then L_timer is not executed, causing our interval timer to expire and raising a FATAL interrupt

Fault Recovery Summary The FATAL interrupt signal is picked by the fault recovery daemon on the host The failure is verified through numerous probing messages The control program is reloaded into the LANai SRAM Any process that was accessing the board prior to the failure is also restored to its original state Simply reloading the MCP will not ensure correctness You have to tell here that we now look one side of the relationship; the host processor to the rescue of the network interface We will first take a sample high speed network and study its vulnerability to Failures. What are the drawbacks of the current network software and how we have

Myrinet Programming Model Flow control is achieved through send and receive tokens Myrinet software (GM) provides reliable in-order delivery of messages A modified form of “Go-Back-N” protocol is used Sequence numbers for the protocol are provided by the MCP One stream of sequence numbers exists per destination You have to tell here that we now look one side of the relationship; the host processor to the rescue of the network interface We will first take a sample high speed network and study its vulnerability to Failures. What are the drawbacks of the current network software and how we have

Typical Control Flow Sender Receiver User process prepares message User process sets send token User process provides receive buffer User process sets recv token LANai sdmas message LANai sends message LANai receives ACK LANai sends event to process LANai recvs message LANai sends ACK LANai rdmas message LANai sends event to process User process handles notification event User process reuses buffer User process handles notification event User process reuses buffer

Duplicate Messages Sender Receiver User process prepares message User process sets send token User process provides receive buffer User process sets recv token LANai sdmas message LANai sends message LANai recvs message LANai sends ACK LANai rdmas message LANai sends event to process LANai goes down Lost ACK Driver reloads MCP into board Driver resends all unacked messages LANai sdmas message LANai sends message User process handles notification event User process reuses buffer Duplicate message LANai recvs message ERROR! Lack of redundant state information is the cause for this problem

Lost Messages Sender Receiver User process prepares message User process sets send token User process provides receive buffer User process sets recv token LANai sdmas message LANai sends message LANai receives ACK LANai sends event to process LANai recvs message LANai sends ACK LANai goes down User process handles notification event User process reuses buffer Driver reloads MCP into board Driver sets all recv tokens again LANai waits for message ERROR! Incorrect commit point is the cause of this problem

Fault Recovery We need to keep a copy of the state information Checkpointing can be a big overhead Logging critical message information is enough GM functions are modified so that A copy of the send tokens and the receive tokens is made with every send and receive call The host processes provide the sequence numbers, one per (destination node, local port) pair Copy of send and receive token is removed when the send/receive completes successfully MCP is modified ACK is sent out only after a message is DMAed to host memory

Performance Impact The scheme has been integrated successfully into GM Over 1 man year for complete implementation How much of the performance of the system has been compromised ? After all one can’t get a free lunch these days! Performance is measured using two key parameters Bandwidth obtained with large messages Latency of small messages

Latency

Bandwidth

Summary of Results Host Platform: Pentium III with 256MB 6.8 ms 6.0 ms LANai-CPU utilization 1.15 ms 0.75 ms Host-CPU utilization for receive 0.55 ms 0.3 ms Host-CPU utilization for send 13.0 ms 11.5 ms Latency 92 MHz 92.4 MHz Bandwidth FTGM GM Performance Metric Host Platform: Pentium III with 256MB RedHat Linux 7.2

Summary of Results Fault Detection Latency = 50 ms Fault Recovery Latency = 0.765 s Per-Process Latency = 0.50 s

Our Contributions We have devised smart ways to detect and recover from network interface failures Our fault detection technique for “network processor hangs” uses software implemented watchdog timers Fault recovery time (including reloading of network control program) ~ 2 seconds Performance impact is under 1% for messages over 1KB Complete user transparency was achieved