Mendosus A SAN-Based Fault Injection Test-Bed for Construction of Highly Available Network Services Xiaoyan Li, Richard Martin, Kiran Nagaraja, Thu D. Nguyen and Bin Zhang Dept. of Computer Science, Rutgers University
Talk Outline Motivation Design Implementation Benchmarks Case Studies Related Work Future Work
Motivation Ubiquitous network access exponential growth in network services Availability is one key challenge Networked systems are comprised of large numbers of heterogeneous components Faults are not uncommon Complex interaction between components Examples of costly failures: Ebay, Brittanica Currently difficult to assess service availability How to analyze impact of failures? How to set up an appropriate test-bed?
Mendosus Goal: provide infrastructure for service designers to assess the availability of network services Overview: Provide flexible infrastructure to accurately model a variety of different networking systems from the application’s point-of-view Run application in real-time and inject faults to assess application’s behavior Two key components: Real-time emulation of a variety of interconnects General fault injection infrastructure
Vision Map available resources to emulated network
Design
Mendosus Architecture Applications Kernel Latency Routing Fault Inclusion Mendosus daemon Central Controller Network State User Level Fast & Reliable SAN Emulator Module Events
Design Decisions Central controller Advantage: consistent network and fault information Disadvantage: limits scalability Not involved in network emulation so should still scale well to targeted system sizes (thousands or tens of thousands of components) Entire network state is maintained at each end node Advantage: performance Disadvantage: limits scalability Only maintain state for LAN Emulation module embedded within kernel Advantage: no modifications to application code Disadvantage: more difficult to modify and extend
Functional Components Topology Maintenance Fault Injection Emulation
Topology Maintenance Specification - simple ns-2 like topology scripts Specify available resources Central controller manages topology Initializes original topology on each node Consistent view Real time topology changes Specified as scripted events Controller monitors network connectivity Detects partitions
Fault Injection Every n/w component can have a fault profile Switches, hubs, NICs, links, end nodes Fault specification: trace files or theoretical distributions Exponential, Weibull, constant Simulate fail-stop components MTTR - constant or follow a distribution E.g. unplugging, port shutdown
Emulation Completely distributed Every node has enough network state Emulation Messaging sequence Application initiates communication Routing – determine route Fault Inclusion – effect of injected faults Latency – corresponding to route taken We do not implement the innards of network components Switching
Implementation
Ethernet LAN Emulation Routing Emulate computation of Ethernet spanning tree Controller chooses root of tree Emulator on each node computes identical spanning tree Reconfiguration performed periodically (every 2 secs) Broadcast & Multicast Emulate using sequence of unicast
Ethernet LAN Emulation - Faults Network partitions Controller monitors connectivity Multiple roots - one for each partition NIC fail-over Multiple interfaces using IP aliasing support in Linux
Emulation completeness… Yes P-to-P Software (multiple unicast) HardwareBroadcast Not implementedSome advanced switches Layer 3, 4 services E.g.VLAN, IGMP Software (Broadcast w/ filters) HardwareMulticast Emulated Ethernet EthernetFeature
Micro-benchmarks
Emulation Limits Emulator Gigabit Ethernet Fast Ethernet RTT usecThroughput MB/sec No. of Switches in Topology Network
Software Broadcast Scaling
Fault View Convergence
Case Studies
Group Membership Test protocol behavior under faults subtle interactions in distributed protocols Three Round Membership algorithm Robust against multiple node failures, packet drops and network partitions Two modes of operation: normal and FCM
Membership Observations A C BD 5. Link L up 4. Packet drops at A 3. NIC at B recovers 2. Link L down 1. NIC failure at B L
Multi-Level Switched Network Large enterprise LANs have multiple layers of network components Access, core and aggregation switches How to evaluate availability vs. cost vs. complexity? Study service availability with increased redundancy Faults following exponential distributions
Enterprise LAN
Availability Vs Redundancy
Related Work Network Emulation Distributed emulation Emulab [Utah], DelayLine Centralized emulation NISTNET, Lancaster emulator Fault injection Script-based probing and fault injection Orchestra, DOCTOR Co-related faults Loki [UIUC] Simulation NS-2, REAL[Cornell], SSFNet, x-sim[Arizona]
Future Work Extend Mendosus to emulate other networks WAN: Build in performance dynamics model Wireless LAN - Realistic fault and performance models Support pluggable modules within network components which add functionality and additional failures ! Intelligent Routing protocols (E.g. HSRP) Dynamic DNS, RR DNS
Summary Test-bed for service designers to systematically analyze network and protocol design against failures Results show that real-time emulation is feasible given capability of current SAN networks Demonstrated the flexibility and usefulness of Mendosus through 2 case studies Another step towards building highly available services…