Download presentation
Presentation is loading. Please wait.
Published byMaryann Watkins Modified over 6 years ago
1
IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS
MICHAEL M. SWIFT, BRIAN N BERSHARD, HENRY M. LEVY Presenter: Shyam Sunder Santoshi Visamsetty 11/13/2007
2
Outline Introduction Motivation Previous work Nooks Performance
Architecture Implementation Performance Conclusion 11/13/2007
3
Features of a Good Operating System
High Performance High Scalability High Reliability 11/13/2007
4
Reliability Problems in Operating Systems
Crashes caused by: Device Drivers Other Extensions such as File Systems, Virus Detectors, Network Protocols etc.. 11/13/2007
5
Causes of System Crashes in Windows NT
Source: June 2000 11/13/2007
6
Crashes in Windows XP Source: Jan 2003 11/13/2007
7
“The most notable reality is that the Windows operating system is not responsible for a majority of PC crashes in our data set. Poorly-written device drivers contribute most of the crashes in our data.” -- Windows XP Kernel Crash Analysis by Archana Ganapathi, Viji Ganapathi and David Patterson, University of California, Berkeley, 2006 11/13/2007
8
Why Device Drivers? Device Drivers access the system memory and hardware directly. Device Drivers and other Extensions account for 70% of the code as in Linux release. Faulty Code might cause the crash. 11/13/2007
9
Motivation Reliability remains a crucial but an unsolved problem.
Rising Costs of Failures Increasing Prevalence of OS Extensions Extensions are leading cause of OS Failure Extensions are optional components that reside in the kernel address space and typically communicate with the kernel through published interfaces. 11/13/2007
10
Previous Approaches to Enhance Reliability
Microkernels Type Safe Languages New Hardware : Ring and Segment Architectures Transaction-based systems 11/13/2007
11
Nooks Approach Conventional Processor Architecture
Conventional Programming Language Conventional OS Architecture Existing Extensions Nooks virtualizes only the interface between the kernel and extension. Virtualization techniques typically run several entire Operating Systems on top of a virtual machine; so faulty extensions in one OS can cause only a few applications to fail. The challenge for reliable extensibility is not in virtualizing the hardware . VM’s also cause slow IPC and intelligent scheduling. 11/13/2007
12
Goals Isolation Recovery Backward Compatibility 11/13/2007
13
Nooks Architecture Two Core Principles:
Design for fault resistance, not fault tolerance. Design for mistakes, not abuse. From the second principle, Nooks chooses to occupy the design space between unprotected and safe. 11/13/2007
14
Nooks: Implementation
Implemented on Linux Kernel. Isolated Kernel Extensions are wrapped by Nooks wrapper stubs. All extensions execute at ring 0. Nooks does not use Intel x86 protection rings or memory segmentation mechanisms. 11/13/2007
15
Nooks Layered Architecture
11/13/2007
16
Functions of Nooks 11/13/2007
17
Isolation Prevent extension errors from damaging the kernel.
Every extension executes within its lightweight kernel protection domain. Tasks: Protection-Domain Management Inter-Domain Control Transfer Protection-domain management involves the creation, manipulation and maintenance of light-weight protection domains. Isolation services support control flow in both directions between extension domains and kernel domains. 11/13/2007
18
Isolation(Contd…) Extension Procedure Call (XPC)
XPC is a control-transfer mechanism for isolating extensions within the kernel. XPC occurs between asymmetric trusted domains. 11/13/2007
19
Isolation: Implementation
Two Parts: Memory Management Extension Procedure Call To provide extensions with read access to the kernel, Nook’s memory management code maintains a synchronized copy of the kernel page table for each domain. Each light-weight domain has private structures like a dynamic local heap, a pool of stacks, physical memory mappings and kernel memory bufffers. Nooks currently does not protect the kernel from DMA by a device into the Kernel Address Space. 11/13/2007
20
Protection of Kernel Address Space
To provide extensions with read access to the kernel, Nook’s memory management code maintains a synchronized copy of the kernel page table for each domain. 11/13/2007
21
Isolation (Contd..) Extension Procedure Call (XPC):
Transfer control between extension and kernel domains. Two Functions: nooks_driver_call nooks_kernel_call 11/13/2007
22
Isolation (Contd…) Deferred Call Mechanism
Maintains two queues: Extension-domain-queue Kernel-domain-queue Changes to the Linux-Kernel: Maintain Coherency between the Kernel and Extension page tables. Handle Exceptions. Handle Co-location of task structure. 11/13/2007
23
Interposition Integrates existing extensions into the Nooks Environment. Tasks: All Extension to Kernel and Kernel to Extension control flows through the XPC mechanism All data transfer between the kernel and extension is viewed and managed by Nook’s object-tracking mechanism. 11/13/2007
24
Interposition ( Contd…)
Wrapper Stubs: Interface between the extension, Nooks Isolation Manager (NIM) and the Kernel . Kernel views the stub as an extension’s function entry point. Extensions view the stub as the Kernel’s extension API. 11/13/2007
25
Interposition: Implementation
Interposes Wrapper stubs between extensions and the kernel Wrappers provide transparency and protects control and data transfers in both directions Changes to the Kernel: Standard Module Loader Module Initialization Code Protection of Data Transfers The Linux Kernel exports many objects that are only read by the extensions. These objects are linked directly into the extension so that they are freely read. Macros and Inline functions that directly modify kernel objects are changed into wrapped function calls.For object modifications, that are not performance critical, Nooks converts object access into an XPC within the kernel. For Data Structures, shadow copy of the kernel object is created within the extension’s domain. The contents of the kernel object and shadow object are synchronized before and after XPC’s into the extension. 11/13/2007
26
Wrappers Two types of Wrappers: Performs three tasks: Kernel Wrappers
Extension Wrappers Performs three tasks: Checks Parameters for Validity by verifying with the object tracker and memory manager that pointers are valid. Object Tracking Code creates a copy of kernel objects on the local heap or stack within the extension’s protection domain. Wrappers perform an XPC into the kernel or extension to execute the desired function. 11/13/2007
27
Control Flow of Extension and kernel Wrappers
11/13/2007
28
Wrappers (Contd…) Wrapper Code Sharing:
248 Wrappers were implemented to isolate 463 imported and exported functions. Implies that wrapper code is shared among multiple drivers . 11/13/2007
29
Code Sharing among Wrappers
11/13/2007
30
Object-Tracking Tasks:
Maintains a list of kernel data structures that are manipulated by an extension. Controls all modification to those structures. Provides object information for clean-up when an extension fails. Object-Tracking code copies kernel objects into an extension domain so they can be modified and copy them back after changes have been applied. 11/13/2007
31
Object Tracking : Implementation
Manages Manipulation of Kernel Objects by extensions. Records all kernel objects and types in use by extensions. Performs Two tasks: Records the addresses of all objects in use by an extension Records an association between the kernel and extension versions of the object. Garbage Collection 11/13/2007
32
Recovery Software Faults: Hardware Faults:
Occurs when extension invokes a kernel service improperly. Recovery policy determines whether Nooks triggers recovery or returns control to the extension with an error code when possible. Hardware Faults: Occurs when extension attempts to read unmapped memory. Triggers Recovery. For Software Faults, a policy is maintained because there may be a few kernel data structures which may be in use by other extensions and also that other extensions which 11/13/2007
33
Recovery: Implementation
Two parts: Release of resource by Recovery Manager. Coordination of Recovery through the user-mode agent. Nooks recovery manager is tasked with returning the system to a clean state from which it can continue. The user-mode recovery agent facilitates flexible recovery. Nooks disables interrupt processing for the device controlled by the extension, preventing live lock that could occur if device interrupts are not properly dismissed. 11/13/2007
34
Recovery: Implementation (contd..)
Recovery Manager walks the list of objects known to the object tracker and releases, frees or unregisters all objects that will not be accessed by external devices. It uses a recovery function which releases the objects to the kernel and removes all the references from the kernel into the extension. 11/13/2007
35
Implementation Limitations
Complete Isolation or fault-tolerance is not achieved. Runs extensions in kernel mode, so cannot prevent extensions from deliberately executing privileged instructions. Limited to drivers that can be killed and restarted safely. As a result of the above limitations, crashes may still occur. It is true for device drivers which can be dynamically loaded when hardware devices are connected to the system. 11/13/2007
36
Reliability Test Test Methodology: synthetic fault-injection
Extensions Isolated: 11/13/2007
37
Test Environment Four Programs: VMware Virtual Machine
Sound Drivers: play a short MP3 file. Network Drivers: ICMP ping and TCP streaming tests. VFAT: untars and compiles a number of files. kHTTPd: Web Load Generator. VMware Virtual Machine 400 trials were run for each extensions in both Native and Nooks mode. 11/13/2007
38
Test Results System Crashes: Native Mode: 317 crashes for 400 trials
Nooks : Eliminated 313 (99%) , 4 resulted in deadlock. e1000, pcnet 32 are interrupt oriented. VFAT, sb, kHTTPd are process-oriented. 11/13/2007
39
Test Results (Contd…) 11/13/2007
40
Test Results (contd…) Non-Fatal Extension Failures:
For e1000 and pcnet32, failures that left the device in a non-functional state were not detected by Nooks. For VFAT and sb, Nooks reduced the number of non-fatal extensions. For kHTTPd, only a small number of injected faults were caught by Nooks. 11/13/2007
41
Test Results (Contd…) 11/13/2007
42
Recovery Errors For network, sb and kHTTPd extensions, errors are recovered straight forwardly. For VFAT, 90% of the cases resulted in on-disk corruption. Reason: Fault injection occurs after files and directories are created and abrupt shutdown and restart of file system leaves it in a corrupted state. 11/13/2007
43
Recovery Errors (Contd…)
Solution: Synchronize the disks with in-memory disk cache before releasing resources on a VFAT recovery. Result: No. of corruption cases reduced from 90% to 10% 11/13/2007
44
Other Tests For Manually Injected Errors, such as improper initializations, removing Null Checks, Nooks automatically detected and recovered from all such failures. Latent Bugs: Nooks revealed several latent bugs in existing kernel extensions such as kHTTPd and 3COM 3c90x Ethernet Driver. 11/13/2007
45
Summary of Reliability Tests
99% of the system crashes were detected and recovered. Nearly 60% of non-fatal extension failures were recovered. 11/13/2007
46
Performance:Benchmarks
Extension XPC Rate (per sec) Nooks Relative Performance Native CPU Util. (%) Nooks CPU Util(%) Play-mp3 (128 Kbps) sb 150 1 4.8 4.6 Receive Stream e1000 8,923 0.92 15.2 15.5 Send-Stream 60,352 0.91 21.4 39.3 Compile-Local VFAT 22,653 0.78 97.5 96.8 Serve-simple-web-page kHTTPd 61,183 0.44 96.6 Serve-complex-web-page 1,960 0.97 90.5 92.6 11/13/2007
47
Comparative Time-chart for Compilation BenchMark
11/13/2007
48
Summary of Benchmark Results:
Nooks provides a substantial reliability improvement at costs that depends on extensions being isolated. Moreover, performance depends on the CPU utilization imposed by the workload. 11/13/2007
49
Conclusion Nooks can be implemented with modest engineering efforts.
Extensions can be isolated without any change to extension code. Isolation and Recovery dramatically improve system reliability But, when performance matters for high XPC frequency extensions, isolation may not be appropriate. 11/13/2007
50
QUESTIONS AND COMMENTS
11/13/2007
51
THANK YOU 11/13/2007
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.