Download presentation
Presentation is loading. Please wait.
Published bySteven Tyler Modified over 9 years ago
1
SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang Hu Bing Bing Zhou Cluster and Grid Computing Lab Services Computing Technology and System Lab Huazhong University of Science and Technology Centre for Distributed and High Performance Computing Services School of Information Technologies University of Sydney
2
Introduction Many applications need high availability Server downtime is very costly (1hr = $84,000~$108,000) But there are still numerous security vulnerabilities Fix all bugs in testing is impossible Virtualization technology brings new challenges there are more application instances in a single-machine How to guarantee high availability?
3
Current Approaches & limitations Rx Change execution environment STEM Emulate function and potentially others within a larger scope to return error values Failure-oblivious computing Manufacture values for “out of the bounds read” Discard “out of the bounds write” Micro-reboot Software components are fail-stop and individually recoverable Limitations Deterministic bugs are still there Require program redesign A narrow suitability for only a small number of applications or memory bugs …… ASSURE better address these problems [ASPLOS’09] SHelp can be considered as an extension of ASSURE to a virtualized computing environment
4
ASSURE Overview Bypass the “faulty” functions Rescue points locations in the existing application code used to handle programmer- anticipated failures Error virtualization force a heuristic-based error return in a function Quick recovery for future faults Take a checkpoint once the appropriate rescue point is called ASPLOS’09 int bad(char* buf) { char rbuf[10]; int i = 0; if(buf == NULL) return -1; while(i < strlen(buf)) { rbuf[i++] = *buf++; } return 0; } input foo() bar() bad() input foo() bar() other() Walk stack Create rescue-graph Execution Graph Rescue Graph
5
ASSURE Limitations A potential problem when the appropriate rescue point is in the main procedure of an application ASPLOS’09 Rescue point B can survive faults Two cases High overhead for frequently checkpointing No rescue point is appropriate
6
SHelp Main Idea “Weighted” rescue point assign weight values to rescue points When an appropriate rescue point is chosen, its associated weight value is incremented. first select the rescue point with the largest weight value to test once detecting a fault Error handling information sharing in VMs A two-level storage hierarchy for rescue point management a global rescue point database in Dom0 a rescue point cache in each DomU Weight values are updating between Dom0 and DomUs for error handling information sharing The accumulative effect of added weight values in Dom0 provides a useful guideline for diagnosis of serious bugs
7
SHelp Architecture Sensors for detecting software faults Recovery and Test component for choosing the appropriate rescue point
8
SHelp Procedure Determine candidate rescue points Prioritize candidate rescue points and test one by one first test the largest weight value of rescue point Increment the corresponding weight values Quick recovery for the same stack smashing bug
9
Implementation Details Updating the Rescue Point Cache At the application level -> LRU At the trace level of applications -> LFUM Consider globally maximum weight value and local hit rate for trace i Updating Weight Values of Rescue Points Real-time updating for RP database Periodical updating for RP cache Bug-Rescue List The stack is corrupted in stack smashing bug Get the trace need to replay program -> high overhead Record the appropriate rescue point related to the fault Choose it to probabilistically survive faults
10
Experimental Setup Implementation Linux 2.6.18.8 kernel with BLCR and TCPCP checkpoint support Xen 3.2.0 and Dyninst 6.0 Platform Intel Xeon E6550, 4MB L2 cache, 1GB memory 100Mbps Ethernet connection Applications ApplicationVersionBugDepth Apache 2.0.49Off-by-one2 2.0.50Heap overflow2 2.0.59NULL dereference3 Light-HTTPd 0.1 Stack smashing2 Light-HTTPd-dbzDivide-by-zero2 ATP-HTTPd0.4bStack smashing1 Null-HTTPd 0.5.0 Heap overflow1 Null-HTTPd-dfDouble free3
11
Comparison between ASSURE and SHelp Web server application Light-HTTPd Select the function serveconnection as the appropriate rescue point Throughput is only about 60KB/s in ASSURE
12
SHelp Recovery Performance First-1: new faults occur First-2: same faults occur again in local VM or in other VMs
13
Benefits of the Bug-Rescue List Subsequent: with Bug-Rescue List
14
Checkpoint/Rollback Overhead Analysis Lightweight checkpoint and roll-back Modified BLCR with TCPCP tool support
15
Conclusions and Future Work “Weighted” rescue points and two-level storage hierarchy for rescue point management make the system perform more effectively and efficiently. Future Work Integrate the COW mechanism in BLCR Evaluate the effectiveness of our system for more complex server and client applications
16
Thank you! Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.