Presentation is loading. Please wait.

Presentation is loading. Please wait.

SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang.

Similar presentations


Presentation on theme: "SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang."— Presentation transcript:

1 SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang Hu Bing Bing Zhou Cluster and Grid Computing Lab Services Computing Technology and System Lab Huazhong University of Science and Technology Centre for Distributed and High Performance Computing Services School of Information Technologies University of Sydney

2 Introduction  Many applications need high availability  Server downtime is very costly (1hr = $84,000~$108,000)  But there are still numerous security vulnerabilities  Fix all bugs in testing is impossible  Virtualization technology brings new challenges  there are more application instances in a single-machine  How to guarantee high availability?

3 Current Approaches & limitations  Rx  Change execution environment  STEM  Emulate function and potentially others within a larger scope to return error values  Failure-oblivious computing  Manufacture values for “out of the bounds read”  Discard “out of the bounds write”  Micro-reboot  Software components are fail-stop and individually recoverable  Limitations  Deterministic bugs are still there  Require program redesign  A narrow suitability for only a small number of applications or memory bugs  ……  ASSURE better address these problems [ASPLOS’09]  SHelp can be considered as an extension of ASSURE to a virtualized computing environment

4 ASSURE Overview  Bypass the “faulty” functions  Rescue points  locations in the existing application code used to handle programmer- anticipated failures  Error virtualization  force a heuristic-based error return in a function  Quick recovery for future faults  Take a checkpoint once the appropriate rescue point is called ASPLOS’09 int bad(char* buf) { char rbuf[10]; int i = 0; if(buf == NULL) return -1; while(i < strlen(buf)) { rbuf[i++] = *buf++; } return 0; } input foo() bar() bad() input foo() bar() other() Walk stack Create rescue-graph Execution Graph Rescue Graph

5 ASSURE Limitations  A potential problem  when the appropriate rescue point is in the main procedure of an application ASPLOS’09  Rescue point B can survive faults  Two cases  High overhead for frequently checkpointing  No rescue point is appropriate

6 SHelp Main Idea  “Weighted” rescue point  assign weight values to rescue points  When an appropriate rescue point is chosen, its associated weight value is incremented.  first select the rescue point with the largest weight value to test once detecting a fault  Error handling information sharing in VMs  A two-level storage hierarchy for rescue point management  a global rescue point database in Dom0  a rescue point cache in each DomU  Weight values are updating between Dom0 and DomUs for error handling information sharing  The accumulative effect of added weight values in Dom0 provides a useful guideline for diagnosis of serious bugs

7 SHelp Architecture  Sensors for detecting software faults  Recovery and Test component for choosing the appropriate rescue point

8 SHelp Procedure  Determine candidate rescue points  Prioritize candidate rescue points and test one by one  first test the largest weight value of rescue point  Increment the corresponding weight values  Quick recovery for the same stack smashing bug

9 Implementation Details  Updating the Rescue Point Cache  At the application level -> LRU  At the trace level of applications -> LFUM  Consider globally maximum weight value and local hit rate for trace i  Updating Weight Values of Rescue Points  Real-time updating for RP database  Periodical updating for RP cache  Bug-Rescue List  The stack is corrupted in stack smashing bug  Get the trace need to replay program -> high overhead  Record the appropriate rescue point related to the fault  Choose it to probabilistically survive faults

10 Experimental Setup  Implementation  Linux 2.6.18.8 kernel with BLCR and TCPCP checkpoint support  Xen 3.2.0 and Dyninst 6.0  Platform  Intel Xeon E6550, 4MB L2 cache, 1GB memory  100Mbps Ethernet connection  Applications ApplicationVersionBugDepth Apache 2.0.49Off-by-one2 2.0.50Heap overflow2 2.0.59NULL dereference3 Light-HTTPd 0.1 Stack smashing2 Light-HTTPd-dbzDivide-by-zero2 ATP-HTTPd0.4bStack smashing1 Null-HTTPd 0.5.0 Heap overflow1 Null-HTTPd-dfDouble free3

11 Comparison between ASSURE and SHelp  Web server application Light-HTTPd  Select the function serveconnection as the appropriate rescue point  Throughput is only about 60KB/s in ASSURE

12 SHelp Recovery Performance  First-1: new faults occur  First-2: same faults occur again in local VM or in other VMs

13 Benefits of the Bug-Rescue List  Subsequent: with Bug-Rescue List

14 Checkpoint/Rollback Overhead Analysis  Lightweight checkpoint and roll-back  Modified BLCR with TCPCP tool support

15 Conclusions and Future Work  “Weighted” rescue points and two-level storage hierarchy for rescue point management make the system perform more effectively and efficiently.  Future Work  Integrate the COW mechanism in BLCR  Evaluate the effectiveness of our system for more complex server and client applications

16 Thank you! Questions?


Download ppt "SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang."

Similar presentations


Ads by Google