SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer.
Performance Testing - Kanwalpreet Singh.
Virtual Switching Without a Hypervisor for a More Secure Cloud Xin Jin Princeton University Joint work with Eric Keller(UPenn) and Jennifer Rexford(Princeton)
Introduction to Memory Management. 2 General Structure of Run-Time Memory.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Virtualization and Cloud Computing. Definition Virtualization is the ability to run multiple operating systems on a single physical system and share the.
Fast and Safe Performance Recovery on OS Reboot Kenichi Kourai Kyushu Institute of Technology.
A Fast Rejuvenation Technique for Server Consolidation with Virtual Machines Kenichi Kourai Shigeru Chiba Tokyo Institute of Technology.
1 Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures Feng Qin Joseph Tucek Jagadeesan Sundaresan Yuanyuan Zhou Presentation by.
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
Live Migration of Virtual Machines Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, Andrew Warfield.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
XENMON: QOS MONITORING AND PERFORMANCE PROFILING TOOL Diwaker Gupta, Rob Gardner, Ludmila Cherkasova 1.
Coda file system: Disconnected operation By Wallis Chau May 7, 2003.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
CacheMind: Fast Performance Recovery Using a Virtual Machine Monitor Kenichi Kourai Kyushu Institute of Technology, Japan.
BASE: Using Abstraction to Improve Fault Tolerance Rodrigo Rodrigues, Miguel Castro, and Barbara Liskov MIT Laboratory for Computer Science and Microsoft.
VMware vCenter Server Module 4.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Accelerating Mobile Applications through Flip-Flop Replication
Report : Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.

Address Space Layout Permutation
Introduction Overview Static analysis Memory analysis Kernel integrity checking Implementation and evaluation Limitations and future work Conclusions.
Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Virtual Machine Scheduling for Parallel Soft Real-Time Applications
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Automatic Diagnosis and Response to Memory Corruption Vulnerabilities Authors: Jun Xu, Peng Ning, Chongkyung Kil, Yan Zhai, Chris Bookholt In ACM CCS’05.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Politecnico di Torino Dipartimento di Automatica ed Informatica TORSEC Group Performance of Xen’s Secured Virtual Networks Emanuele Cesena Paolo Carlo.
Computer Science Detecting Memory Access Errors via Illegal Write Monitoring Ongoing Research by Emre Can Sezer.
COMS E Cloud Computing and Data Center Networking Sambit Sahu
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
High Performance Computing on Virtualized Environments Ganesh Thiagarajan Fall 2014 Instructor: Yuzhe(Richard) Tang Syracuse University.
Copyright © cs-tutorial.com. Overview Introduction Architecture Implementation Evaluation.
Bart Miller – October 22 nd,  TCB & Threat Model  Xen Platform  Xoar Architecture Overview  Xoar Components  Design Goals  Results  Security.
An OBSM method for Real Time Embedded Systems Veronica Eyo Sharvari Joshi.
Buffer Overflow Proofing of Code Binaries By Ramya Reguramalingam Graduate Student, Computer Science Advisor: Dr. Gopal Gupta.
VTurbo: Accelerating Virtual Machine I/O Processing Using Designated Turbo-Sliced Core Embedded Lab. Kim Sewoog Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella,
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Highly Scalable Distributed Dataflow Analysis Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan Chelsea LeBlancTodd.
Seminar of “Virtual Machines” Course Mohammad Mahdizadeh SM. University of Science and Technology Mazandaran-Babol January 2010.
Improving Xen Security through Disaggregation Derek MurrayGrzegorz MilosSteven Hand.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
An Integrated Framework for Dependable and Revivable Architecture Using Multicore Processors Weidong ShiMotorola Labs Hsien-Hsin “Sean” LeeGeorgia Tech.
Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.
Sampling Dynamic Dataflow Analyses Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan University of British Columbia.
Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,
Automatic Diagnosis and Response to Memory Corruption Vulnerabilities Authors: Jun Xu, Peng Ning, Chongkyung Kil, Yan Zhai, Chris Bookholt Cyber Defense.
Let's talk about Linux and Virtualization in 'vLAMP'
Presented by: Daniel Taylor
Bugs (part 2) CPS210 Spring 2006.
Supporting Fault-Tolerance in Streaming Grid Applications
Introduction to Operating Systems
High Coverage Detection of Input-Related Security Faults
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Xen Network I/O Performance Analysis and Opportunities for Improvement
Hadoop Technopoints.
Preventing Performance Degradation on Operating System Reboots
IntScope: Automatically Detecting Integer overflow vulnerability in X86 Binary Using Symbolic Execution Tielei Wang, TaoWei, ZhingiangLin, weiZou Purdue.
Phoenix: A Substrate for Resilient Distributed Graph Analytics
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Sampling Dynamic Dataflow Analyses
Presentation transcript:

SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang Hu Bing Bing Zhou Cluster and Grid Computing Lab Services Computing Technology and System Lab Huazhong University of Science and Technology Centre for Distributed and High Performance Computing Services School of Information Technologies University of Sydney

Introduction  Many applications need high availability  Server downtime is very costly (1hr = $84,000~$108,000)  But there are still numerous security vulnerabilities  Fix all bugs in testing is impossible  Virtualization technology brings new challenges  there are more application instances in a single-machine  How to guarantee high availability?

Current Approaches & limitations  Rx  Change execution environment  STEM  Emulate function and potentially others within a larger scope to return error values  Failure-oblivious computing  Manufacture values for “out of the bounds read”  Discard “out of the bounds write”  Micro-reboot  Software components are fail-stop and individually recoverable  Limitations  Deterministic bugs are still there  Require program redesign  A narrow suitability for only a small number of applications or memory bugs  ……  ASSURE better address these problems [ASPLOS’09]  SHelp can be considered as an extension of ASSURE to a virtualized computing environment

ASSURE Overview  Bypass the “faulty” functions  Rescue points  locations in the existing application code used to handle programmer- anticipated failures  Error virtualization  force a heuristic-based error return in a function  Quick recovery for future faults  Take a checkpoint once the appropriate rescue point is called ASPLOS’09 int bad(char* buf) { char rbuf[10]; int i = 0; if(buf == NULL) return -1; while(i < strlen(buf)) { rbuf[i++] = *buf++; } return 0; } input foo() bar() bad() input foo() bar() other() Walk stack Create rescue-graph Execution Graph Rescue Graph

ASSURE Limitations  A potential problem  when the appropriate rescue point is in the main procedure of an application ASPLOS’09  Rescue point B can survive faults  Two cases  High overhead for frequently checkpointing  No rescue point is appropriate

SHelp Main Idea  “Weighted” rescue point  assign weight values to rescue points  When an appropriate rescue point is chosen, its associated weight value is incremented.  first select the rescue point with the largest weight value to test once detecting a fault  Error handling information sharing in VMs  A two-level storage hierarchy for rescue point management  a global rescue point database in Dom0  a rescue point cache in each DomU  Weight values are updating between Dom0 and DomUs for error handling information sharing  The accumulative effect of added weight values in Dom0 provides a useful guideline for diagnosis of serious bugs

SHelp Architecture  Sensors for detecting software faults  Recovery and Test component for choosing the appropriate rescue point

SHelp Procedure  Determine candidate rescue points  Prioritize candidate rescue points and test one by one  first test the largest weight value of rescue point  Increment the corresponding weight values  Quick recovery for the same stack smashing bug

Implementation Details  Updating the Rescue Point Cache  At the application level -> LRU  At the trace level of applications -> LFUM  Consider globally maximum weight value and local hit rate for trace i  Updating Weight Values of Rescue Points  Real-time updating for RP database  Periodical updating for RP cache  Bug-Rescue List  The stack is corrupted in stack smashing bug  Get the trace need to replay program -> high overhead  Record the appropriate rescue point related to the fault  Choose it to probabilistically survive faults

Experimental Setup  Implementation  Linux kernel with BLCR and TCPCP checkpoint support  Xen and Dyninst 6.0  Platform  Intel Xeon E6550, 4MB L2 cache, 1GB memory  100Mbps Ethernet connection  Applications ApplicationVersionBugDepth Apache Off-by-one Heap overflow NULL dereference3 Light-HTTPd 0.1 Stack smashing2 Light-HTTPd-dbzDivide-by-zero2 ATP-HTTPd0.4bStack smashing1 Null-HTTPd Heap overflow1 Null-HTTPd-dfDouble free3

Comparison between ASSURE and SHelp  Web server application Light-HTTPd  Select the function serveconnection as the appropriate rescue point  Throughput is only about 60KB/s in ASSURE

SHelp Recovery Performance  First-1: new faults occur  First-2: same faults occur again in local VM or in other VMs

Benefits of the Bug-Rescue List  Subsequent: with Bug-Rescue List

Checkpoint/Rollback Overhead Analysis  Lightweight checkpoint and roll-back  Modified BLCR with TCPCP tool support

Conclusions and Future Work  “Weighted” rescue points and two-level storage hierarchy for rescue point management make the system perform more effectively and efficiently.  Future Work  Integrate the COW mechanism in BLCR  Evaluate the effectiveness of our system for more complex server and client applications

Thank you! Questions?