Self Recovery in Server Programs The University of California, Riverside Vijay Nagarajan Dennis JeffreyRajiv Gupta International Symposium on Memory Management.

Self Recovery in Server Programs The University of California, Riverside Vijay Nagarajan Dennis JeffreyRajiv Gupta International Symposium on Memory Management June 19, 2009

Server Availability Long-running server programs: availability threatened by software failures Memory errors a significant problem –Buffer overflows, format string errors, etc. –National Vulnerability Database, 2008: 30% software failures caused by memory errors Memory bug defect  triggered by user request  memory corruption  crash, or software attack due to malicious user input

Checkpointing/Rollback Technique One commonly used technique to recover from a failure –[Gray ‘86, Plank ’98, Qin ’05, Randell ’78, Tallam ’08] Main principle –Periodically checkpoint state –On failure, rollback to most recent safe state –Replay benign user requests from safe state to failure point Limitations –Checkpointing can have high overhead –Rollback can affect throughput and response time –Inherent tradeoff Small checkpoint interval  rollback improved, but time/space overhead Large checkpoint interval  overhead improved, but throughput and response time degrades

Memory Corruption Propagation Is expensive checkpointing/rollback scheme necessary? –Perhaps, if memory corruption undergoes significant propagation –Perhaps not, if memory corruption is relatively isolated Memory corruption –Memory location contains an unexpected value (e.g., overflow) Memory corruption propagation –The spreading of memory corruption as computation progresses –New values become corrupt due to existing corrupt values –Corrupt values become uncorrupted when they are overwritten with an uncorrupt value (e.g., reusing a variable)

How Corruption Propagates Does the number of corrupted locations steadily increase? –Dominated by “uncorrupt  corrupt” transitions –Very quickly, memory may be filled with corrupt locations and computation can’t proceed –May justify the need for checkpointing/rollback Or, does the number of corrupted locations stabilize or vanish? –Significant “corrupt  uncorrupt” transitions –This “self cleansing” can be leveraged for recovery –May suggest that checkpoint/rollback is unnecessary

Study: Corruption Propagation Understand how memory corruption can propagate in server programs One way: study the memory corruption propagation in real bugs –Limited: depends on what is corrupted in the bugs we choose A more comprehensive approach –Consider multiple runs, multiple requests per run –In each run, assume one of the store instruction instances causes corruption Do this for different store instruction instances in each run –Simulate memory corruption propagation For each instruction, if any of the source values are corrupt, then the target value is also corrupt (else, target is not corrupt) –Study the number of corrupt locations: max versus final

Subjects and Implementation Server programs considered –mysqld : database management server –cvs : version control system –squid : web proxy cache server –apache : web server Implementation –Associate a corruption bit with each memory location and register (initially, the bit is not set) –Every execution marks some unique store as corrupt –Propagate the corruption bits –Implemented in the Valgrind dynamic binary translation framework

Observations from Study General patterns we observed –(A) # corrupted locations increases over time –(B) # corrupted locations increases, but is partially reverted to uncorrupt as memory gets deallocated / overwritten –(C) # corrupted locations increases, but is fully reverted by the end Studied the distribution of max and final # of corrupted locations –Max: the highest point in each graph –Final: the right-most value in each graph It turns out that case (C) is the most common case, and usually max is quite low Execution Time # Corrupt Locations (A) Execution Time # Corrupt Locations (B) Execution Time # Corrupt Locations (C)

Illustration: Max vs. Final

Self Cleansing In nearly 80% of cases, all corrupted locations revert back to uncorrupt state by the end Key observation: a memory location marked corrupt usually corrupts relatively few other locations, most of which are uncorrupted by the end of processing of a user request Why does this “self cleansing” occur? –Conduct a study of isolation to understand the cause

Study: Isolation To determine the degree of isolation across user requests inherent in servers Methodology –Send several user requests to server –Determine memory shared across user requests Written by one request, read by subsequent request Maintain a request id with each memory location –Vary the request sizes and observe the effect on shared state

Illustration: Shared vs. Not Shared

Observations from Isolation Study Relatively small % of memory was shared across user requests As request size increases: –# non-shared locations increases –# shared locations stays about the same Evidence points to high isolation: only small global state is shared –Most state is local to a user request, and deallocated / overwritten before the next request Program Name% of Accessed Memory Locations that are Shared mysqld 10.4% cvs 15.6% squid 6.2% apache 35.7%

Summary of Study Requests share small fixed global state –When global state corrupted, corruption persists across requests Most state is local to a user request –When local state corrupted, corruption vanishes by the end of the user request Take advantage of this self-cleansing property

SRS: Self Recovery in Servers Key idea: when a server experiences a failure: –Do not actually crash  nullify and isolate the current request –Continue to process future user requests Ensure corrupt state is not visible during processing Take advantage of self-cleansing for recovery without checkpointing/rollback SRS approach –When a failure happens, do not crash; execute forward to trigger self cleansing Current request executed in crash suppression mode –Ensure isolation: handle (uncommon) case when benign request needs to access corrupted memory location On-demand restoration to an earlier uncorrupt value

Crash Suppression When crash occurs during user request: –Execute forward to trigger self cleansing –Crash suppression prevents additional crashes Semantics of suppression –When initial crash is detected at an instruction, suppress the crash and mark the target as “corrupt” Each register and memory location associated with a “corrupt” bit –Propagate the “corrupt” bits during rest of request If any source operands are “corrupt”, then suppress the instruction and mark the target “corrupt” Ensures that any subsequent instructions dependent upon the corrupt location are suppressed as well –But what about values already written by the faulty request?

Ensuring Isolation Need to isolate faulty request from future requests –Values written by faulty request should be invisible to later requests Only relatively small global state needs to be handled Identify global state using profiling –TrackSet: set of memory locations used across user requests Maintain a rollback value for each entry in TrackSet –When location in TrackSet is about to be written in a request, save the prior (uncorrupt) value into the TrackLog When benign request needs to access a value written by a faulty request –Restore the prior uncorrupt value on demand from the TrackLog

Monitoring Shared Locations Identify TrackSet through profiling –Give several user requests to the server –Identify the shared locations as well as the instructions that write to these shared locations Maintain TrackLog –Used for reverting corrupt location to its prior uncorrupt state when necessary –Allocate memory for TrackLog at the start of a request Each TrackLog entry is an (address, prior-value) pair –When instruction writes to location in TrackSet, store prior value into TrackLog

Demand Driven Restoration Identify when a request needs to read a corrupt location previously written by a faulty request –At each store, associate request id with the memory location, showing which request last defined it –When a crash occurs, remember the current request id as a “faulty” one For benign requests after a crash, may need to perform on-demand restoration of corrupt values –For all loads, check if value to be loaded is corrupt (look at request id) –If corrupt, obtain the value from the TrackLog If value not in TrackLog, restart the server (fail-safe condition, can occur if shared location not identified during profiling)

Summary of SRS Offline step –Profile to identify TrackSet and stores operating on it Online steps –Monitor location accesses and handle crashes Maintain TrackLog for stores that operate on TrackSet Associate current request id with each store If crash detected: add current request id to faulty list, and run in crash suppression mode to the end of the current request –Additional steps when recovering from crash Check whether each load needs to access a corrupt location –Restore value from TrackLog if possible, else restart server

Experimental Evaluation Benchmarks: mysqld, cvs, squid, apache –Considered 1 memory bug in each benchmark For our memory study and for running SRS: –Used Valgrind shadow memory and instrumentation support For performance evaluation: –Used dynamoRIO dynamic code modification system Because Valgrind imposes higher overheads –Required some manual source code changes to handle shadow memory

Running SRS: Recovery in Presence of Crash Each subject program given 10 user requests –5 th request triggers a failure SRS enters crash suppression mode once fault triggered For future benign requests, SRS was able to perform on-demand restoration and recover –The need to restart server did not arise

Performance of SRS Response time during normal run degraded by 5% –Additional store for each store Maintain TrackLog Store request id –Instrumentation tolerable since programs not CPU-intensive Response time after recovery degraded by 8% –Additional check at each load Fail-safety check to possibly restore a corrupt value Overhead during suppression about 3x (graph not shown) –Suppression only performed during a faulty request –Can be improved with DIFT hardware support

Related Work Rebooting techniques –Whole program restart [Gray 1986] –Selective restarting of certain components [Candea 2004] Recovery-oriented computing [Oppenheimer 2002, Patterson 2002] –Software components designed to be isolated to reduce failure impact Checkpointing/rollback –Already described in this talk Failure-oblivious computing [Rinard 2004] –Instead of crashing, server continues execution Discard illegal writes Manufacture appropriate values for illegal reads –Success hinges on “self cleansing” property also observed in our work –Whereas this approach does not cancel a faulty user request, our SRS approach attempts to isolate and nullify faulty requests

Conclusions Conducted detailed study of memory corruption propagation in servers –Provided insight into “self cleansing” property –User requests share small global state Proposed SRS approach –Takes advantage of “self cleansing” property –Recovery without checkpointing / rollback

Start Backup Slides

Observations from Study Min Max = 1 –This is the first location that is corrupted. No more locations are marked corrupt. –Used as loop counter Max Max can be quite large –Median value much smaller –80% of the runs max value is < 10 Max Final considerably smaller than Max Max –Significant # of corrupted locations revert back to uncorrupt state. –Furthermore Min Final = 0 –All corrupted locations revert to uncorrupt state –deallocation/ overwritten –True for 80%-90% of the runs! –Self Cleansing

Example

Self Recovery in Server Programs The University of California, Riverside Vijay Nagarajan Dennis JeffreyRajiv Gupta International Symposium on Memory Management.

Similar presentations

Presentation on theme: "Self Recovery in Server Programs The University of California, Riverside Vijay Nagarajan Dennis JeffreyRajiv Gupta International Symposium on Memory Management."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Self Recovery in Server Programs The University of California, Riverside Vijay Nagarajan Dennis JeffreyRajiv Gupta International Symposium on Memory Management.

Similar presentations

Presentation on theme: "Self Recovery in Server Programs The University of California, Riverside Vijay Nagarajan Dennis JeffreyRajiv Gupta International Symposium on Memory Management."— Presentation transcript:

Similar presentations

About project

Feedback