Bristlecone: A Language for Robust Software Systems Brian Demsky Alokika Dash University of California, Irvine
Current Software is All or Nothing Most current software either executes perfectly or fails completely Small errors cause catastrophic failures Violate fundamental developer assumptions Violated assumptions prevent continued execution No clean way to recover from errors Unclear what parts of the program are affected Failure may leave key data structures partially updated
Degraded Service Can Be Desirable Consider a bug that affects an embedded application in a single web page Current browsers often close all browser windows and exit Users find this behavior frustrating Better option is to isolate the failure and halt only the embedded application component
Request r=receiverequest(); logRequest(r); processRequest(r); Motivating Example: Web Server
Request r=receiverequest(); logRequest(r); processRequest(r); Motivating Example: Web Server CRASH Failure in log operation Prevents serving this request If logging failure is independent of request, potentially causes system to fail to serve any requests
Real World Example First Flight of the Ariane 5 Rocket Uncaught integer overflow in computation that computed horizontal bias Overflow shutdown the inertial reference system Inertial reference sent debug information to the guidance system Guidance system used these invalid values to set incorrect nozzle deflections $120 Million rocket crashed Horizontal bias value is not even used! Lesson: Critical system operations coupled to non-critical operations
Observations about Recovery Challenging to recover from failure with traditional program structures Unclear what code was doing Are data structures consistent? What depends on the failed code? What is still safe to do? Code structure introduces artificial dependences In the absence of precise dependence information, we must assume the worst case Failures can propagate through artificial dependences small errors can cause catastrophic failures
Where do we lose information? Specifications describe functionality requirements Architecture/implementation phases map requirements into sequence of operations Mapping process loses information: Boundaries of operations (What is A and what is B?) Temporal dependences (Does B require A?) Data dependences (Does B use data produced by A?) Lost information introduces artificial dependences
Designing for Robustness Underlying assumption: All code contains bugs Goal: Mitigate the consequences Approach: Decompose application into many small tasks Specify dependences between these tasks Data dependences Control dependences Use transactions to prevent failures from exposing partially updated data structures Use dependence information to continue past failures
Bristlecone Language Program is specified as a set of tasks Task specifications describe task dependences Tasks have transactional semantics Runtime system reasons about dependences to execute past failures (automated recovery)
Web Server Example Read Request Log Request Send Page Decoupled operations Log Request and Send Page tasks are independent Failure of one does not affect the other Accept Connection
Specifying Object States Different object states support different functionality (Type State) Use flag construct to label conceptual object states Use these flags to determine when to perform operations Can differentiate between operations that have true data dependences and operations that just operate on same objects class WebRequest { Flag initialized; Flag send_page; Flag write_log; …}
Tagging Objects Motivation: Consider the web server example Each connection has: A Socket object that provides communication A WebRequest object that stores application specific state Need to pair the correct Socket and WebRequest objects together Solution: Tag the group of objects Socket Object WebRequest Object Connection Tag
Tagging Objects Tags group object instances Tags provide mechanism Tags have types Can create many instances of a tag type Each instance defines a group Can bind tag instances to objects Tags can specify that task parameters must be in the same group
Task Specifications Describe data dependences of tasks Describe affect of tasks on objects /* This task reads a request from a client. */ task readRequest(WebRequest w in initialized with connection t, Socket s in IO_Pending with connection t) {... taskexit(w: initialized:=false, send_page:=true, write_log:=true); }
Bristlecone Task Semantics Runtime invokes tasks Tasks can be invoked when objects are available in the heap that satisfy the task’s parameter guards Task have transactional memory semantics All operations are executed or none Task execution appears to occur in a single instance Failures cause transactions to abort and restore consistency
Failure-free Execution Read Request Log Request Send Page Accept Connection
Failure-free Execution Read Request Log Request Send Page Accept Connection
Failure-free Execution Read Request Log Request Send Page Accept Connection
Failure-free Execution Read Request Log Request Send Page Accept Connection
Error Detection Catching operating system signals Arithmetic exceptions Null pointer exceptions Library signals Socket errors … Runtime language checks Array out of bounds exceptions Assertions Imperative consistency checks Declarative data structure specifications
Failure Recovery Transactions restore data structures to previous consistent state Problem: Re-executing the same task will likely result in the same failure Solution: Use task specifications to determine what other tasks can be safely executed
Automatic Recovery Read Request logRequest Send Page Accept Connection CRASH
Automatic Recovery Read Request Log Request Send Page Accept Connection
Automatic Recovery Summary Read Request Log Request Send Page Accept Connection
Language Benefits Use specifications to understand failure in a meaningful way Use task specifications to reason how to recover from failures Task specifications eliminate artificial dependences
Task Dispatch Goal: Determine which parameter objects satisfy task guards Problem: Brute force search can be expensive Our Approach maintains: Parameter set of objects that satisfy an individual parameter’s guard Active task queue of sets of parameter objects that collectively satisfy all of task’s guards
Task Dispatch Precisely maintain parameter sets If an object is in a parameter set It satisfy the flag component of the guard Is bound to the correct types of tags All objects that satisfy parameter’s guard are in parameter set Active task queue is conservative If a set of objects could potentially satisfy all of task’s guards, it is in the task queue Must check that set of objects in a task queue invocation satisfies guards before invoking task
Task Dispatch When a new object is added to parameter set, create corresponding task queue invocations Search for objects that satisfy tag guards Idea: Use tags to prune search When we add an object with a tag guard to the set, use tags to prune search of other parameter objects that must be bound to the same tag
Task Binding Iteration Structure computation as a list of iterators over tags and objects Multiple types of iterators: Over tags bound to object Over objects bound to tag Over objects in parameter set Want to prune search early – ordering is important Statically generate iterator orderings for each parameter set of each task
Initial Experiences Implemented Bristlecone compiler and runtime Have evaluated system on several benchmarks including: Web Server Web Spider Chat Server Developed a Bristlecone and Java versions of each Java versions were designed to use threads to provide resilience to failures Randomly injected failures into executions
Web Spider Workload is a set of 100 web pages Java version implemented using a thread pool architecture 100 trials on each version Randomly injected 3 halting failures into each execution With injected failures Java version fetched average of 6 pages Bristlecone version fetched average of 91 pages
Web Server Web Server with support for e-commerce transactions Java version spawns a thread for each connection 200 trials on each version Randomly injected 50 halting failures into each execution With injected failures Java failed to serve inventory requests in 4.5% of trials, Bristlecone failed in 1.5% Java had correct inventory responses in 68.6%, Bristlecone in 100%
Chat Server Chat server allows multiple users to chat Java version spawns a thread for each connection 100 trials on each version Workload sent 800 messages Randomly injected 10 failures into each execution With injected failures Java version failed to serve 39.9% of messages Bristlecone version failed to serve 19.3% of messages
Related Work Traditional fault tolerance N-version programming Recovery blocks Exception handlers Languages Linda / Tuple spaces Orc Actors Argus Oz Erlang Software and Hardware Transactional Memory
Conclusions Bristlecone is a exciting approach to improve application reliability Initial experiences promising