Learning from the Past: Fast, On-Demand Analysis of Prior Executions with Eidetic Systems Jason Flinn Michael Chow, David Devecsery, Xianzheng Dou, Andrew Quinn University of Michigan
What is an Eidetic System? Eidetic – Having “Perfect memory” or “Total Recall” Eidetic System – A system which can recall and trace through the lineage of any past computation An eidetic system is one that remembers all computation, and can query the history of it. That means it can look through temporary variables, named pipes, process address space, and track the history and evolution of the computer system. If you were to work at an eidetic workstation you would never have to worry about accidently deleting files, etc. David Devecsery
Motivation - Heartbleed Consider “Heartbleed” People didn’t know if they were exploited. What leaked? Whose data leaked? How important was that data? What data was even available in the web server to leak? Was Heartbleed exploited? What data was leaked? David Devecsery
Motivation - Heartbleed Message Leaked Data Was Heartbleed exploited? - Yes What data was leaked? David Devecsery
Motivation - Heartbleed Leaked Database Rows Heartbleed Message Leaked Data Was Heartbleed exploited? - Yes What data was leaked? David Devecsery
Motivation – Wrong Reference Bad Citation How did I get the wrong citation? David Devecsery
Motivation – Wrong Reference How did I get the wrong citation? David Devecsery
Motivation – Wrong Reference How did I get the wrong citation? David Devecsery
Motivation – Wrong Reference How did I get the wrong citation? What else did this affect? David Devecsery
Motivation ConfAid uses taint tracking ConfAid uses taint tracking How did I get the wrong citation? What else did this affect? David Devecsery
Motivation – Configuration Troubleshooting Config file file = open(config file) token = read_token(file) if (token equals “ExecCGI”) execute_cgi = 1 … if (execute_cgi == 1) ERROR() ExecCGI ConfAid uses a top down approach instead of bottom-up that the developers use, and it uses taint tracking techniques to find the causal relationships. So, there is how it works: when the application reads something from the config file, ConfAid assigns a specific taint to it and as the application runs, it propagates the taint via data flow and information flow. At the end when the error happens it uses these taints to links the error back to the configuration tokens that caused it. Mona Attariyan - University of Michigan
Arnold First practical eidetic computer system Reasonable overheads Efficiently records & recalls all user-space computation Process register/memory state Inter-process communication Supports provenance queries What data was affected? What states and outputs were affected? Targeted towards desktop/workstation use Reasonable overheads Record 4 years of data on $150 commodity HD Under 8% performance overhead on most benchmarks David Devecsery
Deterministic Record & Replay Most execution on a computer system is deterministic Can precisely reproduce an execution by logging non-determinism RECORD PLAY Michael Chow
Model-Based Compression Formulate a model of a typical execution Only record deviations from that model ret_val = sys_read (fd, buffer, count); Idea: Partial determinism Encourage the program to conform to the model usually equal e.g. delta encoding, move-to-front transform for X server messages After sys_read We only need to record when.. , otherwise, we need to record nothing The better the model, the less we need in our log. The log size is proportional to the deviations David Devecsery
Semi-Determinism Example: Time Frequent time queries are non-deterministic Use partially deterministic clock Real time clock & deterministic clock Bound deviation if (deterministic_clock – real_time_clock < threshold) { adjust deterministic_clock record deviation } return deterministic_clock David Devecsery
Space Optimizations David Devecsery
Space Optimizations David Devecsery
Space Optimizations David Devecsery
Space Optimizations 4 years of data on a $150 4TB commodity HD David Devecsery
Performance Evaluation David Devecsery
Querying Provenance Two types of queries: Reverse: Where did this data come from? Forward: What did this data affect? How does Arnold support these queries? User specifies initial state Trace the lineage of the computation Intra-process tracking (DIFT) Inter-process tracking (via OS support) David Devecsery
Intra-Process Provenance Use DIFT (taint tracking) for intra-process causality Run retroactively, on recorded execution Query is tuple of <sources, sinks, propagation function> Arnold supports several propagation functions: Copy Only Data Flow Data+Index Flow Control Flow Strong input/output relation Precision Weak input/output Relation Recall May miss relations Misses few relations David Devecsery
Inter-Process Lineage Two notions of inter-process linkage Process graph Tracks lineage through inter-process communication Precise Captures group to group communication Human linkage Handles relations between user inputs and outputs Infers linkages based on data content and time Imprecise – may have false negatives and false positives Can capture linkages the process graph can miss David Devecsery
Evaluation – Wrong Reference Copy Copy Data Data + Index Data Human Linkage Few false positives (font files, latex sty files, libc.so, libXt.so) No false negatives Record Time Replay Time Replay + Pin Time Query Time 96.1s 2.2s 70.0s 209.5s David Devecsery
Evaluation – Heartbleed Data + Index Data + Index Data + Index No false positives or negatives Record Time Replay Time Replay + Pin Time Query Time 230.3s 0.4s 139.5s 235.1s David Devecsery
Evaluation – Heartbleed Data + Index Data + Index Data + Index No false positives or negatives Record Time Replay Time Replay + Pin Time Query Time 230.3s 0.4s 139.5s 235.1s David Devecsery
Parallelizing queries Idea: parallelize across a compute cluster Problem: DIFT is “embarrassingly sequential” (Ruwase ‘07) Solution: use deterministic replay to time-slice DIFT DIFT execution David Devecsery
Time-slicing DIFT DIFT propagation function relates sources and sinks Problem: Each epoch only knows its local sources and sinks Treat all addresses/registers at epoch start as sources Teat all addresses/registers at epoch end as sinks Aggregation phase “stitches together” DIFT results from all epochs A = source1 B = source2 C = A + B D = C Output D A: {1} B: {2} C: {1,2} D = {C} Sink1 <- {C} {<Sink1,1> , <Sink1,2>} David Devecsery
Aggregation To scale, aggregation must run in parallel Problem: fully-parallel version has obscenely high constant costs All-to-all address/register map does not fit in 256 GB of memory! Solution: Organize epochs in chain by program order Stream info about relations to sources forward along chain Stream info about relations to sinks backward along chain David Devecsery
Results: nginx David Devecsery
Results: Firefox David Devecsery
Results: Evince David Devecsery
Conclusion Eidetic systems can: Arnold: First practical Eidetic System Recall any past state Explain the provenance of that state Arnold: First practical Eidetic System Low runtime overhead 4 years of computation on a commodity HD Parallelizing DIFT queries across a cluster Enables interactive response times (seconds, not hours) Supports powerful queries (what affected this value?) David Devecsery