Jason Flinn Michael Chow, David Devecsery, Xianzheng Dou, Andrew Quinn

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Continuing Abstract Interpretation We have seen: 1.How to compile abstract syntax trees into control-flow graphs 2.Lattices, as structures that describe.
Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Program Representations. Representing programs Goals.
Distributed Computations
University of Washington Database Group Reverse Data Management … and the case for Reverse What-If queries 1 Alexandra Meliou, Wolfgang Gatterbauer, Dan.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
The Design Of A Web Document Snapshots Delivery System David Chao College of Business San Francisco State University.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Vulnerability-Specific Execution Filtering (VSEF) for Exploit Prevention on Commodity Software Authors: James Newsome, James Newsome, David Brumley, David.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
CSC-682 Cryptography & Computer Security Sound and Precise Analysis of Web Applications for Injection Vulnerabilities Pompi Rotaru Based on an article.
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Which Configuration Option Should I Change? Sai Zhang, Michael D. Ernst University of Washington Presented by: Kıvanç Muşlu.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
CS333 Intro to Operating Systems Jonathan Walpole.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
File Systems cs550 Operating Systems David Monismith.
Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software Paper by: James Newsome and Dawn Song.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Automating Configuration Troubleshooting with Dynamic Information Flow Analysis Mona Attariyan Jason Flinn University of Michigan.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Bigtable: A Distributed Storage System for Structured Data
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Information Retrieval in Practice
Optimistic Hybrid Analysis
Efficient Evaluation of XQuery over Streaming Data
Jonathan Walpole Computer Science Portland State University
Processes and threads.
Module 11: File Structure
Seth Pugsley, Jeffrey Jestes,
CS122A: Introduction to Data Management Lecture #16: AsterixDB
Large-scale file systems and Map-Reduce
Chapter 3: Process Concept
Taint tracking Suman Jana.
RDE: Replay DEbugging for Diagnosing Production Site Failures
Threads and Cooperation
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Algorithm Analysis (not included in any exams!)
湖南大学-信息科学与工程学院-计算机与科学系
Monkey See, Monkey Do A Tool for TCP Tracing and Replaying
Cse 344 May 2nd – Map/reduce.
Cse 344 May 4th – Map/Reduce.
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
All You Ever Wanted to Know About Dynamic Taint Analysis & Forward Symbolic Execution (but might have been afraid to ask) Edward J. Schwartz, Thanassis.
Algorithm Discovery and Design
Distributed System Gang Wu Spring,2018.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Introduction to Data Structure
EEC 688/788 Secure and Dependable Computing
Lu Tang , Qun Huang, Patrick P. C. Lee
Lecture 20: Representing Data Elements
Presentation transcript:

Learning from the Past: Fast, On-Demand Analysis of Prior Executions with Eidetic Systems Jason Flinn Michael Chow, David Devecsery, Xianzheng Dou, Andrew Quinn University of Michigan

What is an Eidetic System? Eidetic – Having “Perfect memory” or “Total Recall” Eidetic System – A system which can recall and trace through the lineage of any past computation An eidetic system is one that remembers all computation, and can query the history of it. That means it can look through temporary variables, named pipes, process address space, and track the history and evolution of the computer system. If you were to work at an eidetic workstation you would never have to worry about accidently deleting files, etc. David Devecsery

Motivation - Heartbleed Consider “Heartbleed” People didn’t know if they were exploited. What leaked? Whose data leaked? How important was that data? What data was even available in the web server to leak? Was Heartbleed exploited? What data was leaked? David Devecsery

Motivation - Heartbleed Message Leaked Data Was Heartbleed exploited? - Yes What data was leaked? David Devecsery

Motivation - Heartbleed Leaked Database Rows Heartbleed Message Leaked Data Was Heartbleed exploited? - Yes What data was leaked? David Devecsery

Motivation – Wrong Reference Bad Citation How did I get the wrong citation? David Devecsery

Motivation – Wrong Reference How did I get the wrong citation? David Devecsery

Motivation – Wrong Reference How did I get the wrong citation? David Devecsery

Motivation – Wrong Reference How did I get the wrong citation? What else did this affect? David Devecsery

Motivation ConfAid uses taint tracking ConfAid uses taint tracking How did I get the wrong citation? What else did this affect? David Devecsery

Motivation – Configuration Troubleshooting Config file file = open(config file) token = read_token(file) if (token equals “ExecCGI”) execute_cgi = 1 … if (execute_cgi == 1) ERROR() ExecCGI ConfAid uses a top down approach instead of bottom-up that the developers use, and it uses taint tracking techniques to find the causal relationships. So, there is how it works: when the application reads something from the config file, ConfAid assigns a specific taint to it and as the application runs, it propagates the taint via data flow and information flow. At the end when the error happens it uses these taints to links the error back to the configuration tokens that caused it. Mona Attariyan - University of Michigan

Arnold First practical eidetic computer system Reasonable overheads Efficiently records & recalls all user-space computation Process register/memory state Inter-process communication Supports provenance queries What data was affected? What states and outputs were affected? Targeted towards desktop/workstation use Reasonable overheads Record 4 years of data on $150 commodity HD Under 8% performance overhead on most benchmarks David Devecsery

Deterministic Record & Replay Most execution on a computer system is deterministic Can precisely reproduce an execution by logging non-determinism RECORD PLAY Michael Chow

Model-Based Compression Formulate a model of a typical execution Only record deviations from that model ret_val = sys_read (fd, buffer, count); Idea: Partial determinism Encourage the program to conform to the model usually equal e.g. delta encoding, move-to-front transform for X server messages After sys_read We only need to record when.. , otherwise, we need to record nothing The better the model, the less we need in our log. The log size is proportional to the deviations David Devecsery

Semi-Determinism Example: Time Frequent time queries are non-deterministic Use partially deterministic clock Real time clock & deterministic clock Bound deviation if (deterministic_clock – real_time_clock < threshold) { adjust deterministic_clock record deviation } return deterministic_clock David Devecsery

Space Optimizations David Devecsery

Space Optimizations David Devecsery

Space Optimizations David Devecsery

Space Optimizations 4 years of data on a $150 4TB commodity HD David Devecsery

Performance Evaluation David Devecsery

Querying Provenance Two types of queries: Reverse: Where did this data come from? Forward: What did this data affect? How does Arnold support these queries? User specifies initial state Trace the lineage of the computation Intra-process tracking (DIFT) Inter-process tracking (via OS support) David Devecsery

Intra-Process Provenance Use DIFT (taint tracking) for intra-process causality Run retroactively, on recorded execution Query is tuple of <sources, sinks, propagation function> Arnold supports several propagation functions: Copy Only Data Flow Data+Index Flow Control Flow Strong input/output relation Precision Weak input/output Relation Recall May miss relations Misses few relations David Devecsery

Inter-Process Lineage Two notions of inter-process linkage Process graph Tracks lineage through inter-process communication Precise Captures group to group communication Human linkage Handles relations between user inputs and outputs Infers linkages based on data content and time Imprecise – may have false negatives and false positives Can capture linkages the process graph can miss David Devecsery

Evaluation – Wrong Reference Copy Copy Data Data + Index Data Human Linkage Few false positives (font files, latex sty files, libc.so, libXt.so) No false negatives Record Time Replay Time Replay + Pin Time Query Time 96.1s 2.2s 70.0s 209.5s David Devecsery

Evaluation – Heartbleed Data + Index Data + Index Data + Index No false positives or negatives Record Time Replay Time Replay + Pin Time Query Time 230.3s 0.4s 139.5s 235.1s David Devecsery

Evaluation – Heartbleed Data + Index Data + Index Data + Index No false positives or negatives Record Time Replay Time Replay + Pin Time Query Time 230.3s 0.4s 139.5s 235.1s David Devecsery

Parallelizing queries Idea: parallelize across a compute cluster Problem: DIFT is “embarrassingly sequential” (Ruwase ‘07) Solution: use deterministic replay to time-slice DIFT DIFT execution David Devecsery

Time-slicing DIFT DIFT propagation function relates sources and sinks Problem: Each epoch only knows its local sources and sinks Treat all addresses/registers at epoch start as sources Teat all addresses/registers at epoch end as sinks Aggregation phase “stitches together” DIFT results from all epochs A = source1 B = source2 C = A + B D = C Output D A: {1} B: {2} C: {1,2} D = {C} Sink1 <- {C} {<Sink1,1> , <Sink1,2>} David Devecsery

Aggregation To scale, aggregation must run in parallel Problem: fully-parallel version has obscenely high constant costs All-to-all address/register map does not fit in 256 GB of memory! Solution: Organize epochs in chain by program order Stream info about relations to sources forward along chain Stream info about relations to sinks backward along chain David Devecsery

Results: nginx David Devecsery

Results: Firefox David Devecsery

Results: Evince David Devecsery

Conclusion Eidetic systems can: Arnold: First practical Eidetic System Recall any past state Explain the provenance of that state Arnold: First practical Eidetic System Low runtime overhead 4 years of computation on a commodity HD Parallelizing DIFT queries across a cluster Enables interactive response times (seconds, not hours) Supports powerful queries (what affected this value?) David Devecsery