EEC 688/788 Secure and Dependable Computing

Slides:

Advertisements

Similar presentations

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.

Advertisements

Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Enterprise Applications & Java/J2EE Technologies Dr. Douglas C. Schmidt Professor of EECS.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.

EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

AM Recitation 2/10/11.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.

Enterprise JavaBeans. What is EJB? l An EJB is a specialized, non-visual JavaBean that runs on a server. l EJB technology supports application development.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.

EEC 688/788 Secure and Dependable Computing Lecture 8 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Building Dependable Distributed Systems, Copyright Wenbing Zhao

Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EJB Enterprise Java Beans JAVA Enterprise Edition

Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.

Distributed Systems Lecture 6 Global states and snapshots 1.

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.

Unit 3 Hypothesis.

Virtual Active Networks

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

EEC 688/788 Secure and Dependable Computing

Distribution and components

CHAPTER 3 Architectures for Distributed Systems

#01 Client/Server Computing

COT 5611 Operating Systems Design Principles Spring 2012

EECS 498 Introduction to Distributed Systems Fall 2017

EEC-484/584 Computer Networks

Virtual Active Networks

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

An Introduction to Software Architecture

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Design Yaodong Bi.

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

COT 5611 Operating Systems Design Principles Spring 2014

#01 Client/Server Computing

Presentation transcript:

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org

EEC688/788: Secure & Dependable Computing Outline Recovery oriented computing Overview Application level fault detection Structural behavior monitoring Path shape analysis Microreboot and System-Level Undo/Redo 12/2/2018 EEC688/788: Secure & Dependable Computing

Recovery-Oriented Computing On availability of soft realtime systems Availability = MTTF/(MTTF+MTTR) MTTF: mean time to failure MTTR: mean time to recover Availability can be improved by increasing MTTF as well as reducing MTTR Recovery-oriented computing: focusing on reducing MTTR Making fault detection faster and more accurate Making recovery faster

Fault Detection and Localization Fault detection: determine if some component in the system has failed Fault localization: pinpoint the particular component that failed Low-level fault detection mechanism Based on timeout, probing each component periodically with a heartbeat message Cannot detect many application-level faults Recovery-oriented computing: focusing on application-level fault detection and localization 75% of the recovery time is spent on application-level fault detection

Microreboot and System-Level Undo/Redo Microreboot: many problems can be fixed by simply restarting the faulty component Works best with component-based systems For problems cannot be fixed by microreboot, performs system-level undo, fixed the problem, then carries out system-level redo Based on checkpointing and logging

System Model for Recovery-Oriented Computing Three-tier architecture Separating application logic and data management Middle-tier is stateless or maintains only session state Component-based middleware Java Platform, Enterprise Edition (Java EE often referred to as J2EE) Key component: Enterprise Java Bean (EJB)

Application-Level Fault Detection Fail-stop faults can be detected using timeouts Application-level faults can only be detected in the application level One plausible fault detection method: acceptance test Developer would have to develop effective and efficient acceptance test routings Not practical for Internet apps due to their scale, complexity and rapid rate of changes ROC-based approach: measure and monitor structural behaviors of an app May detect app-level faults without a priori knowledge of the app details

Structural Behavior Monitoring Interaction patterns between different components reflect the app-level functionality Each component implements a specific app function, e.g., Stateful session bean to manage a user’s shopping cart A set of singleton session beans to keep track of inventory The internal structural behavior can be monitored to infer whether or not the app is functioning normally To monitor Log runtime path for each end-user request, including all incoming msgs, outgoing msgs, method invocations, etc.

Structural Behavior: Runtime Path Example Runtime path for a single end-user request Span 5 components Consist of 10 events

Structural Behavior: Machine Learning Train reference models using machine learning Historical reference model: training with aggregated runtime path data Objective: anomaly detection based on historical behavior May use real workload as well as synthetic workload that resembles real workload Peer reference model: train with most recent runtime path data Objective: anomaly detection with respect to peer components Must train with real workload Fault (anomaly) detection: comparing observed patterns with those in the reference models

Component Interactions Modeling Focus on interactions between a component instance and all other component classes More scalable: can cope with cases when there are many instances of each class Suitable for using the Chi-square test for anomaly detection

Component Interactions Modeling Given a system with n component classes, the interaction model for a component instance consists of a set of n-1 weighted links between the instance and all the other n-1 component classes We assume instances of the same class do not interact with each other We assume that interactions are symmetric (i.e., request and reply) Weight assigned to each link is the probability of the component instance interacting with the linked component class The sum of the weight on all links is 1, i.e., the component instance has probability of 1 to interact with other component classes

Component Interaction Model: Example Class A: web component, handles end-user requests Class B: app logic, handles conversations with end-users, 3 instances Class C and Class D: also app logic, representing shared state Class E: database server, persistent state

Component Interaction Model: Example Machine learning: determine link weight based on training data Training data A issued 400 remote invocations on b1 b1 issued 300 local method invocations on C, and 300 invocations on D Not important what happened between C & E, D & E Link weight calculation Total number interactions occurred at b1 instance: 1000 P(b1-A) = 400/1000 = 0.4 P(b1-C) = 300/1000 = 0.3 P(b1-D) = 300/1000 = 0.3

Anomaly Detection Comparison of current behavior with the trained behavior: use Chi-Square test Prepare the observed data as a histogram Compare distribution using formula: n: number of cells in the histogram ei: expected frequency in cell i oi: observed frequency in cell i If ei is 0, the cell should be pruned off Each link is regarded as a cell For observation period of m requests, expected frequency for link i: ei = m * pi No anomaly: D = 0 ideally. In practice, D is not 0 due to randomness, it follows a chi-square distribution

Anomaly Detection: Chi-Square Test Anomaly detected: D > the 1-a quantile of the chi-square distribution with degree of freedom of k=n-1 at a level of significance a Higher level of a => more sensitive => more false positive High level of significance => higher probability to reject that there is no relationship => higher probability to confirm that there is a relationship => higher probability of detecting abnormality Level of significance: the probability of rejecting the null hypothesis in a statistical test when it is true http://www.merriam-webster.com/dictionary/level%20of%20significance The null hypothesis refers to a general statement or default position that there is no relationship between two measured phenomena. Rejecting or disproving the null hypothesis—and thus concluding that there are grounds for believing that there is a relationship between two phenomena

Anomaly Detection: Chi-Square Test: Example Observation period: 100 requests A issued 45 requests on b1 b1 issued 35 invocations on C, and 20 invocations on D Link(A-b1): expected value is 100*0.4=40, observed 45 Link(C-b1): expected: 100*0.3=30, observed 35 Link(D-b1): expected: 100*0.3, observed 20 D=(45-40)2/40 + (35-30)2/30+(20-30)2/30 = 4.79 Chi-square test: degree of freedom is 2 (only 3 cells), for a=0.1, 90% quantile is 4.6 => anomaly detected

Path Shapes Modeling The shape of a runtime path is defined to be the ordered set of component classes A path shape is represented as a tree in which a node represents a component class The directional edge represents the causal relationship between two adjacent nodes

Path Shapes Modeling The probabilistic context-free grammar (PCFG) is used for path shape modeling (in Chomsky Normal Form, CNF) A list of terminal symbols, Tk, component classes in a path shape form Tk A list of nonterminal symbols, Ni Denote the stages of the production rules N1: start symbol, often denoted as S $: the end of a rule All other nonterminal symbols are to be replaced by production rules (see below) A list of production rules, Ni -> zj (a list of terminals and nonterminals) A list of probabilities Rij = P(Ni -> zj )

Path Shape Modeling: Example Path shape for 4 end-user requests 100% probability for the call to transit from A to B R1j: SA, p=1.0 R2j: AB, p=1.0

Path Shape Modeling: Example For B, 3 possible transitions: to C with 25%, to D with 25%, and to both C&D with 50 probability R3j: BC, p=0.25 | BD, p=0.25 | BCD, p=0.5 Once a call reaches C or D, it must transit to E, hence: R4j: CE, p=1.0 R5j: DE, p=1.0 E is the last stop for all R5j: E$, p=1.0

Path Shape Modeling: Anomaly Detection The path shape of new requests can be judged to see if they confirm to the grammar An anomaly is detected if a path shape does not conform to the grammar PCFG itself only detect fault, but not pinpoint root cause (localization of fault) Need to use other method, such as decision tree

Microreboot Microreboot: many problems can be fixed by simply restarting the faulty component Works best with component-based systems System design guideline Component based: such as Java EE, with EJB Separating application logic execution and state management Reboot should be cause state loss Loose coupling: to enable localized microreboot Reduce dependency among components: either self-contained, or interaction with other components should be mediated (e.g., via Java EE container) Key: any instance of the referenced component should be able to get the job done => when one under gone microreboot, another instance can provide same service Resilient inter-component interactions Lease-based resource management

Microreboot Automatic recovery with microreboot Equipping with a fault monitor and a recovery management The fault monitor implements some of the fault detection and localization algorithms described in the previously The recovery manager is responsible to recover the system from the fault recursively: by microrebooting first the identified faulty component, if the symptom does not disappear, a group of components according to a fault-dependency graph. If microrebooting does not work, the entire system is rebooted. The final resort is to notify a human operator

Microreboot Fault-depency graph (f-map): consists of components as nodes and the fault-propagation paths as edges Equipping with a fault monitor and a recovery management Can be obtained using automatic failure-path inference (AFPI) AFPI Constructed by observing the system’s behaviors when faults are injected F-map is then refined during normal operation Cycles in the f-map: nodes in the cycle are grouped as a single node; entire group will be microrebooted as a single unit; f-map => r-map

Microreboot Automatic recovery with microreboot Reboot both reported faulty component and all components that are immediately downstream from the component If faulty symptom persists, the upstream component in the r-map is also microrebooted Recovery is carried out recursively until entire system is rebooted

Microreboot Implications of microreboot Microreboot faulty components before node-level failure Tolerating more false positives Proactive microreboot for software rejuvenation Enhance fault transparency for end-users

Overcoming Operator Errors System dependability is significantly reduced because of human errors Checkpointing and logging useful but not sufficient Operating system level State repair and selective replay System-level undo (rewind), repair, system-level redo (replay)

Exercise 1. Identify the set of most recent checkpoints that can be used to recover the system shown here after the crash of P1 12/2/2018 12/2/2018 EEC693: Secure and Dependable Computing EEC688: Secure & Dependable Computing Wenbing Zhao

EEC688: Secure & Dependable Computing Exercise 2.Chandy and Lamport distributed snapshot protocol is used to produce a consistent global state of the system shown below. Draw all control msgs sent in the CL protocol, the checkpoints taken at P1 and P2, and specify the channel state for the P0 to/from P1 channels, the P1 to/from P2 channels, and P2 to/from P0 channels Software control will be elaborated in more details in the next slide 12/2/2018 EEC688: Secure & Dependable Computing Wenbing Zhao 30

Exercise 3: Prove that the Chandy and Lamport Distributed Snapshot Protocol produces consistent checkpoints of the system.

Exercise 4: The following are the interactions that occurred in a system at instance b1 during a period, the total invocations on b1 at an instance are 1200. The remote invocation on b1 by A, the local method invocation by C, D, E and F are 300,200,300,200 and 200. If remote invocations on A by b1, the local method invocations on C, D, E and F observed are 35, 25,20,15, and 25 then find if anomalies are present in the system?