Heading Off Correlated Failures through Independence-as-a-Service CS 523 Paper Presentation Kai Huang
Abstract Today’s systems rely on redundancy to ensure reliability Complex, multi-layered hardware/software stacks may share deep, hidden dependencies May undermine redundancy efforts and introduce unanticipated correlated failures
Abstract Solution: Independency-as-a-service (or INDaaS An architecture to audit the independence of redundant systems proactively Utilize pluggable dependency acquisition modules to collect structural dependency information (network, hardware, software…) Quantify independence of systems using pluggable auditing modules
Introduction Example: Glitch on one Amazon Elastic Block Store server disabled EBS service Lead to correlated failures across multiple Elastic Compute Cloud (EC2) instances Disabled applications designed for redundancy across these EC2 instances
Introduction Existing techniques usually require human intervention (slow) Correlated failures can be hidden by non-transparent business contracts between cloud providers (e. g. EC2 and Azure were disabled at the same time because a storm took down local power source and backup generator)
Introduction Propose Independence-as-a-Service (INDaaS) Collets and audits structural dependency data to evaluate the independence of redundant systems before failures occur Consists of Pluggable dependency acquisition modules that collect dependency data Pluggable auditing modules to quantify independence and identify common dependency Builds on traditional fault analysis techniques Support independence auditing even across mutually distrustful cloud providers who may be unwilling to share full dependency data (private independence auditing or PIA)
Architecture Overview Step 1: The auditing client, Alice, specifies to the au- diting agent what services she wishes to audit and in what way. This specification includes: a) the relevant data sources; b) the level of redundancy desired; c) the types of components and dependencies to be considered; and d) the metrics used to quantify independence. Step 2: The auditing agent issues a request to each data source Alice specified. Step 3: Each specified data source uses one or more dependency acquisition modules to collect the depen- dency data for future independence auditing Step 4: In the private independence auditing (or PIA) case, the data sources collaborate to obtain the auditing results without revealing the proprietary dependency data to each Step 5: Each data source returns to the auditing agent either the full dependency data for structural indepen- dence auditing, or in the PIA case, returns the collaboratively computed independence auditing results. Step 6: The auditing agent returns to Alice an audit- ing report quantifying the independence of various re- dundancy deployments, optionally computing some use- ful information such as the estimates of correlated failure probabilities and ranked lists of potential risk groups.
Architecture Overview Three main types of entities Auditing client Requests audit of independence of cloud systems May request one-time / periodic independence audit Dependency data sources Providers of cloud systems Computation, storage and networking components Auditing agent Mediates interaction between auditing client and the data sources Construct dependency graph based on data from data sources Process dependency graph Then, the agent processes the dependency graph and quantifies its independence, or identifies any unex- pected common dependencies using a set of pluggable independence auditing modules.
Dependency Acquisition A sample distributed storage sys- tem. Suppose an auditing client desires two-way redun- dancy for her service running on two of the three servers S1-S3 within her cloud. She submits to the auditing agent a specification indicating: 1) IP addresses of the three servers, and 2) relevant software components running on these servers. Our current prototype requires the audit- ing client to list software components of interest manually – e.g., Query Engine and Riak(a distributed database) in this example. With this specification, the auditing agent invokes the dependency acquisition mod- ules (i.e. NSDMiner, lshw, and apt-rdepends) on each server to collect the network, hardware, and software de- pendencies, and store them in the DepDB
Dependency Acquisition Three main category of dependency Network dependency – a route from source to destination via various network components (e.g. router) Hardware dependency – physical component (e. g. disk, CPU of a server Software dependency – the package information of a software component A hardware dependency describes a physical component, e.g., a disk or CPU of a server. The Hw field denotes a physical component, and Type specifies the type of this component such as CPU, disk, RAM, etc. The Dep field specifies the model number of the component. Software dependency: Pgm field denotes software component Hw specifies the hardware on which it runs Dep various packages used by it
Independence Auditing Two scenarios: Structural independence auditing (data sources are willing to provide full dependency data) Private independence auditing (support analysis across multiple cloud providers unwilling to reveal full dependency data)
Independence Auditing – structural independence auditing Generate an explicit dependency graph representation Adapt traditional fault tree models to a directed acyclic graph structure (DAG)
Independence Auditing – structural independence auditing
Independence Auditing – structural independence auditing Generalize the representation to express dependencies at three different levels of detail: Component-set – most basic level of detial Fault-set – additionally assign weight to each component, assign each failure even a probability Fault graph – assume a single level of redundancy across data sources
Independence Auditing – structural independence auditing Determine Risk Groups Minimal RG algorithm Failure sampling algorithm Ranking Risk Groups Size-based ranking Failure probability ranking
Independence Auditing – private independence auditing Existing general approach: Use secure multi-party computation to compute and reveal overlap among the datasets of multiple cloud providers while keeping the data themselves private Problem: scales poorly due to complexity
Independence Auditing – private independence auditing Trust assumption Three main types of entinies Auditing client Cloud providers Auditing agent Assume that auditing clients are potentially malicious and wish to learn as much as possible about the cloud providers’ private dependency data
Independence Auditing – private independence auditing Techniques Jaccard similarity Compute Jaccard similarity based on MinHash Private set intersection cardinality protocol – allows a group of parties each with a local dataset to compute the number of overlapping elements without learning any elements in other parties’ dataset (P-SOP)
Independence Auditing – private independence auditing Generate local dependency graph at each cloud provider Normalize the local dependency graphs to ensure same component shared across different cloud providers has same identifier Use P-SOP to compute the number of common/unique components across cloud providers Use MinHash to deal with large datasets Otherwise, if cloud providers in a potential redun- dancy deployment have large component-sets, PIA uses M hash functions based on the MinHash technique to map each such component-set to a much smaller dataset Si, and then takes these MinHash-generated datasets as input to the P-SOP as normal to get the number of com- mon components across cloud providers
Limitations and Practical Issues Accurate failure probability acquisition may be challenging Only takes static software dependency into account – potential solution: access logs and configuration scripts Cloud providers may not have incentive to join Cloud providers may not behave honestly
Implementation and Deployment Auditing client written in Python Dependency acquisition module (written in Python) include three open source tools NSDminer lshw aptrdepends Auditing agent (written in Python) with NetwrokX library, collets dependency data from dependency acquisition modules over SSH
Implementation and Deployment
Implementation and Deployment
Evaluation Common Network Dependency Over 190 different two-way redundancy deployment Among witch 27 do not have unexpected RGs Without INDaaS a random selection leads to only 14% probability to avoid unexpected RGs
Evaluation: Efficiency v.s. Accuracy
Evaluation: Efficiency v.s. Accuracy
Conclusion INDaaS, an architecture to audit the independence of redundant service deployments in the clould.
Thank you!