Fortifying Distributed Applications against Errors

Fortifying Distributed Applications against Errors
Saurabh Bagchi School of Electrical and Computer Engineering Department of Computer Science Purdue University Presentation available at: engineering.purdue.edu/dcsl

Recovery from failures in data centers (Eurosys ‘16)
Roadmap Dependability basics System design principles Terminology and basic approaches Performance interference in cloud environments (Middleware ‘14, ICAC ‘15) Recovery from failures in data centers (Eurosys ‘16)

What is Dependable Computing?
Dependability: Property that the computer system meets its specification despite the presence of faults Faults can be due to natural causes (bugs, defects in hardware), or Maliciously induced (attacks from external or internal sources) Terminology Failure: Deviation of the delivered service from compliance with the specification Error: Part of the system state that has been damaged by the fault and, if uncorrected, can lead to a failure. Fault: The adjudged or hypothesized cause of an error Fault Error Failure Some faults are not activated Some errors are masked Failure is externally visible. Fault tolerance is based on detecting latent errors before they become active and then replacing the erroneous part of the state with an error-free state. Example: A programmer’s mistake is a fault in the written software. When the system executes the erroneous instructions with certain data values, say x < 0, the fault becomes active and produces an error – say some pixels in error. If the output of the program deviates from the correct value as a result of this error, that will be called a failure. But if finally the image is cropped and the erroneous pixels are left out, then it does not become a failure.

Two Facets of Dependable Computing
Mercedes version High hardware or software development costs High power consumption High space overhead Example: Boeing 777 fly-by-wire (FBW) system, which used triple modular redundancy for all hardware resources Example: AT&T’s ESS telecommunication switch which had a requirement of downtime < 2 minutes/year Commodity systems Cannot have too high development costs Cannot impede performance of system significantly Cannot take recourse to too high redundancy Example: ECC, Parity, RAID Boeing 777 has TMR for all hardware resources, including the computing system, airplane electrical power, hydraulic power, and communication path

How do We Achieve the Objectives?
Hardware System network Processing elements Memory Storage system Operating system Reliable communications SIFT Application program interface (API) Middleware Applications Checkpointing and rollback, application replication, software, voting (fault masking), Process pairs, robust data structures, recovery blocks, N-version programming, CRC on messages , acknowledgment, watchdogs, heartbeats, consistency protocols Memory management, detection of process failures, hooks to support software fault tolerance for application Example: OS provides process health monitoring, then the application (which in this example has multiple processes) does not need to actively probe for the health of all its processes. Example: If the memory has ECC then the OS does not need to verify through reading back what it has just written that the memory is behaving correctly. Error correcting codes, N_of_M and standby redundancy , voting, watchdog timers, reliable storage (RAID, mirrored disks)

Roadmap Dependability basics Performance interference in cloud environments (Middleware 14, ICAC 15) Performance interference in cloud Optimal parameter settings for a server Our solution - automatic reconfiguration: IC2 Evaluation: IC2 Motivation for integrated configuration controller Our solution - load balancer + IC2: ICE Evaluation: ICE Recovery from failures in data centers (Eurosys ‘16)

Problem Statement Internet
Long latency and variable latency of cloud services degrade user experience Interference between VMs on the same physical machine is a key factor for such latency perturbations

Running Web Applications in the Cloud
Host1 Hypervisor VM1 WS1 VM2 WS2 VMn DB2 ….. Host2 App1 DB1 VMm Appm Storage Network

Imperfect Performance Isolation due to Shared Hardware Resources
Other shared resources Memory bandwidth Network/IO Translation Lookaside Buffer (TLB) P1 L1 P2 L1 Processor Cache L2 Cache (last level) Multi-core Cache Sharing TLB stored on the processor and used for mapping the virtual memory address to the physical memory address.

Mitigating Performance Interference in Clouds
Performance of one VM suffering due to activity of another co-located VM Why it happens? Low level hardware resources are not partitioned well Contention for Cache, Mem bandwidth, Network can degrade performance Our experiments with Amazon EC2 Performance of web servers can suffer drastically during interference Cloudsuite Application benchmark m1.large VM instances (2 cores, 7.5GB) Run for 100 hours Tail ~ 55 X median Tail ~ 4 X median EC2 Private Cloud Our results indicate that there are several instances when interference lasted for 30s or longer, the longest duration being 140s. While these interference instances are a small portion of the total number of requests, there are two conditions that suggest we need to deal with them they are unpredictable and therefore, worst-case provision- ing for performance critical applications suggests we must put in place mechanisms to deal with them; when interferences do happen, they cause pathologically poor behavior of the application and may push the application into a “death spiral", such as due to filling of application queues. Our results indicate that there are several instances when interference lasted for 30s or longer, the longest duration being 140s. While these interference instances are a small portion of the total number of requests, there are two conditions that suggest we need to deal with them they are unpredictable and therefore, worst-case provision- ing for performance critical applications suggests we must put in place mechanisms to deal with them; when interferences do happen, they cause pathologically poor behavior of the application and may push the application into a “death spiral", such as due to filling of application queues.

Remediation Techniques
Traditional techniques for remediation Better VM placement [Paragon ASPLOS2013] Hypervisor scheduling [QCloud Eurosys 2010] Dynamic live migration [Deepdive ATC2013] Our approach Requirements Need user level control Fast response during interference Key idea: Reconfigure application to handle change in operating environment (interference) IC2: Interference-aware Cloud application Configuration Require changes in hypervisor. Not feasible in public cloud

Solution Overview

IC2: Agenda Performance Interference in Cloud Our approach Solution Overview Interference vs. Middleware Parameters Interference Detection Configuration Controller IC2 in Operation Key Results

Interference vs. Middleware Parameters
Setup Servers are Poweredge T320 servers, Xeon E processor 6(12) cores, 16GB Memory Application: Cloudsuite (Olio, Social media calendar) Middlewares: Apache + Php-fpm Server 1 KVM Web Server Interference Server 2 Database Server 3 Clients

Interference vs. Middleware Parameters
Setup Middleware Parameters Thread-pool parameters Apache: MaxClients Php-fpm: pm.max_children (PhpMaxChildren) Timeout parameter Apache: KeepaliveTimeout Interference: Dcopy from BLAS (cache r+w) LLCProbe from Ristenpart CCS’12 (cache r) Varying sizes of Dcopy to create different levels of contention

Choice of Optimal Apache Parameters
Optimal MXC changes with interference Optimal KAT changes with interference Depends on degree of interference Need dynamic reconfiguration In the baseline case (Dcopy 0MB), best response time can be obtained by setting MXC to However, the optimal value reduces to 1100 for Dcopy-1.5GB and LLC-15MB. So with the default value of MXC, the response time under the interference would be 70% higher. So the MXC parameter will need to be reconfigured under interference. For KAT plot, MXC was kept constant at the baseline of Interference from co-located VMs increases the optimal KAT value.

Parameter Dependency Parameter dependency changes with interference
KAT = MXC / #new_connections/sec no longer valid during interference With interference, need smaller MXC larger KAT Are two parameters independent? No the parameters are dependent and the exact nature of dependency changes with interference. No interference negative slope, while a convex curve with interference. Choice of optimal KAT for a given MXC is significantly different in the two. As a general observation, we find that lower KAT is better at baseline while higher KAT is better during interference.

Optimal configuration values with interference
Root Cause Analysis Optimal configuration values with interference Optimal MXC decreases, KAT, PHP increases Server capacity with interference CPU saturates sooner with interference Interference increases CPU utilization of WS-VM Interference increases Mem utilization of WS-VM CPI, CMR rises Due to the large number of cache misses induced by the interference VM, the WS VM uses more of its cpu cycles fetching data from memory to cache and consequently the CPI increases. It can be seen from Fig. 4(b) that the CPI values for the WS with interference is between 2 and 2.25, whereas, baseline CPI is only 1.5. It implies that, on average, a WS thread takes longer time to finish execution. The overall effect is that a larger fraction of the WS VM’s time slice is occupied by some busy thread. This is reflected as increased cpu utilization inside the guest VM.

Agenda: IC2 Performance Interference in Cloud Our approach Solution Overview Interference vs. Middleware Parameters Interference Detection Configuration Controller IC2 in Operation Key Results

Solution Overview Questions that we answer How to detect interference?
Which parameters to reconfigure during interference? How to determine new parameter values?

Interference Detection
IC2 workflow Interference Detection Interference Detection Use Decision Tree classifier In EC2, use system and application metrics to detect interference Load per operation (LPO) is a key indicator Challenge: Capture metrics variations with configuration changes More details on Decision Tree in paper

State Manager In EC2, use buffer states to deal with transient interference/noisy data Reconfigure only after two successive periods under interference Also masks classifier errors We find empirically that a sharp increase in CMR and CPI is a leading indicator of interference. These are available on the local testbed but not on the EC2 testbed.

Configuration Controller
Choice of parameter driven by knowledge base Created from empirical results shown earlier Can be created by expert administrators Our heuristic Decrease MXC based on proportional increase in LPO Increase KAT based on proportional increase in response time. For PHP use two constant values (no-interference, interference) Implementation Modified Apache to handle graceful parameter update Called httpd-online: Instead of using raw CPU utilization which may show sharp fluctuations due to stochastic nature of request arrivals, we use a normalized metric Load Per Operation (LPO). LPO is defined as LPO = CPUutil/Throughput . We also define another derived metric WorkDone = RT * THPT * CPUutil. Intuitively, Workdone approximates the number of CPU cycles spent to serve all the requests during current measurement interval.

Agenda Performance Interference in Cloud Our approach Interference vs. Middleware Parameters Solution Overview Interference Detection Configuration Controller IC2 in Operation Key Results Conclusion

IC2 in Operation Setup Metrics to consider EC2 m1.large VMs
Web server co-located with interference VM Periodic interference of varying intensity and type (LLCProbe, Dcopy) Private testbed VMs configured to match EC2 specifications Metrics to consider Improvement in response time during interference Detection latency Detection accuracy M1.large is 2 VCPUs and 7.5 GB memory

IC2 Improves Response Time
Httpd-online reduces overhead New values Effects of interference lasts longer in EC2 Default Apache distribution has high overhead of reconfiguration Httpd-online solves this

Results IC2 improved response time by upto 40% in private testbed and upto 29% in EC2 during interference Median interference detection latency 15 sec in private testbed; 20 sec in EC2 testbed Classifier accuracy Interference detection has 89% recall and 73% precision Majority misclassifications due to Interference, No-interference detected as Transient From onset of interference (red line in Fig. 9) up to 60 seconds is considered first half. This is the period when interference detection and reconfiguration take place effectively showing overhead of IC2, specially in case of httpd-basic. b) From 60s after interference to stopping of interference (green line in Fig. 9) is considered the second half. This is the steady state performance of IC2 during interference. Overall IC2 showed higher improvement in response time in the local testbed since it was able to compute \delta(MXC) and \delta(KAT) more precisely (no ambient interference as in EC2). We find that the response time improvements are significant considering the simplicity of our controller. It further establishes our point that, in a cloud deployment, an application configuration manager must be interference-aware.

Summary: IC2 Interference causes severe performance degradation in cloud Optimal application configurations change during interference Web services can mitigate effects of interference by reconfiguration We presented the design and implementation of IC2 which reconfigures web servers during interference Our evaluations showed 40% reduction of response time in Private testbed and 29% reduction in EC2.

Review: Dependability of Smartphones
Agenda Introduction Contributions Review: Dependability of Smartphones Study of failures in Android and Symbian Robustness testing of Android ICC Dependability of Cloud Applications IC2: Mitigating interference by middleware reconfiguration ICE: Two-level configuration engine for WS clusters Directions for Future Research Summary

ICE: An Integrated Configuration Engine for Interference Mitigation
Motivation IC2 improves response time by configuring WS parameters WS reconfiguration is costly and limited Use residual capacity in a WS cluster efficiently Objectives Make reconfiguration (interference mitigation) faster Make existing load-balancers interference-aware Get better response time during interference (than IC2) We use HAProxy as our baseline load-balancer

Two-level reconfiguration
ICE Overview Two-level reconfiguration 1. Update load balancer weight Less overhead. More agile. 2. Update Middleware parameters Only for long interferences. Reduces overhead of idle threads.

ICE Design We use hardware counters for interference detection
Faster detection Hypervisor access not required if counters are virtualized

ICE: Load Balancer Reconfiguration
Objective: Keep WS VM’s CPU utilization below a threshold Uthres If predicted CPU above threshold, find a new request rate such that it goes below threshold Request rate (RPS) determines server weight value in load balancer configuration Use the following empirical function for load estimation Predicted Util Past Util CPI RPS Indicator of Interference

Evaluation Experimental Setup
Cloudsuite benchmark with different interferences We look at ICE with two different load balancer scheduling policies Weighted Round Robin (WRR or simply RR) WRR shows performance of a static configuration. Weighted Least Connection (WLC or simply LC) WLC shows performance of an out-of-box dynamic load balancer

Response Time Round Robin (RR) Least Connection (LC)
400ms 200ms Round Robin (RR) Least Connection (LC) ICE improves response time both in RR and LC LC (out-of-box) reduces effect of interference significantly, but occasional spikes remain ICE reduces frequency of these spikes

Median interference detection latency
Results ICE improves median response time by upto 94% compared to a static configuration (RR) ICE improves median response time by upto 39% compared to a dynamic load balancer (LC) Median interference detection latency 3 sec using ICE (15-20 sec for IC2)

ICE: Summary Effect of interference can be mitigated by reducing load on the affected VM We presented ICE for two-level configuration in WS clusters ICE improves median RT by 94% compared to static configuration and 39% compared to a dynamic out-of-box load balancer Median interference detection latency 3s

Roadmap Dependability basics Performance interference in cloud environments (Middleware ‘14, ICAC ‘15) Recovery from failures in data centers (Eurosys ‘16) Repairing from a storage failure: Erasure Coding Latency problem of EC storage Our solution: PPR Recovering from concurrent failures: m-PPR Evaluation of PPR and m-PPR

Need for storage redundancy
Data center storages are frequently affected by unavailability events Unplanned unavailability: - Component failures, network congestions, software glitches, power failures Planned unavailability: - Software/hardware updates, infrastructure maintenance How storage redundancy helps ? Prevents permanent data loss (Reliability) Keeps the data accessible to the user (Availability)

Replication for storage redundancy
Keep multiple copies of the data in different machines Data Data is divided in chunks Each chunk is replicated multiple times Replication is the most widely used storage redundancy technique. The Key idea is to store multiple replicas on different machines in such a way so that not all the copies will be lost at the same time in case of an unavailability event. =>They way it is done in practice is to divide the data into chunks. =>Then each chunk is replicated multiple times for fault tolerance. => For example, triple replication can tolerate up to 2 failures out of the 3 available replicas. However replication is not the ideal redundancy scheme for large amount of data due to its storage overhead and the associated costs. Replication is not suitable for large amount of data

Total storage required
Erasure coded (EC) storage Data k data chunks m parity chunks Stripe Can survive up to m chunk failures Reed-Solomon (RS) is the most popular coding method Erasure coding has much lower storage overhead while providing same or better reliability. Erasure coding has come out as a very attractive alternative to replication in the area of distributed storage. In Erasure coded storage …data is first divided into “k” number of chunks. Then another set of “m” parity chunks are calculated using some mathematical operations. => The set of k+m chunks is called a stripe. =>The total ensemble of “k+m” chunks can survive upto “m” failures. => Reed-Solomon coding is the most popular choice. Example for 10 TB of data Redundancy method Total storage required Reliability Triple Replication 30 TB 2 Failures RS (k=6, m=3) 15 TB 3 Failures RS (k=12, m=4) 13.34 TB 4 Failures

The repair problem in EC storage
(4, 2) RS code Chunk size = 256MB Bottleneck S1 S2 S3 S4 S5 S6 S7 However, there is a key problem that prohibits wide scale adaptation of the erasure coded storage. The reconstruction time for a missing chunk is very long. Through an example, I will describe the reason behind this repair problem. Lets say we have a (4, 2) Reed-Solomon coding and the chunks are distributed over several servers. Now if server S1 goes down, the missing chunk will be recreated in S7 using chunks from servers S1 through S5. =>At first S7 collects all the data and then performs some mathematical operation to recreate it. => It can be seen a network bottleneck is created near the link leading to S7 as all the chunks will have to pass through that link. Data chunks Parity chunks Crashed New destination Network bottleneck slows down the repair process

Total storage required # chunks transferred during a repair
The repair problem in EC storage(2) Repair time in EC is much longer than replication Example for 10 TB of data Redundancy method Total storage required Reliability Triple Replication 30 TB 2 Failures RS (k=6, m=3) 15 TB 3 Failures RS (k=12, m=4) 13.33 TB 4 Failures # chunks transferred during a repair 1 6 12 If you compare it with replication, it is much worse. => If a chunk fails in replication, we can just get a copy from its replica. In erasure coded storage, we need multiple chunk transfers over a network link. Often chunk sizes are large, for example, 256MB. So we are talking about 12 x 256MB or 24Gigabits of data transfer over a link. For a chunk size of 256MB This would be a 12 x 256 Mega-Bytes (24 Gbits ) of data transfer over a particular link !

What triggers a repair ? Monitoring process finds unavailable chunks
Regular repairs Chunk is re-created in a new server Client finds missing or corrupted chunks Degraded reads Chunk is re-created in the client On the critical path of the user application There can be one of two ways how such repairs are triggered. Either a monitoring process finds about a missing chunk and initiates a repair. We call it regular repairs. Or a client while trying to read the data discovers the chunk is missing and initiates a repair. We call degraded reads.

Existing solutions Keep additional parities: Need additional storage
Huang et al. (ATC-2012), Sathiamoorthy et al. (VLDB-2013) Mix of both replication and erasure code: Higher storage overhead than EC Xia et al. (FAST-2015), Ma et al. (INFOCOM-2013) Repair friendly codes: Restricted parameters Khan et al. (FAST-2012), Xiang et al. (SIGMETRICS-2010), Hu et al. (FAST-2012), Rashmi et al. (SIGCOMM-2014) Delay repairs: Depends on policy. Immediate repair needed for degraded reads Silberstein et al. (SYSTOR-2014) There is large body of work that attempted to solve this problem. The overall approach can be broadly classified into these 4 categories.

Network transfer time dominates
We propose an approach which is orthogonal to the previous work and is motivated by the observation that up to 94% of the repair time is taken by the network transfer. PPR attempts to solve this problem by redistributing the reconstruction traffic more uniformly across the existing network links, thereby improving the utilization of network resources and decreasing reconstruction time.

Our solution approach We introduce: Partial Parallel Repair (PPR)
A distributed repair technique – targeted to reduce network transfer time No additional storage No restrictions on the code Significantly lower repair time We introduce our technique Partial Parallel Repair or PPR in short. This is a distributed repair technique where repair done not by a single node but by multiple nodes, using concurrency.

Key insight: partial calculations
Encoding Repair Our solution is based on some key insights of the mathematical calculations involved in Erasure coded storage. Therefore, let me give you a quick overview of the mathematics involved here. As shown on the left, To encode the parity chunks, data chunks are multiplied by a matrix with some special coefficients. During repair of a missing chunk, two kind of operations might happen depending on whether a parity or data chunk was lost. If a parity chunk is lost, it is simply re-encoded using this equation. If a data chunk is lost, depending what are the surviving chunk, a set of coefficients are calculated and the lost chunk is reconstructed using existing chunks. =>It can be seen, both the equations have similar form – a summation of product terms. Hence it is associative and the terms can be calculated in parallel. Equations are associative Individual terms can be calculated in parallel

Partial Parallel Repair Technique
Traditional Repair Partial Parallel Repair Bottleneck Armed with this observation, we design a repair technique which is distributed over multiple nodes and uses concurrency. Recall the problem with traditional repair was many chunks were travelling over a particular link leading to the destination server. Network bottleneck created because all the computation was being done in the new destination. =>Our Partial Parallel Repair technique involves few logical steps. In each steps some partial results are calculated in parallel in multiple nodes and sent to a peer node instead of everyone directly sending to the destination server. In the next logical step, these servers send the aggregated result to the final destination. You can see, there are no links that carry all “k” chunks. Hence no bottleneck is created. => This bottleneck reduction was possible because the size of the results of these partial operations are exactly same as the original chunk size S1 S2 S3 S4 S5 S6 S7 S1 S2 S3 S4 S5 S6 S7 a2C2 + a3C3 + a4C4+ a5C5 a2C2 |a2C2 | = |a2C2 + a3C3 | a2C2 + a3C3 a4C4 a2C2 + a3C3+ a4C4 + a5C5 a4C4 + a5C5

Amount of transferred data
PPR communication patterns Traditional Repair PPR Network transfer time O(k) O(log2(k+1)) Repair traffic flow Many to one More evenly distributed Amount of transferred data Same Here are more examples of the communication pattern involved in PPR for few other RS coding parameters. The main advantage of PPR comes from better distributing the repair traffic and aggregating like a tree. => It can be shown PPR takes only about order of log(k) time to complete the network transfer instead of order k time as in the traditional case. Although, it must be noted the total amount data transferred is exactly same as the traditional repair.

When is PPR most useful ? Network transfer times during repair:
Traditional RS (k, m) (chunk size / bandwidth) * k PPR RS (k, m) (chunk size / bandwidth) * ceil( log2(k+1) ) PPR time/ Traditional time k 1.2 1.0 0.8 0.6 0.4 0.2 0.0 PPR is useful when k is large Network is the bottleneck Chunk size is large So when is PPR most useful ? These equations represent the theoretical calculation for network transfer time in Traditional and in PPR. Usually a value of K is determined based on reliability requirement for the system. (FB uses k=10, Microsoft K=12, Some companies use K=20). It can be seen the effectiveness of PPR increases for larger K. Since PPR reduced network transfer time, it is most useful if network is the bottleneck. We found larger chunk size also tend to create network bottleneck, hence PPR is more useful with larger chunk sizes.

Additional benefits of PPR
Maximum data transferred to/from any node is logarithmically lower Implications: Less repair bandwidth reservation per node Computation is parallelized across multiple nodes Implications: Lower memory footprint per node and computation speedup PPR works if encoding/decoding operations are associative - Implications: Compatible to a wide range of codes including RS, LRC, RS-Hitchhiker, Rotated-RS etc. With PPR, along with the improvement in network transfer time, we get 3 additional benefits. Maximum data transferred to and from any node is logarithmically lower. Thus in a bandwidth reservation based system, less bandwidth can be reserved for the repair traffic. Computations are parallelized over multiple nodes resulting in speedup and lower memory footprint per node. PPR works with any associative code. That means it can work with a wide variety of codes proposed in the prior works.

Can we try to reduce the repair time a bit more ?
Disk I/O is the second dominant factor in the total repair time Use caching technique to bypass disk I/O time chunkID Last access time Server C1 t1 A chunkID Last access time C1 t1 C2 t2 B C1 in cache A Read C1 Our goal is to reduce total repair time as much as possible to make erasure coded storage a viable option. Disk I/O is the second dominant factor in the repair time. We use a form of caching to reduce the repair time. =>The general architecture looks like this. Repair manager is a logically centralized entity which schedules the repairs corresponding to these failures. This technique is not specific to PPR and can also be used by the traditional technique. Repair Manager C2 in cache Read C2 Client chunkID Last access time C2 t2 B

Multiple simultaneous failures
m-PPR: a scheduling mechanism for running multiple PPR based jobs Chunk Failures C1 C2 C3 Repair Manager Schedule repair Chosen servers It is likely that at any point there can be failures in multiple stripes across the datacenter. We propose a simple greedy heuristic, called multi-PPR (or m-PPR) that would help the repair manager to prioritize these repairs and select the best source and destination servers. Our scheduling technique attempts to minimize the resource contention and even out the load across all the repairs. C2 < , , , > C3 < , , , > C1 < , , , > A greedy approach. Attempts to minimize resource contention

Multiple simultaneous failures(2)
A weight is calculated for each server Wsrc = a1*hasCache – a2*(#reconstructions) - a3*userLoad Wdst = -b1*(#repair destinations) – b2*userLoad The weights represent the “goodness” of the server for scheduling the next repair Best “k” servers are chosen as the source servers. Similarly best destination server is chosen All selections are subjected to reliability constraints. E.g. chunks of the same stripe should be in separate failure domains/update domains. Before scheduling each repair, the scheduler calculates a weight for all the source and destination candidates. Based on the timesteps collected as part of the caching, the warm stripes are repair first, followed by hot, followed by cold (a policy decision)

Implementation and evaluation
Implemented on top of Quantcast File System (QFS) - QFS has similar architecture as HDFS. Repair Manager implemented inside the Meta Server of QFS Evaluated with various coding parameters and chunk sizes Evaluated PPR with Reed-Solomon code and two other repair friendly codes (LRC and Rotated-RS) We implemented PPR on top of Quantcast Files System which has similar architecture as HDFS with an already existing support for Erasure codes. Our Repair manager was implemented within the Meta-Server component of QFS. Meta-Server is similar to Name-Node in HDFS. We evaluated PPR with various coding parameters and chunk sizes. Most importantly we show its benefits when used two previously proposed repair friendly coding techniques.

Repair time improvements
Y axis is the percentage reduction in repair time. PPR becomes more effective for higher K, as expected. Higher chunk size creates more network traffic increasing the percentage of network transfer time. Thus the improvement becomes higher for larger chunk sizes as well. And there are independent reasons why larger k and larger chunk sizes are useful in many systems. PPR becomes more effective for higher values of “k” Larger chunk size also gives higher benefits

Improvements for degraded reads
For degraded reads PPR is also extremely useful. Y axis is the degraded read throughput. For degraded reads, destination is the client. A network bandwidth limited link. Clients can have much less bandwidth. Here we show as the bandwidth available to the client decreases, PPR is much able to sustain a higher degraded read throughput. . PPR becomes more effective under constrained network bandwidth

Relationship with chunk size
For same coding parameter, advantage from PPR becomes more prominent for higher chunk sizes. Y axis is the repair time. Higher chunk size creates more network traffic increasing the percentage of network transfer time. Thus the improvement becomes higher for larger chunk sizes.

Benefits from caching Fig. 7e shows that chunk caching is more useful for lower values of k (e.g., (6, 3) code). For higher values of k or for higher chunk sizes, the benefit of caching becomes marginal because the improvement in the network transfer time dominates that of the disk IO time.

Compatibility with existing codes
PPR on top of LRC (Huang et al. in ATC-2012) provides 19% additional savings PPR on top of Rotated Reed-Solomon (Khan et al. in FAST-2012) provides 35% additional savings The most useful benefit of PPR is its compatibility to a wide variety of codes. LRC and Rotated-RS are two codes proposed to reduce the repair traffic. Here we show, PPR based technique can be used on top of those codes to get additional savings.

Improvements from m-PPR
Finally, in this experiment we show the benefits of m-PPR scheduling. We increase the number of simultaneous failures and report the total repair time. M-PPR makes the best effort to spread out the repair operations across datacenter, so that the resource contention can be minimized. Thus, m-PPR becomes less effective for a vary large number of simultaneous failures because in that case even a random selection of source and destination servers can be as effective as m-PPR. SB: Motivate why such a large number of simultaneous failures is possible. m-PPR can reduce repair time by 31%-47% It’s effectiveness reduces with higher number of simultaneous failures because overall network transfers are more evenly distributed

Concluding Insights: IC2 and ICE
Effect of interference can be mitigated by reducing load on the affected VM, through a load balancer We presented ICE for two-level configuration in WS clusters First level: Reconfigure load balancer Second level: Reconfigure the web service (only for longer lasting interference) ICE improves median Response Time of a representative web service by 94% compared to static configuration and 39% compared to a dynamic out-of-box load balancer Median interference detection latency is low – 3 seconds The basic principle of ICE is also applicable to streaming servers Future work: Handling other types of interferences: network, storage, etc. Finding “useful” configuration parameters automatically

Concluding Insights: PPR
Partial Parallel Repair (PPR) a technique for distributing the repair task over multiple nodes and exploits concurrency PPR can reduce the total repair time by up to 60% Theoretically, the network transfer time is reduced by a factor of log(k)/k PPR is more attractive for higher “k” and higher chunk sizes PPR is compatible with any associative erasure codes We introduced Partial Parallel Repair – a distributed repair technique for erasure coded storage. PPR can reduce repair time by 60%. Theoretically it reduces network transfer time logarithmically. PPR is more attractive for higher values of K, higher chunk sizes and when the repair is constrained by the network transfer. In addition to Reed-Solomon code, PPR works with a wide variety of associative erasure codes proposed by prior research.

It Takes a Village Todd Martin Rajesh Greg (Google) Amiya Ignacio Dong
Moo-Ryong Sasa (UIUC) Subrata ( AT&T ) ( Purdue ) ( LLNL )

Thank You!

Backup Slides

The protocol Here is the complete Protocol for PPR

Fortifying Distributed Applications against Errors

Similar presentations

Presentation on theme: "Fortifying Distributed Applications against Errors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fortifying Distributed Applications against Errors

Similar presentations

Presentation on theme: "Fortifying Distributed Applications against Errors"— Presentation transcript:

Similar presentations

About project

Feedback