Presentation is loading. Please wait.

Presentation is loading. Please wait.

Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern.

Similar presentations


Presentation on theme: "Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern."— Presentation transcript:

1 Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern California Nenad Medvidovic University of Southern California

2 Distributed Computation Architectures Solve large computational problems and/or process large data sets Provide a platform and API for applications Transparently parallelize computation across a pool of computers Examples: – Clouds – Grids – Volunteer computing

3 DCA Applications Highly parallelizable problems – Find the 10 100 th digit of π – Factor 2 2011 – 1 Driven by: – Basic research – Pharmaceutical applications – Web analytics – …

4 Volunteer Computing Attempts to leverage the more than 1 billion (mostly idle) machines on the Internet – Volunteers install a client – When idle, the client requests work from a server and send back results Aids projects that have limited funding but large public appeal

5 Dealing with Faults Context: – Volunteers fail and maliciously return false results – Volunteers are not accountable – Malicious volunteers may collude – Well-formed but incorrect results are hard to detect – The reliability of volunteers is difficult to estimate Solution: – Redundancy and voting

6 System Model A task server subdivides computations into tasks The task server replicates each task into multiple identical jobs The task server assigns each job to a node in the node pool Nodes perform work, send results, and rejoin the pool New volunteer nodes may join the pool while other nodes may leave

7 k-vote Traditional Redundancy (TR) Performs k independent executions of each task Takes a vote on the correctness of the result Requires expending a factor of k resources or suffering a factor of k slowdown in performance Example k = 19 r = 0.7 Example k = 19 r = 0.7

8 Insights Redundant computations need not be simultaneous DCAs can dynamically adjust the level of redundancy based on run-time information k-vote traditional redundancy wastes computations Example 19 independent computations (k = 19) 70% node reliability (r = 0.7) (0.7) 10 ≈ 2.8% of the time, the first 10 of them will return the correct result The last 9 results are irrelevant Example 19 independent computations (k = 19) 70% node reliability (r = 0.7) (0.7) 10 ≈ 2.8% of the time, the first 10 of them will return the correct result The last 9 results are irrelevant

9 k-vote Progressive Redundancy (PR) Distributes jobs in waves In each wave, distributes the minimum jobs needed to produce a consensus (assuming all agree) Repeats until a consensus is reached Example k = 19 r = 0.7 Example k = 19 r = 0.7

10 Insights The confidence level associated with a result can be computed k-vote progressive redundancy produces results with varying confidence Example k = 19, r = 0.7 If the vote is 10-0, confidence level ≈ 99.98% If the vote is 10-9, confidence level = 70% Example k = 19, r = 0.7 If the vote is 10-0, confidence level ≈ 99.98% If the vote is 10-9, confidence level = 70%

11 Iterative Redundancy (IR) Distributes jobs in waves In each wave, distributes the minimum jobs required to achieve a desired confidence level Repeats until desired confidence level is reached Example d = 4 r = 0.7 Example d = 4 r = 0.7

12 Algorithm Comparison System reliability approaches 1 exponentially for TR, PR, and IR IR produces the same reliability at a lower cost – Or, equivalently, higher reliability at the same cost IR is optimal with respect to cost – Guaranteed to use the minimum computation needed to achieve desired system reliability Cost Factor System Reliability

13 Algorithm Comparison PR and IR perform best when the reliability of the node pool is high Node Reliability Ratio Improvement Over Traditional Recovery

14 Adaptive Behavior IR maintains a constant system reliability as node reliability fluctuates – Injects redundancy where it is needed “Unlucky” situations – Removes redundancy where it is unnecessary Time Node Reliability Cost Factor System Reliability

15 Node Reliability Estimation Incorrectly estimating node reliability does not affect the performance of IR Cost Factor System Reliability

16 Conclusions Iterative redundancy automatically replicates computation with optimal efficiency Iterative redundancy can be used when: – A computation can be broken down into independent tasks – Computation is performed by a pool of independent processing resources – Task deployment decisions can be made at runtime – The reliability of resources in the pool is unknown

17 For More Information To appear in ICDCS 2011: Smart Redundancy for Distributed Computation by Yuriy Brun, George Edwards, Jae young Bang and Nenad Medvidovic http://www.cs.washington.edu/homes/brun/pubs/pubs/Brun11icdcs.pdf


Download ppt "Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern."

Similar presentations


Ads by Google