Robust Network Supercomputing with Malicious Processes (Reliably Executing Tasks Upon Estimating the Number of Malicious Processes) Kishori M. Konwar* Sanguthevar Rajasekaran Alexander A. Shvartsman *Computer Science & Engineering Department University of Connecticut Storrs, CT
2 Motivation Internet supercomputing is increasingly becoming a powerful tool for harnessing massive amounts of computational resources availability of high bandwidth Internet connections there is an enormous number of processes around the world comes at a cost substantially lower than acquiring a supercomputer or building a cluster of powerful machines
6 PrimeNet Server PrimeNet Server is a distributed, massively parallel scientific computing Internet Supercomputer Supported by and ranks among the most powerful computers in the world A project comprised of about 30,000 PCs and laptops Currently sustains a 22,296 billion floating point operations per second (gigaflops) (operations that involve fractional numbers )
7 project a massive distributed cooperative computer Used for analysis of gigabytes of data for Search for Extraterrestrial Intelligence (SETI) Comprises of millions of voluntary machines around project reported its speed to be more than 57,290 billion floating point operations per second
8 Reliability Issues The master and perhaps certain workers are reliable they will correctly execute the tasks assigned by the server However, workers are commonly unreliable they may return to the master incorrect results due to unintended failures caused, e.g., by over-clocked processors may deceivingly claim to have performed assigned work so as to obtain incentive such as getting higher rank
10 Some Previous Studies [FGLS05] Assumed the worker processes might act maliciously and hence deliberately return wrong results. goal is to design algorithm that enable the master to accept correct results with high probability at a lower cost they provided a randomized algorithm unfortunately the cost complexity results depend on several parameters and hard to interpret
11 Some Previous Studies (cont’d) [GM05] considered the problem of maximizing the expected number of correct result the tasks are dependent any worker computes correctly with probability p < 1 any incorrectly computed task corrupts all dependent tasks the goal is to compute a schedule that maximizes expected number of correct results under a given time constraint they showed the optimization problem to be NP-hard provided some solutions on a restricted DAG
12 Overview Models of Computation Stopping Rule Algorithm based solution Detection of Faulty Processors Performing Tasks with Faulty Workers Conclusions
14 Models of Computation Processes takes steps in lock steps, i.e., in synchrony Processes communicate by exchanging messages The tasks are independent and idempotent Processes are subject to failures and can return incorrect results maliciously Workers, P = {1,2,..., n} and a master M
15 Work Complexities [CDS01] defined as work complexity or available processor steps All steps taken by processes during execution of the algorithm are counted including the steps of the idling and waiting non-faulty processes work [DHW92] define work as the number of performed tasks counting multiplicities Approach does not charge for idling and waiting this is called task oriented work
16 Few Comments work We say that an even E occurs with high probability (w.h.p.) to mean that Pr[E] = 1 – O(n - ) for some constant > 0.
17 Modeling Failures Failure model F a f-fraction, 0 < f < ½ of the n workers may fail Each possibly faulty worker independently exhibits faulty behavior with probability 0 < p < ½. The master has no a priori knowledge of f and p.
18 Modeling Failures (cont’d) Failure model F b There is a fixed bound on the f-fraction, 0 < f < ½ of the n workers that can be faulty Any worker from the remaining (1-f)-fraction of the workers fails with probability 0 < p <1/2 independently of other workers The master knows the values of f and p.
19 Algorithmic Template procedure for master process M, task T Choose a set S P Send task T to each processor p S Wait for the results from the processes in S Decide on the result value v from the responses procedure for worker w P Wait to receive a task from master M Upon receiving a task from M Execute the task Send the result to M
21 ( , )-approximation algorithm Z is a random variable distributed in the interval [0,1] with mean Z Z 1, Z 2, Z are independently and identically distributed according to the random variable Z An ( , )-approximation algorithm, with 0 < < 1, > 0 for estimating Z satisfies Pr[ Z (1- ) Z (1+ ) ] > 1 - where is the estimated value of Z
22 Stopping Rule Algorithm [Dagum, Karp, Luby, and Ross 1995] Input Parameters ( , ) with 0 0 Let 1 = 1 + (1+ ) // = 0.72 & = 4 log(2/ )/ 2 Initialize N 0, S 0 While S < 1 do: N N+1, S S + Z N Output: Z 1 /N
23 Stopping Rule Theorem Theorem (Stopping Rule Theorem) [Dagum, Karp, Luby, and Ross] Let Z be a random variable in [0,1] with Z = E[Z] > 0. Let be the estimate produced and let N Z be the number of experiments that SRA runs with respect to Z on input and . Then, (i) Pr[ Z (1- ) Z (1+ ) ] > 1 - (ii) E[N Z ] 1 / Z and (iii) Pr[N Z >(1+ ) 1 / Z ] /2
24 Algorithm A f,p to estimate f and p
25 Work Complexity of A f,p Theorem: Algorithm A f,p is an ( , )-approximation algorithm, 0 0, for the estimation of f and p with work complexity O(log 2 n), complexity O(n log n), message complexity O(log 2 n) and time complexity O(log n), with high probability.
27 Detection of Faulty Processors Lemma: It is not possible to perform all the n tasks correctly, in the failure model F a with linear complexity (i.e., O(n)) with high probability.
28 Detection of Faulty Processors procedure for master process M Initially, F For t = 0, …. k log n, k > 0 Choose a set S P \ F Send each process p S “test” task Wait for the results from the processes in S If the response is faulty F F {p: p is a faulty process} End If End For
29 Detection of Faulty Processors Lemma: The algorithm detects all faulty processes among the n workers in O(log n) time with O(n) work with high probability Theorem[Karp 04]: Suppose that a(x) is a non-decreasing, continuous function that is strictly increasing on {x | a(x) >0}, and m(x) is a continuous function. Then for every positive real x and every positive integer t, Pr[T(x) > u(x) + ta(x)] (m(x)/x) t where u(x) is the solution to the equation u(x)=a(x) + u(m(x)) with m 0 (x) :=0 and m i+1 (x):= m(m i (x)).
31 Performing Tasks under F a procedure for master process M: Initially, C , J set of n tasks Randomly choose a set, possibly with repetition, S P, |S|=kn/log n workers k>0 is a constant For i = 1, …, k' log n, k' > 0 Send to each worker p S a “test” task Collect the responses from all the workers. End For If all the responses from a worker p S are correct then C C {p} End if For i=1, …, n/|C| Send |C| jobs from J, not sent in previous iteration, one to each worker in C. Collect the responses from the C workers End For
32 Work and Time Complexities Theorem: The algorithm performs all n tasks correctly in O(log n) time and has O(n) work and complexities, with high probability.
34 Performing Tasks under F b procedure for master process M, For t = 0, …. k log n, k > 0 Choose a random permutation R S n Foreach j [n] Send task to processor (j) End For Collect the responses from all the workers End For Foreach j [n] Choose the majority of the results of computation for task as the result End For
35 Work and Time Complexities Theorem: The algorithm performs all n tasks correctly in O(log n) time and has and work complexities O(n log n), for 0 ½ with high probability
37 Conclusions Perform tasks under above models where the tasks are dependent The dependency graph can be DAG Quantify work and time complexities on some characteristics of the DAG