Robust Network Supercomputing with Malicious Processes (Reliably Executing Tasks Upon Estimating the Number of Malicious Processes) Kishori M. Konwar*

Slides:



Advertisements
Similar presentations
On allocations that maximize fairness Uriel Feige Microsoft Research and Weizmann Institute.
Advertisements

Problems and Their Classes
Impossibility of Distributed Consensus with One Faulty Process
Covers, Dominations, Independent Sets and Matchings AmirHossein Bayegan Amirkabir University of Technology.
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Coverage by Directional Sensors Jing Ai and Alhussein A. Abouzeid Dept. of Electrical, Computer and Systems Engineering Rensselaer Polytechnic Institute.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Experimental Design, Response Surface Analysis, and Optimization
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
Prepared by Ilya Kolchinsky.  n generals, communicating through messengers  some of the generals (up to m) might be traitors  all loyal generals should.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
PCPs and Inapproximability Introduction. My T. Thai 2 Why Approximation Algorithms  Problems that we cannot find an optimal solution.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 University of Freiburg Computer Networks and Telematics Prof. Christian Schindelhauer Distributed Coloring in Õ(  log n) Bit Rounds COST 293 GRAAL and.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April
Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.
Probably Approximately Correct Model (PAC)
Smart Redundancy for Distributed Computation George Edwards Blue Cell Software, LLC Yuriy Brun University of Washington Jae young Bang University of Southern.
Bit Complexity of Breaking and Achieving Symmetry in Chains and Rings.
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
Distributed Combinatorial Optimization
Estimation 8.
Experimental Evaluation
Maximal Independent Set Distributed Algorithms for Multi-Agent Networks Instructor: K. Sinan YILDIRIM.
Monté Carlo Simulation MGS 3100 – Chapter 9. Simulation Defined A computer-based model used to run experiments on a real system.  Typically done on a.
Scheduling Master - Slave Multiprocessor Systems Professor: Dr. G S Young Speaker:Darvesh Singh.
Operations Research Models
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
1 Reducing Queue Lock Pessimism in Multiprocessor Schedulability Analysis Yang Chang, Robert Davis and Andy Wellings Real-time Systems Research Group University.
1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Private Approximation of Search Problems Amos Beimel Paz Carmi Kobbi Nissim Enav Weinreb (Technion)
1 Permutation routing in n-cube. 2 n-cube 1-cube2-cube3-cube 4-cube.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 10, 2005 Session 9.
Modeling and Simulation Discrete-Event Simulation
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport Massachusetts Computer Associates,Inc. Presented by Xiaofeng Xiao.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 1. Complexity Bounds.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
Bayesian Algorithmic Mechanism Design Jason Hartline Northwestern University Brendan Lucier University of Toronto.
The Message Passing Communication Model David Woodruff IBM Almaden.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Randomized Algorithms for Distributed Agreement Problems Peter Robinson.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Theory of Computational Complexity Yusuke FURUKAWA Iwama Ito lab M1.
Linear program Separation Oracle. Rounding We consider a single-machine scheduling problem, and see another way of rounding fractional solutions to integer.
Complexity Analysis (Part I)
PERFORMANCE EVALUATIONS
The consensus problem in distributed systems
Job Scheduling in a Grid Computing Environment
Agreement Protocols CS60002: Distributed Systems
Objective of This Course
Networked Real-Time Systems: Routing and Scheduling
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Programming with data Lecture 3
Locality In Distributed Graph Algorithms
Presentation transcript:

Robust Network Supercomputing with Malicious Processes (Reliably Executing Tasks Upon Estimating the Number of Malicious Processes) Kishori M. Konwar* Sanguthevar Rajasekaran Alexander A. Shvartsman *Computer Science & Engineering Department University of Connecticut Storrs, CT

2 Motivation  Internet supercomputing is increasingly becoming a powerful tool for harnessing massive amounts of computational resources  availability of high bandwidth Internet connections  there is an enormous number of processes around the world  comes at a cost substantially lower than acquiring a supercomputer or building a cluster of powerful machines

3

4 TASKS

5

6 PrimeNet Server  PrimeNet Server is a distributed, massively parallel scientific computing Internet Supercomputer  Supported by Entropia.com and ranks among the most powerful computers in the world  A project comprised of about 30,000 PCs and laptops  Currently sustains a 22,296 billion floating point operations per second (gigaflops) (operations that involve fractional numbers )

7  project a massive distributed cooperative computer  Used for analysis of gigabytes of data for Search for Extraterrestrial Intelligence (SETI)  Comprises of millions of voluntary machines around  project reported its speed to be more than 57,290 billion floating point operations per second

8 Reliability Issues  The master and perhaps certain workers are reliable  they will correctly execute the tasks assigned by the server  However, workers are commonly unreliable  they may return to the master incorrect results due to unintended failures caused, e.g., by over-clocked processors  may deceivingly claim to have performed assigned work so as to obtain incentive such as getting higher rank

9

10 Some Previous Studies  [FGLS05] Assumed the worker processes might act maliciously and hence deliberately return wrong results.  goal is to design algorithm that enable the master to accept correct results with high probability at a lower cost  they provided a randomized algorithm  unfortunately the cost complexity results depend on several parameters and hard to interpret

11 Some Previous Studies (cont’d)  [GM05] considered the problem of maximizing the expected number of correct result  the tasks are dependent  any worker computes correctly with probability p < 1 any incorrectly computed task corrupts all dependent tasks  the goal is to compute a schedule that maximizes expected number of correct results under a given time constraint  they showed the optimization problem to be NP-hard  provided some solutions on a restricted DAG

12 Overview  Models of Computation  Stopping Rule Algorithm based solution  Detection of Faulty Processors  Performing Tasks with Faulty Workers  Conclusions

13 Overview  Models of Computation  Stopping Rule Algorithm based solution  Detection of Faulty Processors  Performing Tasks with Faulty Workers  Conclusions

14 Models of Computation  Processes takes steps in lock steps, i.e., in synchrony  Processes communicate by exchanging messages  The tasks are independent and idempotent  Processes are subject to failures and can return incorrect results maliciously  Workers, P = {1,2,..., n} and a master M

15 Work Complexities   [CDS01] defined as work complexity or available processor steps  All steps taken by processes during execution of the algorithm are counted including the steps of the idling and waiting non-faulty processes  work  [DHW92] define work as the number of performed tasks counting multiplicities  Approach does not charge for idling and waiting this is called task oriented work

16 Few Comments  work   We say that an even E occurs with high probability (w.h.p.) to mean that Pr[E] = 1 – O(n -  ) for some constant  > 0.

17 Modeling Failures  Failure model F a  f-fraction, 0 < f < ½ of the n workers may fail  Each possibly faulty worker independently exhibits faulty behavior with probability 0 < p < ½.  The master has no a priori knowledge of f and p.

18 Modeling Failures (cont’d)  Failure model F b  There is a fixed bound on the f-fraction, 0 < f < ½ of the n workers that can be faulty  Any worker from the remaining (1-f)-fraction of the workers fails with probability 0 < p <1/2 independently of other workers  The master knows the values of f and p.

19 Algorithmic Template  procedure for master process M, task T Choose a set S  P Send task T to each processor p  S Wait for the results from the processes in S Decide on the result value v from the responses  procedure for worker w  P Wait to receive a task from master M Upon receiving a task from M Execute the task Send the result to M

20 Overview  Models of Computation  Stopping Rule Algorithm based solution  Detection of Faulty Processors  Performing Tasks with Faulty Workers  Conclusions

21 ( ,  )-approximation algorithm  Z is a random variable distributed in the interval [0,1] with mean  Z  Z 1, Z 2, Z are independently and identically distributed according to the random variable Z  An ( ,  )-approximation algorithm, with 0 <  < 1,  > 0 for estimating  Z satisfies Pr[  Z (1-  )    Z (1+  ) ] > 1 -  where is the estimated value of  Z

22 Stopping Rule Algorithm [Dagum, Karp, Luby, and Ross 1995] Input Parameters ( ,  ) with 0 0 Let  1 = 1 + (1+  )  // = 0.72 &  = 4 log(2/  )/  2 Initialize N  0, S  0 While S <  1 do: N  N+1, S  S + Z N Output: Z   1 /N

23 Stopping Rule Theorem Theorem (Stopping Rule Theorem) [Dagum, Karp, Luby, and Ross] Let Z be a random variable in [0,1] with  Z = E[Z] > 0. Let be the estimate produced and let N Z be the number of experiments that SRA runs with respect to Z on input  and . Then, (i) Pr[  Z (1-  )    Z (1+  ) ] > 1 -  (ii) E[N Z ]   1 /  Z and (iii) Pr[N Z >(1+  )  1 /  Z ]   /2

24 Algorithm A f,p to estimate f and p

25 Work Complexity of A f,p Theorem: Algorithm A f,p is an ( ,  )-approximation algorithm, 0 0, for the estimation of f and p with work complexity O(log 2 n), complexity O(n log n), message complexity O(log 2 n) and time complexity O(log n), with high probability.

26 Overview  Models of Computation  Stopping Rule Algorithm based solution  Detection of Faulty Processors  Performing Tasks with Faulty Workers  Conclusions

27 Detection of Faulty Processors  Lemma: It is not possible to perform all the n tasks correctly, in the failure model F a with linear complexity (i.e., O(n)) with high probability.

28 Detection of Faulty Processors  procedure for master process M Initially, F  For t = 0, …. k log n, k > 0 Choose a set S  P \ F Send each process p  S “test” task Wait for the results from the processes in S If the response is faulty F  F  {p: p is a faulty process} End If End For

29 Detection of Faulty Processors  Lemma: The algorithm detects all faulty processes among the n workers in O(log n) time with O(n) work with high probability  Theorem[Karp 04]: Suppose that a(x) is a non-decreasing, continuous function that is strictly increasing on {x | a(x) >0}, and m(x) is a continuous function. Then for every positive real x and every positive integer t, Pr[T(x) > u(x) + ta(x)]  (m(x)/x) t where u(x) is the solution to the equation u(x)=a(x) + u(m(x)) with m 0 (x) :=0 and m i+1 (x):= m(m i (x)).

30 Overview  Models of Computation  Stopping Rule Algorithm based solution  Detection of Faulty Processors  Performing Tasks with Faulty Workers  Conclusions

31 Performing Tasks under F a procedure for master process M: Initially, C , J  set of n tasks Randomly choose a set, possibly with repetition, S  P, |S|=kn/log n workers k>0 is a constant For i = 1, …, k' log n, k' > 0 Send to each worker p  S a “test” task Collect the responses from all the workers. End For If all the responses from a worker p  S are correct then C  C  {p} End if For i=1, …, n/|C| Send |C| jobs from J, not sent in previous iteration, one to each worker in C. Collect the responses from the C workers End For

32 Work and Time Complexities Theorem: The algorithm performs all n tasks correctly in O(log n) time and has O(n) work and complexities, with high probability.

33 Overview  Models of Computation  Stopping Rule Algorithm based solution  Detection of Faulty Processors  Performing Tasks with Faulty Workers  Conclusions

34 Performing Tasks under F b procedure for master process M, For t = 0, …. k log n, k > 0 Choose a random permutation  R S n Foreach j  [n] Send task to processor  (j) End For Collect the responses from all the workers End For Foreach j  [n] Choose the majority of the results of computation for task as the result End For

35 Work and Time Complexities Theorem: The algorithm performs all n tasks correctly in O(log n) time and has and work complexities O(n log n), for 0 ½ with high probability

36 Overview  Models of Computation  Stopping Rule Algorithm based solution  Detection of Faulty Processors  Performing Tasks with Faulty Workers  Conclusions

37 Conclusions  Perform tasks under above models where the tasks are dependent  The dependency graph can be DAG  Quantify work and time complexities on some characteristics of the DAG