Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1.

Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1

Large-Scale Distributed Computations Easily parallelizable, compute intensive Divide into independent tasks to be executed on participant PCs Significant results collected by supervisor Participants may receive credits –Money, e-cash, ISP fees, fame and glory 2

Examples seti@home –Finding Martians folding@home –Protein folding GIMPS (Entropia) –Mersenne Prime search United Devices, IBM, DOD: Smallpox study DNA sequencing Graphics Exhaustive Regression Genetic Algorithms Data Mining Monte Carlo simulation

The Problem Code is executing in untrusted environments –Results may be corrupted either intentionally or unintentionally –Significant results may be withheld –Cheating : credit for work not performed

An Obvious Solution Assign Tasks Redundantly Collusion may seem unlikely but… –Firms solicit participants from groups such as alumni associations and large corporations Processor cycles are primary resource Some problems can tolerate some bad results

Related Work Historical roots in result checking and self-correcting programs Golle and Mironov (2001) Golle and Stubblebine (2001) Monrose, Wyckoff, Rubin (1999)

Related Work Body of literature on protecting mobile agents from malicious hosts –Sander and Tschudin, Vigna, Hohl, and others Syverson (1998)

Adversary Assumed to be intelligent –Can decompile, analyze, modify code –Understands task algorithms and measures used to prevent corruption Motivation may not be obvious... –I.e. gaining credits may not be important –E.g. business competitor But does not wish to be caught

Our Approach Hardening functions –Verb, not adjective Does not guarantee resulting computation returns correct results Does not prevent an adversary from disrupting a computation Significantly increases likelihood that abnormal activity will be detected

The Model Computation is evaluation of algorithm f : D -> R for every input value x in D Tasks created by partitioning D into subsets D i Each task assigned filter function G i

Two General Classes Non-sequential –Computed values of f in task are independent Sequential –Participant given single value x 0 and asked to compute first m elements of sequence x n = f (x n-1 )

Hardening Non-sequentials Plant each task’s data set with values r i such that the following hold: 1.Supervisor knows f(r i ) for each i 2.Participant cannot distinguish r i from other data values regardless of number of tasks a participant completes

Hardening Non-sequentials 3.Participants do not know number of r i in data space 4.For some known proportion of r i f(r i ) is a significant result 5.Nice but not necessary: Same set of r i can be used for several tasks

Difficulties r i are indistinguishable only if they generate truly significant results What is indistinguishable in theory may not be in practice –E.g. DES key search: Tasks given ciphertext C and subset K i of key space, told to decrypt C with each k i and return any key that generates plausible plaintext

Even Filter Function Can Be Revealing... E.g. Traveling Salesperson with five precomputed circuits of length 100, 105, 102, 113, 104 –Return any circuit whose length is any of the above or less than 100 –Return the ten best circuits found –Return any circuit with length less than 120

Optimization Problems Designate small proportion of tasks as initial distribution Distribute each of these tasks redundantly Check returned values — handle non- matches appropriately Retain k best results and use them as ringers for remaining tasks

Collusion If task in initial distribution is assigned to colluding adversaries, supervisor will initially miss this Honest participants not in initial distribution will eventually return results that do not match Supervisor can then determine which participants have been dishonest

Size of Initial Distribution Probability that at least k of n best results are in proportion p of space is nkpprob 5080.250.9547 15050.10.9967 10 5 1000.02≈1 For 10 9 inputs, best 10 5 results are in top 0.01%

Caveat Previous figures assume: –n, k much less than size of data space –proportion of incorrect results is small Probability should be adjusted to reflect expected number of incorrect results returned in initial distribution

The Good No precomputing required Hardening is achieved at fraction of cost of simple redundancy Ringers can be used for multiple tasks Additional good results can be used as ringers Collusion resistant since ringers can be combined in many ways

The Bad Assuming tasks require equal time, cost of compute job is at least doubled... But, by running multiple projects concurrently, overall throughput rates can be reduced to factor of 1+p times rate of unmodified job In some cases, implementation details can give away identities of ringers (or require significant changes to app)

Sequential Computations Seeding the data is impractical Often the validity of returned results can only be checked by performing the entire task Ex: Mersenne Primes –nth Mersenne Number, M n, is 2 n -1

The Strategy Share the work of computing N tasks among K participants K > N is very small proportion of total number of participants in computation Assume: –Each task requires roughly m iterations –K/N < 2, else simple redundancy is cheaper

The Algorithm 1.Divide tasks into S segments, each containing roughly J = m/S iterations 2.Each participant in group is given an initial value and computes first J iterations using this value 3.When J iterations complete, results returned to supervisor

The Algorithm 4.Supervisor checks correctness of redundantly assigned subtasks 5.Supervisor permutes N values and assigns these values to K participants as initial value for next segment 6.Repeat until all S segments completed

The Numbers If K/N < 2, each task assigned to no more than two participants, and adversary cheats in L (of S) segments, then in absence of collusion

Probabilities KNSLP(caught) 54550.9222 5410 0.9939 541020.64 542040.8704 109 0.8926 109 20.36

Redundancy vs. P values P0.10.20.30.40.50.60.70.80.91.0 K/N1.051.111.181.251.331.431.541.671.822.0 P0.10.20.30.40.50.60.70.80.91.0 K/N1.031.061.091.131.171.231.291.381.522.0 L = 1 L = 2

Advantages Far fewer task compute cycles than simple redundancy Values need not be precomputed Method is relatively collusion resistant (unless supervisor picks an entire group of colluding participants) Method is tunable Can also be applied to non-sequential case

Disadvantages Increased coordination and communication costs for supervisor Need for synchronization increases time cost of job –Dial-up connectivity –Sporadic task execution (owners using PCs)

Disadvantages Strategy does not protect well against adversary who cheats once Cheating damage can be magnified –Propagation of undetected incorrect results

Conclusions Presented two strategies for hardening distributed metacomputations Non-sequential: Seed data with ringers Sequential: Share N tasks among K > N participants Small increase in average execution time of modified task Overall computing costs significantly less than redundantly assigning every task

Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1.

Similar presentations

Presentation on theme: "Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1.

Similar presentations

Presentation on theme: "Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1."— Presentation transcript:

Similar presentations

About project

Feedback