Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1.

Slides:



Advertisements
Similar presentations
Cryptography and Game Theory: Designing Protocols for Exchanging Information Gillat Kol and Moni Naor.
Advertisements

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
CSE Lecture 3 – Algorithms I
Using Parallel Genetic Algorithm in a Predictive Job Scheduling
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Analysis of Algorithms
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
HPC - High Performance Productivity Computing and Future Computational Systems: A Research Engineer’s Perspective Dr. Robert C. Singleterry Jr. NASA Langley.
Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Distributed Algorithms for Secure Multipath Routing
1 How to securely outsource cryptographic computations Susan Hohenberger and Anna Lysyanskaya TCC2005.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Distributed Computing Group TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA Distributed Asymmetric Verification.
Atomistic Protein Folding Simulations on the Submillisecond Timescale Using Worldwide Distributed Computing Qing Lu CMSC 838 Presentation.
Hashing General idea: Get a large array
Session 6: Introduction to cryptanalysis part 1. Contents Problem definition Symmetric systems cryptanalysis Particularities of block ciphers cryptanalysis.
Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
A Bridge to Your First Computer Science Course Prof. H.E. Dunsmore Concurrent Programming Threads Synchronization.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Liang, Introduction to Java Programming, Seventh Edition, (c) 2009 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele, University of.
Evaluating the Running Time of a Communication Round over the Internet Omar Bakr Idit Keidar MIT MIT/Technion PODC 2002.
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
Invitation to Computer Science, Java Version, Second Edition.
INT-Evry (Masters IT– Soft Eng)IntegrationTesting.1 (OO) Integration Testing What: Integration testing is a phase of software testing in which.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
© 2011 Pearson Addison-Wesley. All rights reserved 10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Practical Byzantine Fault Tolerance
Simulation is the process of studying the behavior of a real system by using a model that replicates the behavior of the system under different scenarios.
An Analysis of Parallel Mixing with Attacker-Controlled Inputs Nikita Borisov formerly of UC Berkeley.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
FORS 8450 Advanced Forest Planning Lecture 5 Relatively Straightforward Stochastic Approach.
M ONTE C ARLO SIMULATION Modeling and Simulation CS
Ihab Mohammed and Safaa Alwajidi. Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table.
O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.
A paper by: Paul Kocher, Joshua Jaffe, and Benjamin Jun Presentation by: Michelle Dickson.
Schreiber, Yevgeny. Value-Ordering Heuristics: Search Performance vs. Solution Diversity. In: D. Cohen (Ed.) CP 2010, LNCS 6308, pp Springer-
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
UNIT-I INTRODUCTION ANALYSIS AND DESIGN OF ALGORITHMS CHAPTER 1:
CPSC 171 Introduction to Computer Science System Software and Virtual Machines.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
Chapter 2 Symmetric Encryption.
Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.
© The McGraw-Hill Companies, Inc., Chapter 12 On-Line Algorithms.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Almost Entirely Correct Mixing With Applications to Voting Philippe Golle Dan Boneh Stanford University.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.
11 -1 Chapter 12 On-Line Algorithms On-Line Algorithms On-line algorithms are used to solve on-line problems. The disk scheduling problem The requests.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Lecture 5 Page 1 CS 236 Online More on Cryptography CS 236 On-Line MS Program Networks and Systems Security Peter Reiher.
SOFTWARE TESTING LECTURE 9. OBSERVATIONS ABOUT TESTING “ Testing is the process of executing a program with the intention of finding errors. ” – Myers.
Process Scheduling. Scheduling Strategies Scheduling strategies can broadly fall into two categories  Co-operative scheduling is where the currently.
Cryptographic Hash Function. A hash function H accepts a variable-length block of data as input and produces a fixed-size hash value h = H(M). The principal.
Securing Distributed Computations in a Commercial Environment Philippe Golle, Stanford University Stuart Stubblebine, CertCo.
Grid Computing Colton Lewis.
Objective of This Course
Hash Tables – 2 Comp 122, Spring 2004.
RAID Redundant Array of Inexpensive (Independent) Disks
Hash Tables – 2 1.
Multidisciplinary Optimization
Presentation transcript:

Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1

Large-Scale Distributed Computations Easily parallelizable, compute intensive Divide into independent tasks to be executed on participant PCs Significant results collected by supervisor Participants may receive credits –Money, e-cash, ISP fees, fame and glory 2

Examples –Finding Martians –Protein folding GIMPS (Entropia) –Mersenne Prime search United Devices, IBM, DOD: Smallpox study DNA sequencing Graphics Exhaustive Regression Genetic Algorithms Data Mining Monte Carlo simulation

The Problem Code is executing in untrusted environments –Results may be corrupted either intentionally or unintentionally –Significant results may be withheld –Cheating : credit for work not performed

An Obvious Solution Assign Tasks Redundantly Collusion may seem unlikely but… –Firms solicit participants from groups such as alumni associations and large corporations Processor cycles are primary resource Some problems can tolerate some bad results

Related Work Historical roots in result checking and self-correcting programs Golle and Mironov (2001) Golle and Stubblebine (2001) Monrose, Wyckoff, Rubin (1999)

Related Work Body of literature on protecting mobile agents from malicious hosts –Sander and Tschudin, Vigna, Hohl, and others Syverson (1998)

Adversary Assumed to be intelligent –Can decompile, analyze, modify code –Understands task algorithms and measures used to prevent corruption Motivation may not be obvious... –I.e. gaining credits may not be important –E.g. business competitor But does not wish to be caught

Our Approach Hardening functions –Verb, not adjective Does not guarantee resulting computation returns correct results Does not prevent an adversary from disrupting a computation Significantly increases likelihood that abnormal activity will be detected

The Model Computation is evaluation of algorithm f : D -> R for every input value x in D Tasks created by partitioning D into subsets D i Each task assigned filter function G i

Two General Classes Non-sequential –Computed values of f in task are independent Sequential –Participant given single value x 0 and asked to compute first m elements of sequence x n = f (x n-1 )

Hardening Non-sequentials Plant each task’s data set with values r i such that the following hold: 1.Supervisor knows f(r i ) for each i 2.Participant cannot distinguish r i from other data values regardless of number of tasks a participant completes

Hardening Non-sequentials 3.Participants do not know number of r i in data space 4.For some known proportion of r i f(r i ) is a significant result 5.Nice but not necessary: Same set of r i can be used for several tasks

Difficulties r i are indistinguishable only if they generate truly significant results What is indistinguishable in theory may not be in practice –E.g. DES key search: Tasks given ciphertext C and subset K i of key space, told to decrypt C with each k i and return any key that generates plausible plaintext

Even Filter Function Can Be Revealing... E.g. Traveling Salesperson with five precomputed circuits of length 100, 105, 102, 113, 104 –Return any circuit whose length is any of the above or less than 100 –Return the ten best circuits found –Return any circuit with length less than 120

Optimization Problems Designate small proportion of tasks as initial distribution Distribute each of these tasks redundantly Check returned values — handle non- matches appropriately Retain k best results and use them as ringers for remaining tasks

Collusion If task in initial distribution is assigned to colluding adversaries, supervisor will initially miss this Honest participants not in initial distribution will eventually return results that do not match Supervisor can then determine which participants have been dishonest

Size of Initial Distribution Probability that at least k of n best results are in proportion p of space is nkpprob ≈1 For 10 9 inputs, best 10 5 results are in top 0.01%

Caveat Previous figures assume: –n, k much less than size of data space –proportion of incorrect results is small Probability should be adjusted to reflect expected number of incorrect results returned in initial distribution

The Good No precomputing required Hardening is achieved at fraction of cost of simple redundancy Ringers can be used for multiple tasks Additional good results can be used as ringers Collusion resistant since ringers can be combined in many ways

The Bad Assuming tasks require equal time, cost of compute job is at least doubled... But, by running multiple projects concurrently, overall throughput rates can be reduced to factor of 1+p times rate of unmodified job In some cases, implementation details can give away identities of ringers (or require significant changes to app)

Sequential Computations Seeding the data is impractical Often the validity of returned results can only be checked by performing the entire task Ex: Mersenne Primes –nth Mersenne Number, M n, is 2 n -1

The Strategy Share the work of computing N tasks among K participants K > N is very small proportion of total number of participants in computation Assume: –Each task requires roughly m iterations –K/N < 2, else simple redundancy is cheaper

The Algorithm 1.Divide tasks into S segments, each containing roughly J = m/S iterations 2.Each participant in group is given an initial value and computes first J iterations using this value 3.When J iterations complete, results returned to supervisor

The Algorithm 4.Supervisor checks correctness of redundantly assigned subtasks 5.Supervisor permutes N values and assigns these values to K participants as initial value for next segment 6.Repeat until all S segments completed

The Numbers If K/N < 2, each task assigned to no more than two participants, and adversary cheats in L (of S) segments, then in absence of collusion

Probabilities KNSLP(caught)

Redundancy vs. P values P K/N P K/N L = 1 L = 2

Advantages Far fewer task compute cycles than simple redundancy Values need not be precomputed Method is relatively collusion resistant (unless supervisor picks an entire group of colluding participants) Method is tunable Can also be applied to non-sequential case

Disadvantages Increased coordination and communication costs for supervisor Need for synchronization increases time cost of job –Dial-up connectivity –Sporadic task execution (owners using PCs)

Disadvantages Strategy does not protect well against adversary who cheats once Cheating damage can be magnified –Propagation of undetected incorrect results

Conclusions Presented two strategies for hardening distributed metacomputations Non-sequential: Seed data with ringers Sequential: Share N tasks among K > N participants Small increase in average execution time of modified task Overall computing costs significantly less than redundantly assigning every task