Presentation is loading. Please wait.

Presentation is loading. Please wait.

How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm.

Similar presentations


Presentation on theme: "How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm."— Presentation transcript:

1

2 How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Barry Lawson University of Richmond

3 Outline Distributed volunteer computing A problem Related work A real-world application Our enhanced privacy approach Results & analysis Conclusions, future & ongoing work

4 Our Scenario You have a very large, compute intensive project Your PC (few GFLOPS) NOW (GFLOPS) Supercomputer (< 280.6 TFLOPS) ~$128M

5 What To Do? Welcome to the Internet, my friend. How can I help you?

6 Distributed Volunteer Computing (DVC) Supervisor (Alice) Participants (Bob) (< 200 TFLOPS)

7 DVC Computation Large-scale distributed computation: –compute intensive –easily parallelizable Supervisor: –divide computation into tasks (independent) –ship tasks to participants –collect significant results Participants: –download and execute tasks (when o/w idle) –return significant results

8 Real-world Examples

9 DVC Group @ Richmond Faculty –Doug Szajda (CS) –Barry Lawson (CS) –Jason Owen (Statistics) Students –Current Mike Pohl, Greg Steffensen, Andy White –Past Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin C. –Four more this summer Stefan Chipilov, Brittany Williams, Matt King, Ivan Jibaja

10 Telling a Secret Without Revealing It A problem: –Participants are untrustworthy –Code executes outside supervisor’s control –Computation data may be proprietary Goal: –Participants provide meaningful results –Supervisor does not divulge data

11 Related Work (not exhaustive) Computing with encrypted data: –Alice has x, wants Bob to compute f(x) But does not want to divulge x –Alice gives Bob x’ and f’( ) –Bob computes f’(x’) The key: –Alice can determine f(x) from f’(x’) –Bob cannot determine x from x’ and/or f’(x’) Difficult (often impossible) in practice

12 Less Formally xx’ ff’ “Bob” f’(x’) f(x)f(x) ? “Alice”

13 Flexibility In Our Context The computation: –Alice (supervisor) has many x’s –Bob (participant) determines x’s that are significant Alice doesn’t need the value f(x) Alice will post-process –A few false positives are OK –Sufficient accuracy: flexibility in f’( )

14 The Adversary Assumed to be intelligent –can decompile, analyze, modify code –understands task algorithms –understands enhanced privacy scheme(s) Motivation –may not be obvious: business competitor? –may not care if leak is detected

15 General Model The Computation: –evaluate an algorithm f : D R for all x in D Task T( ): –partition D into subsets D i –T(D i ) evaluates f(x i ) for all x i in D Filter function G( ): –determines “significance” –returns indices of significant x i D D1D1 D2D2 D3D3

16 Our General Approach Transform D i, f, G into D i ’, f’, G’ Replace task T(D i ) with T(D i ’) Desirable properties: – T(D i ) does not leak additional info about values in D i – significance in T(D i ) significance in T(D i ’) – any difference is reasonably small

17 In Reality… Providing desired properties is difficult –even with increased flexibility –impossible for some apps When possible, application-specific Bottom line: we have a potential approach –where few, if any, others exist

18 Application: Genome Sequence Comparison Compare sequences over genome alphabet ∑ = {A,C,G,T} Track evolutionary changes by aligning columns of sequences (an alignment) E.g.:CTGTTA CAGTTA

19 Sequence Evolution Deletion: Insertion: Substitution: CTGTTA CTG–TA C–TGTTA CGTGTTA CTGTTA CAGTTA indels

20 Sequence Evolution After several “generations” Note: # of alignments is huge (for realistic-length sequences) C–TGT––TA–– CTA–TGCTACG

21 Alignment Types Global alignment –considers entire sequence Local alignment –considers substrings –biologists usually use local

22 Measuring Alignments Scoring function: +1 if symbols match -1 if not Gap penalty –g(k) = a + b(k-1) –k is gap length (# consecutive dashes in single sequence) Alignment score: –sum of column scores minus gap penalties

23 A Simple Example Global alignment: –Scoring function: +1 match, -1 no match –Gap penalty: g(k) = 2 + 1(k-1) C – T G T – – T A – – C T A – T G C T A C G +1 -2 -1 -2 +1 -3 +1 +1 -3 Alignment score: +4 - 11 = -7

24 Smith-Waterman Dynamic programming algorithm Produces an optimal alignment Global: O(n 2 )Local: O(n 3 ) Implemented on commercial DVC platforms

25 Significance in S-W Significance of scores based on probability Empirical evidence: –given randomly-generated sequences –scores exhibit extreme value distribution

26 Determining Significance Choosing a significance threshold p : –want small probability that a random score >p –typically, probability < 0.003 p

27 A Smith-Waterman Task Pairwise comparison of two sets of sequences, A and B – A : proprietary sequences – B : sequences from public database Returned: indices of well-matched pairs Notation: T( A, B,s,g,p)

28 Our Transformation Use offset sequences: –compare relative distances b/w specific nucleotides X: GCACTTACGCCCTTACGACG –F(X,A) = {3,4,8,3} –F(X,C) = {2,2,4,2,1,1,4,3} –F(X,G) = {1,8,8,3} –F(X,T) = {5,1,7,1}

29 Modified Tasks X: GCACTTACGCCCTTACGACG F(X,C) = {2,2,4,2,1,1,4,3} Y: GCACTCGCCACTTAGCACG F(Y,C) = {2,2,2,2,1,2,5,2} Apply S-W to F(X,C) and F(Y,C) –Scoring function, gap penalty –“Goodness” threshold

30 Intuition Similar sequences similar offsets –consider effects of indels, substitutions What about false positives? –multiple nucleotides e.g., assign A & C tasks to distinct participants –good match if both tasks indicate significance

31 CAGGATCTCAAGC “Alice” “Bob 2” ? CAGCATATCACGT 2351 2323 1624 1352 AC “Bob 1” ? 2351 2323 1624 1352

32 Using Multiple Nucleotides Maximum method –one task for each of A,C,G,T –result significant if any of the four indicate Adding method –one task for each of A,C,G,T –result significant if sum of four scores indicates significance Costs reduced in either case –on average, 1/4 length of original sequence –runtime for an offset sequence ~1/64

33 Does This Provide Real Data Privacy? Recall desired properties: 1. T(D i ) does not leak additional info about values in D i 2.significance in T(D i ) significance in T(D i ’) 3.any difference is reasonably small

34 Data Privacy? Property 1 fails: –T(D i ) does leak additional info about values in D i –adversary knows all info about one nucleotide How much info is leaked? –conditional entropy gives rough estimate –e.g., N = 600, C ∂ = N/4  487 bits (of 1200) leaked 713 bits of uncertainty remain

35 Analysis Clearly, not provable security Suggests two questions: 1.Can adversary determine additional symbols; if so, how many? 2.How much info leakage is too much?

36 “4 out of 5 [Biologists] Agree” Given only the position of a single nucleotide literal: 1.No additional nucleotides can be inferred 2.No “biologically useful” information that can be inferred Given current understanding of the structure and function of the genome

37 Does It Work? In general, yes –strong correlation b/w our scores and S-W –not as sensitive as S-W some “weak” matches missed Via statistical inference: –very few false positives: < 10 -4 –very few false negatives (usually none)

38 An Extension Sequences can be “masked” –For each task, choose random binary mask –Remove from sequence all “zero” elements Our experiments suggest mask with “1” in 90% of positions works well X: 2 2 4 2 1 1 4 3 1 1 1 0 1 1 1 0 2 2 4 1 1 4

39 Simulation Results Well-matched sequences artificially generated –Substring mutated over several generations –Placed at random location into random sequences Scoring function: +1 match, -1 no match Gap penalty: g(k) = 2 + 1(k-1)

40 10000 comparisons, no mask, maximum method Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

41 10000 comparisons, no mask, adding method Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels

42 1000 comparisons, no mask, maximum method Sequence length 2000, matching portion length 1000, average of 150 subs and 150 indels

43 1000 comp, 90% mask, maximum method Sequence length 1000-1300, matching portion length 500, average of 86.25 subs and 86.25 indels

44 Conclusions Introduced notion of sufficient accuracy Presented a strategy for enhancing data privacy in important real-world application Present important real-world app that: –requires privacy –efficiently parallelizable Potential first entry for benchmark suite of apps for privacy study

45 Future Work Solution is less than ideal –lack of formal privacy model / provable security –need more testing on real genetic data But it’s a start –general problem is very difficult –this is a potential avenue of attack –S-W requires more careful study in this context Consider additional apps

46 Ongoing DVC Work @ UR Augmenting BOINC software for campus-wide distribution –want to collect participant/server/data info & patterns –Greg Steffensen Exploring AI to catch malicious behavior –can we catch omitted results? –Andy White, Matt Kretchmar (Denison CS)

47 Thanks NSF CyberTrust Doug Szajda, Jason Owen All the UR students UR Biologists: –Rafael de Sa, Laura Runyen-Janecky, Joe Gindhart Tadayoshi Kohno (UCSD)


Download ppt "How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm."

Similar presentations


Ads by Google