Download presentation
Presentation is loading. Please wait.
Published byAubrey Holmes Modified over 8 years ago
2
How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Barry Lawson University of Richmond
3
Outline Distributed volunteer computing A problem Related work A real-world application Our enhanced privacy approach Results & analysis Conclusions, future & ongoing work
4
Our Scenario You have a very large, compute intensive project Your PC (few GFLOPS) NOW (GFLOPS) Supercomputer (< 280.6 TFLOPS) ~$128M
5
What To Do? Welcome to the Internet, my friend. How can I help you?
6
Distributed Volunteer Computing (DVC) Supervisor (Alice) Participants (Bob) (< 200 TFLOPS)
7
DVC Computation Large-scale distributed computation: –compute intensive –easily parallelizable Supervisor: –divide computation into tasks (independent) –ship tasks to participants –collect significant results Participants: –download and execute tasks (when o/w idle) –return significant results
8
Real-world Examples
9
DVC Group @ Richmond Faculty –Doug Szajda (CS) –Barry Lawson (CS) –Jason Owen (Statistics) Students –Current Mike Pohl, Greg Steffensen, Andy White –Past Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin C. –Four more this summer Stefan Chipilov, Brittany Williams, Matt King, Ivan Jibaja
10
Telling a Secret Without Revealing It A problem: –Participants are untrustworthy –Code executes outside supervisor’s control –Computation data may be proprietary Goal: –Participants provide meaningful results –Supervisor does not divulge data
11
Related Work (not exhaustive) Computing with encrypted data: –Alice has x, wants Bob to compute f(x) But does not want to divulge x –Alice gives Bob x’ and f’( ) –Bob computes f’(x’) The key: –Alice can determine f(x) from f’(x’) –Bob cannot determine x from x’ and/or f’(x’) Difficult (often impossible) in practice
12
Less Formally xx’ ff’ “Bob” f’(x’) f(x)f(x) ? “Alice”
13
Flexibility In Our Context The computation: –Alice (supervisor) has many x’s –Bob (participant) determines x’s that are significant Alice doesn’t need the value f(x) Alice will post-process –A few false positives are OK –Sufficient accuracy: flexibility in f’( )
14
The Adversary Assumed to be intelligent –can decompile, analyze, modify code –understands task algorithms –understands enhanced privacy scheme(s) Motivation –may not be obvious: business competitor? –may not care if leak is detected
15
General Model The Computation: –evaluate an algorithm f : D R for all x in D Task T( ): –partition D into subsets D i –T(D i ) evaluates f(x i ) for all x i in D Filter function G( ): –determines “significance” –returns indices of significant x i D D1D1 D2D2 D3D3
16
Our General Approach Transform D i, f, G into D i ’, f’, G’ Replace task T(D i ) with T(D i ’) Desirable properties: – T(D i ) does not leak additional info about values in D i – significance in T(D i ) significance in T(D i ’) – any difference is reasonably small
17
In Reality… Providing desired properties is difficult –even with increased flexibility –impossible for some apps When possible, application-specific Bottom line: we have a potential approach –where few, if any, others exist
18
Application: Genome Sequence Comparison Compare sequences over genome alphabet ∑ = {A,C,G,T} Track evolutionary changes by aligning columns of sequences (an alignment) E.g.:CTGTTA CAGTTA
19
Sequence Evolution Deletion: Insertion: Substitution: CTGTTA CTG–TA C–TGTTA CGTGTTA CTGTTA CAGTTA indels
20
Sequence Evolution After several “generations” Note: # of alignments is huge (for realistic-length sequences) C–TGT––TA–– CTA–TGCTACG
21
Alignment Types Global alignment –considers entire sequence Local alignment –considers substrings –biologists usually use local
22
Measuring Alignments Scoring function: +1 if symbols match -1 if not Gap penalty –g(k) = a + b(k-1) –k is gap length (# consecutive dashes in single sequence) Alignment score: –sum of column scores minus gap penalties
23
A Simple Example Global alignment: –Scoring function: +1 match, -1 no match –Gap penalty: g(k) = 2 + 1(k-1) C – T G T – – T A – – C T A – T G C T A C G +1 -2 -1 -2 +1 -3 +1 +1 -3 Alignment score: +4 - 11 = -7
24
Smith-Waterman Dynamic programming algorithm Produces an optimal alignment Global: O(n 2 )Local: O(n 3 ) Implemented on commercial DVC platforms
25
Significance in S-W Significance of scores based on probability Empirical evidence: –given randomly-generated sequences –scores exhibit extreme value distribution
26
Determining Significance Choosing a significance threshold p : –want small probability that a random score >p –typically, probability < 0.003 p
27
A Smith-Waterman Task Pairwise comparison of two sets of sequences, A and B – A : proprietary sequences – B : sequences from public database Returned: indices of well-matched pairs Notation: T( A, B,s,g,p)
28
Our Transformation Use offset sequences: –compare relative distances b/w specific nucleotides X: GCACTTACGCCCTTACGACG –F(X,A) = {3,4,8,3} –F(X,C) = {2,2,4,2,1,1,4,3} –F(X,G) = {1,8,8,3} –F(X,T) = {5,1,7,1}
29
Modified Tasks X: GCACTTACGCCCTTACGACG F(X,C) = {2,2,4,2,1,1,4,3} Y: GCACTCGCCACTTAGCACG F(Y,C) = {2,2,2,2,1,2,5,2} Apply S-W to F(X,C) and F(Y,C) –Scoring function, gap penalty –“Goodness” threshold
30
Intuition Similar sequences similar offsets –consider effects of indels, substitutions What about false positives? –multiple nucleotides e.g., assign A & C tasks to distinct participants –good match if both tasks indicate significance
31
CAGGATCTCAAGC “Alice” “Bob 2” ? CAGCATATCACGT 2351 2323 1624 1352 AC “Bob 1” ? 2351 2323 1624 1352
32
Using Multiple Nucleotides Maximum method –one task for each of A,C,G,T –result significant if any of the four indicate Adding method –one task for each of A,C,G,T –result significant if sum of four scores indicates significance Costs reduced in either case –on average, 1/4 length of original sequence –runtime for an offset sequence ~1/64
33
Does This Provide Real Data Privacy? Recall desired properties: 1. T(D i ) does not leak additional info about values in D i 2.significance in T(D i ) significance in T(D i ’) 3.any difference is reasonably small
34
Data Privacy? Property 1 fails: –T(D i ) does leak additional info about values in D i –adversary knows all info about one nucleotide How much info is leaked? –conditional entropy gives rough estimate –e.g., N = 600, C ∂ = N/4 487 bits (of 1200) leaked 713 bits of uncertainty remain
35
Analysis Clearly, not provable security Suggests two questions: 1.Can adversary determine additional symbols; if so, how many? 2.How much info leakage is too much?
36
“4 out of 5 [Biologists] Agree” Given only the position of a single nucleotide literal: 1.No additional nucleotides can be inferred 2.No “biologically useful” information that can be inferred Given current understanding of the structure and function of the genome
37
Does It Work? In general, yes –strong correlation b/w our scores and S-W –not as sensitive as S-W some “weak” matches missed Via statistical inference: –very few false positives: < 10 -4 –very few false negatives (usually none)
38
An Extension Sequences can be “masked” –For each task, choose random binary mask –Remove from sequence all “zero” elements Our experiments suggest mask with “1” in 90% of positions works well X: 2 2 4 2 1 1 4 3 1 1 1 0 1 1 1 0 2 2 4 1 1 4
39
Simulation Results Well-matched sequences artificially generated –Substring mutated over several generations –Placed at random location into random sequences Scoring function: +1 match, -1 no match Gap penalty: g(k) = 2 + 1(k-1)
40
10000 comparisons, no mask, maximum method Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels
41
10000 comparisons, no mask, adding method Sequence length 600-800, matching portion length 300, average of 52.5 subs and 52.5 indels
42
1000 comparisons, no mask, maximum method Sequence length 2000, matching portion length 1000, average of 150 subs and 150 indels
43
1000 comp, 90% mask, maximum method Sequence length 1000-1300, matching portion length 500, average of 86.25 subs and 86.25 indels
44
Conclusions Introduced notion of sufficient accuracy Presented a strategy for enhancing data privacy in important real-world application Present important real-world app that: –requires privacy –efficiently parallelizable Potential first entry for benchmark suite of apps for privacy study
45
Future Work Solution is less than ideal –lack of formal privacy model / provable security –need more testing on real genetic data But it’s a start –general problem is very difficult –this is a potential avenue of attack –S-W requires more careful study in this context Consider additional apps
46
Ongoing DVC Work @ UR Augmenting BOINC software for campus-wide distribution –want to collect participant/server/data info & patterns –Greg Steffensen Exploring AI to catch malicious behavior –can we catch omitted results? –Andy White, Matt Kretchmar (Denison CS)
47
Thanks NSF CyberTrust Doug Szajda, Jason Owen All the UR students UR Biologists: –Rafael de Sa, Laura Runyen-Janecky, Joe Gindhart Tadayoshi Kohno (UCSD)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.