How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm.

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Multiple Sequence Alignment
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Mutual Information Mathematical Biology Seminar
Heuristic alignment algorithms and cost matrices
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, and Jiawei Han SIGMOD 2002 Presented by: Eddie Date: 2002/12/23.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise Alignment of Metamorphic Computer Viruses Student:Scott McGhee Advisor:Dr. Mark Stamp Committee:Dr. David Taylor Dr. Teng Moh.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Carnegie Mellon Selected Topics in Automated Diversity Stephanie Forrest University of New Mexico Mike Reiter Dawn Song Carnegie Mellon University.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
Hidden Markov Models for Software Piracy Detection Shabana Kazi Mark Stamp HMMs for Piracy Detection 1.
Images Similarity by Relative Dynamic Programming M. Sc. thesis by Ady Ecker Supervisor: prof. Shimon Ullman.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Step 3: Tools Database Searching
Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
R ANDOM N UMBER G ENERATORS Modeling and Simulation CS
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
All Your Queries are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption Yupeng Zhang, Jonathan Katz, Charalampos Papamanthou University.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Multiple sequence alignment (msa)
Applying principles of computer science in a biological context
Presentation transcript:

How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm Barry Lawson University of Richmond

Outline Distributed volunteer computing A problem Related work A real-world application Our enhanced privacy approach Results & analysis Conclusions, future & ongoing work

Our Scenario You have a very large, compute intensive project Your PC (few GFLOPS) NOW (GFLOPS) Supercomputer (< TFLOPS) ~$128M

What To Do? Welcome to the Internet, my friend. How can I help you?

Distributed Volunteer Computing (DVC) Supervisor (Alice) Participants (Bob) (< 200 TFLOPS)

DVC Computation Large-scale distributed computation: –compute intensive –easily parallelizable Supervisor: –divide computation into tasks (independent) –ship tasks to participants –collect significant results Participants: –download and execute tasks (when o/w idle) –return significant results

Real-world Examples

DVC Richmond Faculty –Doug Szajda (CS) –Barry Lawson (CS) –Jason Owen (Statistics) Students –Current Mike Pohl, Greg Steffensen, Andy White –Past Ed Kenney (CMU), Dan Upton (UVA), Rom Chan, Trin C. –Four more this summer Stefan Chipilov, Brittany Williams, Matt King, Ivan Jibaja

Telling a Secret Without Revealing It A problem: –Participants are untrustworthy –Code executes outside supervisor’s control –Computation data may be proprietary Goal: –Participants provide meaningful results –Supervisor does not divulge data

Related Work (not exhaustive) Computing with encrypted data: –Alice has x, wants Bob to compute f(x) But does not want to divulge x –Alice gives Bob x’ and f’( ) –Bob computes f’(x’) The key: –Alice can determine f(x) from f’(x’) –Bob cannot determine x from x’ and/or f’(x’) Difficult (often impossible) in practice

Less Formally xx’ ff’ “Bob” f’(x’) f(x)f(x) ? “Alice”

Flexibility In Our Context The computation: –Alice (supervisor) has many x’s –Bob (participant) determines x’s that are significant Alice doesn’t need the value f(x) Alice will post-process –A few false positives are OK –Sufficient accuracy: flexibility in f’( )

The Adversary Assumed to be intelligent –can decompile, analyze, modify code –understands task algorithms –understands enhanced privacy scheme(s) Motivation –may not be obvious: business competitor? –may not care if leak is detected

General Model The Computation: –evaluate an algorithm f : D R for all x in D Task T( ): –partition D into subsets D i –T(D i ) evaluates f(x i ) for all x i in D Filter function G( ): –determines “significance” –returns indices of significant x i D D1D1 D2D2 D3D3

Our General Approach Transform D i, f, G into D i ’, f’, G’ Replace task T(D i ) with T(D i ’) Desirable properties: – T(D i ) does not leak additional info about values in D i – significance in T(D i ) significance in T(D i ’) – any difference is reasonably small

In Reality… Providing desired properties is difficult –even with increased flexibility –impossible for some apps When possible, application-specific Bottom line: we have a potential approach –where few, if any, others exist

Application: Genome Sequence Comparison Compare sequences over genome alphabet ∑ = {A,C,G,T} Track evolutionary changes by aligning columns of sequences (an alignment) E.g.:CTGTTA CAGTTA

Sequence Evolution Deletion: Insertion: Substitution: CTGTTA CTG–TA C–TGTTA CGTGTTA CTGTTA CAGTTA indels

Sequence Evolution After several “generations” Note: # of alignments is huge (for realistic-length sequences) C–TGT––TA–– CTA–TGCTACG

Alignment Types Global alignment –considers entire sequence Local alignment –considers substrings –biologists usually use local

Measuring Alignments Scoring function: +1 if symbols match -1 if not Gap penalty –g(k) = a + b(k-1) –k is gap length (# consecutive dashes in single sequence) Alignment score: –sum of column scores minus gap penalties

A Simple Example Global alignment: –Scoring function: +1 match, -1 no match –Gap penalty: g(k) = 2 + 1(k-1) C – T G T – – T A – – C T A – T G C T A C G Alignment score: = -7

Smith-Waterman Dynamic programming algorithm Produces an optimal alignment Global: O(n 2 )Local: O(n 3 ) Implemented on commercial DVC platforms

Significance in S-W Significance of scores based on probability Empirical evidence: –given randomly-generated sequences –scores exhibit extreme value distribution

Determining Significance Choosing a significance threshold p : –want small probability that a random score >p –typically, probability < p

A Smith-Waterman Task Pairwise comparison of two sets of sequences, A and B – A : proprietary sequences – B : sequences from public database Returned: indices of well-matched pairs Notation: T( A, B,s,g,p)

Our Transformation Use offset sequences: –compare relative distances b/w specific nucleotides X: GCACTTACGCCCTTACGACG –F(X,A) = {3,4,8,3} –F(X,C) = {2,2,4,2,1,1,4,3} –F(X,G) = {1,8,8,3} –F(X,T) = {5,1,7,1}

Modified Tasks X: GCACTTACGCCCTTACGACG F(X,C) = {2,2,4,2,1,1,4,3} Y: GCACTCGCCACTTAGCACG F(Y,C) = {2,2,2,2,1,2,5,2} Apply S-W to F(X,C) and F(Y,C) –Scoring function, gap penalty –“Goodness” threshold

Intuition Similar sequences similar offsets –consider effects of indels, substitutions What about false positives? –multiple nucleotides e.g., assign A & C tasks to distinct participants –good match if both tasks indicate significance

CAGGATCTCAAGC “Alice” “Bob 2” ? CAGCATATCACGT AC “Bob 1” ?

Using Multiple Nucleotides Maximum method –one task for each of A,C,G,T –result significant if any of the four indicate Adding method –one task for each of A,C,G,T –result significant if sum of four scores indicates significance Costs reduced in either case –on average, 1/4 length of original sequence –runtime for an offset sequence ~1/64

Does This Provide Real Data Privacy? Recall desired properties: 1. T(D i ) does not leak additional info about values in D i 2.significance in T(D i ) significance in T(D i ’) 3.any difference is reasonably small

Data Privacy? Property 1 fails: –T(D i ) does leak additional info about values in D i –adversary knows all info about one nucleotide How much info is leaked? –conditional entropy gives rough estimate –e.g., N = 600, C ∂ = N/4  487 bits (of 1200) leaked 713 bits of uncertainty remain

Analysis Clearly, not provable security Suggests two questions: 1.Can adversary determine additional symbols; if so, how many? 2.How much info leakage is too much?

“4 out of 5 [Biologists] Agree” Given only the position of a single nucleotide literal: 1.No additional nucleotides can be inferred 2.No “biologically useful” information that can be inferred Given current understanding of the structure and function of the genome

Does It Work? In general, yes –strong correlation b/w our scores and S-W –not as sensitive as S-W some “weak” matches missed Via statistical inference: –very few false positives: < –very few false negatives (usually none)

An Extension Sequences can be “masked” –For each task, choose random binary mask –Remove from sequence all “zero” elements Our experiments suggest mask with “1” in 90% of positions works well X:

Simulation Results Well-matched sequences artificially generated –Substring mutated over several generations –Placed at random location into random sequences Scoring function: +1 match, -1 no match Gap penalty: g(k) = 2 + 1(k-1)

10000 comparisons, no mask, maximum method Sequence length , matching portion length 300, average of 52.5 subs and 52.5 indels

10000 comparisons, no mask, adding method Sequence length , matching portion length 300, average of 52.5 subs and 52.5 indels

1000 comparisons, no mask, maximum method Sequence length 2000, matching portion length 1000, average of 150 subs and 150 indels

1000 comp, 90% mask, maximum method Sequence length , matching portion length 500, average of subs and indels

Conclusions Introduced notion of sufficient accuracy Presented a strategy for enhancing data privacy in important real-world application Present important real-world app that: –requires privacy –efficiently parallelizable Potential first entry for benchmark suite of apps for privacy study

Future Work Solution is less than ideal –lack of formal privacy model / provable security –need more testing on real genetic data But it’s a start –general problem is very difficult –this is a potential avenue of attack –S-W requires more careful study in this context Consider additional apps

Ongoing DVC UR Augmenting BOINC software for campus-wide distribution –want to collect participant/server/data info & patterns –Greg Steffensen Exploring AI to catch malicious behavior –can we catch omitted results? –Andy White, Matt Kretchmar (Denison CS)

Thanks NSF CyberTrust Doug Szajda, Jason Owen All the UR students UR Biologists: –Rafael de Sa, Laura Runyen-Janecky, Joe Gindhart Tadayoshi Kohno (UCSD)