Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
Multiple Sequence Alignment
BLAST Sequence alignment, E-value & Extreme value distribution.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence Analysis Tools
Sequence similarity.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Hardening Functions for Large-Scale Distributed Computations Doug Szajda Barry Lawson Jason Owen 1.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
BLAST What it does and what it means Steven Slater Adapted from pt.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
How To Tell A Secret Without Revealing It Enhanced Data Privacy in a Distributed Implementation of the Smith-Waterman Genome Sequence Comparison Algorithm.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Pairwise Sequence Alignment and Database Searching
Presentation transcript:

Toward a Practical Data Privacy Scheme for a Distributed Implementation of the Smith- Waterman Genome Sequence Comparison Algorithm Doug Szajda Mike Pohl * Jason Owen Barry Lawson 1

Large-Scale Distributed Computations Easily parallelizable, compute intensive Divide into independent tasks to be executed on participant PCs Significant results collected by supervisor 2

Examples –Finding Martians –Protein folding GIMPS (Entropia) –Mersenne Prime search United Devices, IBM, DOD: Smallpox study DNA sequencing Graphics Exhaustive Regression Genetic Algorithms Data Mining Monte Carlo simulation

A Problem Code is executing in untrusted environments –Data required for task execution may be proprietary –Can we find a way to have participants execute tasks without divulging data?

Related Work (not exhaustive) Computing with Encrypted Data –Feigenbaum (1985) –Abadi, Feigenbaum, Killian (1987) Secure Circuit Evaluation –Abadi and Feigenbaum (1990) –Sander, Young, and Yung (1999)

Related Work (not exhaustive) Privacy Homomorphisms –Rivest, Adleman, Dertouzos (1978) –Ahituv, Lapid, Neumann (1987) –Brickell and Yacobi (1987) Multiparty function computation –Yao (1986) –Goldreich, Micali, Wigderson (1987) –Ben-Or, Goldwasser, and Wigderson (1988) – Chaum, Crepeau, and Damgard (1988)

Computing With Encrypted Data Alice has x, wants Bob to compute f(x), but does not want to divulge x Alice gives Bob E(x) and f’, tells him to return f’(E(x)) Alice can determine f(x) from f’(E(x)), but Bob cannot determine x from knowledge of E(x), f’(E(x))

In Present Context Alice has several x values. Asks Bob to identify those that are significant –Alice doesn’t need f(x), so greater flexibility in definition of f’ (Sufficient Accuracy) –Post-filtering means that some false positives are OK. Lots of Bobs offering computing services

Adversary (as usual) Assumed to be intelligent –Can decompile, analyze, modify code –Understands task algorithms and measures used to prevent disclosure of data

The Model Computation: evaluate f : D -> R Partition D into subsets D i Task T(D i ): evaluate f(x i ) for all x i in D i Each task assigned filter function G i –G i returns indices of interesting x i

Basic Approach Transform D i, f, G i into D i ’, f’, G i ’ Replace T(D i ) with T(D i ’) such that 1.T(D i ’) does not leak additional information about values in D i 2.Identifiers returned by T(D i ’) contains those that would be returned by T(D i ) 3.Difference is reasonably small

Reality Providing required properties is difficult (impossible for some apps) Even when possible, implementation is application specific Bottom line: A potential approach, where few (if any) others exist

An Example: Smith- Waterman Genome Sequence Comparison

Genetic Sequence Alignment Comparing sequences over alphabet ∑={A,C,G,T} Biologists track evolutionary changes by writing sequences with columns aligned (called an alignment) Ex. CTGTTA CAGTTA

Sequence Evolution Deletion: CTGTTA CTG  TA Insertion: C  TGTTA CGTGTTA Substitution: CTGTTA CAGTTA indels

Sequence Evolution (cont.) After several “generations”: C  TGT  TA  CTA  TGCT  CG Note: Number of alignments (for pair of realistic length sequences) is huge

Alignment “Types” Global alignment –Considers entire sequence Local alignment –Considers substrings –Biologists usually consider local alignments

Measuring Alignments Scoring function –+1 if symbols match –-1 if not Gap penalty –g(k) = a + b(k-1) –k is gap length (# consecutive dashes in single sequence) Alignment score is sum of column scores minus gap penalties

Smith-Waterman Dynamic programming algorithm guaranteed to produce an optimal alignment –Global: O(n 2 ); local: O(n 3 ) Widely used by biologists Implemented on commercial volunteer distributed computing platforms

Using Smith-Waterman Significance of Smith-Waterman score based on probabilistic considerations Empirical Evidence: Similarity scores of randomly generated sequences exhibit an extreme value distribution Significance threshold p chosen so that probability random score > p is small (typically <0.003)

A Smith-Waterman Task Pairwise comparison of two sets of sequences, A and B –A : proprietary sequences –B : sequences from public database Returned: indices of well-matched pairs Notation: T( A, B,s,g,p)

Our Transformation Offset sequences: compare relative distances b/w specific nucleotide U: GCACTTACGCCCTTACGACG –F(U,A) = {3,4,8,3} –F(U,C) = {2,2,4,2,1,1,4,3} –F(U,G) = {1,8,8,3} –F(U,T) = {5,1,7,1}

Modified Tasks U: GCACTTACGCCCTTACGACG F(U,C) = {2,2,4,2,1,1,4,3} V: GCACTCGCCACTTAGCACG F(V,C) = {2,2,2,2,1,2,5,2} Apply S-W to F(U,C) and F(V,C) –Scoring function, gap penalty –“Goodness” threshold

Intuition Similar sequences should have similar offsets –Consider effects of indels, substitutions False positives can be reduced –Consider multiple nucleotides I.e., assign A and C info to distinct participants –Good match if both tasks indicate significance

Using Multiple Nucleotide Literals Maximum method –One task for each of A,C,G,T –Result significant if any of the four says so Adding method –One task for each of A,C,G,T, results passed to fifth participant –Result significant if sum of four scores indicates significance Costs reduced in either case

Security?

Recall… 1.T(D i ’) does not leak additional information about values in D i 2.Identifiers returned by T(D i ’) contains those that would be returned by T(D i ) 3.Difference is reasonably small

Data Privacy? Property 1 fails: adversary will know all info about a single nucleotide literal Conditional entropy gives rough estimate of amount of information leaked –Bits leaked: 2N - (N - C ∂ ) log 3 C ∂ is # of occurrences of ∂ in sequence –Ex. N = 600, C ∂ = N/4  487 bits (of 1200) leaked (713 bits of uncertainty remain)

Analysis Clearly, our scheme does not provide provable security, but it does suggest two questions: 1.Can an adversary determine additional symbols (and if so, how many)? 2.How much information leakage is too much in this context?

“4 out of 5 [Biologists] Agree” Given only the position of a single nucleotide literal: 1.No additional elements can be inferred 2.There is no “biologically useful” information that can be inferred Given current understanding of the structure and function of the genome

An Extension Sequences can be “masked” –For each task, choose random binary mask –Remove from sequence all “zeroed” elements Our experiments suggest mask with “1” in 90% of positions works well

Does it Work? In general, yes –Strong correlation between our scores and S-W –Not as sensitive as Smith-Waterman Some weak matches missed Statistical inference techniques show: –Very few false positives ( < ) –Very few false negatives (often none)

Simulation Results Well-matched sequences artificially generated –Substring mutated over several generations –Placed at random location into random sequences Scoring function as given earlier (1, -1) Gap penalty: g(k) = 2 + 1(k-1)

10000 comp, no mask, maximum method for determining significance Sequence length , matching portion length 300, average of 52.5 subs and 52.5 indels

10000 comp, no mask, adding method for determining significance Sequence length , matching portion length 300, average of 52.5 subs and 52.5 indels

1000 comp, no mask, maximum method for determining significance Sequence length 2000, matching portion length 1000, average of 150 subs and 150 indels

1000 comp, 90% mask, maximum method for determining significance Sequence length , matching portion length 500, average of subs and indels

Conclusions Introduced notion of sufficient accuracy Presented a strategy for enhancing data privacy in important real-world application Present important real-world app that requires privacy and is efficiently parallelizable –These are relatively rare –Potential first entry for benchmark suite of apps for privacy study

In the Future Solution is less than ideal –Lack of formal privacy model / provable security –Need more testing on real genetic data But it’s a start –General problem is difficult, this is a potential avenue of attack –Smith-Waterman requires more careful study in this context Application behavior vs. application configurations