Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

Slides:



Advertisements
Similar presentations
Private Inference Control David Woodruff MIT Joint work with Jessica Staddon (PARC)
Advertisements

Revisiting the efficiency of malicious two party computation David Woodruff MIT.
Oblivious Branching Program Evaluation
ECE454/CS594 Computer and Network Security Dr. Jinyuan (Stella) Sun Dept. of Electrical Engineering and Computer Science University of Tennessee Fall 2011.
Efficient Information Retrieval for Ranked Queries in Cost-Effective Cloud Environments Presenter: Qin Liu a,b Joint work with Chiu C. Tan b, Jie Wu b,
ITIS 6200/ Secure multiparty computation – Alice has x, Bob has y, we want to calculate f(x, y) without disclosing the values – We can only do.
Digital Signatures and Hash Functions. Digital Signatures.
Lect. 18: Cryptographic Protocols. 2 1.Cryptographic Protocols 2.Special Signatures 3.Secret Sharing and Threshold Cryptography 4.Zero-knowledge Proofs.
Fine-grained Private Matching for Proximity-based Mobile Social Networking INFOCOM 2012 Rui Zhang, Yanchao Zhang Jinyuan (Stella) Sun Arizona State University.
What Crypto Can Do for You: Solutions in Search of Problems Anna Lysyanskaya Brown University.
Yan Huang, Jonathan Katz, David Evans University of Maryland, University of Virginia Efficient Secure Two-Party Computation Using Symmetric Cut-and-Choose.
Chapter 3 The Relational Model Transparencies © Pearson Education Limited 1995, 2005.
Cryptography1 CPSC 3730 Cryptography Chapter 10 Key Management.
10/25/20061 Threshold Paillier Encryption Web Service A Master’s Project Proposal by Brett Wilson.
Chapter 3. 2 Chapter 3 - Objectives Terminology of relational model. Terminology of relational model. How tables are used to represent data. How tables.
Data Mining.
The Relational Database Model:
Private Analysis of Data Sets Benny Pinkas HP Labs, Princeton.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Methodology Logical Database Design for the Relational Model
1 Privacy-Preserving Relationship Path Discovery in Social Networks Ghita Mezzour, Adrian Perrig, Virgil Gligor Carnegie Mellon University Panos Papadimitratos.
WS Algorithmentheorie 03 – Randomized Algorithms (Public Key Cryptosystems) Prof. Dr. Th. Ottmann.
Privacy Preserving Data Mining Yehuda Lindell & Benny Pinkas.
The Relational Model Codd (1970): based on set theory Relational model: represents the database as a collection of relations (a table of values --> file)
Cryptography and Network Security Chapter 10. Chapter 10 – Key Management; Other Public Key Cryptosystems No Singhalese, whether man or woman, would venture.
Last time Finish OTR Database Security Introduction to Databases
Database Architecture The Relational Database Model.
Practical Techniques for Searches on Encrypted Data Yongdae Kim Written by Song, Wagner, Perrig.
Page 1 Secure Communication Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model Pearson Education © 2014.
Chapter 4 The Relational Model.
Chapter 3 The Relational Model Transparencies Last Updated: Pebruari 2011 By M. Arief
An Efficient Identity-based Cryptosystem for
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
1 Privacy-Preserving Distributed Information Sharing Nan Zhang and Wei Zhao Texas A&M University, USA.
Key Management and Diffie- Hellman Dr. Monther Aldwairi New York Institute of Technology- Amman Campus 12/3/2009 INCS 741: Cryptography 12/3/20091Dr. Monther.
Public Key Encryption and the RSA Public Key Algorithm CSCI 5857: Encoding and Encryption.
Chapter 3 The Relational Model. 2 Chapter 3 - Objectives u Terminology of relational model. u How tables are used to represent data. u Connection between.
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
Background on security
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
Public Key Encryption with keyword Search Author: Dan Boneh Rafail Ostroversity Giovanni Di Crescenzo Giuseppe Persiano Presenter: 陳昱圻.
Chapter 3 (B) – Key Management; Other Public Key Cryptosystems.
CS555Topic 251 Cryptography CS 555 Topic 25: Quantum Crpytography.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Database Systems, 9th Edition 1.  In this chapter, students will learn: That the relational database model offers a logical view of data About the relational.
Secure Conjunctive Keyword Search Over Encrypted Data Philippe Golle Jessica Staddon Palo Alto Research Center Brent Waters Princeton University.
1 Chapter 10: Key Management in Public key cryptosystems Fourth Edition by William Stallings Lecture slides by Lawrie Brown (Modified by Prof. M. Singhal,
The Relational Model. 2 Relational Model Terminology u A relation is a table with columns and rows. –Only applies to logical structure of the database,
Efficient Private Matching and Set Intersection Mike Freedman, NYU Kobbi Nissim, MSR Benny Pinkas, HP Labs EUROCRYPT 2004.
Lecture 9 Overview. Digital Signature Properties CS 450/650 Lecture 9: Digital Signatures 2 Unforgeable: Only the signer can produce his/her signature.
The Relational Model © Pearson Education Limited 1995, 2005 Bayu Adhi Tama, M.T.I.
1 Chapter 3-3 Key Distribution. 2 Key Management public-key encryption helps address key distribution problems have two aspects of this: –distribution.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Chapter 4 The Relational Model Pearson Education © 2009.
INFO 340 Lecture 3 Relational Databases. Based on the relational model, grounded in mathematic set theories. Three basic elements: Relation, Tuple, and.
Logical Database Design and the Rational Model
Attacks on Public Key Encryption Algorithms
Privacy Preserving Record Linkage
Chapter 4 The Relational Model Pearson Education © 2009.
Chapter 4 The Relational Model Pearson Education © 2009.
The Relational Model Transparencies
Chapter 4 The Relational Model Pearson Education © 2009.
Data Warehousing Data Mining Privacy
Chapter 4 The Relational Model Pearson Education © 2009.
Emerging Security Mechanisms for Medical Cyber Physical Systems
Presentation transcript:

Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL $ 2

What about data quality? Alice does not know data quality prior to acquisition Dirty data costs US businesses ~$600 billion annually [1] Data cleaning accounts for up to 80% of development time First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL [1] W. Eckerson. Data quality and the bottom line. TDWI Report, The Data Warehouse Institute,

First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL Privacy concerns for Bob 4

All of them How many rows are complete? First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL Trust and privacy concerns for Alice 5

Problem Privacy-Preserving Data Quality Assessment 6

Data Quality Metrics Integrity constraints on attributes =, >, [ ], age > 0 Dependency constraints across 2+ attributes if, while, for if state == CA, then ZIP in [94000, 96199] Many data quality metrics [1,2] Completeness Validity Uniqueness Consistency Timeliness [1] Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information & management, 40(2), 2002 [2] P. Cykana, A. Paul, and M. Stern. DoD Guidelines on Data Quality Management. In IQ, pages 154–171,

Data Quality Metrics Completeness Percentage of elements that are properly populated Check for values such as NULL, “”,… First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL 8

Data Quality Metrics Validity Percentage of elements whose attributes possess meaningful values First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL 9

Data Quality Metrics Consistency Degree to which the data attributes satisfy a dependency constraints First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL 10

Desired Privacy Properties Query Privacy Bob should not learn the data quality constraint parameters and the resulting values Data Privacy Alice should not learn anything from Bob’s data besides quality metric 11

Application: High-Fidelity Cyber Threat Mitigation [1] S. Katti, B. Krishnamurthy, and D. Katabi. Collaborating against common enemies. In IMC, 2005 [2] J. Zhang, P. A. Porras, and J. Ullrich. Highly predictive blacklisting. In USENIX Security, 2008 [3] P. Porras and V. Shmatikov. Large-scale collection and sanitization of network security data: risks and challenges. In NSPW, 2006 IPPortTimeUIDAPT IPPortTimeUIDAPT IPPortTimeUIDAPT IPPortTimeUIDAPT 12

Solutions Rely on existing cryptographic primitives Develop custom solution 13

Private Set Intersection Set intersection or cardinality of set intersection [1] M. Freedman,K. Nissim, and B. Pinkas. Efficient private matching and set intersection. In EUROCRYPT, 2004 [2] E. De Cristofaro, P. Gasti, and G. Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS,

Private Set Intersection Completeness {NULL} 1, NULL 2, NULL … n, NULL 1, d 1 2, d 2 … n, d n {d 1, …, d n } PSI-CA approach is inefficient 15

Encrypted-domain Computation E(d 1 ), E(d 2 ) E(d 1 ) * E(d 2 ) d 1 + d 2 [1] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT,

Select & Aggregate Setup Goal: Alice has a binary selector u, Bob has data vector v. Alice should discover the sum of selected elements from v. Query Privacy: Bob should not find the selector vector. Data Privacy: Alice should not discover any information other than the selected aggregate. Secure Select & Aggregate Protocol 17

Select & Aggregate Protocol 1.Alice sends element-wise encryptions of u to Bob. 2.Bob computes the dot product of u and v using additive homomorphic property, and sends it to Alice. 1.Alice decrypts the dot product. Secure Select & Aggregate Protocol 18

Select & Aggregate Complexity Cannot afford O(#tuples) complexity for large databases. # Encryptions K 0 # Decryptions 10 # Multiplications 0 K # Exponentiations 0 K # Transmissions K 1 19

Key Idea 1.Find a suitable low-dimensional representation. 2.Use Select & Aggregate to evaluate quality metric. 20

Completeness Evaluation Setup Example: Alice wants to find the number of NULL values in Bob’s data. Query Privacy: Bob does not discover that Alice is searching for the number of NULLs. Data Privacy: Alice discovers nothing else about Bob’s data. Trick: Alice generates a Hashmap, Bob generates a Counting Hashmap. 0. H(NULL): 1. 0 HashMapCounting HashMap H(b 1 ): 23. H(NULL): 5. H(b t ): 2 21

Completeness Evaluation Protocol Alice generates public encryption key and private decryption key for additively homomorphic cryptosystem. The parties evaluate Select & Aggregate on Alice’s Hashmap and Bob’s Counting Hashmap. By construction, protocol reveals number of NULLs to Alice. 0. H(NULL): 1. 0 HashMapCounting HashMap 5 H(b 1 ): 23. H(NULL): 5. H(b t ): 2 Secure Select & Aggregate Protocol 22

Validity Evaluation Setup Histogram of attribute Binary vector Example: Alice wants to know how many of Bob’s entries are in the range [C,E]. Query Privacy: Bob does not discover the range of Alice’s searches. Data Privacy: Alice discovers nothing else about Bob’s data. Trick: Bob generates a histogram vector, Alice generates a binary selector vector on the support of the histogram. A B C D E G F Z 23

Validity Evaluation Protocol As before, Alice and Bob run the Select & Aggregate protocol on Alice’s selector vector and Bob’s histogram. By construction, protocol reveals number of “valid” values to Alice. Protocol works for arbitrary range queries, uniqueness, timeliness Binary vectorHistogram of attribute 15 Secure Select & Aggregate Protocol A B C D E G F Z 24

Consistency Evaluation Setup Example: Alice wants to know how many of Bob’s entries follow correct dependencies among attributes, e.g., State – Zipcode. Query Privacy: Bob doesn’t discover which dependencies Alice is checking. Data Privacy: Alice discovers nothing else about Bob’s data. Trick: Bob generates a vector of observed associations, Alice generates a vector of desired associations Observed dependencies Expected dependencies 25

Alice and Bob agree upon an ordering of attribute values. They also agree on a vectorization (flattening) pattern. Need to securely compute how many of Bob’s dependencies are consistent with Alice’s rules. CAMAMN… … CAMAMN… … Desired Dependencies Observed Dependencies 26

Consistency Evaluation Protocol Expected dependencies Observed dependencies 4 Secure Select & Aggregate Protocol Alice and Bob run the Select & Aggregate protocol on Alice’s desired rule vector and Bob’s observed rule vector. Protocol reveals number of “valid” dependencies to Alice. Works for dependencies among arbitrary attribute combinations. 27

Computational Complexity DRLG # uniques = # bins = 4 # tuples = 2,306,559 AZ 2012 votes MetricsProposed ProtocolsUsing PSI-CA Completeness O(# uniques)O(# tuples) Validity Timeliness Uniqueness O(# histogram bins)O(# tuples) Consistency O((# histogram bins) m ) O((# tuples) m ) 28

Conclusions & Discussion An important subclass of privacy-preserving data mining. Precursor to collaboration among untrusting entities. Existing protocols, e.g., PSI-CA have high computational overhead. Can efficiently evaluate many DQ metrics via homomorphic operations on reduced-dimensionality descriptions. Future work: –DQ for non-numeric attributes. –Efficient protocols for testing sparse dependencies. –Extremely difficult: Private evaluation of reliability of data. 29