Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC.

Similar presentations


Presentation on theme: "Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC."— Presentation transcript:

1 Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

2 First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL $ 2

3 What about data quality? Alice does not know data quality prior to acquisition Dirty data costs US businesses ~$600 billion annually [1] Data cleaning accounts for up to 80% of development time First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL [1] W. Eckerson. Data quality and the bottom line. TDWI Report, The Data Warehouse Institute, 2002 3

4 First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL Privacy concerns for Bob 4

5 All of them How many rows are complete? First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL Trust and privacy concerns for Alice 5

6 Problem Privacy-Preserving Data Quality Assessment 6

7 Data Quality Metrics Integrity constraints on attributes =, >, [ ], age > 0 Dependency constraints across 2+ attributes if, while, for if state == CA, then ZIP in [94000, 96199] Many data quality metrics [1,2] Completeness Validity Uniqueness Consistency Timeliness [1] Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information & management, 40(2), 2002 [2] P. Cykana, A. Paul, and M. Stern. DoD Guidelines on Data Quality Management. In IQ, pages 154–171, 1996 7

8 Data Quality Metrics Completeness Percentage of elements that are properly populated Check for values such as NULL, “”,… First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL 8

9 Data Quality Metrics Validity Percentage of elements whose attributes possess meaningful values First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL 9

10 Data Quality Metrics Consistency Degree to which the data attributes satisfy a dependency constraints First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL 10

11 Desired Privacy Properties Query Privacy Bob should not learn the data quality constraint parameters and the resulting values Data Privacy Alice should not learn anything from Bob’s data besides quality metric 11

12 Application: High-Fidelity Cyber Threat Mitigation [1] S. Katti, B. Krishnamurthy, and D. Katabi. Collaborating against common enemies. In IMC, 2005 [2] J. Zhang, P. A. Porras, and J. Ullrich. Highly predictive blacklisting. In USENIX Security, 2008 [3] P. Porras and V. Shmatikov. Large-scale collection and sanitization of network security data: risks and challenges. In NSPW, 2006 IPPortTimeUIDAPT IPPortTimeUIDAPT IPPortTimeUIDAPT IPPortTimeUIDAPT 12

13 Solutions Rely on existing cryptographic primitives Develop custom solution 13

14 Private Set Intersection Set intersection or cardinality of set intersection [1] M. Freedman,K. Nissim, and B. Pinkas. Efficient private matching and set intersection. In EUROCRYPT, 2004 [2] E. De Cristofaro, P. Gasti, and G. Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS, 2012 14

15 Private Set Intersection Completeness {NULL} 1, NULL 2, NULL … n, NULL 1, d 1 2, d 2 … n, d n {d 1, …, d n } PSI-CA approach is inefficient 15

16 Encrypted-domain Computation E(d 1 ), E(d 2 ) E(d 1 ) * E(d 2 ) d 1 + d 2 [1] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT, 1999 16

17 Select & Aggregate Setup Goal: Alice has a binary selector u, Bob has data vector v. Alice should discover the sum of selected elements from v. Query Privacy: Bob should not find the selector vector. Data Privacy: Alice should not discover any information other than the selected aggregate. Secure Select & Aggregate Protocol 17

18 Select & Aggregate Protocol 1.Alice sends element-wise encryptions of u to Bob. 2.Bob computes the dot product of u and v using additive homomorphic property, and sends it to Alice. 1.Alice decrypts the dot product. Secure Select & Aggregate Protocol 18

19 Select & Aggregate Complexity Cannot afford O(#tuples) complexity for large databases. # Encryptions K 0 # Decryptions 10 # Multiplications 0 K # Exponentiations 0 K # Transmissions K 1 19

20 Key Idea 1.Find a suitable low-dimensional representation. 2.Use Select & Aggregate to evaluate quality metric. 20

21 Completeness Evaluation Setup Example: Alice wants to find the number of NULL values in Bob’s data. Query Privacy: Bob does not discover that Alice is searching for the number of NULLs. Data Privacy: Alice discovers nothing else about Bob’s data. Trick: Alice generates a Hashmap, Bob generates a Counting Hashmap. 0. H(NULL): 1. 0 HashMapCounting HashMap H(b 1 ): 23. H(NULL): 5. H(b t ): 2 21

22 Completeness Evaluation Protocol Alice generates public encryption key and private decryption key for additively homomorphic cryptosystem. The parties evaluate Select & Aggregate on Alice’s Hashmap and Bob’s Counting Hashmap. By construction, protocol reveals number of NULLs to Alice. 0. H(NULL): 1. 0 HashMapCounting HashMap 5 H(b 1 ): 23. H(NULL): 5. H(b t ): 2 Secure Select & Aggregate Protocol 22

23 Validity Evaluation Setup 0146720101467201 Histogram of attribute 0001110000011100 Binary vector Example: Alice wants to know how many of Bob’s entries are in the range [C,E]. Query Privacy: Bob does not discover the range of Alice’s searches. Data Privacy: Alice discovers nothing else about Bob’s data. Trick: Bob generates a histogram vector, Alice generates a binary selector vector on the support of the histogram. A B C D E G F Z 23

24 Validity Evaluation Protocol As before, Alice and Bob run the Select & Aggregate protocol on Alice’s selector vector and Bob’s histogram. By construction, protocol reveals number of “valid” values to Alice. Protocol works for arbitrary range queries, uniqueness, timeliness. 0001110000011100 0146720101467201 Binary vectorHistogram of attribute 15 Secure Select & Aggregate Protocol A B C D E G F Z 24

25 Consistency Evaluation Setup Example: Alice wants to know how many of Bob’s entries follow correct dependencies among attributes, e.g., State – Zipcode. Query Privacy: Bob doesn’t discover which dependencies Alice is checking. Data Privacy: Alice discovers nothing else about Bob’s data. Trick: Bob generates a vector of observed associations, Alice generates a vector of desired associations. 1011100110111001 Observed dependencies 1101101111011011 Expected dependencies 25

26 Alice and Bob agree upon an ordering of attribute values. They also agree on a vectorization (flattening) pattern. Need to securely compute how many of Bob’s dependencies are consistent with Alice’s rules. CAMAMN… 94304 1 000 55414 0 010 02139 0100 94305 1 000 … CAMAMN… 94304 0 010 55414 0 010 02139 0100 94305 1 000 … Desired Dependencies Observed Dependencies 26

27 Consistency Evaluation Protocol 1101101111011011 1011100110111001 Expected dependencies Observed dependencies 4 Secure Select & Aggregate Protocol Alice and Bob run the Select & Aggregate protocol on Alice’s desired rule vector and Bob’s observed rule vector. Protocol reveals number of “valid” dependencies to Alice. Works for dependencies among arbitrary attribute combinations. 27

28 Computational Complexity DRLG # uniques = # bins = 4 # tuples = 2,306,559 AZ 2012 votes MetricsProposed ProtocolsUsing PSI-CA Completeness O(# uniques)O(# tuples) Validity Timeliness Uniqueness O(# histogram bins)O(# tuples) Consistency O((# histogram bins) m ) O((# tuples) m ) 28

29 Conclusions & Discussion An important subclass of privacy-preserving data mining. Precursor to collaboration among untrusting entities. Existing protocols, e.g., PSI-CA have high computational overhead. Can efficiently evaluate many DQ metrics via homomorphic operations on reduced-dimensionality descriptions. Future work: –DQ for non-numeric attributes. –Efficient protocols for testing sparse dependencies. –Extremely difficult: Private evaluation of reliability of data. {jfreudig,srane}@parc.com 29


Download ppt "Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC."

Similar presentations


Ads by Google