Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC
First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL $ 2
What about data quality? Alice does not know data quality prior to acquisition Dirty data costs US businesses ~$600 billion annually [1] Data cleaning accounts for up to 80% of development time First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL [1] W. Eckerson. Data quality and the bottom line. TDWI Report, The Data Warehouse Institute,
First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL Privacy concerns for Bob 4
All of them How many rows are complete? First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL Trust and privacy concerns for Alice 5
Problem Privacy-Preserving Data Quality Assessment 6
Data Quality Metrics Integrity constraints on attributes =, >, [ ], age > 0 Dependency constraints across 2+ attributes if, while, for if state == CA, then ZIP in [94000, 96199] Many data quality metrics [1,2] Completeness Validity Uniqueness Consistency Timeliness [1] Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information & management, 40(2), 2002 [2] P. Cykana, A. Paul, and M. Stern. DoD Guidelines on Data Quality Management. In IQ, pages 154–171,
Data Quality Metrics Completeness Percentage of elements that are properly populated Check for values such as NULL, “”,… First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL 8
Data Quality Metrics Validity Percentage of elements whose attributes possess meaningful values First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL 9
Data Quality Metrics Consistency Degree to which the data attributes satisfy a dependency constraints First NameLast NameAgeStateZIP JohnSteinbeck32CA94043 JimiHendrix27WA01000 IsaacAsimov-15NYNULL 10
Desired Privacy Properties Query Privacy Bob should not learn the data quality constraint parameters and the resulting values Data Privacy Alice should not learn anything from Bob’s data besides quality metric 11
Application: High-Fidelity Cyber Threat Mitigation [1] S. Katti, B. Krishnamurthy, and D. Katabi. Collaborating against common enemies. In IMC, 2005 [2] J. Zhang, P. A. Porras, and J. Ullrich. Highly predictive blacklisting. In USENIX Security, 2008 [3] P. Porras and V. Shmatikov. Large-scale collection and sanitization of network security data: risks and challenges. In NSPW, 2006 IPPortTimeUIDAPT IPPortTimeUIDAPT IPPortTimeUIDAPT IPPortTimeUIDAPT 12
Solutions Rely on existing cryptographic primitives Develop custom solution 13
Private Set Intersection Set intersection or cardinality of set intersection [1] M. Freedman,K. Nissim, and B. Pinkas. Efficient private matching and set intersection. In EUROCRYPT, 2004 [2] E. De Cristofaro, P. Gasti, and G. Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS,
Private Set Intersection Completeness {NULL} 1, NULL 2, NULL … n, NULL 1, d 1 2, d 2 … n, d n {d 1, …, d n } PSI-CA approach is inefficient 15
Encrypted-domain Computation E(d 1 ), E(d 2 ) E(d 1 ) * E(d 2 ) d 1 + d 2 [1] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT,
Select & Aggregate Setup Goal: Alice has a binary selector u, Bob has data vector v. Alice should discover the sum of selected elements from v. Query Privacy: Bob should not find the selector vector. Data Privacy: Alice should not discover any information other than the selected aggregate. Secure Select & Aggregate Protocol 17
Select & Aggregate Protocol 1.Alice sends element-wise encryptions of u to Bob. 2.Bob computes the dot product of u and v using additive homomorphic property, and sends it to Alice. 1.Alice decrypts the dot product. Secure Select & Aggregate Protocol 18
Select & Aggregate Complexity Cannot afford O(#tuples) complexity for large databases. # Encryptions K 0 # Decryptions 10 # Multiplications 0 K # Exponentiations 0 K # Transmissions K 1 19
Key Idea 1.Find a suitable low-dimensional representation. 2.Use Select & Aggregate to evaluate quality metric. 20
Completeness Evaluation Setup Example: Alice wants to find the number of NULL values in Bob’s data. Query Privacy: Bob does not discover that Alice is searching for the number of NULLs. Data Privacy: Alice discovers nothing else about Bob’s data. Trick: Alice generates a Hashmap, Bob generates a Counting Hashmap. 0. H(NULL): 1. 0 HashMapCounting HashMap H(b 1 ): 23. H(NULL): 5. H(b t ): 2 21
Completeness Evaluation Protocol Alice generates public encryption key and private decryption key for additively homomorphic cryptosystem. The parties evaluate Select & Aggregate on Alice’s Hashmap and Bob’s Counting Hashmap. By construction, protocol reveals number of NULLs to Alice. 0. H(NULL): 1. 0 HashMapCounting HashMap 5 H(b 1 ): 23. H(NULL): 5. H(b t ): 2 Secure Select & Aggregate Protocol 22
Validity Evaluation Setup Histogram of attribute Binary vector Example: Alice wants to know how many of Bob’s entries are in the range [C,E]. Query Privacy: Bob does not discover the range of Alice’s searches. Data Privacy: Alice discovers nothing else about Bob’s data. Trick: Bob generates a histogram vector, Alice generates a binary selector vector on the support of the histogram. A B C D E G F Z 23
Validity Evaluation Protocol As before, Alice and Bob run the Select & Aggregate protocol on Alice’s selector vector and Bob’s histogram. By construction, protocol reveals number of “valid” values to Alice. Protocol works for arbitrary range queries, uniqueness, timeliness Binary vectorHistogram of attribute 15 Secure Select & Aggregate Protocol A B C D E G F Z 24
Consistency Evaluation Setup Example: Alice wants to know how many of Bob’s entries follow correct dependencies among attributes, e.g., State – Zipcode. Query Privacy: Bob doesn’t discover which dependencies Alice is checking. Data Privacy: Alice discovers nothing else about Bob’s data. Trick: Bob generates a vector of observed associations, Alice generates a vector of desired associations Observed dependencies Expected dependencies 25
Alice and Bob agree upon an ordering of attribute values. They also agree on a vectorization (flattening) pattern. Need to securely compute how many of Bob’s dependencies are consistent with Alice’s rules. CAMAMN… … CAMAMN… … Desired Dependencies Observed Dependencies 26
Consistency Evaluation Protocol Expected dependencies Observed dependencies 4 Secure Select & Aggregate Protocol Alice and Bob run the Select & Aggregate protocol on Alice’s desired rule vector and Bob’s observed rule vector. Protocol reveals number of “valid” dependencies to Alice. Works for dependencies among arbitrary attribute combinations. 27
Computational Complexity DRLG # uniques = # bins = 4 # tuples = 2,306,559 AZ 2012 votes MetricsProposed ProtocolsUsing PSI-CA Completeness O(# uniques)O(# tuples) Validity Timeliness Uniqueness O(# histogram bins)O(# tuples) Consistency O((# histogram bins) m ) O((# tuples) m ) 28
Conclusions & Discussion An important subclass of privacy-preserving data mining. Precursor to collaboration among untrusting entities. Existing protocols, e.g., PSI-CA have high computational overhead. Can efficiently evaluate many DQ metrics via homomorphic operations on reduced-dimensionality descriptions. Future work: –DQ for non-numeric attributes. –Efficient protocols for testing sparse dependencies. –Extremely difficult: Private evaluation of reliability of data. 29