Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Data Mining Techniques Association Rule
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Fast Algorithms For Hierarchical Range Histogram Constructions
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
Privacy Preserving Association Rule Mining in Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Visual Recognition Tutorial
Evaluation.
ECIV 201 Computational Methods for Civil Engineers Richard P. Ray, Ph.D., P.E. Error Analysis.
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.
Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical.
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
Evaluation.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Association Rule Mining Part 1 Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Bulut, Singh # Selecting the Right Interestingness Measure for Association Patterns Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer.
1 Systems of Linear Equations Error Analysis and System Condition.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
Lecture II-2: Probability Review
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
TH EDITION LIAL HORNSBY SCHNEIDER COLLEGE ALGEBRA.
Mathematics for Computer Graphics (Appendix A) Won-Ki Jeong.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Data Mining: Potentials and Challenges Rakesh Agrawal IBM Almaden Research Center.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center.
Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Other Perturbation Techniques. Outline  Randomized Responses  Sketch  Project ideas.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Additive Data Perturbation: the Basic Problem and Techniques.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
1 FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Association Rule Mining
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 4 Inverse, Exponential, and Logarithmic Functions Copyright © 2013, 2009, 2005 Pearson Education,
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
1 An infrastructure for context-awareness based on first order logic 송지수 ISI LAB.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Chap 5-1 Discrete and Continuous Probability Distributions.
Logarithmic Functions Logarithms Logarithmic Equations Logarithmic Functions Properties of Logarithms.
Privacy-Preserving Data Mining
New Characterizations in Turnstile Streams with Applications
Understanding Generalization in Adaptive Data Analysis
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
CMPE 521 PRINCIPLES of DATABASE SYSTEMS
Data Integration with Dependent Sources
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Presentation transcript:

Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University

Horizontally Partitioned Personal Information p 1 p 2 p n Table T for analysis at server Client C 1 Original Row r 1 Perturbed p 1 Client C 2 Original Row r 2 Perturbed p 2 Client C n Original Row r n Perturbed p n EXAMPLE: What number of children in this county go to college?

Vertically Partitioned Enterprise Information IDC1C1 John 1 Alice 5 Bob 18 IDC1C1 John 1 Alice 7 Bob 18 IDC2C2 C3C3 John 279 Alice 536 IDC2C2 C3C3 John 359 Alice 537 IDC1C1 C2C2 C3C3 John 1359 Alice 7537 Original Relation D 1 Perturbed Relation D’ 1 Original Relation D 2 Perturbed Relation D’ 2 Perturbed Joined Relation D’ EXAMPLE: What fraction of United customers to New York fly Virgin Atlantic to travel to London?

Talk Outline Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments

Privacy Preserving OLAP Compute select count(*) from T where P 1 and P 2 and P 3 and …. P k where P 1 and P 2 and P 3 and …. P k i.e. COUNT T ( P 1 and P 2 and P 3 and …. P k ) We need to provide error bounds to analyst. provide privacy guarantees to data sources. scale to larger # of attributes.

Uniform Retention Replacement Perturbation HEADS: RETAIN TAILS: REPLACE U.A.R FROM [1-5] BIAS=0.2

Retention Replacement Perturbation Done for each column The replacing pdf need not be uniform Different columns can have different biases for retention

Talk Outline Motivation Problem Definition Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method Privacy Guarantees Experiments

Single Attribute Example What is the fraction of people in this building with age 30-50? Assume age between Whenever a person enters the building flips a coin of bias p=0.2 for heads say. –Heads -- report true age –Tails -- random number uniform in reported Totally 100 randomized numbers collected. Of these 22 are How many among the original are 30-50?

Analysis 80 Perturbed 20 Retained Out of 100 : 80 perturbed (0.8 fraction), 20 retained (0.2 fraction)

Analysis Contd. 64 Perturbed, NOT Age[30-50] 16 Perturbed, Age[30-50] 20 Retained 20% of the 80 randomized rows, i.e. 16 of them satisfy Age[30-50]. The remaining 64 don’t.

Analysis Contd. Since there were 22 randomized rows in [30-50] =6 of them come from the 20 retained rows. 16 Perturbed, Age[30-50] 64 Perturbed, NOT Age[30-50] 6 Retained, Age[30-50] 14 Retained, NOT Age[30-50]

Scaling up Total Rows Age[30-50] ? 30 Thus 30 people had age in expectation.

Multiple Attributes (k=2) QueryEstimated on T Evaluated on T` count(¬P 1 ٨¬P 2 ) x0x0 y0y0 count(¬P 1 ٨P 2 ) x1x1 y1y1 count(P 1 ٨¬P 2 ) x2x2 y2y2 count(P 1 ٨P 2 ) x3x3 y3y3

Architecture

Formally : Select count(*) from R where P p = retention probability (0.2 in example) 1-p = probability that an element is replaced by replacing p.d.f. b = probability that an element from the replacing p.d.f. satisfies predicate P ( in example) a = 1-b

Transition matrix (1-p)a + p(1-p)b (1-p)a(1-p)b+p Count T (: P)Count T ( P)Count T’ (: P)Count T’ (P) = i.e. Solve xA=y A 00 = probability that original element satisfies : P and after perturbation satisfies : P p = probability it was retained (1-p)a = probability it was perturbed and satisfies : P A 00 = (1-p)a+p

Multiple Attributes For k attributes, x, y are vectors of size 2 k x=y A -1 Where A=A 1 ­ A 2 ­.. ­ A k [Tensor Product]

Error Bounds In our example, we want to say when estimated answer is 30, the actual answer lies in [28-32] with probability greater than 0.9 Given T !  T’, with n rows f(T) is (n, ,  ) reconstructible by g(T’) if |f(T) – g(T’)| < max ( ,  f(T)) with probability greater than (1-  ).  f(T) =2,  =0.1 in above example

Results Fraction, f, of rows in [low,high] in the original table estimated by matrix inversion on the table obtained after uniform perturbation is a (n, ,  ) estimator for f if n > 4 log(2/  )(p  ) -2, by Chernoff bounds Vector, x, obtained by matrix inversion is the MLE (maximum likelihood estimator), by using Lagrangian Multiplier method and showing that the Hessian is negative

Talk Outline Motivation Problem Definition Query Reconstruction Inversion method Iterative method Privacy Guarantees Experiments

Iterative Algorithm [AS00] Iterate: x p T+1 =  q=0 t y q (a pq x p T / (  r=0 t a rq x r T )) [ By Application of Bayes Rule] Initialize: x 0 =y Stop Condition: Two consecutive x iterates do not differ much

Iterative Algorithm RESULT [AA01]: The Iterative Algorithm gives the MLE with the additional constraint that 0 < x i, 8 0 < i < 2 k -1

Talk Outline Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments

Privacy Guarantees Say initially know with probability < 0.3 that Alice’s age > 25 After seeing perturbed value can say that with probability > 0.95 Then we say there is a (0.3,0.95) privacy breach

Let X, Y be random variables where X = original value, Y= perturbed value. Let Q, S be subsets of their domains Apriori Probability P[ X 2 Q] = P q ·  1 Posteriori Probability P[X 2 Q | Y 2 ] ¸  2 where 0 0 S Privacy Guarantees Q Where p q /m q < s, i.e. Q is a rare set (m q = probability of Q under replacing pdf) (  1,  2 ) Privacy breach (s,  1,  2 ) Privacy breach S Q

(s,  1,  2 ) vs (  1,  2 ) metric –Provides more privacy to rare sets e.g. : in market basket data, medicines are rarer than bread, so we provide more privacy for medicines than for bread –For multiple columns, s expresses correlations –Works for retention replacement perturbation on numeric attributes

(s,  1,  2 ) Guarantees The median value of s is 1 There is no (s,  1,  2) privacy breach for s < f(  1,  2,p) for retention replacement perturbation on single as well as multiple columns

Application to Classification[AS00] For the first split to compute split criterion/gini index Count(age[0-30] and class-var=‘-’) Count(age[0-30] and class-var=‘+’) Count(: age[0-30] and class-var=‘-’) Count(: age[0-30] and class-var=‘+’)

Talk Outline Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments

Real data: Census data from the UCI Machine Learning Repository having rows Synthetic data: Generated multiple columns of Zipfian data, number of rows varied between 1000 and Error metric: l 1 norm of difference between x and y. Eg for 1-dim queries |x 1 – y 1 | + | x 0 – y 0 |

Inversion vs Iterative Reconstruction 2 attributes: Census Data 3 attributes: Census Data Iterative algorithm (MLE on constrained space) outperforms Inversion (global MLE)

Privacy Obtained Privacy as a function of retention probability on 3 attributes of census data

Error vs Number of Columns: Census Data Inversion Algorithm Iterative Algorithm Error increases exponentially with increase in number of columns

Error as a function of number of Rows Error has square root n dependence on number of rows

Conclusion Possible to run OLAP on data across multiple servers so that probabilistically approximate answers are obtained and data privacy is maintained The techniques have been tested experimentally on real and synthetic data. More experiments in the paper. PRIVACY PRESERVING OLAP is PRACTICAL

References [AS00] Agrawal, Srikant: Privacy Preserving Data Mining [AA01] Agarwal, Aggarwal: On the Quantification of… [W65] Randomized Response.. [EGS] Evfimievski, Gehrke, Srikant: Limiting Privacy Breaches.. Others in the paper..

The error in the iterative algorithm flattens out as its maximum value is bounded by 2 Error vs Number of Columns: Iterative Algorithm: Zipf Data

Supported by Privacy Group at Stanford: Rajeev and Hector