Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.

Slides:



Advertisements
Similar presentations
Chapter 7 Hypothesis Testing
Advertisements

Estimation of Means and Proportions
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Statistics for Business and Economics
Sampling: Final and Initial Sample Size Determination
POINT ESTIMATION AND INTERVAL ESTIMATION
Chap 8-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 8 Estimation: Single Population Statistics for Business and Economics.
Statistics for Business and Economics
5 - 1 © 1997 Prentice-Hall, Inc. Importance of Normal Distribution n Describes many random processes or continuous phenomena n Can be used to approximate.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Point estimation, interval estimation
Chapter 7 Sampling and Sampling Distributions
SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte.
SAC’06 April 23-27, 2006, Dijon, France On the Use of Spectral Filtering for Privacy Preserving Data Mining Songtao Guo UNC Charlotte Xintao Wu UNC Charlotte.
Evaluating Hypotheses
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 9: Hypothesis Tests for Means: One Sample.
About a survey method on sensitive matters in human life Jong-Min Kim Discipline of Statistics Division of Science & Mathematics University of Minnesota,
Chapter Topics Confidence Interval Estimation for the Mean (s Known)
Chapter 7 Estimation: Single Population
BCOR 1020 Business Statistics
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Statistical Intervals Based on a Single Sample.
Estimation Goal: Use sample data to make predictions regarding unknown population parameters Point Estimate - Single value that is best guess of true parameter.
AM Recitation 2/10/11.
Chapter 7 Estimation: Single Population
1 Math 10 Part 5 Slides Confidence Intervals © Maurice Geraghty, 2009.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
PARAMETRIC STATISTICAL INFERENCE
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
1 Estimation From Sample Data Chapter 08. Chapter 8 - Learning Objectives Explain the difference between a point and an interval estimate. Construct and.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
1 Chapter 8 Hypothesis Testing 8.2 Basics of Hypothesis Testing 8.3 Testing about a Proportion p 8.4 Testing about a Mean µ (σ known) 8.5 Testing about.
STATISTICAL INFERENCE PART VI HYPOTHESIS TESTING 1.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Risk Analysis & Modelling Lecture 2: Measuring Risk.
7 - 1 © 1998 Prentice-Hall, Inc. Chapter 7 Inferences Based on a Single Sample: Estimation with Confidence Intervals.
Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte August 30, 2012.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Example: In a recent poll, 70% of 1501 randomly selected adults said they believed.
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
POLS 7000X STATISTICS IN POLITICAL SCIENCE CLASS 5 BROOKLYN COLLEGE-CUNY SHANG E. HA Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for.
Lesoon Statistics for Management Confidence Interval Estimation.
Section 10.5 Let X be any random variable with (finite) mean  and (finite) variance  2. We shall assume X is a continuous type random variable with p.d.f.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
1 Probability and Statistics Confidence Intervals.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Example: In a recent poll, 70% of 1501 randomly selected adults said they believed.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Chapter 7: The Distribution of Sample Means
Statistics for Business and Economics 8 th Edition Chapter 7 Estimation: Single Population Copyright © 2013 Pearson Education, Inc. Publishing as Prentice.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Sampling and Sampling Distributions. Sampling Distribution Basics Sample statistics (the mean and standard deviation are examples) vary from sample to.
Chapter 6 Sampling and Sampling Distributions
Statistics for Business and Economics 7 th Edition Chapter 7 Estimation: Single Population Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
CHAPTER 6: SAMPLING, SAMPLING DISTRIBUTIONS, AND ESTIMATION Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
Introduction For inference on the difference between the means of two populations, we need samples from both populations. The basic assumptions.
Making inferences from collected data involve two possible tasks:
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Sampling Distributions and Estimation
Statistics in Applied Science and Technology
CONCEPTS OF ESTIMATION
Estimation Goal: Use sample data to make predictions regarding unknown population parameters Point Estimate - Single value that is best guess of true parameter.
Sampling Distributions
Chapter 8 Estimation.
Presentation transcript:

Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte

2 Market Basket Data TIDmilksugarbread … cereals 1101 … … … … …. N011 … 0 1: presence 0: absence …  Association rule (R.Agrawal SIGMOD 1993)  with support and confidence

3 Other measures 2 x 2 contingency table Objective measures for A=>B

4 Related Work Privacy preserving association rule mining Data swapping Frequent itemset or rule hiding Inverse frequent itemset mining Item randomization

5 Item Randomization TIDmilksugarbread … cereals …. N0110 TIDmilksugarbread … cereals …. N1101 Original DataRandomized Data To what extent randomization affects mining results? (Focus) To what extent it protects privacy?

6 Randomized Response ([ Stanley Warner; JASA 1965]) : Cheated in the exam : Didn ’ t cheat in the exam Cheated in exam Didn’t cheat Randomization device Do you belong to A? (p) Do you belong to ?(1-p) … “Yes” answer “No” answer As:Unbiased estimate of is:  Procedure: Purpose: Get the proportion( ) of population members that cheated in the exam. … Purpose

7 Application of RR in MBD RR can be expressed by matrix as: ( 0: No 1:Yes) =  Extension to multiple variables e.g., for 2 variables  Unbiased estimate of is: stands for Kronecker product diagonal matrix with elements

8 Analysis the dispersion matrix of the regular survey estimation nonnegative definite, represents the components of dispersion associated with RR experiment diagonal matrix with elements

9 Kronecker Product Example = =

10 Randomization example TIDmilksugarbread … cereals …. N0110 Original Data Randomized Data TIDmilksugarbread … cereals …. N0101 RR A: Milk B: Cereals =(0.415,0.043,0.183,0.359)’ =(0.427,0.031,0.181,0.362)’ We can get the estimate, how accurate we can achieve? =(0.368,0.097,0.218,0.316)’ Data miners Data owners

11 Motivation Frequent set Not frequent set Estimated values Original values Rule 6 is falsely recognized from estimated value! Lower& Upper bound Frequent set with high confidence Frequent set without confidence Both are frequent set

12 Accuracy on Support S Estimate of support Variance of support Interquantile range (normal dist.)

13 Accuracy on Confidence C Estimate of confidence A =>B Variance of confidence Interquantile range (ratio dist. is F(w))  Loose range derived on Chebyshev’s theorem where  Let be a random variable with expected value and finite variance.Then for any real

14 Bounds of other measures Accuracy Bounds

15 General Framework  Step1: Estimation  Express the measure as one derived function from the observed variables ( or their marginal totals, ).  Compute the estimated measure value.  Step2: Variance of the estimated measure  Get the variance of the estimated measure (a function with multi known variables) through Taylor approximation  Step 3: Derive the interquantile range through Chebyshev's theorem

16 Example for with two variables  Step 1: Get the estimate of the measure   Step 2: Get the variance of the estimated measure   Step 3: Derive the interquantile range through Chebyshev's theorem. Where:,,,

17 Accuracy Bounds With unknown distribution, Chebyshev theorm only gives loose bounds. Bounds of the support vs. varying p

18 Distortion All the above discussions assume distortion matrices P are known to data miners P could be exploited by attackers to improve the posteriori probability of their prediction on sensitive items How about not releasing P? Disclosure risk is decreased Data mining result?

19 Unknown distortion P MeasureExpression Correlation ( ) Mutual Information (M) Likelihood ratio ( ) Pearson Statistics( )  Some measures have monotonic properties  Other measures don’t have such properties

20 Applications: hypothesis test  From the randomized data, if we discover an itemset which satisfies, we can guarantee dependence exists among the original itemset since. Still be able to derive the strong dependent itemsets from the randomized data No false positive

21 Conclusion Propose a general approach to deriving accuracy bounds of various measures adopted in MBD analysis Prove some measures have monotonic property and some data mining tasks can be conducted directly on randomized data (without knowing the distortion). No false positive pattern exists in the mining result.

22 Future Work Which measures are more sensible to randomization? The tradeoff between the privacy of individual data and the accuracy of data mining results Accuracy vs. disclosure analysis for general categorical data

23 Acknowledgement NSF IIS Ph.D. students Ling Guo Songtao Guo

24 Q A &