1 Measures of Disclosure Risk and Harm Measures of Disclosure Risk and Harm Diane Lambert, Journal of Official Statistics, 9 (1993), pp. 313-331 Jim Lynch.

Slides:

Advertisements

Similar presentations

Introductory Mathematics & Statistics for Business

Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.

1 A Model for an Intrusion Prior Related to Example 4.2: n=10,N=100 (Slides 25-27) Jim Lynch NISS/SAMSI & University of South Carolina.

1 Measures of Disclosure Risk and Harm Measures of Disclosure Risk and Harm Diane Lambert, Journal of Official Statistics, 9 (1993), pp Jim Lynch.

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.

Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 

1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.

1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.

Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 9-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.

Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 8-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.

Copyright © 2010 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.

Inferences About Process Quality

Ch. 9 Fundamental of Hypothesis Testing

Bayesian Decision Theory Making Decisions Under uncertainty 1.

Chapter 10 Hypothesis Testing

Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,

Fundamentals of Hypothesis Testing: One-Sample Tests

Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap th Lesson Introduction to Hypothesis Testing.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.

Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.

Sampling Theory Determining the distribution of Sample statistics.

The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,

Chapter 10 Hypothesis Testing

Lecture 7 Introduction to Hypothesis Testing. Lecture Goals After completing this lecture, you should be able to: Formulate null and alternative hypotheses.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.

1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.

Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.

Chapter 8 Delving Into The Use of Inference 8.1 Estimating with Confidence 8.2 Use and Abuse of Tests.

1 Chapter 9 Hypothesis Testing. 2 Chapter Outline  Developing Null and Alternative Hypothesis  Type I and Type II Errors  Population Mean: Known 

1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.

Chap 8-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 8 Introduction to Hypothesis.

6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.

Not in FPP Bayesian Statistics. The Frequentist paradigm Defines probability as a long-run frequency independent, identical trials Looks at parameters.

Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall 9-1 σ σ.

Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.

Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.

1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.

Statistical Techniques

Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.

Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.

Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.

What’s the Point (Estimate)? Casualty Loss Reserve Seminar September 12-13, 2005 Roger M. Hayne, FCAS, MAAA.

Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.

Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.

Hypothesis Tests for 1-Proportion Presentation 9.

© 2010 Pearson Prentice Hall. All rights reserved Chapter Hypothesis Tests Regarding a Parameter 10.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.

Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

What’s Reasonable About a Range?

Chapter 8: Estimating with Confidence

Chapter 9 Hypothesis Testing.

Chapter 8: Estimating with Confidence

Confidence Intervals for Proportions

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

2/5/ Estimating a Population Mean.

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Presentation transcript:

1 Measures of Disclosure Risk and Harm Measures of Disclosure Risk and Harm Diane Lambert, Journal of Official Statistics, 9 (1993), pp Jim Lynch NISS/SAMSI & University of South Carolina

2 Measures of Disclosure Risk and Harm Introduction-Discussion (Section 7) What is Disclosure? Risk of Perceived Identification Modeling the Intruder Risk of True Identification Disclosure Harm

3 Discussion (Section 7) It is the intruder, and not the structure of the data alone, that controls disclosure. When the intruder is sure enough that a released record belongs to a respondent –There is a reidentification. –It may be incorrect, but the intruder perceives there to be a reidentification.

4 Discussion (Section 7) The risk of perceived disclosure and the risk of true disclosure cannot be measured without considering the seriousness of the threat posed by the intruder's strategy. The harm that follows from a reidentification –Depends on the attributes, if any, that the intruder infers about the target –The harm cannot be measured without considering the strategy that the intruder uses to infer sensitive attributes. Once the intruder's strategy is modeled, disclosure risk and harm can be evaluated Risk is measured in terms of probabilities Harm is measured in losses or costs.

5 Discussion (Section 7) All the agency can do to reduce disclosure risk or harm is –to mask the data before release –or carefully select the individuals and organizations that are given the data, or both. The models developed here imply that masking and releasing only a subset of records does not necessarily protect against disclosure. Masking may lower the risk of true reidentification –But it may also lead to false reidentifications and false inferences about attributes. –The fact that inferred attributes may be wrong may be little comfort to the respondent whose record is re identified.

6 Discussion (Section 7) Masking also complicates data analysis –An agency cannot be expected to predict and minimize all the effects of masking on all the analyses of interest. –Nor is it reasonable to expect the data analyst to describe how the data will be analyzed before the data are obtained so that the agency can verify that the conclusions will be the same for the masked data as they would have been for the original data. –Future masking techniques may preserve more general features of the data, but for now data masked enough to preserve confidentiality can be a challenge to analyze appropriately.

7 Discussion (Section 7) It does seem reasonable to put some of the burden for protecting confidentiality on the researcher. –Institutions and researchers have to abide by all sorts of conditions in experiments involving humans. –The experience in those and other areas ought to provide some guidance on protecting respondents in agency databases from unscrupulous intruders. –Would not necessarily remove the need for some masking, but it might reduce the need for heroic masking that severely limits the usefulness of the data. “Confidentiality issues for medical data miners,” Jules J. Berman, Pathology Informatics Cancer Diagnosis Program, DCTD, NCI, NIH,

8 Discussion (Section 7) One could argue that models of disclosure are hopeless because the issues are too complex and the intruder too mysterious. This paper, though, argues that models of disclosure are indispensable. –They force definitions and assumptions to be stated explicitly. –When the assumptions are realistic, models of disclosure can lead to practical measures of disclosure risk and harm.

9 What is Disclosure? Key Attributes –Useful for identification but usually not sensitive –E.g., age, location, marital status, and profession Sensitive Attributes –Disease, debts, credit rating Scenario: A sample of records is released –Obvious identifiers removed –Some attributes left intact such as marital status –Others modified to protect confidentiality Incomes truncated, professions grouped more coarsely, and ages on pairs of records swapped, some attributes on some records might be missing or imputed.

10 What is Disclosure? Two major types of disclosure –Identification or Re-identification Equivalent to inadvertent release of an identifiable record –Attribute Disclosure Occurs when the intruder believes something new has been learned about the respondent. May occur with or without re-identification E.g., the intruder may narrow the list of possible target records to two with nearly the same value of a sensitive attribute. Then the attribute is disclosed although the target record is not located. Or two records may be averaged so the released record belongs to no one. Yet the debt on the averaged record may disclose something about the debt carried by the targeted individual. The agency must decide whether attribute disclosures without identifications are important.

11 What is Disclosure? Considers only disclosures that involve re identifications but NOT attribute disclosures without reidentifications. Attribute disclosures that result from re identification are considered to the extent that they harm the respondent. In this paper, the risk of disclosure is the risk of reidentifying a released record and the harm from disclosure depends on what is learned from the identification.

12 What is Disclosure? Attribute disclosures that do not involve identification are ignored This assumes that all intruders first look for the record that is most likely to be correct and then take information about the targeted attribute from that record. Intruders with other strategies are ignored.

13 What is Disclosure? Includes –true and false reidentifications and –true and false attribute disclosures. –Correct and incorrect inferences can be distinguished if desired (as happens with measures of harm) It distinguishes between –true identification and true attribute disclosure and –perceived identification and perceived attribute disclosure (the intruder believes the information is correct) where, in the former, when correct inferences are to be prevented and in the latter when perceived inferences are to be prevented.

14 The Risk of Perceived Identification Basic Premise: Disclosure is limited only to the extent that the intruder is discouraged from making any inferences, correct or incorrect, about a particular target respondent.

15 The Risk of Perceived Identification Format (Similar to Jerry’s last time) –Population of N records denoted Z –A random sample of n masked records X=(x 1,…, x n ) with k attributes –Masking suppresses attributes in Z, adds random noise, truncates outliers, or swaps values of an attribute between records. Knowing this, which, if any, record in the released file should be linked to the target respondent’s record Y?

16 The Risk of Perceived Identification Rational Intruder has two options. –1. Decide that one of the released records belongs to the target respondent. (i.e., link the i th released record x i to the target record Y. –2. Decide not to link (the null link) any released record to Y, perhaps because none of the released records is close enough to what the intruder expects for Y or perhaps because too many released records are close to what the intruder expects for Y. The decision not Rational intruder chooses the link (nonnull or null) believed most likely to be correct whenever any incorrect choice incurs the same positive loss and a correct link incurs no loss. (See Duncan and Lambert (1989) for details.)

17 The Risk of Perceived Identification

18 The Risk of Perceived Identification Other Measures

19 Modeling the Intruder Example 4.1 – Pop of 2 Records: N=2=n One continuous attribute Intruder makes judgments about the M(Y), the masked version of target Y Series of judgments leads to intruder modeling prior about M(Y) as lognormal  with  (0,1) (prior denoted f 1 (x)) Information about the other respondent, Y’, is modeled as M(Y’)~lognormal(2,1) denoted f 2 (x) E(M(Y))=1.65 and E(M(Y’))=12.2 Released data is X=(7,20)

20 Modeling the Intruder Example 4.1 – A “Posterior Calculation ” p 1 =P(M(Y)=7|X=(7,20))=P(M(Y’)=20|X=(7,20)) =f 1 (7)f 2 (20)/[f 1 (7)f 2 (20)+f 2 (7)f 1 (20)]=.89 In the original population Y=Z 1 <Z 2 =Y’; p 1 is just the probability that the order is preserved in the released data after masking. The terminology of “prior” and “posterior” don’t suggest that this is Bayesian. Just modeling the masking. If masking techniques require order to be preserved then p 1 =1 and the joint distribution of M(Y) and M(Y’) is not f 1 f 2.

21 Modeling the Intruder Example 4.1 Suppose only one record is released and it is x=7. Then, p 1 =P(M(Y) is selected and M(Y)=7|X=(7)) =.5f 1 (7)/[.5f 1 (7))+.5f 2 (7)]=.13 In this case, D(X)=max(.13,.87)=.87

22 Modeling the Intruder Example 4.2 – n of N records Intruder believes that the i th record in pop Z will be appear as M i =M(Y i )~ f i (x) The probability that the n th released record belongs to target Y 1 is p n =P(Y 1 is sampled and M 1 =x n |X) =P(x n is sampled from f 1 and x 1,…, x n-1 are sampled from f 2,…, f N )/P(x 1,…, x n are sampled from f 1,…, f N )

23 Modeling the Intruder Example n=2 of N=3 records Non-unique Records

24 Example n=1 of N=2 records Unknown respondents may be reidentifiable Intruder’s priors on Z –Y 1 ~Unif[-4,4], Y 2 ~N(0,1), x 1 =-2.25

25 Example n=10 of N=100 records Sampling by itself need not protect confidentiality Target is thought to be the smallest in Pop The Priors: Y 1 ~LogN(0,.5), Y 2,…,Y 100 iid~LogN(2,.5) Masking is iid multiplicative LogN(0,.5) Uncertainty in the released records (masking+intruder prior) M 1 ~LogN(0,1)=f 1, M 2,…,M 100 iid~LogN(2,1)=f 2 X=(.05,.14, 1.5, 2.4, 3.2, 3.8, 4.6, 8.7, 10.3, 10.7)

26 Example n=10 of N=100 records Sampling by itself need not protect confidentiality

27 Example n=10 of N=100 records Sampling by itself need not protect confidentiality Values of P j1 (X)

28 Risk of True Identification The agency cannot control the intruder's perceptions and actions once the data are released. All it can do is count the number of true identifications for an intruder with a given set of beliefs about the target and source file. A reasonable measure of the risk of true identification, then, is simply the fraction of released records (or number of released records) that an intruder can correctly reidentify.

29 Risk of True Identification Distinguishes “Risk of Matching” (Spruill, ) with “Risk of True Identification” (Risk of Matching is the proportion of masked records whose closest source records are the actual source records generated them) To illustrate Risk of True identification, consider the following example where N is large and n small so that we can calculate using sampling with replacement

30 Risk of True Identification

31 Risk of True Identification

32 Risk of True Identification Risk of True Identification is low (zero if.078 is too low to link). Look down columns 15 and 30. Risk of matching is not zero for both records? Look across rows. 32 matched with 30 which is incorrect but 35 is matched with 30? (Why not 40.7?) Claimed risk of matching is ½? Risk of perceived re-identification? Look down all columns. If 1- the sum of the column is more than the max of the column the intruder is wasting their time. This is an assumption about the intruder that their rational decision is that the record for that column has not been released. In this example, this is true for all the columns.

33 Disclosure Harm Just Considers Harm to Respondent (not to agencies, researchers, etc) whose released record has been reidentified or perceived to have been reidentified Scenario –Masked Data is released X=(x 1,…,x n ) where = x i =(x i1,…,x ik ) and x i is a binary attribute of interest. Assume that the target record is Y 1 and that the intruder has linked Y 1 to x 1.Let x -11 =(x 12,…,x 1k ) and X -1 =(x -11,x 2 …,x n ) –Because of masking the intruder believes, independent of everything else, that x 11 = Y 11 with probability q x 11 = 1-Y 11 with probability 1- q

34 Disclosure Harm

35 Disclosure Harm-Logistic Regression

36 Disclosure Harm-Measures of Harm Harm H(Y 11,X) is a variable that takes on various values depending on the action that the intruder takes based on their their strategy These values are losses and are –0 if record is not identified –c FN if re-identification is incorrect and Y 11 is not inferred –c TN if re-identification is correct and Y 11 is not inferred –and

37 Disclosure Harm Some Possibly Delusionary Closing Comments Think of the source data, Y, as the parameter The released data, X, is the sample This is somewhat like a two person game where the agency plays the role of Mother Nature and the intruder is the other person The agency controls the way it generates the released data

38 Disclosure Harm Some Possibly Delusionary Closing Comments When we describe the mechanism/structure/model that is used to generate released data we are specifying somewhat the model X|Y. –Are we totally specifying this? –There are at the very least some nuisance parameters regarding weights, e.g. Is there a meaningful interpretation in randomizing over the parameter from the agencies perspective? Perhaps we should reverse the roles of the agency and the intruder. The parameter is then the intruder’s strategy. In any event Lambert is suggesting that we need to model the intruder strategy and formulate the problem from a decision theory standpoint.

39 Disclosure Harm Some Possibly Delusionary Closing Comments Addendum Based on the Talk Last Week Last week, Bahjat described the mechanism/structure/model that could be used to generate released data based on swapping. There the model for X|Y was completely specified. Modeling the intruder’s behavior involves –Modeling the intruder’s prior  (Y) on the source data Y Here Y is the source tabular data prior to the swapping (not the original data from which the source table was made). The prior has to satisfy the constraints imposed by swapping (column and row totals preserved) if known. –Then calculating the posterior  n (Y|X) where n is the number of swaps –One intruder scenario is that the intruder is interested in a target and has very precise info on some identifier variables y I for the target, where I is a subset of the variables in the table. The intruder is really interested in determining y S where y = (y I,y S ) and should calculate the distribution of y S |y I under  n (Y|X).

40 Disclosure Harm Some Possibly Delusionary Closing Comments Addendum Based on the Talk Last Week –For Bahjat’s example last time, with prior  (a)=1/6, a=4,…,9, the posterior is essentially the prior when n=32. E.g.,  32 (8|8)=  (8)P 32 (8|8)/  32 (8)=1/6( / ) =1/6( ) If this is the intruder’s prior, the intruder’s prior opinion about what is the actual table has changed very little by knowing the released table. –Since the process is reversible, P 32 (a’|a) is the posterior for the equilibrium prior where a is the disclosed table after 32 swaps. There is quite a bit of difference between P 32 (a’|a) and  32 (a’|a). –In Bahjat’s example you have six tables labeled by a. For row 1 of these tables (y I =0 in my notation) the entries for y S =0 or 1 are (9,1), (8,2),…,(4,6). Thus, the distribution of y S |y I under  n (Y|X) is approximately (39/60, 21/60) (the six tables are almost equally likely when n=32 under this prior). So, y S =0 is twice as likely as y S =1 given y I =0