New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.

Slides:



Advertisements
Similar presentations
Linear Transformation of post-microaggregated data Mi-Ja Woo National Institute of Statistical Sciences.
Advertisements

Statistics for Improving the Efficiency of Public Administration Daniel Peña Universidad Carlos III Madrid, Spain NTTS 2009 Brussels.
Chapter 3 Properties of Random Variables
Generalized Method of Moments: Introduction
Lecture 11 (Chapter 9).
Part 12: Asymptotics for the Regression Model 12-1/39 Econometrics I Professor William Greene Stern School of Business Department of Economics.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Correlation and Regression
The World Bank Human Development Network Spanish Impact Evaluation Fund.
1-1 Regression Models  Population Deterministic Regression Model Y i =  0 +  1 X i u Y i only depends on the value of X i and no other factor can affect.
NORMAL CURVE Needed for inferential statistics. Find percentile ranks without knowing all the scores in the distribution. Determine probabilities.
1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.
Chapter 11 Multiple Regression.
Topic 3: Regression.
Business Statistics - QBM117 Statistical inference for regression.
Introduction to Regression with Measurement Error STA431: Spring 2013.
Correlation and Regression Analysis
Model Checking in the Proportional Hazard model
Separate multivariate observations
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Correlation & Regression
9. Binary Dependent Variables 9.1 Homogeneous models –Logit, probit models –Inference –Tax preparers 9.2 Random effects models 9.3 Fixed effects models.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Regression Analysis (2)
Classification (Supervised Clustering) Naomi Altman Nov '06.
Introduction to Regression with Measurement Error STA302: Fall/Winter 2013.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Stats/Methods I JEOPARDY. Jeopardy CorrelationRegressionZ-ScoresProbabilitySurprise $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Geo597 Geostatistics Ch9 Random Function Models.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Estimating Causal Effects from Large Data Sets Using Propensity Scores Hal V. Barron, MD TICR 5/06.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Generalizing Observational Study Results Applying Propensity Score Methods to Complex Surveys Megan Schuler Eva DuGoff Elizabeth Stuart National Conference.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
–The shortest distance is the one that crosses at 90° the vector u Statistical Inference on correlation and regression.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Classification Ensemble Methods 1
1 Hester van Eeren Erasmus Medical Centre, Rotterdam Halsteren, August 23, 2010.
Linear Correlation (12.5) In the regression analysis that we have considered so far, we assume that x is a controlled independent variable and Y is an.
Sampling Theory and Some Important Sampling Distributions.
Randomized Assignment Difference-in-Differences
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
Considering model structure of covariates to estimate propensity scores Qiu Wang.
Chapter 15: Correlation. Correlations: Measuring and Describing Relationships A correlation is a statistical method used to measure and describe the relationship.
Combinations of SDC methods for continuous microdata Anna Oganian National Institute of Statistical Sciences.
DISCRIMINANT ANALYSIS. Discriminant Analysis  Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant.
Sequence Kernel Association Tests (SKAT) for the Combined Effect of Rare and Common Variants 統計論文 奈良原.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.
Research and Evaluation Methodology Program College of Education A comparison of methods for imputation of missing covariate data prior to propensity score.
CH 5: Multivariate Methods
Inference about the Slope and Intercept
Linear Regression.
Clustering and Multidimensional Scaling
Inference about the Slope and Intercept
LEARNING OUTCOMES After studying this chapter, you should be able to
Matching Methods & Propensity Scores
Simple Linear Regression
数据的矩阵描述.
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
3 basic analytical tasks in bivariate (or multivariate) analyses:
Presentation transcript:

New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences

Question How to evaluate the characteristics of SDL methods?  Previously, data utility measures were studied in context of moments and linear regression models. - Differences in inferences obtained from the original and masked data. - Regression model and KL distance rely on the multivariate normality assumption.  Questions : - Is the assumption satisfied in the realistic situation? - What if the assumption is violated?  Example

Example: Two-dimensional original data and two masked data by synthetic and resampling methods.

 Different distributions, but the same moments and estimates of regression coefficients.  New measures are needed.

1. CDF utility measure  Extension of univariate case.  Kolmogorov statistics  Cramer-von Mises statistics, where are empiricaldistributions of original and masked data. Large MD and MCM indicate two data are distributed differently.

2. Cluster Data Utility  A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.  A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.  A data set is said to be randomly assigned when proportion of observations from original data for each cluster is constant (1/2 with equal number of observations for two groups) : where is the total number of records, is the number of records from original data, and is the weight assigned to i-th cluster.

3. Propensity Score Data Utility  A propensity score is generally defined as the conditional probability of assignment to a particular treatment given a vector of observed covariates (Rosenbaum and Rubin 1983).  A data set is said to be randomly assigned when propensity score for each covariate is constant (1/2 with equal number of observations for two groups).  In the propensity score method, a propensity score is estimated for each observed covariate, and utility is measured by:

Estimation of propensity scores:  Combine original and masked data sets, and create an indicator variable Rj with the value 0 for observations from original and 1 otherwise. 1) Logistic regression model such as where 2) Tree model. 3) Modified logistic regression model : Classify all data points into g groups, and fit a logistic model for each group. It combines logistic model with clustering, and it borrows strength of logistics model and clustering method.  Cluster utility is one way of propensity score utility.

4. Simulation  Eight different types of two-dimensional data with n=10,000: 1) Symmetric/non-symmetric 2) High/ low correlated 3) Negative/ positive correlated.  Masking strategies considered: Synthetic, microaggregation, microaggregation followed by noise, rank swapping, and resampling.  Computational details: 1) Cluster Utility: g=500 (5%) and g=1,000 (10%). 2) Propensity score utility with logistic model: 

3) Propensity score utility with tree model: Sizes of tree considered are complexity parameter cp=0.001, and That is, any split that does not decrease the overall lack of fit by a factor of cp is not attempted. 4) Propensity score utility with modified logistic model: The number of group is g=100 (1%), and linear and quadratic logistic functions are used to fit logistic regression models.

Results: Symmetric high negative case.

Symmetric low negative case.

Non-symmetric high negative case.

Non-symmetric low negative case.

Summary:  CDF utility: 1) Do not involve parameters. 2) It is favorable to rank swapping SDL method.  Cluster utility: 1) Do not measure the differences between two structures of original and masked data within a cluster, within-cluster variation. 2) Generally, it is consistent to overall results. 3) For non-symmetric cases, large number of clusters tend to produce worse utility for the masked data by microaggregation method since there are three overlaps in microaggregated data.

 Propensity score with logistic model: 1) The choice of degree is very crucial. 2) It is hard to deal with high-dimensional data.  Propensity score with tree model: 1) Small size of tree can not distinguish utility of Rank from that of Resample. 2) Large size of tree leads to bad utility for the micro- aggregation method. For some cases, large size of tree can not partition space for Rank method. 3) It is favorable to Rank SDL method.  Propensity score with modified logistic model: 1) It possesses both advantages and disadvantages of logistic model and clustering since it is the combination of cluster and propensity score utilities. 2) It looks consistent to overall results for all data structures.

END