Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago.

Slides:



Advertisements
Similar presentations
You can use this presentation to: Gain an overall understanding of the purpose of the revised tool Learn about the changes that have been made Find advice.
Advertisements

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
What is a sample? Epidemiology matters: a new introduction to methodological foundations Chapter 4.
Survey of Earned Doctorates National Science Foundation Division of Science Resources Statistics Mark Fiegener, Ph.D. Presentation to Clemson University.
10/25/2001Database Management -- R. Larson Data Administration and Database Administration University of California, Berkeley School of Information Management.
Longitudinal Experiments Larry V. Hedges Northwestern University Prepared for the IES Summer Research Training Institute July 28, 2010.
Beginning the Research Design
Chapter 4 Selecting a Sample Gay, Mills, and Airasian
Course Content Introduction to the Research Process
Protection of Personally Identifiable Information through Disclosure Avoidance Techniques Michael Hawes Statistical Privacy Advisor U.S. Department of.
Chapter 5 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:
1 The New York State Education Department New York State’s Student Reporting and Accountability System.
Sampling Moazzam Ali.
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
Key terms in Sampling Sample: A fraction or portion of the population of interest e.g. consumers, brands, companies, products, etc Population: All the.
17 June, 2003Sampling TWO-STAGE CLUSTER SAMPLING (WITH QUOTA SAMPLING AT SECOND STAGE)
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
1 Numerical Data Masking Techniques for Maintaining Sub-Domain Characteristics Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State.
Vouchers in Milwaukee: What Have We Learned From the Nation’s Oldest and Largest Program? Deven Carlson University of Oklahoma.
Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, September 2011 Overview of Archiving of Microdata Session 4 United Nations.
FAMIS CONFERENCE Mari M. Presley, Assistant General Counsel Florida Department of Education June 14, 2011.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
RESEARCH A systematic quest for undiscovered truth A way of thinking
PDHPE K-6 Using the syllabus for consistency of assessment © 2006 Curriculum K-12 Directorate, NSW Department of Education and Training.
Using 5-year ACS for Transportation Planning Applications Elaine Murakami FHWA Office of Planning (in Seattle) 1.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Requirements Engineering
1 New York State Growth Model for Educator Evaluation 2011–12 July 2012 PRESENTATION as of 7/9/12.
Statistics Canada’s Real Time Remote Access Solution 2011 MSIS Meeting – Karen Doherty May 2011.
Confidentiality Issues with “Small Cell” Data Michael C. Samuel, DrPH STD Control Branch California Department of Public Health 2008 National STD Prevention.
 Collecting Quantitative  Data  By: Zainab Aidroos.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Evaluating a Research Report
Has Public Health Insurance for Older Children Reduced Disparities in Access to Care and Health Outcomes? Janet Currie, Sandra Decker, and Wanchuan Lin.
Discussion of “ Statistical Disclosure Limitation: Releasing Useful Data for Statistical Analysis” Nancy J. Kirkendall Energy Information Administration.
S14: Analytical Review and Audit Approaches. Session Objectives To define analytical review To define analytical review To explain commonly used analytical.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
DTC Quantitative Methods Survey Research Design/Sampling (Mostly a hangover from Week 1…) Thursday 17 th January 2013.
1 The New York State Education Department New York State’s Student Data Collection and Reporting System.
CHAPTER 12 Descriptive, Program Evaluation, and Advanced Methods.
American Community Survey “It Don’t Come Easy”, Ringo Starr Jane Traynham Maryland State Data Center March 15, 2011.
VALIDITY AND VALIDATION: AN INTRODUCTION Note: I have included explanatory notes for each slide. To access these, you will probably have to save the file.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
Can Mental Health Services Reduce Juvenile Justice Involvement? Non-Experimental Evidence E. Michael Foster School of Public Health, University of North.
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
META-ANALYSIS, RESEARCH SYNTHESES AND SYSTEMATIC REVIEWS © LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Sampling Design and Analysis MTH 494 Lecture-22 Ossam Chohan Assistant Professor CIIT Abbottabad.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Building the NCSC Summative Assessment: Towards a Stage- Adaptive Design Sarah Hagge, Ph.D., and Anne Davidson, Ed.D. McGraw-Hill Education CTB CCSSO New.
Strategies for Effective Program Evaluations U.S. Department of Education The contents of this presentation were produced by the Coalition for Evidence-Based.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Policy Uses of Federal Statistics Rebecca M. Blank Department of Commerce.
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.
1 Introduction to Statistics. 2 What is Statistics? The gathering, organization, analysis, and presentation of numerical information.
Analytical Review and Audit Approaches
Population vs Sample Population = The full set of cases Sample = A portion of population The need to sample: More practical Budget constraint Time constraint.
Nagraj Rao Statistician Asian Development Bank CROP CUTTING: AN INTRODUCTION.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
1 New York State Growth Model for Educator Evaluation June 2012 PRESENTATION as of 6/14/12.
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Sampling And Sampling Methods.
Chapter Eight: Quantitative Methods
A handbook on validation methodology. Metrics.
Imputation as a Practical Alternative to Data Swapping
Presentation transcript:

Balancing Research & Privacy E. C. Hedberg Arizona State University & NORC at the University of Chicago

Today Topics for conversation – Why is external research important – Data products typically produced – What is disclosure? – Methods to avoid disclosure (aggregate tables) – Methods to avoid disclosure (individual level data) Research results about a common disclosure method Pilot work on creation of synthetic data

Access Current state of researcher access: It is very hard to get access to SLDS data In the 8 years of working in this area, I have had access to 16 states Very rare to have this level of access

Privacy Researchers, Educators, and Parents, are increasingly concerned about what student data elements are recorded and who has access to them FERPA regulations set a high bar for the release of information – Must remove personally identifiable information (PII) But what constitutes PII? – Privacy Technical Assistance Center (PTAC) offers some advice

FERPA Section 34 CFR § 99.31(b)(1)

Search “student data” in news

PTAC Advice

Access There area a wide variety of interpretations to FERPA – Some states allow data use through the audit and evaluation exception – Some states don’t allow researchers access at all

WHO CARES ABOUT RESEARCH?

Research A valid question is: “why allow research with state longitudinal data systems (SLDS) at all?”

Research Premise: Good policy is based on the best available evidence as to: 1.Facts on the ground 2.The mechanisms of achievement 3.The results of previous policies If we want to enact good policy, we need to (at least) know these three things

Research SLDS data provide a good source, and sometimes the only source, of evidence to support positions The budgets for national surveys of education achievement are declining – Research about current mechanisms is more difficult States such as Arizona usually make up a small portion of those surveys due to sampling plans

Research The only way to evaluate facts on the ground in Arizona is with Arizona data The only way to evaluate Arizona policies is with Arizona data Data from a sample of districts is not necessarily representative. A complex, representative, sample can be just as expensive as the SLDS

Research Finally, there is return on investment Nationally, over 600 million federal dollars have been invested in SLDS Who is going to analyze all this data? Much of it can be analyzed by the state… … but it is also efficient and prudent to partner with trained researchers.

Research Ecosystem Arizona SLDS provides a key resource to support policy investigation to improve education for Arizona residents Arizona can partner with ASU and UofA researchers Researchers, in turn, get credit for their work, get tenure, and provide return on investment for Arizona

Research Ecosystem However, this ecosystem is based on a risky exchange of information Private data, protected by FERPA, is the key resource. The safest thing to do is to not collect it: but that cripples the ability of Arizona to use evidence to support policies

Key Question How can we balance research and privacy?

Types of data products The research ecosystem is supported by several types of data products

Types of Data Products Aggregated Tables Individual level data for research Also: Research centers (e.g., Texas; Web based interfaces to analyze data on a server (e.g. Rhode Island;

Disclosure Risk So, what are we worried about? We are concerned that an “intruder” will be able to identify individuals and obtain sensitive information (score, income level, etc.) about them. – Identification through the use of published tables – Identification through access of individual level data

Disclosure Risk In survey research, the bar is if someone knows that a person is in the sample, can they be identified. In administrative data, since (almost) all are in the data, the bar is far lower for risk

AGGREGATE TABLES

Aggregate Tables Descriptive tables indicating counts or other statistics broken down by other nominal characteristics Each table needs to balance disclosure risk with data utility

Example Random sample from ECLS Reading level by poverty by gender…by race

Problem?

Conceptual Diagram Taken from Duncan, G. T., Fienberg, S. E., Krishnan, R., Padman, R., & Roehrig, S. F. (2001). Disclosure limitation methods and information loss for tabular data. Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies,

Options One option is to enact cell suppression – If a cell in a table is based on n or less observations, the cell is suppressed This is easy to implement, but has problems – It is often possible to reproduce the cell count using other cells and marginal totals – Enacting complementary suppression to avoid such tactics is often complicated and removes more data

Alternatives to Cell Suppression Rounding – All cells in a table are rounded to mask true values – Problems: Can destroy even more information than cell suppression Hard to define the rounding rules Tables may be inconsistent

Overall Data products such as aggregate tables should be vetted by a specialist data auditor – Pre-specified level of risk discussed – Procedures such as linear programming are used to analyze cells to quantify risk. – Problems: Is is an expensive position or service

MICRO-DATA FILES

Rounding, Perturbing One option is to limit small cells by rounding covariates to larger units so that large tables that identify individuals is not possible – Problems: Destroys data May limit analyses

Micro-aggregation Individuals are grouped based on nominal groups or through a cluster analysis Mean scores are assigned to each group Groups are analyzed using weights See, e.g., – Sande, G. (2001). Methods for Data Directed Microaggregation in One Dimension. Proceedings of New Techniques and Technologies for Statistics/Exchange of Technology and Know-how, – Domingo-Ferrer, J., & Mateo-Sanz, J. M. (2002). Practical data-oriented microaggregation for statistical disclosure control. Knowledge and Data Engineering, IEEE Transactions on, 14(1),

CONSEQUENCES OF REDACTION

Data Redaction One common safeguard to securing privacy is to redact data of unique individuals This strategy is harmful to the analysis, however

Data Redaction Common practice is to redact “small cells” from data before giving it to researchers For each demographic combination within a district:school:grade: – if 5 or less students have that combination (gender, disability status, race/ethnicity, English learner, poverty status) test scores removed from data This presents major problems for even basic analyses

Data Redaction Test 6 States agreed to participate in a study about consequences of data redaction – Names withheld for presentation Original data provided that was not redacted Analyses performed using original data Redaction rules applied Reanalysis and comparison of results Math and Reading, grades 3-8 analyzed

5 th Graders

5 th grade redaction rates

Redaction process can remove up to 35 percent of the data! For minority groups, much of the data can be removed.

Data redaction Consequences Mean differences are exaggerated Intraclass correlations increase The cause is the removal of heterogeneous schools

Bias in mean differences

GroupCorrelation Black0.45 Hispanic0.50 Poor0.65 The level of bias in the mean estimate from the redacted sample is positively correlated with the rate of redaction of that particular group Unit of analysis: state-subject-grade combinations Bias is related to the level of redaction

Bias in design parameters

Alternatives to Data Redaction Hedges and Hedberg have three active grants looking at alternative methodologies to data redaction – Spencer Foundation Pilot grant – IES Methodology grant – NSF Education and Human Resources grant The spencer grant is completing now, IES and NSF are in data gathering stages

Pilot test of synthetic data Data from the State of Arkansas, 2010 Examine 5 th grade literacy scores Use data with pretests for 4 th and 3 rd grade.

Pilot test of synthetic data Micro-data with sensitive columns (i.e., test scores) Replace sensitive columns with synthetic data that preserves the variation and co-variation with covariates Uses a model based approach similar to imputation to produce synthetic test scores

Two different tries Simple model Race, gender and teacher effects Fast to implement Complex model Race, gender, teacher, district effects Pretests Race by teacher and district effects Gender by teacher and district effects

Results of Pilot

Simple model based synthetic data estimates the mean

Results of Pilot

Simple model based synthetic data doesn’t do so well on the variance: gross underestimation

Results of Pilot

Complex model based synthetic data does OK on estimating the mean

Results of Pilot

But the complex model based synthetic data over-estimates the variance

PILOT TEST ON MEAN DIFFERENCES

Results of Pilot

Simple model based synthetic data underestimates the standard error of the Black/White Difference

Results of Pilot

Complex model based synthetic data overestimates the standard error

Pilot test of synthetic data These are not the only options for models Also, there are some technical details about the simulation procedures that we are glossing over; we have more options here as well

Alternatives to Data Redaction We are examining two other alternatives to data redaction – Masking, perturbing, and coarsening the data – NORC’s X-ID system of micro grouping (micro- aggregation;

NORC XID

Thank you! E. C. Hedberg