Synthetic Data – A New Future for Public Use Micro-Data? John M. Abowd December 7, 2004.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,
Implementation of the CoP in SLOVENIA Cooperation with data users Genovefa RUŽIĆ Deputy Director-General.
Research developments at the Census Bureau Roderick J. Little Associate Director for Research & Methodology and Chief Scientist Bureau of the Census.
John M. Abowd Cornell University and Census Bureau
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
Plannes security for items, variables and applications NEPS User Rights Management.
Semi-Permeable Boundaries Among Institutions: Non-Public Data and the Census RDC at Berkeley IASSIST 2009 – Tampere, Finland Jon StilesMay 27, 2009.
Introducing ICPSR An Electronic Brochure. Our Mission ICPSR provides leadership and training in data access, curation, and methods of analysis for a diverse.
Are Public Use (Micro) Data a Thing of the Past? John M. Abowd Cornell University US Census Bureau Prepared for IASSIST 2002.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Statistical Programs of the Federal Government John M. Abowd February 2005.
Shirley Crompton Source: Rob Allan. Institutional Repository Subject Repository Data Producer Repository share resources solve bigger problems integrate.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
© John M. Abowd 2005, all rights reserved Introduction John M. Abowd January 2005.
United Nations Expert Group Meeting on Revising the Principles and Recommendations for Population and Housing Censuses New York, 29 October – 1 November.
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
NORTHWEST CENSUS RESEARCH DATA CENTER (NWCRDC) Mark Ellis Director, Northwest Census Research Data Center (NWCRDC) Director, Center for Studies in Demography.
“OnTheMap” The Census Bureau’s New Tool for Residence-Workplace Analysis Fredrik Andersson and Jeremy Wu May 7, 2007 Daytona Beach, FL.
INFO 7470/ILRLE 7400 Survey of Income and Program Participation (SIPP) Synthetic Beta File John M. Abowd and Lars Vilhuber April 26, 2011.
“ Does Cloud Computing Offer a Viable Option for the Control of Statistical Data: How Safe Are Clouds” Federal Committee for Statistical Methodology (FCSM)
Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, September 2011 Overview of Archiving of Microdata Session 4 United Nations.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
1 Supplementing ACS: The LEHD Program Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005 Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005.
Dissemination to support Research & Analysis John Cornish.
Chuck Humphrey Data Library Co-ordinator University of Alberta May 16, Capitalising on Metadata Tool development plans IASSIST 2007.
Corral: A Texas-scale repository for digital research data Chris Jordan Data Management and Collections Group Texas Advanced Computing Center.
Mara Cammarrota Italian National Institute of Statistics Development of Information System and Corporate Products, Information Management and Quality Assessment.
3/1/2007 Wayne Gray1 Using Census Business Data Wayne B. Gray March 2007.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
Center for Economic Studies Research Data Centers Arnold P. Reznek Research Data Center Administrator Center for Economic Studies U.S. Census Bureau Room.
Developing and improving data resources for social science research A strategic approach to data development and data sharing in the social sciences Peter.
Current and Future Applications of the Generic Statistical Business Process Model at Statistics Canada Laurie Reedman and Claude Julien May 5, 2010.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.
Copyright 2010, The World Bank Group. All Rights Reserved. ICT - a core management issue Part 1 Managing ICT resources Produced in Collaboration between.
1 Reengineering the SIPP: An Assessment of the Use of Administrative Records Jim Farber and Sally Obenski US Census Bureau CNSTAT Panel January 26, 2007.
Creating Something from Nothing: Synthetic and Dummy files Bo Wandschneider University of Guelph Chuck Humphrey University of Alberta DLI Training: Ottawa,
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
MCRDC Michigan Census Research Data Center The MCRDC is a joint project of the U.S. Bureau of the Census and the University of Michigan to enable qualified.
2008 NCHS Data Users’ Conference Omni Shoreham Hotel Washington, DC Wednesday, August 13, 2008.
Editing of linked micro files for statistics and research.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta.
Economic Research and Policy Analysis Branch May 6, 2010 Access to Business Micro-Data to Support Economic Research and Policy Analysis: Where Do We Go.
Regional Seminar on Promotion and Utilization of Census Results and on the Revision on the United Nations Principles and Recommendations for Population.
The NSF-Census Research Network (NCRN) Spring 2014 Meeting Introduction by John Thompson Director, Census Bureau.
Jerry Reiter Department of Statistical Science and the Information Initiative at Duke Duke University.
INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.
The LEHD Program and Employment Dynamics Estimates Ronald Prevost Director, LEHD Program US Bureau of the Census
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro,
Data Coordinating Center University of Washington Department of Biostatistics Elizabeth Brown, ScD Siiri Bennett, MD.
Administrative Data and Official Statistics Administrative Data and Official Statistics Principles and good practices Quality in Statistics: Administrative.
Using Census Data at the Federal Statistical Research Data Centers Barbara A. Downs Director, FSRDC Center for Economic Studies U.S. Census Bureau.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
“Data from national surveys: access, analysis, and sharing”
Michigan Census Research Data Center
Understanding Social and Economic Data
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Secure Data Laboratories: The U.S. Census Bureau Model
Connecting Researchers with Data: Discovery, Documentation, Access and Security Cornell Institute for Social and Economic Research (CISER); German Institute.
Presentation 2b 2018 Census Products & Services Engagement.
Connecting Researchers with Data: Discovery, Documentation, Access and Security Cornell Institute for Social and Economic Research (CISER); German Institute.
UT-Austin FSRDC Grand Opening December 13, 2017
A strategic approach to data development and data sharing in the social sciences Peter Elias NCRM/SRA Workshop: "Data Linkage: Exploring the Potential"
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

Synthetic Data – A New Future for Public Use Micro-Data? John M. Abowd December 7, 2004

Overview What is the new NSF-Census research project? Who are the research teams? How do the RDC research projects and the synthetic data system work together? What resources and opportunities does the new project bring to Census? How can we best collaborate?

NSF-ITR Grant Goals and Overview

The Information Technologies Research Grant from NSF A program that encourages innovative, high-payoff IT research and education Our grant proposal cited the many research studies and data products created by previous NSF support for the Research Data Center network and the Longitudinal Employer-Household Dynamics Program

What Is It? $2.9 million 3-year grant to the RDC network (Cornell is the coordinating institution) To provide core support for scientific activities at the RDCs To develop public use, analytically valid synthetic data from many of the RDC- accessible data sets To facilitate collaboration with RDC projects that help design and test these products

Public Use Data Products Are the Lifeblood of Statistical Agencies RDC-based teams understand the public use data products produced at Census and how they relate to the underlying confidential data products In the demographic area there are many public use micro data products –But, their confidentiality protection is increasingly challenging In the economic area there are very few public use micro-data products –But all the data are used for public use aggregate products

Public Use Data Products Are the Lifeblood of Statistical Agencies Integrated data products, like LEHD, produce public use summary data (QWIs) but no micro data products –But, synthetic data offers the possibility of releasing customized micro data products

The Research – Synthetic Data Feedback Cycle Scientific Modeling Data Synthesis Confidentiality Protection Analytic Validity

Research Teams Who is working with us

Principal Investigators Ron Jarmin, Center for Economics Studies Trivellore Raghunathan, University of Michigan Stephen Roehrig, Carnegie Mellon University Matthew D Shapiro, University of Michigan I am the coordinating PI

Project Teams and Coordinators Neil Bennett, CUNY Baruch Gail Boyd, Argonne National Laboratories Marjorie McElroy, Duke University Wayne Gray, Clark University John Haltiwanger, University of Maryland Andrew Hildreth, UC Berkeley Margaret Levenstein, University of Michigan Jerome Reiter, Duke University Jeremy Wu, LEHD Census Ray Bair, Argonne National Laboratories Lars Vilhuber, Cornell University

Team Locations At Census, in the Center for Economic Studies and the Longitudinal Employer-Household Dynamics Program At the RDCs in Washington Plaza, Boston (NBER, Cambridge), California (UCLA and Berkeley), Chicago (Consortium administered by Northwestern), Ann Arbor (University of Michigan), Research Triangle (Consortium administered by Duke), New York (Cornell and Baruch)

Research Projects and Synthetic Data How they work together

The Multi-layer System Basic confidential data –Fundamental product of virtually all Census programs –Leads to the publication of public-use products (summary data, micro data, narrative data) Gold-standard confidential data –Edited, documented and archived research versions of confidential data –Used in internal Census research and at Research Data Centers

More Layers Partially-synthetic micro data –Preserves the record structure or sampling frame of the gold standard micro data –Replaces the data elements with synthetic values sampled from an appropriate probability model Fully-synthetic micro data –Uses only the population or record linkage structure of the gold standard micro data –Generates synthetic entities and data elements from appropriate probability models

Example: SIPP-SSA-IRS Project Links IRS detailed earnings records and Social Security benefit data to public use SIPP data Basic confidential data: SIPP ( , 1996); W-2 earnings data; SSA benefit data Gold standard: completely linked, edited version of the data with variables drawn from all of the sources Partially-synthetic data: created using the record structure of the existing SIPP panels with all data elements synthesized using Bayesian bootstrap and sequential regression multivariate imputation methods.

Example: LBD Longitudinal Business Database: longitudinal integration of Census Business Registers developed at CES Basic confidential data: Business Registers, including co-mingled IRS data Gold standard: LBD and associated metadata at CES and accessible via RDCs Fully synthetic micro data: one of the major R&D tasks of the ITR grant

More Examples CES/NBER productivity series Decennial Census and PUMS/ACS LEHD and QWI-Online Geo-spatial matching of establishment and household data Linked versions of CPS data similar to SIPP-SSA-IRS project

The Desired Result Implement the feedback loop of developing releasable synthetic data and testing models on the confidential and synthetic data Promote active collaboration between Census and RDC researchers Reinforce the role of the RDCs as an important research and development arm of the Census Bureau Focus on clear public-use products—a direct Title 13, Chapter 5 benefit Using data makes them better

RDC Implementation Some details

Structure of Support Explicit RDC and internal projects to produce synthetic data Support for basic RDC expenses via subcontract to Census with an account for each RDC Direct support for research teams that are developing the supercluster and virtual RDC

Expectations of Researchers All projects at RDCs are part of the NSF-ITR and will benefit from it –RDC administrators are directly supported (about 1/3 of all grant funds are used for this) All projects will database results and programming files per metadata specs that the ITR will develop The individual research projects will provide a laboratory for developing synthetic data and testing the analytic validity of the new products

Distance Learning Class Social and Economic Data To begin Spring 2005 Wednesdays 7:00-9:30pm beginning January 26 Census and RDCs invited to enroll students Meets via distance-learning facilities intermediated by a professional producer Examples drawn from many Census data products Exercises developed on the Virtual RDC

Computer Resources The RDC Supercluster and the Virtual RDC

Two Computer Systems The supercluster will be installed at Census facilities at Bowie –fully-integrated part of the RDC computing system and subject to all security and operating requirements of that system –ANL team to help design and optimize Secondary system is a “Virtual RDC” –Configured in much the same way as the cluster is set up in the RDC environment –WITHOUT any confidential data (operated at Cornell) –WITH secure access from anywhere –Automated file release (simulating disclosure review)

Supercluster Specifications 4 x 64-way Itanium-based thick cluster Internal memory 768GB (192 in each node) Operating System: SuSE Linux Enterprise Server (SLES) version 9 Integrated into the RDC SAN for disk storage Software –SAS for Itanium/Linux –GAUSS –Stata –MPI and similar programming interfaces –Fortran and C compilers from Intel

Virtual RDC Will Provide A close approximation of the main cluster environment A close approximation of RDC thin client Disclosure-proofed metadata and synthetic data

Virtual RDC Goals TESTING: initially test programs in a virtual RDC environment before deploying them in the RDC network TRAINING: allow researchers to experience the “limitations” of the RDC environment (work through the thin client) RESEARCH: once synthesized and disclosure-proofed data are available, actual research can be done on the virtual RDC server, though that may not be the only place where this occurs FEEDBACK: run research, and provide output in a file format that is immediately transferable to the RDC network to feed back into the synthesizer

Collaboration Opportunities and Suggestions

Opportunities The grant team consists of many Census- based researchers and long-term Census collaborators. We are openly seeking more internal Census collaborators. The ANL supercluster team, funded through the grant, brings enormous experience in designing and running unix-based superclusters.

Feedback Please suggest important areas where you would like to develop projects Recruit internal and university-based research teams to collaborate via the RDC network Other ideas?

Further Information To request a copy of the NSF proposal, send to (same as To participate in the distance learning course, contact Ron Jarmin at CES e- mail to Thank you for your efforts.